Fix docs for reward signals (#2320)

6 年前 · b6bbfea4
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
 - **Learning** - where decisions are made using an embedded
  [TensorFlow](Background-TensorFlow.md) model. The embedded TensorFlow model
  represents a learned policy and the Brain directly uses this model to
-  determine the action for each Agent. You can train a **Learning Brain** 
-  by dragging it into the Academy's `Broadcast Hub` with the `Control` 
+  determine the action for each Agent. You can train a **Learning Brain**
+  by dragging it into the Academy's `Broadcast Hub` with the `Control`
  checkbox checked.
 - **Player** - where decisions are made using real input from a keyboard or
  controller. Here, a human player is controlling the Agent and the observations

 As mentioned previously, the ML-Agents toolkit ships with several
 implementations of state-of-the-art algorithms for training intelligent agents.
-In this mode, the only Brain used is a **Learning Brain**. More 
+In this mode, the only Brain used is a **Learning Brain**. More
 specifically, during training, all the medics in the
 scene send their observations to the Python API through the External
 Communicator (this is the behavior with an External Brain). The Python API
 To summarize: our built-in implementations are based on TensorFlow, thus, during
 training the Python API uses the observations it receives to learn a TensorFlow
 model. This model is then embedded within the Learning Brain during inference to
-generate the optimal actions for all Agents linked to that Brain. 
+generate the optimal actions for all Agents linked to that Brain.

 The
 [Getting Started with the 3D Balance Ball Example](Getting-Started-with-Balance-Ball.md)
 In the previous mode, the Learning Brain was used for training to generate
 a TensorFlow model that the Learning Brain can later use. However,
 any user of the ML-Agents toolkit can leverage their own algorithms for
-training. In this case, the Brain type would be set to Learning and be linked 
+training. In this case, the Brain type would be set to Learning and be linked
 to the BroadcastHub (with checked `Control` checkbox)
 and the behaviors of all the Agents in the scene will be controlled within Python.
 You can even turn your environment into a [gym.](../gym-unity/README.md)
 actions from the human player to learn a policy. [Video
 Link](https://youtu.be/kpb8ZkMBFYs).

-The [Training with Imitation Learning](Training-Imitation-Learning.md) tutorial
-covers this training mode with the **Banana Collector** sample environment.
+ML-Agents provides ways to both learn directly from demonstrations as well as
+use demonstrations to help speed up reward-based training. The
+[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial
+covers these features in more depth.

 ## Flexible Training Scenarios

 - **Broadcasting** - As discussed earlier, a Learning Brain sends the
  observations for all its Agents to the Python API when dragged into the
  Academy's `Broadcast Hub` with the `Control` checkbox checked. This is helpful
-  for training and later inference. Broadcasting is a feature which can be 
+  for training and later inference. Broadcasting is a feature which can be
  enabled all types of Brains (Player, Learning, Heuristic) where the Agent
  observations and actions are also sent to the Python API (despite the fact
  that the Agent is **not** controlled by the Python API). This feature is
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 | brain\_to\_imitate   | For online imitation learning, the name of the GameObject containing the Brain component to imitate.                                                                                    | (online)BC               |
 | demo_path            | For offline imitation learning, the file path of the recorded demonstration file                                                                                                        | (offline)BC              |
 | buffer_size          | The number of experiences to collect before updating the policy model.                                                                                                                  | PPO                      |
-| curiosity\_enc\_size | The size of the encoding to use in the forward and inverse models in the Curiosity module.                                                                                               | PPO                      |
-| curiosity_strength   | Magnitude of intrinsic reward generated by Intrinsic Curiosity Module.                                                                                                                  | PPO                      |
-| gamma                | The reward discount rate for the Generalized Advantage Estimator (GAE).                                                                                                                 | PPO                      |
 | hidden_units         | The number of units in the hidden layers of the neural network.                                                                                                                         | PPO, BC                  |
 | lambd                | The regularization parameter.                                                                                                                                                           | PPO                      |
 | learning_rate        | The initial learning rate for gradient descent.                                                                                                                                         | PPO, BC                  |
 | num_epoch            | The number of passes to make through the experience buffer when performing gradient descent optimization.                                                                               | PPO                      |
 | num_layers           | The number of hidden layers in the neural network.                                                                                                                                      | PPO, BC                  |
+| pretraining          | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations).                                                                                            | PPO                      |
+| reward_signals       | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Training-RewardSignals.md) for configuration options.                                                                                            | PPO                      |
-| trainer              | The type of training to perform: "ppo" or "imitation".                                                                                                                                  | PPO, BC                  |
-| use_curiosity        | Train using an additional intrinsic reward signal generated from Intrinsic Curiosity Module.                                                                                            | PPO                      |
+| trainer              | The type of training to perform: "ppo", "offline_bc" or "online_bc".                                                                                                                                  | PPO, BC                  |
+

 \*PPO = Proximal Policy Optimization, BC = Behavioral Cloning (Imitation)