浏览代码

Fix docs for reward signals (#2320)

/develop-generalizationTraining-TrainerController
GitHub 6 年前
当前提交
b6bbfea4
共有 2 个文件被更改,包括 14 次插入13 次删除
  1. 18
      docs/ML-Agents-Overview.md
  2. 9
      docs/Training-ML-Agents.md

18
docs/ML-Agents-Overview.md


- **Learning** - where decisions are made using an embedded
[TensorFlow](Background-TensorFlow.md) model. The embedded TensorFlow model
represents a learned policy and the Brain directly uses this model to
determine the action for each Agent. You can train a **Learning Brain**
by dragging it into the Academy's `Broadcast Hub` with the `Control`
determine the action for each Agent. You can train a **Learning Brain**
by dragging it into the Academy's `Broadcast Hub` with the `Control`
checkbox checked.
- **Player** - where decisions are made using real input from a keyboard or
controller. Here, a human player is controlling the Agent and the observations

As mentioned previously, the ML-Agents toolkit ships with several
implementations of state-of-the-art algorithms for training intelligent agents.
In this mode, the only Brain used is a **Learning Brain**. More
In this mode, the only Brain used is a **Learning Brain**. More
specifically, during training, all the medics in the
scene send their observations to the Python API through the External
Communicator (this is the behavior with an External Brain). The Python API

To summarize: our built-in implementations are based on TensorFlow, thus, during
training the Python API uses the observations it receives to learn a TensorFlow
model. This model is then embedded within the Learning Brain during inference to
generate the optimal actions for all Agents linked to that Brain.
generate the optimal actions for all Agents linked to that Brain.
The
[Getting Started with the 3D Balance Ball Example](Getting-Started-with-Balance-Ball.md)

In the previous mode, the Learning Brain was used for training to generate
a TensorFlow model that the Learning Brain can later use. However,
any user of the ML-Agents toolkit can leverage their own algorithms for
training. In this case, the Brain type would be set to Learning and be linked
training. In this case, the Brain type would be set to Learning and be linked
to the BroadcastHub (with checked `Control` checkbox)
and the behaviors of all the Agents in the scene will be controlled within Python.
You can even turn your environment into a [gym.](../gym-unity/README.md)

actions from the human player to learn a policy. [Video
Link](https://youtu.be/kpb8ZkMBFYs).
The [Training with Imitation Learning](Training-Imitation-Learning.md) tutorial
covers this training mode with the **Banana Collector** sample environment.
ML-Agents provides ways to both learn directly from demonstrations as well as
use demonstrations to help speed up reward-based training. The
[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial
covers these features in more depth.
## Flexible Training Scenarios

- **Broadcasting** - As discussed earlier, a Learning Brain sends the
observations for all its Agents to the Python API when dragged into the
Academy's `Broadcast Hub` with the `Control` checkbox checked. This is helpful
for training and later inference. Broadcasting is a feature which can be
for training and later inference. Broadcasting is a feature which can be
enabled all types of Brains (Player, Learning, Heuristic) where the Agent
observations and actions are also sent to the Python API (despite the fact
that the Agent is **not** controlled by the Python API). This feature is

9
docs/Training-ML-Agents.md


| brain\_to\_imitate | For online imitation learning, the name of the GameObject containing the Brain component to imitate. | (online)BC |
| demo_path | For offline imitation learning, the file path of the recorded demonstration file | (offline)BC |
| buffer_size | The number of experiences to collect before updating the policy model. | PPO |
| curiosity\_enc\_size | The size of the encoding to use in the forward and inverse models in the Curiosity module. | PPO |
| curiosity_strength | Magnitude of intrinsic reward generated by Intrinsic Curiosity Module. | PPO |
| gamma | The reward discount rate for the Generalized Advantage Estimator (GAE). | PPO |
| hidden_units | The number of units in the hidden layers of the neural network. | PPO, BC |
| lambd | The regularization parameter. | PPO |
| learning_rate | The initial learning rate for gradient descent. | PPO, BC |

| num_epoch | The number of passes to make through the experience buffer when performing gradient descent optimization. | PPO |
| num_layers | The number of hidden layers in the neural network. | PPO, BC |
| pretraining | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations). | PPO |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Training-RewardSignals.md) for configuration options. | PPO |
| trainer | The type of training to perform: "ppo" or "imitation". | PPO, BC |
| use_curiosity | Train using an additional intrinsic reward signal generated from Intrinsic Curiosity Module. | PPO |
| trainer | The type of training to perform: "ppo", "offline_bc" or "online_bc". | PPO, BC |
\*PPO = Proximal Policy Optimization, BC = Behavioral Cloning (Imitation)

正在加载...
取消
保存