浏览代码

Release mm GitHub docs (#3864)

* Improvements to Key Components section of ML-Agents Overview

- Moved some documentation from Learning-Environment-Design.
- Added the trainers vs LL-API separation.
- Made a note about gym-unity.
- Some update to the Agent/Behavior sections
- Updated diagrams to reflect new side channels. Made Behavior type a consistent color.

* Reorganizing the overview file and creating new (empty) sections

This change defines the new structure for the overview doc. Subsequent commits will fill in the sections and rewrite existing sections.

* Reorganizing the main Training ML-Agents page

Re-organizes into feature-specific sections that somewhat mirror the previous commit of reorganizing the overview doc.

Subsequent commits will populate these empty sections.

* Adding Deep RL

- Update ML-Agents-Overview with description of DeepRL training algorithms
- Decribe the common and trainer-specific hyperparams in Training-ML-Agents.
- Removed ...
/release_1_branch
GitHub 4 年前
当前提交
0dff739b
共有 30 个文件被更改,包括 1798 次插入2241 次删除
  1. 123
      README.md
  2. 6
      com.unity.ml-agents/Runtime/Agent.cs
  3. 2
      com.unity.ml-agents/Runtime/Demonstrations/DemonstrationRecorder.cs
  4. 10
      docs/Getting-Started.md
  5. 6
      docs/Glossary.md
  6. 19
      docs/Learning-Environment-Create-New.md
  7. 97
      docs/Learning-Environment-Design-Agents.md
  8. 135
      docs/Learning-Environment-Design.md
  9. 2
      docs/Learning-Environment-Executable.md
  10. 693
      docs/ML-Agents-Overview.md
  11. 55
      docs/Migrating.md
  12. 2
      docs/Python-API.md
  13. 13
      docs/Readme.md
  14. 473
      docs/Training-ML-Agents.md
  15. 2
      docs/Using-Docker.md
  16. 65
      docs/Using-Tensorboard.md
  17. 123
      docs/images/learning_environment_basic.png
  18. 251
      docs/images/learning_environment_example.png
  19. 167
      docs/images/learning_environment_full.png
  20. 216
      docs/Training-Configuration-File.md
  21. 48
      docs/Feature-Memory.md
  22. 50
      docs/Feature-Monitor.md
  23. 25
      docs/Training-Using-Concurrent-Unity-Instances.md
  24. 104
      docs/Training-Imitation-Learning.md
  25. 205
      docs/Reward-Signals.md
  26. 159
      docs/Training-Self-Play.md
  27. 171
      docs/Training-Environment-Parameter-Randomization.md
  28. 350
      docs/Training-PPO.md
  29. 356
      docs/Training-SAC.md
  30. 111
      docs/Training-Curriculum-Learning.md

123
README.md


<img src="docs/images/image-banner.png" align="middle" width="3000"/>
# Unity ML-Agents Toolkit (Beta)
[![docs badge](https://img.shields.io/badge/docs-reference-blue.svg)](https://github.com/Unity-Technologies/ml-agents/tree/latest_release/docs/)
[![license badge](https://img.shields.io/badge/license-Apache--2.0-green.svg)](LICENSE)

## Features
* Unity environment control from Python
* 15+ sample Unity environments
* Two deep reinforcement learning algorithms,
[Proximal Policy Optimization](docs/Training-PPO.md)
(PPO) and [Soft Actor-Critic](docs/Training-SAC.md)
(SAC)
* Support for multiple environment configurations and training scenarios
* Self-play mechanism for training agents in adversarial scenarios
* Train memory-enhanced agents using deep reinforcement learning
* Easily definable Curriculum Learning and Generalization scenarios
* Built-in support for [Imitation Learning](docs/Training-Imitation-Learning.md) through Behavioral Cloning or Generative Adversarial Imitation Learning
* Flexible agent control with On Demand Decision Making
* Visualizing network outputs within the environment
* Wrap learning environments as a gym
* Utilizes the Unity Inference Engine
* Train using concurrent Unity environment instances
- Unity environment control from Python
- 15+ sample Unity environments
- Two deep reinforcement learning algorithms, Proximal Policy Optimization (PPO)
and Soft Actor-Critic (SAC)
- Support for multiple environment configurations and training scenarios
- Self-play mechanism for training agents in adversarial scenarios
- Train memory-enhanced agents using deep reinforcement learning
- Easily definable Curriculum Learning and Generalization scenarios
- Built-in support for Imitation Learning through Behavioral Cloning or
Generative Adversarial Imitation Learning
- Flexible agent control with On Demand Decision Making
- Wrap learning environments as a gym
- Utilizes the Unity Inference Engine
- Train using concurrent Unity environment instances
**Our latest, stable release is `Release 1`. Click [here](docs/Readme.md) to
get started with the latest release of ML-Agents.**

details of the changes between versions.
* If you have used an earlier version of the ML-Agents Toolkit, we strongly recommend our
[guide on migrating from earlier versions](docs/Migrating.md).
| **Version** | **Release Date** | **Source** | **Documentation** | **Download** |
|:-------:|:------:|:-------------:|:-------:|:------------:|

## Citation
If you are a researcher interested in a discussion of Unity as an AI platform, see a pre-print
of our [reference paper on Unity and the ML-Agents Toolkit](https://arxiv.org/abs/1809.02627).
If you use Unity or the ML-Agents Toolkit to conduct research, we ask that you cite the following
paper as a reference:
If you are a researcher interested in a discussion of Unity as an AI platform,
see a pre-print of our
[reference paper on Unity and the ML-Agents Toolkit](https://arxiv.org/abs/1809.02627).
Juliani, A., Berges, V., Vckay, E., Gao, Y., Henry, H., Mattar, M., Lange, D. (2018). Unity: A General Platform for Intelligent Agents. *arXiv preprint arXiv:1809.02627.* https://github.com/Unity-Technologies/ml-agents.
If you use Unity or the ML-Agents Toolkit to conduct research, we ask that you
cite the following paper as a reference:
Juliani, A., Berges, V., Vckay, E., Gao, Y., Henry, H., Mattar, M., Lange, D.
(2018). Unity: A General Platform for Intelligent Agents. _arXiv preprint
arXiv:1809.02627._ https://github.com/Unity-Technologies/ml-agents.
* (February 28, 2020) [Training intelligent adversaries using self-play with ML-Agents](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/)
* (November 11, 2019) [Training your agents 7 times faster with ML-Agents](https://blogs.unity3d.com/2019/11/11/training-your-agents-7-times-faster-with-ml-agents/)
* (October 21, 2019) [The AI@Unity interns help shape the world](https://blogs.unity3d.com/2019/10/21/the-aiunity-interns-help-shape-the-world/)
* (April 15, 2019) [Unity ML-Agents Toolkit v0.8: Faster training on real games](https://blogs.unity3d.com/2019/04/15/unity-ml-agents-toolkit-v0-8-faster-training-on-real-games/)
* (March 1, 2019) [Unity ML-Agents Toolkit v0.7: A leap towards cross-platform inference](https://blogs.unity3d.com/2019/03/01/unity-ml-agents-toolkit-v0-7-a-leap-towards-cross-platform-inference/)
* (December 17, 2018) [ML-Agents Toolkit v0.6: Improved usability of Brains and Imitation Learning](https://blogs.unity3d.com/2018/12/17/ml-agents-toolkit-v0-6-improved-usability-of-brains-and-imitation-learning/)
* (October 2, 2018) [Puppo, The Corgi: Cuteness Overload with the Unity ML-Agents Toolkit](https://blogs.unity3d.com/2018/10/02/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit/)
* (September 11, 2018) [ML-Agents Toolkit v0.5, new resources for AI researchers available now](https://blogs.unity3d.com/2018/09/11/ml-agents-toolkit-v0-5-new-resources-for-ai-researchers-available-now/)
* (June 26, 2018) [Solving sparse-reward tasks with Curiosity](https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/)
* (June 19, 2018) [Unity ML-Agents Toolkit v0.4 and Udacity Deep Reinforcement Learning Nanodegree](https://blogs.unity3d.com/2018/06/19/unity-ml-agents-toolkit-v0-4-and-udacity-deep-reinforcement-learning-nanodegree/)
* (May 24, 2018) [Imitation Learning in Unity: The Workflow](https://blogs.unity3d.com/2018/05/24/imitation-learning-in-unity-the-workflow/)
* (March 15, 2018) [ML-Agents Toolkit v0.3 Beta released: Imitation Learning, feedback-driven features, and more](https://blogs.unity3d.com/2018/03/15/ml-agents-v0-3-beta-released-imitation-learning-feedback-driven-features-and-more/)
* (December 11, 2017) [Using Machine Learning Agents in a real game: a beginner’s guide](https://blogs.unity3d.com/2017/12/11/using-machine-learning-agents-in-a-real-game-a-beginners-guide/)
* (December 8, 2017) [Introducing ML-Agents Toolkit v0.2: Curriculum Learning, new environments, and more](https://blogs.unity3d.com/2017/12/08/introducing-ml-agents-v0-2-curriculum-learning-new-environments-and-more/)
* (September 19, 2017) [Introducing: Unity Machine Learning Agents Toolkit](https://blogs.unity3d.com/2017/09/19/introducing-unity-machine-learning-agents/)
* Overviewing reinforcement learning concepts
- (February 28, 2020)
[Training intelligent adversaries using self-play with ML-Agents](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/)
- (November 11, 2019)
[Training your agents 7 times faster with ML-Agents](https://blogs.unity3d.com/2019/11/11/training-your-agents-7-times-faster-with-ml-agents/)
- (October 21, 2019)
[The AI@Unity interns help shape the world](https://blogs.unity3d.com/2019/10/21/the-aiunity-interns-help-shape-the-world/)
- (April 15, 2019)
[Unity ML-Agents Toolkit v0.8: Faster training on real games](https://blogs.unity3d.com/2019/04/15/unity-ml-agents-toolkit-v0-8-faster-training-on-real-games/)
- (March 1, 2019)
[Unity ML-Agents Toolkit v0.7: A leap towards cross-platform inference](https://blogs.unity3d.com/2019/03/01/unity-ml-agents-toolkit-v0-7-a-leap-towards-cross-platform-inference/)
- (December 17, 2018)
[ML-Agents Toolkit v0.6: Improved usability of Brains and Imitation Learning](https://blogs.unity3d.com/2018/12/17/ml-agents-toolkit-v0-6-improved-usability-of-brains-and-imitation-learning/)
- (October 2, 2018)
[Puppo, The Corgi: Cuteness Overload with the Unity ML-Agents Toolkit](https://blogs.unity3d.com/2018/10/02/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit/)
- (September 11, 2018)
[ML-Agents Toolkit v0.5, new resources for AI researchers available now](https://blogs.unity3d.com/2018/09/11/ml-agents-toolkit-v0-5-new-resources-for-ai-researchers-available-now/)
- (June 26, 2018)
[Solving sparse-reward tasks with Curiosity](https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/)
- (June 19, 2018)
[Unity ML-Agents Toolkit v0.4 and Udacity Deep Reinforcement Learning Nanodegree](https://blogs.unity3d.com/2018/06/19/unity-ml-agents-toolkit-v0-4-and-udacity-deep-reinforcement-learning-nanodegree/)
- (May 24, 2018)
[Imitation Learning in Unity: The Workflow](https://blogs.unity3d.com/2018/05/24/imitation-learning-in-unity-the-workflow/)
- (March 15, 2018)
[ML-Agents Toolkit v0.3 Beta released: Imitation Learning, feedback-driven features, and more](https://blogs.unity3d.com/2018/03/15/ml-agents-v0-3-beta-released-imitation-learning-feedback-driven-features-and-more/)
- (December 11, 2017)
[Using Machine Learning Agents in a real game: a beginner’s guide](https://blogs.unity3d.com/2017/12/11/using-machine-learning-agents-in-a-real-game-a-beginners-guide/)
- (December 8, 2017)
[Introducing ML-Agents Toolkit v0.2: Curriculum Learning, new environments, and more](https://blogs.unity3d.com/2017/12/08/introducing-ml-agents-v0-2-curriculum-learning-new-environments-and-more/)
- (September 19, 2017)
[Introducing: Unity Machine Learning Agents Toolkit](https://blogs.unity3d.com/2017/09/19/introducing-unity-machine-learning-agents/)
- Overviewing reinforcement learning concepts
In addition to our own documentation, here are some additional, relevant articles:
In addition to our own documentation, here are some additional, relevant
articles:
* [A Game Developer Learns Machine Learning](https://mikecann.co.uk/machine-learning/a-game-developer-learns-machine-learning-intent/)
* [Explore Unity Technologies ML-Agents Exclusively on Intel Architecture](https://software.intel.com/en-us/articles/explore-unity-technologies-ml-agents-exclusively-on-intel-architecture)
* [ML-Agents Penguins tutorial](https://learn.unity.com/project/ml-agents-penguins)
- [A Game Developer Learns Machine Learning](https://mikecann.co.uk/machine-learning/a-game-developer-learns-machine-learning-intent/)
- [Explore Unity Technologies ML-Agents Exclusively on Intel Architecture](https://software.intel.com/en-us/articles/explore-unity-technologies-ml-agents-exclusively-on-intel-architecture)
- [ML-Agents Penguins tutorial](https://learn.unity.com/project/ml-agents-penguins)
## Community and Feedback

For problems with the installation and setup of the the ML-Agents Toolkit, or
discussions about how to best setup or train your agents, please create a new
thread on the [Unity ML-Agents forum](https://forum.unity.com/forums/ml-agents.453/)
and make sure to include as much detail as possible.
If you run into any other problems using the ML-Agents Toolkit, or have a specific
feature requests, please [submit a GitHub issue](https://github.com/Unity-Technologies/ml-agents/issues).
thread on the
[Unity ML-Agents forum](https://forum.unity.com/forums/ml-agents.453/) and make
sure to include as much detail as possible. If you run into any other problems
using the ML-Agents Toolkit, or have a specific feature requests, please
[submit a GitHub issue](https://github.com/Unity-Technologies/ml-agents/issues).
Your opinion matters a great deal to us. Only by hearing your thoughts on the Unity ML-Agents
Toolkit can we continue to improve and grow. Please take a few minutes to
Your opinion matters a great deal to us. Only by hearing your thoughts on the
Unity ML-Agents Toolkit can we continue to improve and grow. Please take a few
minutes to
For any other questions or feedback, connect directly with the ML-Agents
team at ml-agents@unity3d.com.
For any other questions or feedback, connect directly with the ML-Agents team at
ml-agents@unity3d.com.
## License

6
com.unity.ml-agents/Runtime/Agent.cs


/// Imitation Learning (GAIL) with rewards supplied through this method.
///
/// [Agents - Rewards]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Learning-Environment-Design-Agents.md#rewards
/// [Reward Signals]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Reward-Signals.md
/// [Reward Signals]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/ML-Agents-Overview.md#a-quick-note-on-reward-signals
/// </remarks>
/// <param name="reward">The new value of the reward.</param>
public void SetReward(float reward)

/// Imitation Learning (GAIL) with rewards supplied through this method.
///
/// [Agents - Rewards]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Learning-Environment-Design-Agents.md#rewards
/// [Reward Signals]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Reward-Signals.md
/// [Reward Signals]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/ML-Agents-Overview.md#a-quick-note-on-reward-signals
///</remarks>
/// <param name="increment">Incremental reward value.</param>
public void AddReward(float increment)

/// implementing a simple heuristic function can aid in debugging agent actions and interactions
/// with its environment.
///
/// [Demonstration Recorder]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Training-Imitation-Learning.md#recording-demonstrations
/// [Demonstration Recorder]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Learning-Environment-Design-Agents.md#recording-demonstrations
/// [Actions]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Learning-Environment-Design-Agents.md#actions
/// [GameObject]: https://docs.unity3d.com/Manual/GameObjects.html
/// </remarks>

2
com.unity.ml-agents/Runtime/Demonstrations/DemonstrationRecorder.cs


/// See [Imitation Learning - Recording Demonstrations] for more information.
///
/// [GameObject]: https://docs.unity3d.com/Manual/GameObjects.html
/// [Imitation Learning - Recording Demonstrations]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Training-Imitation-Learning.md#recording-demonstrations
/// [Imitation Learning - Recording Demonstrations]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs//Learning-Environment-Design-Agents.md#recording-demonstrations
/// </remarks>
[RequireComponent(typeof(Agent))]
[AddComponentMenu("ML Agents/Demonstration Recorder", (int)MenuGroup.Default)]

10
docs/Getting-Started.md


**Note** : You can modify multiple game objects in a scene by selecting them
all at once using the search bar in the Scene Hierarchy.
1. Set the **Inference Device** to use for this model as `CPU`.
1. Click the :arrow_forward: button in the Unity Editor and you will see the
platforms balance the balls using the pre-trained model.
1. Click the **Play** button in the Unity Editor and you will see the platforms
balance the balls using the pre-trained model.
## Training a new model with Reinforcement Learning

all our example environments, including 3DBall.
- `run-id` is a unique name for this training session.
1. When the message _"Start training by pressing the Play button in the Unity
Editor"_ is displayed on the screen, you can press the :arrow_forward: button
in Unity to start training in the Editor.
Editor"_ is displayed on the screen, you can press the **Play** button in
Unity to start training in the Editor.
If `mlagents-learn` runs correctly and starts training, you should see something
like this:

1. Select the **3DBall** prefab Agent object.
1. Drag the `<behavior_name>.nn` file from the Project window of the Editor to
the **Model** placeholder in the **Ball3DAgent** inspector window.
1. Press the :arrow_forward: button at the top of the Editor.
1. Press the **Play** button at the top of the Editor.
## Next Steps

6
docs/Glossary.md


agent’s action within the current state of the environment.
- **State** - The underlying properties of the environment (including all agents
within it) at a given time.
- **Step** - Corresponds to each `FixedUpdate` call of the game engine. Is the
smallest atomic change to the state possible.
- **Step** - Corresponds to an atomic change of the engine that happens between
Agent decisions.
- **Experience** - Corresponds to a tuple of [Agent observations, actions,
rewards] of a single Agent obtained after a Step.
- **Update** - Unity function called each time a frame is rendered. ML-Agents
logic should not be placed here.
- **External Coordinator** - ML-Agents class responsible for communication with

19
docs/Learning-Environment-Create-New.md


1. In the Unity Project window, double-click the `RollerAgent` script to open it
in your code editor.
1. In the editor, add the `using Unity.MLAgents;` and `using Unity.MLAgents.Sensors`
statements and then change the base class from `MonoBehaviour` to `Agent`.
1. In the editor, add the `using Unity.MLAgents;` and
`using Unity.MLAgents.Sensors` statements and then change the base class from
`MonoBehaviour` to `Agent`.
1. Delete the `Update()` method, but we will use the `Start()` function, so
leave it alone for now.

`Behavior Type` to `Heuristic Only` in the `Behavior Parameters` of the
RollerAgent.
Press :arrow_forward: to run the scene and use the arrows keys to move the Agent
around the platform. Make sure that there are no errors displayed in the Unity
Editor Console window and that the Agent resets when it reaches its target or
falls from the platform. Note that for more involved debugging, the ML-Agents
SDK includes a convenient [Monitor](Feature-Monitor.md) class that you can use
to easily display Agent status information in the Game window.
Press **Play** to run the scene and use the arrows keys to move the Agent around
the platform. Make sure that there are no errors displayed in the Unity Editor
Console window and that the Agent resets when it reaches its target or falls
from the platform.
## Training the Environment

decisions the training algorithm has to consider and, in this simple
environment, speeds up training.
To train your agent, run the following command before pressing :arrow_forward:
in the Editor:
To train your agent, run the following command before pressing **Play** in the
Editor:
mlagents-learn config/rollerball_config.yaml --run-id=RollerBall

97
docs/Learning-Environment-Design-Agents.md


# Agents
**Table of Contents:**
- [Decisions](#decisions)
- [Observations and Sensors](#observations-and-sensors)
- [Vector Observations](#vector-observations)
- [One-hot encoding categorical information](#one-hot-encoding-categorical-information)
- [Normalization](#normalization)
- [Vector Observation Summary & Best Practices](#vector-observation-summary--best-practices)
- [Visual Observations](#visual-observations)
- [Visual Observation Summary & Best Practices](#visual-observation-summary--best-practices)
- [Raycast Observations](#raycast-observations)
- [RayCast Observation Summary & Best Practices](#raycast-observation-summary--best-practices)
- [Actions](#actions)
- [Continuous Action Space](#continuous-action-space)
- [Discrete Action Space](#discrete-action-space)
- [Masking Discrete Actions](#masking-discrete-actions)
- [Actions Summary & Best Practices](#actions-summary--best-practices)
- [Rewards](#rewards)
- [Examples](#examples)
- [Rewards Summary & Best Practices](#rewards-summary--best-practices)
- [Agent Properties](#agent-properties)
- [Destroying an Agent](#destroying-an-agent)
- [Defining Teams for Multi-agent Scenarios](#defining-teams-for-multi-agent-scenarios)
- [Recording Demonstrations](#recording-demonstrations)
An agent is an entity that can observe its environment, decide on the best
course of action using those observations, and execute those actions within its
environment. Agents can be created in Unity by extending the `Agent` class. The

agent to take the optimally informed decision, and ideally no extraneous
information.
- In cases where Vector Observations need to be remembered or compared over
time, either an LSTM (see [here](Feature-Memory.md)) should be used in the
model, or the `Stacked Vectors` value in the agent GameObject's
`Behavior Parameters` should be changed.
time, either an RNN should be used in the model, or the `Stacked Vectors`
value in the agent GameObject's `Behavior Parameters` should be changed.
- Categorical variables such as type of object (Sword, Shield, Bow) should be
encoded in one-hot fashion (i.e. `3` -> `0, 0, 1`). This can be done
automatically using the `AddOneHotObservation()` method of the `VectorSensor`.

not sufficient.
- Image size should be kept as small as possible, without the loss of needed
details for decision making.
- Images should be made greyscale in situations where color information is not
- Images should be made grayscale in situations where color information is not
needed for making informed decisions.
### Raycast Observations

- `Behavior Parameters` - The parameters dictating what Policy the Agent will
receive.
- `Behavior Name` - The identifier for the behavior. Agents with the same
behavior name will learn the same policy. If you're using
[curriculum learning](Training-Curriculum-Learning.md), this is used as the
top-level key in the config.
behavior name will learn the same policy.
- `Vector Observation`
- `Space Size` - Length of vector observation for the Agent.
- `Stacked Vectors` - The number of previous vector observations that will

otherwise they will perform inference.
- `Heuristic Only` - the Agent will always use the `Heuristic()` method.
- `Inference Only` - the Agent will always perform inference.
- `Team ID` - Used to define the team for [self-play](Training-Self-Play.md)
- `Team ID` - Used to define the team for self-play
## Monitoring Agents
We created a helpful `Monitor` class that enables visualizing variables within a
Unity environment. While this was built for monitoring an agent's value function
throughout the training process, we imagine it can be more broadly useful. You
can learn more [here](Feature-Monitor.md).
## Destroying an Agent
You can destroy an Agent GameObject during the simulation. Make sure that there

## Defining Teams for Multi-agent Scenarios
Self-play is triggered by including the self-play hyperparameter hierarchy in
the [trainer configuration](Training-ML-Agents.md#training-configurations). To
distinguish opposing agents, set the team ID to different integer values in the
behavior parameters script on the agent prefab.
![Team ID](images/team_id.png)
**_Team ID must be 0 or an integer greater than 0._**
In symmetric games, since all agents (even on opposing teams) will share the
same policy, they should have the same 'Behavior Name' in their Behavior
Parameters Script. In asymmetric games, they should have a different Behavior
Name in their Behavior Parameters script. Note, in asymmetric games, the agents
must have both different Behavior Names _and_ different team IDs!
For examples of how to use this feature, you can see the trainer configurations
and agent prefabs for our Tennis and Soccer environments. Tennis and Soccer
provide examples of symmetric games. To train an asymmetric game, specify
trainer configurations for each of your behavior names and include the self-play
hyperparameter hierarchy in both.
## Recording Demonstrations
In order to record demonstrations from an agent, add the
`Demonstration Recorder` component to a GameObject in the scene which contains
an `Agent` component. Once added, it is possible to name the demonstration that
will be recorded from the agent.
<p align="center">
<img src="images/demo_component.png"
alt="Demonstration Recorder"
width="375" border="10" />
</p>
When `Record` is checked, a demonstration will be created whenever the scene is
played from the Editor. Depending on the complexity of the task, anywhere from a
few minutes or a few hours of demonstration data may be necessary to be useful
for imitation learning. When you have recorded enough data, end the Editor play
session. A `.demo` file will be created in the `Assets/Demonstrations` folder
(by default). This file contains the demonstrations. Clicking on the file will
provide metadata about the demonstration in the inspector.
<p align="center">
<img src="images/demo_inspector.png"
alt="Demonstration Inspector"
width="375" border="10" />
</p>
You can then specify the path to this file in your training configurations.

135
docs/Learning-Environment-Design.md


# Reinforcement Learning in Unity
Reinforcement learning is an artificial intelligence technique that trains
_agents_ to perform tasks by rewarding desirable behavior. During reinforcement
learning, an agent explores its environment, observes the state of things, and,
based on those observations, takes an action. If the action leads to a better
state, the agent receives a positive reward. If it leads to a less desirable
state, then the agent receives no reward or a negative reward (punishment). As
the agent learns during training, it optimizes its decision making so that it
receives the maximum reward over time.
The ML-Agents Toolkit uses a reinforcement learning technique called
[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/).
PPO uses a neural network to approximate the ideal function that maps an agent's
observations to the best action an agent can take in a given state. The
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
Python process (communicating with the running Unity application over a socket).
**Note:** if you aren't studying machine and reinforcement learning as a subject
and just want to train agents to accomplish tasks, you can treat PPO training as
a _black box_. There are a few training-related parameters to adjust inside
Unity as well as on the Python training side, but you do not need in-depth
knowledge of the algorithm itself to successfully create and train agents.
Step-by-step procedures for running the training process are provided in the
[Training section](Training-ML-Agents.md).
class. The Academy works with Agent objects in the scene to step
through the simulation.
class. The Academy works with Agent objects in the scene to step through the
simulation.
neural network model. When training is completed
successfully, you can add the trained model file to your Unity project for later
use.
neural network model. When training is completed successfully, you can add the
trained model file to your Unity project for later use.
2. Calls the `OnEpisodeBegin()` function for each Agent in the scene.
3. Calls the `CollectObservations(VectorSensor sensor)` function for each Agent in the scene.
4. Uses each Agent's Policy to decide on the Agent's next action.
5. Calls the `OnActionReceived()` function for each Agent in the scene, passing in
the action chosen by the Agent's Policy.
6. Calls the Agent's `OnEpisodeBegin()` function if the Agent has reached its `Max
Step` count or has otherwise marked itself as `EndEpisode()`.
To create a training environment, extend the Agent class to
implement the above methods whether you need to implement them or not depends on
your specific scenario.
1. Calls the `OnEpisodeBegin()` function for each Agent in the scene.
1. Calls the `CollectObservations(VectorSensor sensor)` function for each Agent
in the scene.
1. Uses each Agent's Policy to decide on the Agent's next action.
1. Calls the `OnActionReceived()` function for each Agent in the scene, passing
in the action chosen by the Agent's Policy.
1. Calls the Agent's `OnEpisodeBegin()` function if the Agent has reached its
`Max Step` count or has otherwise marked itself as `EndEpisode()`.
**Note:** The API used by the Python training process to communicate with
and control the Academy during training can be used for other purposes as well.
For example, you could use the API to use Unity as the simulation engine for
your own machine learning algorithms. See [Python API](Python-API.md) for more
information.
To create a training environment, extend the Agent class to implement the above
methods whether you need to implement them or not depends on your specific
scenario.
To train and use the ML-Agents Toolkit in a Unity scene, the scene as many Agent subclasses as you need.
Agent instances should be attached to the GameObject representing that Agent.
To train and use the ML-Agents Toolkit in a Unity scene, the scene as many Agent
subclasses as you need. Agent instances should be attached to the GameObject
representing that Agent.
The Academy is a singleton which orchestrates Agents and their decision making processes. Only
a single Academy exists at a time.
The Academy is a singleton which orchestrates Agents and their decision making
processes. Only a single Academy exists at a time.
To alter the environment at the start of each episode, add your method to the Academy's OnEnvironmentReset action.
To alter the environment at the start of each episode, add your method to the
Academy's OnEnvironmentReset action.
```csharp
public class MySceneBehavior : MonoBehaviour

}
```
For example, you might want to reset an Agent to its starting
position or move a goal to a random position. An environment resets when the
`reset()` method is called on the Python `UnityEnvironment`.
For example, you might want to reset an Agent to its starting position or move a
goal to a random position. An environment resets when the `reset()` method is
called on the Python `UnityEnvironment`.
When you reset an environment, consider the factors that should change so that
training is generalizable to different conditions. For example, if you were

### Environment Parameters
Curriculum learning and environment parameter randomization are two training
methods that control specific parameters in your environment. As such, it is
important to ensure that your environment parameters are updated at each step to
the correct values. To enable this, we expose a `EnvironmentParameters` C# class
that you can use to retrieve the values of the parameters defined in the
training configurations for both of those features.
We recommend modifying the environment from the Agent's `OnEpisodeBegin()`
function by leveraging `Academy.Instance.EnvironmentParameters`. See the
WallJump example environment for a sample usage (specifically,
[WallJumpAgent.cs](../Project/Assets/ML-Agents/Examples/WallJump/Scripts/WallJumpAgent.cs)
).
### Agent
The Agent class represents an actor in the scene that collects observations and

To create an Agent, extend the Agent class and implement the essential
`CollectObservations(VectorSensor sensor)` and `OnActionReceived()` methods:
* `CollectObservations(VectorSensor sensor)` — Collects the Agent's observation of its environment.
* `OnActionReceived()` — Carries out the action chosen by the Agent's Policy and
- `CollectObservations(VectorSensor sensor)` — Collects the Agent's observation
of its environment.
- `OnActionReceived()` — Carries out the action chosen by the Agent's Policy and
assigns a reward to the current state.
Your implementations of these functions determine how the Behavior Parameters

manually terminate an Agent episode in your `OnActionReceived()` function when the Agent
has finished (or irrevocably failed) its task by calling the `EndEpisode()` function.
You can also set the Agent's `Max Steps` property to a positive value and the
Agent will consider the episode over after it has taken that many steps. You can
use the `Agent.OnEpisodeBegin()` function to prepare the Agent to start again.
manually terminate an Agent episode in your `OnActionReceived()` function when
the Agent has finished (or irrevocably failed) its task by calling the
`EndEpisode()` function. You can also set the Agent's `Max Steps` property to a
positive value and the Agent will consider the episode over after it has taken
that many steps. You can use the `Agent.OnEpisodeBegin()` function to prepare
the Agent to start again.
An _environment_ in the ML-Agents Toolkit can be any scene built in Unity. The
Unity scene provides the environment in which agents observe, act, and learn.
How you set up the Unity scene to serve as a learning environment really depends
on your goal. You may be trying to solve a specific reinforcement learning
problem of limited scope, in which case you can use the same scene for both
training and for testing trained agents. Or, you may be training agents to
operate in a complex game or simulation. In this case, it might be more
efficient and practical to create a purpose-built training scene.
* The training scene must start automatically when your Unity application is
- The training scene must start automatically when your Unity application is
* The Academy must reset the scene to a valid starting point for each episode of
- The Academy must reset the scene to a valid starting point for each episode of
* A training episode must have a definite end — either using `Max Steps` or by
- A training episode must have a definite end — either using `Max Steps` or by
## Recording Statistics
We offer developers a mechanism to record statistics from within their Unity
environments. These statistics are aggregated and generated during the training
process. To record statistics, see the `StatsRecorder` C# class.
See the FoodCollector example environment for a sample usage (specifically,
[FoodCollectorSettings.cs](../Project/Assets/ML-Agents/Examples/FoodCollector/Scripts/FoodCollectorSettings.cs)
).

2
docs/Learning-Environment-Executable.md


1. Select the **3DBall** prefab from the Project window and select **Agent**.
1. Drag the `<behavior_name>.nn` file from the Project window of the Editor to
the **Model** placeholder in the **Ball3DAgent** inspector window.
1. Press the :arrow_forward: button at the top of the editor.
1. Press the **Play** button at the top of the Editor.

693
docs/ML-Agents-Overview.md


# ML-Agents Toolkit Overview
**Table of Contents**
- [Running Example: Training NPC Behaviors](#running-example-training-npc-behaviors)
- [Key Components](#key-components)
- [Training Modes](#training-modes)
- [Built-in Training and Inference](#built-in-training-and-inference)
- [Custom Training and Inference](#custom-training-and-inference)
- [Flexible Training Scenarios](#flexible-training-scenarios)
- [Training Methods: Environment-agnostic](#training-methods-environment-agnostic)
- [A Quick Note on Reward Signals](#a-quick-note-on-reward-signals)
- [Deep Reinforcement Learning](#deep-reinforcement-learning)
- [Curiosity for Sparse-reward Environments](#curiosity-for-sparse-reward-environments)
- [Imitation Learning](#imitation-learning)
- [GAIL (Generative Adversarial Imitation Learning)](#gail-generative-adversarial-imitation-learning)
- [Behavioral Cloning (BC)](#behavioral-cloning-bc)
- [Recording Demonstrations](#recording-demonstrations)
- [Summary](#summary)
- [Training Methods: Environment-specific](#training-methods-environment-specific)
- [Training in Multi-Agent Environments with Self-Play](#training-in-multi-agent-environments-with-self-play)
- [Solving Complex Tasks using Curriculum Learning](#solving-complex-tasks-using-curriculum-learning)
- [Training Robust Agents using Environment Parameter Randomization](#training-robust-agents-using-environment-parameter-randomization)
- [Model Types](#model-types)
- [Learning from Vector Observations](#learning-from-vector-observations)
- [Learning from Cameras using Convolutional Neural Networks](#learning-from-cameras-using-convolutional-neural-networks)
- [Memory-enhanced Agents using Recurrent Neural Networks](#memory-enhanced-agents-using-recurrent-neural-networks)
- [Additional Features](#additional-features)
- [Summary and Next Steps](#summary-and-next-steps)
open-source Unity plugin that enables games and simulations to serve as
environments for training intelligent agents. Agents can be trained using
reinforcement learning, imitation learning, neuroevolution, or other machine
learning methods through a simple-to-use Python API. We also provide
implementations (based on TensorFlow) of state-of-the-art algorithms to enable
game developers and hobbyists to easily train intelligent agents for 2D, 3D and
VR/AR games. These trained agents can be used for multiple purposes, including
controlling NPC behavior (in a variety of settings such as multi-agent and
adversarial), automated testing of game builds and evaluating different game
design decisions pre-release. The ML-Agents Toolkit is mutually beneficial for
both game developers and AI researchers as it provides a central platform where
advances in AI can be evaluated on Unity’s rich environments and then made
accessible to the wider research and game developer communities.
open-source project that enables games and simulations to serve as environments
for training intelligent agents. Agents can be trained using reinforcement
learning, imitation learning, neuroevolution, or other machine learning methods
through a simple-to-use Python API. We also provide implementations (based on
TensorFlow) of state-of-the-art algorithms to enable game developers and
hobbyists to easily train intelligent agents for 2D, 3D and VR/AR games. These
trained agents can be used for multiple purposes, including controlling NPC
behavior (in a variety of settings such as multi-agent and adversarial),
automated testing of game builds and evaluating different game design decisions
pre-release. The ML-Agents Toolkit is mutually beneficial for both game
developers and AI researchers as it provides a central platform where advances
in AI can be evaluated on Unity’s rich environments and then made accessible to
the wider research and game developer communities.
that include overviews and helpful resources on the [Unity
Engine](Background-Unity.md), [machine learning](Background-Machine-Learning.md)
and [TensorFlow](Background-TensorFlow.md). We **strongly** recommend browsing
the relevant background pages if you're not familiar with a Unity scene, basic
that include overviews and helpful resources on the
[Unity Engine](Background-Unity.md),
[machine learning](Background-Machine-Learning.md) and
[TensorFlow](Background-TensorFlow.md). We **strongly** recommend browsing the
relevant background pages if you're not familiar with a Unity scene, basic
subsequent documentation pages provide examples of _how_ to use ML-Agents.
subsequent documentation pages provide examples of _how_ to use ML-Agents. To
get started, watch this
[demo video of ML-Agents in action](https://www.youtube.com/watch?v=fiQsmdwEGT8&feature=youtu.be).
## Running Example: Training NPC Behaviors

## Key Components
The ML-Agents Toolkit is a Unity plugin that contains three high-level
components:
The ML-Agents Toolkit contains five high-level components:
characters.
- **Python API** - which contains all the machine learning algorithms that are
used for training (learning a behavior or policy). Note that, unlike the
characters. The Unity scene provides the environment in which agents observe,
act, and learn. How you set up the Unity scene to serve as a learning
environment really depends on your goal. You may be trying to solve a specific
reinforcement learning problem of limited scope, in which case you can use the
same scene for both training and for testing trained agents. Or, you may be
training agents to operate in a complex game or simulation. In this case, it
might be more efficient and practical to create a purpose-built training
scene. The ML-Agents Toolkit includes an ML-Agents Unity SDK
(`com.unity.ml-agents` package) that enables you to transform any Unity scene
into a learning environment by defining the agents and their behaviors.
- **Python Low-Level API** - which contains a low-level Python interface for
interacting and manipulating a learning environment. Note that, unlike the
and communicates with Unity through the External Communicator.
and communicates with Unity through the Communicator. This API is contained in
a dedicated `mlagents_envs` Python package and is used by the Python training
process to communicate with and control the Academy during training. However,
it can be used for other purposes as well. For example, you could use the API
to use Unity as the simulation engine for your own machine learning
algorithms. See [Python API](Python-API.md) for more information.
Python API. It lives within the Learning Environment.
Python Low-Level API. It lives within the Learning Environment.
- **Python Trainers** which contains all the machine learning algorithms that
enable training agents. The algorithms are implemented in Python and are part
of their own `mlagents` Python package. The package exposes a single
command-line utility `mlagents-learn` that supports all the training methods
and options outlined in this document. The Python Trainers interface solely
with the Python Low-Level API.
- **Gym Wrapper** (not pictured). A common way in which machine learning
researchers interact with simulation environments is via a wrapper provided by
OpenAI called [gym](https://github.com/openai/gym). We provide a gym wrapper
in a dedicated `gym-unity` Python package and
[instructions](../gym-unity/README.md) for using it with existing machine
learning algorithms which utilize gym.
width="700" border="10" />
width="600"
border="10" />
The Learning Environment contains an additional component that help
organize the Unity scene:
The Learning Environment contains two Unity Components that help organize the
Unity scene:
- **Behavior** - defines specific attributes of the agent such as the number of
actions that agent can take. Each Behavior is uniquely identified by a
`Behavior Name` field. A Behavior can be thought as a function that receives
observations and rewards from the Agent and returns actions. A Behavior can be
of one of three types: Learning, Heuristic or Inference. A Learning Behavior
is one that is not, yet, defined but about to be trained. A Heuristic Behavior
is one that is defined by a hard-coded set of rules implemented in code. An
Inference Behavior is one that includes a trained Neural Network file. In
essence, after a Learning Behavior is trained, it becomes an Inference
Behavior.
Every Learning Environment will always have one Agent for
every character in the scene. While each Agent must be linked to a Behavior, it is
possible for Agents that have similar observations and actions to have
the same Behavior. In our sample game, we have two teams each with their own medic.
Thus we will have two Agents in our Learning Environment, one for each medic,
but both of these medics can have the same Behavior. Note that these two
medics have the same Behavior. This does not mean that at each instance they will have
identical observation and action _values_. If we expanded our game to include
tank driver NPCs, then the Agent
attached to those characters cannot share its Behavior with the Agent linked to the
medics (medics and drivers have different actions).
Every Learning Environment will always have one Agent for every character in the
scene. While each Agent must be linked to a Behavior, it is possible for Agents
that have similar observations and actions to have the same Behavior. In our
sample game, we have two teams each with their own medic. Thus we will have two
Agents in our Learning Environment, one for each medic, but both of these medics
can have the same Behavior. This does not mean that at each instance they will
have identical observation and action _values_.
width="700"
We have yet to discuss how the ML-Agents Toolkit trains behaviors, and what role
the Python API and External Communicator play. Before we dive into those
details, let's summarize the earlier components. Each character is attached to
an Agent, and each Agent has a Behavior. The Behavior can be thought as a function
that receives observations
and rewards from the Agent and returns actions. The Learning Environment through
the Academy (not represented in the diagram) ensures that all the
Agents are in sync in addition to controlling environment-wide
settings.
Note that in a single environment, there can be multiple Agents and multiple
Behaviors at the same time. For example, if we expanded our game to include tank
driver NPCs, then the Agent attached to those characters cannot share its
Behavior with the Agent linked to the medics (medics and drivers have different
actions). The Learning Environment through the Academy (not represented in the
diagram) ensures that all the Agents are in sync in addition to controlling
environment-wide settings.
Note that in a single environment, there can be multiple Agents and multiple Behaviors
at the same time. These Behaviors can communicate with Python through the communicator
but can also use a pre-trained _Neural Network_ or a _Heuristic_. Note that it is also
possible to communicate data with Python without using Agents through _Side Channels_.
One example of using _Side Channels_ is to exchange data with Python about
_Environment Parameters_. The following diagram illustrates the above.
Lastly, it is possible to exchange data between Unity and Python outside of the
machine learning loop through _Side Channels_. One example of using _Side
Channels_ is to exchange data with Python about _Environment Parameters_. The
following diagram illustrates the above.
<p align="center">
<img src="images/learning_environment_full.png"

As mentioned previously, the ML-Agents Toolkit ships with several
implementations of state-of-the-art algorithms for training intelligent agents.
More specifically, during training, all the medics in the
scene send their observations to the Python API through the External
Communicator. The Python API
More specifically, during training, all the medics in the scene send their
observations to the Python API through the External Communicator. The Python API
during the inference phase, we use the
TensorFlow model generated from the training phase. Now during the inference
phase, the medics still continue to generate their observations, but instead of
being sent to the Python API, they will be fed into their (internal, embedded)
model to generate the _optimal_ action for each medic to take at every point in
time.
during the inference phase, we use the TensorFlow model generated from the
training phase. Now during the inference phase, the medics still continue to
generate their observations, but instead of being sent to the Python API, they
will be fed into their (internal, embedded) model to generate the _optimal_
action for each medic to take at every point in time.
The [Getting Started Guide](Getting-Started.md)
tutorial covers this training mode with the **3D Balance Ball** sample environment.
The [Getting Started Guide](Getting-Started.md) tutorial covers this training
mode with the **3D Balance Ball** sample environment.
In the previous mode, the Agents were used for training to generate
a TensorFlow model that the Agents can later use. However,
any user of the ML-Agents Toolkit can leverage their own algorithms for
training. In this case, the behaviors of all the Agents in the scene
will be controlled within Python.
You can even turn your environment into a [gym.](../gym-unity/README.md)
In the previous mode, the Agents were used for training to generate a TensorFlow
model that the Agents can later use. However, any user of the ML-Agents Toolkit
can leverage their own algorithms for training. In this case, the behaviors of
all the Agents in the scene will be controlled within Python. You can even turn
your environment into a [gym.](../gym-unity/README.md)
We do not currently have a tutorial highlighting this mode, but you can learn
more about the Python API [here](Python-API.md).
## Flexible Training Scenarios
While the discussion so-far has mostly focused on training a single agent, with
ML-Agents, several training scenarios are possible. We are excited to see what
kinds of novel and fun environments the community creates. For those new to
training intelligent agents, below are a few examples that can serve as
inspiration:
- Single-Agent. A single agent, with its own reward signal. The traditional way
of training an agent. An example is any single-player game, such as Chicken.
- Simultaneous Single-Agent. Multiple independent agents with independent reward
signals with same `Behavior Parameters`. A parallelized version of the
traditional training scenario, which can speed-up and stabilize the training
process. Helpful when you have multiple versions of the same character in an
environment who should learn similar behaviors. An example might be training a
dozen robot-arms to each open a door simultaneously.
- Adversarial Self-Play. Two interacting agents with inverse reward signals. In
two-player games, adversarial self-play can allow an agent to become
increasingly more skilled, while always having the perfectly matched opponent:
itself. This was the strategy employed when training AlphaGo, and more
recently used by OpenAI to train a human-beating 1-vs-1 Dota 2 agent.
- Cooperative Multi-Agent. Multiple interacting agents with a shared reward
signal with same or different `Behavior Parameters`. In this scenario, all
agents must work together to accomplish a task that cannot be done alone.
Examples include environments where each agent only has access to partial
information, which needs to be shared in order to accomplish the task or
collaboratively solve a puzzle.
- Competitive Multi-Agent. Multiple interacting agents with inverse reward
signals with same or different `Behavior Parameters`. In this scenario, agents
must compete with one another to either win a competition, or obtain some
limited set of resources. All team sports fall into this scenario.
- Ecosystem. Multiple interacting agents with independent reward signals with
same or different `Behavior Parameters`. This scenario can be thought of as
creating a small world in which animals with different goals all interact,
such as a savanna in which there might be zebras, elephants and giraffes, or
an autonomous driving simulation within an urban environment.
## Training Methods: Environment-agnostic
The remaining sections overview the various state-of-the-art machine learning
algorithms that are part of the ML-Agents Toolkit. If you aren't studying
machine and reinforcement learning as a subject and just want to train agents to
accomplish tasks, you can treat these algorithms as _black boxes_. There are a
few training-related parameters to adjust inside Unity as well as on the Python
training side, but you do not need in-depth knowledge of the algorithms
themselves to successfully create and train agents. Step-by-step procedures for
running the training process are provided in the
[Training ML-Agents](Training-ML-Agents.md) page.
This section specifically focuses on the training methods that are available
regardless of the specifics of your learning environment.
#### A Quick Note on Reward Signals
In this section we introduce the concepts of _intrinsic_ and _extrinsic_
rewards, which helps explain some of the training methods.
In reinforcement learning, the end goal for the Agent is to discover a behavior
(a Policy) that maximizes a reward. You will need to provide the agent one or
more reward signals to use during training.Typically, a reward is defined by
your environment, and corresponds to reaching some goal. These are what we refer
to as _extrinsic_ rewards, as they are defined external of the learning
algorithm.
Rewards, however, can be defined outside of the environment as well, to
encourage the agent to behave in certain ways, or to aid the learning of the
true extrinsic reward. We refer to these rewards as _intrinsic_ reward signals.
The total reward that the agent will learn to maximize can be a mix of extrinsic
and intrinsic reward signals.
The ML-Agents Toolkit allows reward signals to be defined in a modular way, and
we provide three reward signals that can the mixed and matched to help shape
your agent's behavior:
- `extrinsic`: represents the rewards defined in your environment, and is
enabled by default
- `gail`: represents an intrinsic reward signal that is defined by GAIL (see
below)
- `curiosity`: represents an intrinsic reward signal that encourages exploration
in sparse-reward environments that is defined by the Curiosity module (see
below).
### Deep Reinforcement Learning
ML-Agents provide an implementation of two reinforcement learning algorithms:
- [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/)
- [Soft Actor-Critic (SAC)](https://bair.berkeley.edu/blog/2018/12/14/sac/)
The default algorithm is PPO. This is a method that has been shown to be more
general purpose and stable than many other RL algorithms.
In contrast with PPO, SAC is _off-policy_, which means it can learn from
experiences collected at any time during the past. As experiences are collected,
they are placed in an experience replay buffer and randomly drawn during
training. This makes SAC significantly more sample-efficient, often requiring
5-10 times less samples to learn the same task as PPO. However, SAC tends to
require more model updates. SAC is a good choice for heavier or slower
environments (about 0.1 seconds per step or more). SAC is also a "maximum
entropy" algorithm, and enables exploration in an intrinsic way. Read more about
maximum entropy RL
[here](https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/).
#### Curiosity for Sparse-reward Environments
In environments where the agent receives rare or infrequent rewards (i.e.
sparse-reward), an agent may never receive a reward signal on which to bootstrap
its training process. This is a scenario where the use of an intrinsic reward
signals can be valuable. Curiosity is one such signal which can help the agent
explore when extrinsic rewards are sparse.
The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an
implementation of the approach described in
[Curiosity-driven Exploration by Self-supervised Prediction](https://pathak22.github.io/noreward-rl/)
by Pathak, et al. It trains two networks:
- an inverse model, which takes the current and next observation of the agent,
encodes them, and uses the encoding to predict the action that was taken
between the observations
- a forward model, which takes the encoded current observation and action, and
predicts the next encoded observation.
The loss of the forward model (the difference between the predicted and actual
encoded observations) is used as the intrinsic reward, so the more surprised the
model is, the larger the reward will be.
For more information, see our dedicated
[blog post on the Curiosity module](https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/).
### Imitation Learning
It is often more intuitive to simply demonstrate the behavior we want an agent
to perform, rather than attempting to have it learn via trial-and-error methods.
For example, instead of indirectly training a medic with the help of a reward
function, we can give the medic real world examples of observations from the
game and actions from a game controller to guide the medic's behavior. Imitation
Learning uses pairs of observations and actions from a demonstration to learn a
policy. See this [video demo](https://youtu.be/kpb8ZkMBFYs) of imitation
learning .
Imitation learning can either be used alone or in conjunction with reinforcement
learning. If used alone it can provide a mechanism for learning a specific type
of behavior (i.e. a specific style of solving the task). If used in conjunction
with reinforcement learning it can dramatically reduce the time the agent takes
to solve the environment. This can be especially pronounced in sparse-reward
environments. For instance, on the
[Pyramids environment](Learning-Environment-Examples.md#pyramids), using 6
episodes of demonstrations can reduce training steps by more than 4 times. See
Behavioral Cloning + GAIL + Curiosity + RL below.
<p align="center">
<img src="images/mlagents-ImitationAndRL.png"
alt="Using Demonstrations with Reinforcement Learning"
width="700" border="0" />
</p>
The ML-Agents Toolkit provides a way to learn directly from demonstrations, as
well as use them to help speed up reward-based training (RL). We include two
algorithms called Behavioral Cloning (BC) and Generative Adversarial Imitation
Learning (GAIL). In most scenarios, you can combine these two features:
- If you want to help your agents learn (especially with environments that have
sparse rewards) using pre-recorded demonstrations, you can generally enable
both GAIL and Behavioral Cloning at low strengths in addition to having an
extrinsic reward. An example of this is provided for the Pyramids example
environment under `PyramidsLearning` in `config/gail_config.yaml`.
- If you want to train purely from demonstrations, GAIL and BC _without_ an
extrinsic reward signal is the preferred approach. An example of this is
provided for the Crawler example environment under `CrawlerStaticLearning` in
`config/gail_config.yaml`.
#### GAIL (Generative Adversarial Imitation Learning)
GAIL, or
[Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476),
uses an adversarial approach to reward your Agent for behaving similar to a set
of demonstrations. GAIL can be used with or without environment rewards, and
works well when there are a limited number of demonstrations. In this framework,
a second neural network, the discriminator, is taught to distinguish whether an
observation/action is from a demonstration or produced by the agent. This
discriminator can the examine a new observation/action and provide it a reward
based on how close it believes this new observation/action is to the provided
demonstrations.
At each training step, the agent tries to learn how to maximize this reward.
Then, the discriminator is trained to better distinguish between demonstrations
and agent state/actions. In this way, while the agent gets better and better at
mimicking the demonstrations, the discriminator keeps getting stricter and
stricter and the agent must try harder to "fool" it.
This approach learns a _policy_ that produces states and actions similar to the
demonstrations, requiring fewer demonstrations than direct cloning of the
actions. In addition to learning purely from demonstrations, the GAIL reward
signal can be mixed with an extrinsic reward signal to guide the learning
process.
#### Behavioral Cloning (BC)
BC trains the Agent's policy to exactly mimic the actions shown in a set of
demonstrations. The BC feature can be enabled on the PPO or SAC trainers. As BC
cannot generalize past the examples shown in the demonstrations, BC tends to
work best when there exists demonstrations for nearly all of the states that the
agent can experience, or in conjunction with GAIL and/or an extrinsic reward.
#### Recording Demonstrations
Demonstrations of agent behavior can be recorded from the Unity Editor or build,
and saved as assets. These demonstrations contain information on the
observations, actions, and rewards for a given agent during the recording
session. They can be managed in the Editor, as well as used for training with BC
and GAIL.
### Summary
To summarize, we provide 3 training methods: BC, GAIL and RL (PPO or SAC) that
can be used independently or together:
- BC can be used on its own or as a pre-training step before GAIL and/or RL
- GAIL can be used with or without extrinsic rewards
- RL can be used on its own (either PPO or SAC) or in conjunction with BC and/or
GAIL.
Leveraging either BC or GAIL requires recording demonstrations to be provided as
input to the training algorithms.
## Training Methods: Environment-specific
In addition to the three environment-agnostic training methods introduced in the
previous section, the ML-Agents Toolkit provides additional methods that can aid
in training behaviors for specific types of environments.
### Training in Multi-Agent Environments with Self-Play
ML-Agents provides the functionality to train both symmetric and asymmetric
adversarial games with
[Self-Play](https://openai.com/blog/competitive-self-play/). A symmetric game is
one in which opposing agents are equal in form, function and objective. Examples
of symmetric games are our Tennis and Soccer example environments. In
reinforcement learning, this means both agents have the same observation and
action spaces and learn from the same reward function and so _they can share the
same policy_. In asymmetric games, this is not the case. An example of an
asymmetric games are Hide and Seek. Agents in these types of games do not always
have the same observation or action spaces and so sharing policy networks is not
necessarily ideal.
We do not currently have a tutorial highlighting this mode, but you can
learn more about the Python API [here](Python-API.md).
With self-play, an agent learns in adversarial games by competing against fixed,
past versions of its opponent (which could be itself as in symmetric games) to
provide a more stable, stationary learning environment. This is compared to
competing against the current, best opponent in every episode, which is
constantly changing (because it's learning).
Self-play can be used with our implementations of both Proximal Policy
Optimization (PPO) and Soft Actor-Critic (SAC). However, from the perspective of
an individual agent, these scenarios appear to have non-stationary dynamics
because the opponent is often changing. This can cause significant issues in the
experience replay mechanism used by SAC. Thus, we recommend that users use PPO.
For further reading on this issue in particular, see the paper
[Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf).
### Curriculum Learning
### Solving Complex Tasks using Curriculum Learning
This mode is an extension of _Built-in Training and Inference_, and is
particularly helpful when training intricate behaviors for complex environments.
Curriculum learning is a way of training a machine learning model where more
difficult aspects of a problem are gradually introduced in such a way that the
model is always optimally challenged. This idea has been around for a long time,

training on easier tasks can provide a scaffolding for harder tasks in the
future.
<p align="center">
<img src="images/math.png"
alt="Example Math Curriculum"
width="700"
border="10" />
</p>
Imagine training the medic to to scale a wall to arrive at a wounded team
member. The starting point when training a medic to accomplish this task will be
a random policy. That starting policy will have the medic running in circles,
and will likely never, or very rarely scale the wall properly to revive their
team member (and achieve the reward). If we start with a simpler task, such as
moving toward an unobstructed team member, then the medic can easily learn to
accomplish the task. From there, we can slowly add to the difficulty of the task
by increasing the size of the wall until the medic can complete the initially
near-impossible task of scaling the wall. We have included an environment to
demonstrate this with ML-Agents, called
[Wall Jump](Learning-Environment-Examples.md#wall-jump).
_Example of a mathematics curriculum. Lessons progress from simpler topics to
more complex ones, with each building on the last._
![Wall](images/curriculum.png)
When we think about how reinforcement learning actually works, the learning reward
signal is received occasionally throughout training. The starting point
when training an agent to accomplish this task will be a random policy. That
starting policy will have the agent running in circles, and will likely never,
or very rarely achieve the reward for complex environments. Thus by simplifying
the environment at the beginning of training, we allow the agent to quickly
update the random policy to a more meaningful one that is successively improved
as the environment gradually increases in complexity. In our example, we can
imagine first training the medic when each team only contains one player, and
then iteratively increasing the number of players (i.e. the environment
complexity). The ML-Agents Toolkit supports setting custom environment
parameters within the Academy. This allows elements of the environment related
to difficulty or complexity to be dynamically adjusted based on training
progress.
_Demonstration of a hypothetical curriculum training scenario in which a
progressively taller wall obstructs the path to the goal._
The [Training with Curriculum Learning](Training-Curriculum-Learning.md)
tutorial covers this training mode with the **Wall Area** sample environment.
_[**Note**: The example provided above is for instructional purposes, and was
based on an early version of the
[Wall Jump example environment](Learning-Environment-Examples.md). As such, it
is not possible to directly replicate the results here using that environment.]_
### Imitation Learning
The ML-Agents Toolkit supports modifying custom environment parameters during
the training process to aid in learning.. This allows elements of the
environment related to difficulty or complexity to be dynamically adjusted based
on training progress.
It is often more intuitive to simply demonstrate the behavior we want an agent
to perform, rather than attempting to have it learn via trial-and-error methods.
For example, instead of training the medic by setting up its reward function,
this mode allows providing real examples from a game controller on how the medic
should behave. More specifically, in this mode, the Agent must use its heuristic
to generate action, and all the actions performed with the controller (in addition
to the agent observations) will be recorded. The
imitation learning algorithm will then use these pairs of observations and
actions from the human player to learn a policy. [Video
Link](https://youtu.be/kpb8ZkMBFYs).
### Training Robust Agents using Environment Parameter Randomization
An agent trained on a specific environment, may be unable to generalize to any
tweaks or variations in the environment (in machine learning this is referred to
as overfitting). This becomes problematic in cases where environments are
instantiated with varying objects or properties. One mechanism to alleviate this
and train more robust agents that can generalize to unseen variations of the
environment is to expose them to these variations during training. Similar to
Curriculum Learning, where environments become more difficult as the agent
learns, the ML-Agents Toolkit provides a way to randomly sample parameters of
the environment during training. We refer to this approach as **Environment
Parameter Randomization**. For those familiar with Reinforcement Learning
research, this approach is based on the concept of Domain Randomization (you can
read more about it [here](https://arxiv.org/abs/1703.06907)). By using parameter
randomization during training, the agent can be better suited to adapt (with
higher performance) to future unseen variations of the environment.
_Example of variations of the 3D Ball environment._
| Ball scale of 0.5 | Ball scale of 4 |
| :--------------------------: | :------------------------: |
| ![](images/3dball_small.png) | ![](images/3dball_big.png) |
In the 3D ball environment example displayed in the figure above, the
environment parameters are `gravity`, `ball_mass` and `ball_scale`.
## Model Types
Regardless of the training method deployed, there are a few model types that
users can train using the ML-Agents Toolkit. This is due to the flexibility in
defining agent observations, which can include vector, ray cast and visual
observations. You can learn more about how to instrument an agent's observation
in the [Designing Agents](Learning-Environment-Design-Agents.md) guide.
### Learning from Vector Observations
Whether an agent's observations are ray cast or vector, the ML-Agents Toolkit
provides a fully connected neural network model to learn from those
observations. At training time you can configure different aspects of this model
such as the number of hidden units and number of layers.
### Learning from Cameras using Convolutional Neural Networks
Unlike other platforms, where the agent’s observation might be limited to a
single vector or image, the ML-Agents Toolkit allows multiple cameras to be used
for observations per agent. This enables agents to learn to integrate
information from multiple visual streams. This can be helpful in several
scenarios such as training a self-driving car which requires multiple cameras
with different viewpoints, or a navigational agent which might need to integrate
aerial and first-person visuals. You can learn more about adding visual
observations to an agent
[here](Learning-Environment-Design-Agents.md#multiple-visual-observations).
When visual observations are utilized, the ML-Agents Toolkit leverages
convolutional neural networks (CNN) to learn from the input images. We offer
three network architectures:
- a simple encoder which consists of two convolutional layers
- the implementation proposed by
[Mnih et al.](https://www.nature.com/articles/nature14236), consisting of
three convolutional layers,
- the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three
stacked layers, each with two residual blocks, making a much larger network
than the other two.
The choice of the architecture depends on the visual complexity of the scene and
the available computational resources.
The toolkit provides a way to learn directly from demonstrations, as well as use them
to help speed up reward-based training (RL). We include two algorithms called
Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL). The
[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial covers these
features in more depth.
### Memory-enhanced Agents using Recurrent Neural Networks
## Flexible Training Scenarios
Have you ever entered a room to get something and immediately forgot what you
were looking for? Don't let that happen to your agents.
While the discussion so-far has mostly focused on training a single agent, with
ML-Agents, several training scenarios are possible. We are excited to see what
kinds of novel and fun environments the community creates. For those new to
training intelligent agents, below are a few examples that can serve as
inspiration:
![Inspector](images/ml-agents-LSTM.png)
- Single-Agent. A single agent, with its own reward
signal. The traditional way of training an agent. An example is any
single-player game, such as Chicken. [Video
Link](https://www.youtube.com/watch?v=fiQsmdwEGT8&feature=youtu.be).
- Simultaneous Single-Agent. Multiple independent agents with independent reward
signals with same `Behavior Parameters`. A parallelized version of the traditional
training scenario, which can speed-up and stabilize the training process.
Helpful when you have multiple versions of the same character in an
environment who should learn similar behaviors. An example might be training a
dozen robot-arms to each open a door simultaneously. [Video
Link](https://www.youtube.com/watch?v=fq0JBaiCYNA).
- Adversarial Self-Play. Two interacting agents with inverse reward signals.
In two-player games, adversarial self-play can allow
an agent to become increasingly more skilled, while always having the
perfectly matched opponent: itself. This was the strategy employed when
training AlphaGo, and more recently used by OpenAI to train a human-beating
1-vs-1 Dota 2 agent.
- Cooperative Multi-Agent. Multiple interacting agents with a shared reward
signal with same or different `Behavior Parameters`. In this
scenario, all agents must work together to accomplish a task that cannot be
done alone. Examples include environments where each agent only has access to
partial information, which needs to be shared in order to accomplish the task
or collaboratively solve a puzzle.
- Competitive Multi-Agent. Multiple interacting agents with inverse reward
signals with same or different `Behavior Parameters`. In this
scenario, agents must compete with one another to either win a competition, or
obtain some limited set of resources. All team sports fall into this scenario.
- Ecosystem. Multiple interacting agents with independent reward signals with
same or different `Behavior Parameters`. This scenario can be thought
of as creating a small world in which animals with different goals all
interact, such as a savanna in which there might be zebras, elephants and
giraffes, or an autonomous driving simulation within an urban environment.
In some scenarios, agents must learn to remember the past in order to take the
best decision. When an agent only has partial observability of the environment,
keeping track of past observations can help the agent learn. Deciding what the
agents should remember in order to solve a task is not easy to do by hand, but
our training algorithms can learn to keep track of what is important to remember
with [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).
## Additional Features

- **Memory-enhanced Agents** - In some scenarios, agents must learn to remember
the past in order to take the best decision. When an agent only has partial
observability of the environment, keeping track of past observations can help
the agent learn. We provide an implementation of _Long Short-term Memory_
([LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory)) in our trainers
that enable the agent to store memories to be used in future steps. You can
learn more about enabling LSTM during training [here](Feature-Memory.md).
- **Monitoring Agent’s Decision Making** - Since communication in ML-Agents is a
two-way street, we provide an Agent Monitor class in Unity which can display
aspects of the trained Agent, such as the Agents perception on how well it is
doing (called **value estimates**) within the Unity environment itself. By
leveraging Unity as a visualization tool and providing these outputs in
real-time, researchers and developers can more easily debug an Agent’s
behavior. You can learn more about using the Monitor class
[here](Feature-Monitor.md).
- **Complex Visual Observations** - Unlike other platforms, where the agent’s
observation might be limited to a single vector or image, the ML-Agents
toolkit allows multiple cameras to be used for observations per agent. This
enables agents to learn to integrate information from multiple visual streams.
This can be helpful in several scenarios such as training a self-driving car
which requires multiple cameras with different viewpoints, or a navigational
agent which might need to integrate aerial and first-person visuals. You can
learn more about adding visual observations to an agent
[here](Learning-Environment-Design-Agents.md#multiple-visual-observations).
- **Training with Environment Parameter Randomization** - If an agent is exposed to several variations of an environment, it will be more robust (i.e. generalize better) to
unseen variations of the environment. Similar to Curriculum Learning,
where environments become more difficult as the agent learns, the toolkit provides
a way to randomly sample parameters of the environment during training. See
[Training With Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md)
to learn more about this feature.
- **Concurrent Unity Instances** - We enable developers to run concurrent,
parallel instances of the Unity executable during training. For certain
scenarios, this should speed up training.
- **Recording Statistics from Unity** - We enable developers to record
statistics from within their Unity environments. These statistics are
aggregated and generated during the training process.
- **Custom Side Channels** - We enable developers to create custom side channels
to manage data transfer between Unity and Python that is unique to their
training workflow and/or environment.
- **Custom Samplers** - We enable developers to create custom sampling methods
for Environment Parameter Randomization. This enables users to customize this
training method for their particular environment.
## Summary and Next Steps

(and enhance) machine learning within Unity.
To help you use ML-Agents, we've created several in-depth tutorials for
[installing ML-Agents](Installation.md),
[getting started](Getting-Started.md) with the 3D Balance Ball
environment (one of our many
[installing ML-Agents](Installation.md), [getting started](Getting-Started.md)
with the 3D Balance Ball environment (one of our many
[sample environments](Learning-Environment-Examples.md)) and
[making your own environment](Learning-Environment-Create-New.md).

55
docs/Migrating.md


# Migrating
## Migrating from Release 1 to latest
## Migrating from 0.15 to Release 1
## Migrating from 0.15 to Release 1
### Important changes

create an `EnvironmentParametersChannel` instead.
- `SideChannel.OnMessageReceived` is now a protected method (was public)
- SideChannel IncomingMessages methods now take an optional default argument,
which is used when trying to read more data than the message contains.
which is used when trying to read more data than the message contains.
(and other python StatsWriters). To do this from your code, use
`Academy.Instance.StatsRecorder.Add(key, value)`(#3660)
- `num_updates` and `train_interval` for SAC have been replaced with `steps_per_update`.
(and other python StatsWriters). To do this from your code, use
`Academy.Instance.StatsRecorder.Add(key, value)`(#3660)
- `num_updates` and `train_interval` for SAC have been replaced with
`steps_per_update`.
`UnityToGymWrapper` and no longer creates the `UnityEnvironment`. Instead,
the `UnityEnvironment` must be passed as input to the
constructor of `UnityToGymWrapper`
`UnityToGymWrapper` and no longer creates the `UnityEnvironment`. Instead, the
`UnityEnvironment` must be passed as input to the constructor of
`UnityToGymWrapper`
- Public fields and properties on several classes were renamed to follow Unity's
C# style conventions. All public fields and properties now use "PascalCase"
instead of "camelCase"; for example, `Agent.maxStep` was renamed to

`public override void Heuristic(float[] actionsOut)` and assign values to
`actionsOut` instead of returning an array.
- If you used `SideChannels` you must:
- Replace `Academy.FloatProperties` with `Academy.Instance.EnvironmentParameters`.
- Replace `Academy.FloatProperties` with
`Academy.Instance.EnvironmentParameters`.
removed. Use `SideChannelManager.RegisterSideChannel` and
`SideChannelManager.UnregisterSideChannel` instead.
- Set `steps_per_update` to be around equal to the number of agents in your environment,
times `num_updates` and divided by `train_interval`.
- Replace `UnityEnv` with `UnityToGymWrapper` in your code. The constructor
no longer takes a file name as input but a fully constructed
`UnityEnvironment` instead.
removed. Use `SideChannelManager.RegisterSideChannel` and
`SideChannelManager.UnregisterSideChannel` instead.
- Set `steps_per_update` to be around equal to the number of agents in your
environment, times `num_updates` and divided by `train_interval`.
- Replace `UnityEnv` with `UnityToGymWrapper` in your code. The constructor no
longer takes a file name as input but a fully constructed `UnityEnvironment`
instead.
- If you have a custom `ISensor` implementation, you will need to change the signature of
its `Write()` method to use `ObservationWriter` instead of `WriteAdapter`.
- If you have a custom `ISensor` implementation, you will need to change the
signature of its `Write()` method to use `ObservationWriter` instead of
`WriteAdapter`.
## Migrating from 0.14 to 0.15

The Academy class no longer has a `ResetParameters`. To access shared float
properties with Python, use the new `FloatProperties` field on the Academy.
- Offline Behavioral Cloning has been removed. To learn from demonstrations, use
the GAIL and Behavioral Cloning features with either PPO or SAC. See
[Imitation Learning](Training-Imitation-Learning.md) for more information.
the GAIL and Behavioral Cloning features with either PPO or SAC.
- `mlagents.envs` was renamed to `mlagents_envs`. The previous repo layout
depended on [PEP420](https://www.python.org/dev/peps/pep-0420/), which caused
problems with some of our tooling such as mypy and pylint.

- `use_curiosity`, `curiosity_strength`, `curiosity_enc_size`: Define a
`curiosity` reward signal and set its `strength` to `curiosity_strength`,
and `encoding_size` to `curiosity_enc_size`. Give it the same `gamma` as
your `extrinsic` signal to mimic previous behavior. See
[Reward Signals](Reward-Signals.md) for more information on defining reward
signals.
your `extrinsic` signal to mimic previous behavior.
- TensorBoards generated when running multiple environments in v0.8 are not
comparable to those generated in v0.9 in terms of step count. Multiply your
v0.8 step count by `num_envs` for an approximate comparison. You may need to

[trainer_config.yaml](../config/trainer_config.yaml). An example of passing a
trainer configuration to `mlagents-learn` is shown above.
- The environment name is now passed through the `--env` option.
- Curriculum learning has been changed. Refer to the
[curriculum learning documentation](Training-Curriculum-Learning.md) for
detailed information. In summary:
- Curriculum learning has been changed. In summary:
- Curriculum files for the same environment must now be placed into a folder.
Each curriculum file should be named after the Brain whose curriculum it
specifies.

[here](Training-ML-Agents.md#training-with-mlagents-learn).
- Hyperparameters for training Brains are now stored in the
`trainer_config.yaml` file. For more information on using this file, see
[here](Training-ML-Agents.md#training-config-file).
[here](Training-ML-Agents.md#training-configurations).
### Unity API

2
docs/Python-API.md


or properties. More on them in the [Modifying the environment from Python](Python-API.md#modifying-the-environment-from-python) section.
If you want to directly interact with the Editor, you need to use
`file_name=None`, then press the :arrow_forward: button in the Editor when the
`file_name=None`, then press the **Play** button in the Editor when the
message _"Start training by pressing the Play button in the Unity Editor"_ is
displayed on the screen

13
docs/Readme.md


### Advanced Usage
- [Using the Monitor](Feature-Monitor.md)
- [Reward Signals](Reward-Signals.md)
- [Training Using Concurrent Unity Instances](Training-Using-Concurrent-Unity-Instances.md)
- [Training with Proximal Policy Optimization](Training-PPO.md)
- [Training with Soft Actor-Critic](Training-SAC.md)
- [Training with Self-Play](Training-Self-Play.md)
### Advanced Training Methods
- [Training with Curriculum Learning](Training-Curriculum-Learning.md)
- [Training with Imitation Learning](Training-Imitation-Learning.md)
- [Training with LSTM](Feature-Memory.md)
- [Training with Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md)
## Inference

473
docs/Training-ML-Agents.md


# Training ML-Agents
**Table of Contents**
- [Training with mlagents-learn](#training-with-mlagents-learn)
- [Starting Training](#starting-training)
- [Observing Training](#observing-training)
- [Stopping and Resuming Training](#stopping-and-resuming-training)
- [Loading an Existing Model](#loading-an-existing-model)
- [Training Configurations](#training-configurations)
- [Trainer Config File](#trainer-config-file)
- [Curriculum Learning](#curriculum-learning)
- [Specifying Curricula](#specifying-curricula)
- [Training with a Curriculum](#training-with-a-curriculum)
- [Environment Parameter Randomization](#environment-parameter-randomization)
- [Included Sampler Types](#included-sampler-types)
- [Defining a New Sampler Type](#defining-a-new-sampler-type)
- [Training with Environment Parameter Randomization](#training-with-environment-parameter-randomization)
- [Training Using Concurrent Unity Instances](#training-using-concurrent-unity-instances)
For a broad overview of reinforcement learning, imitation learning and all the
training scenarios, methods and options within the ML-Agents Toolkit, see
[ML-Agents Toolkit Overview](ML-Agents-Overview.md).

- `<trainer-config-file>` is the file path of the trainer configuration yaml.
This contains all the hyperparameter values. We offer a detailed guide on the
structure of this file and the meaning of the hyperameters (and advice on how
to set them) in the dedicated [Training Config File](#training-config-file)
section below.
structure of this file and the meaning of the hyperparameters (and advice on
how to set them) in the dedicated
[Training Configurations](#training-configurations) section below.
Editor. Press the :arrow_forward: button in Unity when the message _"Start
training by pressing the Play button in the Unity Editor"_ is displayed on
the screen.
Editor. Press the **Play** button in Unity when the message _"Start training
by pressing the Play button in the Unity Editor"_ is displayed on the screen.
- `<run-identifier>` is a unique name you can use to identify the results of
your training runs.

`--initialize-from=<run-identifier>`, where `<run-identifier>` is the old run
ID.
## Training Config File
## Training Configurations
The Unity ML-Agents Toolkit provides a wide range of training scenarios, methods
and options. As such, specific training runs may require different training

The training config files `config/trainer_config.yaml`,
`config/sac_trainer_config.yaml`, `config/gail_config.yaml` and
`config/offline_bc_config.yaml` specifies the training method, the
hyperparameters, and a few additional values to use when training with Proximal
Policy Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial
Imitation Learning) with PPO/SAC, and Behavioral Cloning(BC)/Imitation with
PPO/SAC. These files are divided into sections. The **default** section defines
the default values for all the available training with PPO, SAC, GAIL (with
PPO), and BC. These files are divided into sections. The **default** section
defines the default values for all the available settings. You can also add new
sections to override these defaults to train specific Behaviors. Name each of
these override sections after the appropriate `Behavior Name`. Sections for the
example environments are included in the provided config file.
More specifically, this section offers a detailed guide on four command-line
flags for `mlagents-learn` that control the training configurations:
- `<trainer-config-file>`: defines the training hyperparameters for each
Behavior in the scene
- `--curriculum`: defines the set-up for Curriculum Learning
- `--sampler`: defines the set-up for Environment Parameter Randomization
- `--num-envs`: number of concurrent Unity instances to use during training
Reminder that a detailed description of all command-line options can be found by
using the help utility:
```sh
mlagents-learn --help
```
It is important to highlight that successfully training a Behavior in the
ML-Agents Toolkit involves tuning the training hyperparameters and
configuration. This guide contains some best practices for tuning the training
process when the default parameters don't seem to be giving the level of
performance you would like. We provide sample configuration files for our
example environments in the [config/](../config/) directory. The
`config/trainer_config.yaml` was used to train the 3D Balance Ball in the
[Getting Started](Getting-Started.md) guide. That configuration file uses the
PPO trainer, but we also have configuration files for SAC and GAIL.
Additionally, the set of configurations you provide depend on the training
functionalities you use (see [ML-Agents Toolkit Overview](ML-Agents-Overview.md)
for a description of all the training functionalities). Each functionality you
add typically has its own training configurations or additional configuration
files. For instance:
- Use PPO or SAC?
- Use Recurrent Neural Networks for adding memory to your agents?
- Use the intrinsic curiosity module?
- Ignore the environment reward signal?
- Pre-train using behavioral cloning? (Assuming you have recorded
demonstrations.)
- Include the GAIL intrinsic reward signals? (Assuming you have recorded
demonstrations.)
- Use self-play? (Assuming your environment includes multiple agents.)
The answers to the above questions will dictate the configuration files and the
parameters within them. The rest of this section breaks down the different
configuration files and explains the possible settings for each.
### Trainer Config File
We begin with the trainer config file, `<trainer-config-file>`, which includes a
set of configurations for each Behavior in your scene. Some of the
configurations are required while others are optional. To help us get started,
below is a sample file that includes all the possible settings if we're using a
PPO trainer with all the possible training functionalities enabled (memory,
behavioral cloning, curiosity, GAIL and self-play). You will notice that
curriculum and environment parameter randomization settings are not part of this
file, but their settings live in different files that we'll cover in subsequent
sections.
```yaml
BehaviorPPO:
trainer: ppo
# Trainer configs common to PPO/SAC (excluding reward signals)
batch_size: 1024
buffer_size: 10240
hidden_units: 128
learning_rate: 3.0e-4
learning_rate_schedule: linear
max_steps: 5.0e5
normalize: false
num_layers: 2
time_horizon: 64
vis_encoder_type: simple
# PPO-specific configs
beta: 5.0e-3
epsilon: 0.2
lambd: 0.95
num_epoch: 3
threaded: true
# memory
use_recurrent: true
sequence_length: 64
memory_size: 256
# behavior cloning
behavioral_cloning:
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
strength: 0.5
steps: 150000
batch_size: 512
num_epoch: 3
samples_per_update: 0
init_path:
reward_signals:
# environment reward
extrinsic:
strength: 1.0
gamma: 0.99
# curiosity module
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 256
learning_rate: 3e-4
# GAIL
gail:
strength: 0.01
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
learning_rate: 3e-4
use_actions: false
use_vail: false
# self-play
self_play:
window: 10
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 50000
team_change: 100000
```
Here is an equivalent file if we use an SAC trainer instead. Notice that the
configurations for the additional functionalities (memory, behavioral cloning,
curiosity and self-play) remain unchanged.
```yaml
BehaviorSAC:
trainer: sac
# Trainer configs common to PPO/SAC (excluding reward signals)
# same as PPO config
# SAC-specific configs (replaces the "PPO-specific configs" section above)
buffer_init_steps: 0
tau: 0.005
steps_per_update: 1
train_interval: 1
init_entcoef: 1.0
save_replay_buffer: false
# memory
# same as PPO config
# pre-training using behavior cloning
behavioral_cloning:
# same as PPO config
reward_signals:
reward_signal_num_update: 1 # only applies to SAC
# environment reward
extrinsic:
# same as PPO config
# curiosity module
curiosity:
# same as PPO config
# GAIL
gail:
# same as PPO config
# self-play
self_play:
# same as PPO config
```
We now break apart the components of the configuration file and describe what
each of these parameters mean and provide guidelines on how to set them. See
[Training Configuration File](Training-Configuration-File.md) for a detailed
description of all the configurations listed above.
### Curriculum Learning
To enable curriculum learning, you need to provide the `--curriculum` CLI option
and point to a YAML file that defines the curriculum. Here is one example file:
```yml
BehaviorY:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
wall_height: [1.5, 2.0, 2.5, 4.0]
```
Each group of Agents under the same `Behavior Name` in an environment can have a
corresponding curriculum. These curricula are held in what we call a
"metacurriculum". A metacurriculum allows different groups of Agents to follow
different curricula within the same environment.
#### Specifying Curricula
In order to define the curricula, the first step is to decide which parameters
of the environment will vary. In the case of the Wall Jump environment, the
height of the wall is what varies. Rather than adjusting it by hand, we will
create a YAML file which describes the structure of the curricula. Within it, we
can specify which points in the training process our wall height will change,
either based on the percentage of training steps which have taken place, or what
the average reward the agent has received in the recent past is. Below is an
example config for the curricula for the Wall Jump environment.
\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral
Cloning (Imitation), GAIL = Generative Adversarial Imitation Learning
```yaml
BigWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
SmallWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
small_wall_height: [1.5, 2.0, 2.5, 4.0]
```
| **Setting** | **Description** | **Applies To Trainer\*** |
| :--------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- |
| batch_size | The number of experiences in each iteration of gradient descent. | PPO, SAC |
| batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model. | |
| beta | The strength of entropy regularization. | PPO |
| buffer_size | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer. | PPO, SAC |
| buffer_init_steps | The number of experiences to collect into the buffer before updating the policy model. | SAC |
| epsilon | Influences how rapidly the policy can evolve during training. | PPO |
| hidden_units | The number of units in the hidden layers of the neural network. | PPO, SAC |
| init_entcoef | How much the agent should explore in the beginning of training. | SAC |
| lambd | The regularization parameter. | PPO |
| learning_rate | The initial learning rate for gradient descent. | PPO, SAC |
| learning_rate_schedule | Determines how learning rate changes over time. | PPO, SAC |
| max_steps | The maximum number of simulation steps to run during a training session. | PPO, SAC |
| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC |
| normalize | Whether to automatically normalize observations. | PPO, SAC |
| num_epoch | The number of passes to make through the experience buffer when performing gradient descent optimization. | PPO |
| num_layers | The number of hidden layers in the neural network. | PPO, SAC |
| behavioral_cloning | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations). | PPO, SAC |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO, SAC |
| save_replay_buffer | Saves the replay buffer when exiting training, and loads it on resume. | SAC |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC |
| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, SAC |
| tau | How aggressively to update the target network used for bootstrapping value estimation in SAC. | SAC |
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC |
| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC |
| steps_per_update | Ratio of agent steps per mini-batch update. | SAC |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC |
| init_path | Initialize trainer from a previously saved model. | PPO, SAC |
| threaded | Run the trainer in a parallel thread from the environment steps. (Default: true) | PPO, SAC |
The curriculum for each Behavior has the following parameters:
| **Setting** | **Description** |
| :------------------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `measure` | What to measure learning progress, and advancement in lessons by.<br><br> `reward` uses a measure received reward, while `progress` uses the ratio of steps/max_steps. |
| `thresholds` | Points in value of `measure` where lesson should be increased. |
| `min_lesson_length` | The minimum number of episodes that should be completed before the lesson can change. If `measure` is set to `reward`, the average cumulative reward of the last `min_lesson_length` episodes will be used to determine if the lesson should change. Must be nonnegative. <br><br> **Important**: the average reward that is compared to the thresholds is different than the mean reward that is logged to the console. For example, if `min_lesson_length` is `100`, the lesson will increment after the average cumulative reward of the last `100` episodes exceeds the current threshold. The mean reward logged to the console is dictated by the `summary_freq` parameter defined above. |
| `signal_smoothing` | Whether to weight the current progress measure by previous values. |
| `parameters` | Corresponds to environment parameters to control. Length of each array should be one greater than number of thresholds. |
#### Training with a Curriculum
Once we have specified our metacurriculum and curricula, we can launch
`mlagents-learn` using the `–curriculum` flag to point to the config file for
our curricula and PPO will train using Curriculum Learning. For example, to
train agents in the Wall Jump environment with curriculum learning, we can run:
```sh
mlagents-learn config/trainer_config.yaml --curriculum=config/curricula/wall_jump.yaml --run-id=wall-jump-curriculum
```
We can then keep track of the current lessons and progresses via TensorBoard.
**Note**: If you are resuming a training session that uses curriculum, please
pass the number of the last-reached lesson using the `--lesson` flag when
running `mlagents-learn`.
### Environment Parameter Randomization
To enable parameter randomization, you need to provide the `--sampler` CLI
option and point to a YAML file that defines the curriculum. Here is one example
file:
For specific advice on setting hyperparameters based on the type of training you
are conducting, see:
```yaml
resampling-interval: 5000
- [Training with PPO](Training-PPO.md)
- [Training with SAC](Training-SAC.md)
- [Training with Self-Play](Training-Self-Play.md)
- [Using Recurrent Neural Networks](Feature-Memory.md)
- [Training with Curriculum Learning](Training-Curriculum-Learning.md)
- [Training with Imitation Learning](Training-Imitation-Learning.md)
- [Training with Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md)
mass:
sampler-type: "uniform"
min_value: 0.5
max_value: 10
You can also compare the
[example environments](Learning-Environment-Examples.md) to the corresponding
sections of the `config/trainer_config.yaml` file for each example to see how
the hyperparameters and other configuration variables have been changed from the
defaults.
gravity:
sampler-type: "multirange_uniform"
intervals: [[7, 10], [15, 20]]
scale:
sampler-type: "uniform"
min_value: 0.75
max_value: 3
```
Note that `mass`, `gravity` and `scale` are the names of the environment
parameters that will be sampled. If a parameter specified in the file doesn't
exist in the environment, then this parameter will be ignored.
| **Setting** | **Description** |
| :--------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `resampling-interval` | Number of steps for the agent to train under a particular environment configuration before resetting the environment with a new sample of `Environment Parameters`. |
| `sampler-type` | Type of sampler use for this `Environment Parameter`. This is a string that should exist in the `Sampler Factory` (explained below). |
| `sampler-type-sub-arguments` | Specify the sub-arguments depending on the `sampler-type`. In the example above, this would correspond to the `intervals` under the `sampler-type` `multirange_uniform` for the `Environment Parameter` called `gravity`. The key name should match the name of the corresponding argument in the sampler definition (explained) below) |
#### Included Sampler Types
Below is a list of included `sampler-type` as part of the toolkit.
- `uniform` - Uniform sampler
- Uniformly samples a single float value between defined endpoints. The
sub-arguments for this sampler to specify the interval endpoints are as
below. The sampling is done in the range of [`min_value`, `max_value`).
- **sub-arguments** - `min_value`, `max_value`
- `gaussian` - Gaussian sampler
- Samples a single float value from the distribution characterized by the mean
and standard deviation. The sub-arguments to specify the Gaussian
distribution to use are as below.
- **sub-arguments** - `mean`, `st_dev`
- `multirange_uniform` - Multirange uniform sampler
- Uniformly samples a single float value between the specified intervals.
Samples by first performing a weight pick of an interval from the list of
intervals (weighted based on interval width) and samples uniformly from the
selected interval (half-closed interval, same as the uniform sampler). This
sampler can take an arbitrary number of intervals in a list in the following
format: [[`interval_1_min`, `interval_1_max`], [`interval_2_min`,
`interval_2_max`], ...]
- **sub-arguments** - `intervals`
The implementation of the samplers can be found at
`ml-agents-envs/mlagents_envs/sampler_class.py`.
#### Defining a New Sampler Type
If you want to define your own sampler type, you must first inherit the
_Sampler_ base class (included in the `sampler_class` file) and preserve the
interface. Once the class for the required method is specified, it must be
registered in the Sampler Factory.
This can be done by subscribing to the _register_sampler_ method of the
`SamplerFactory`. The command is as follows:
`SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)`
Once the Sampler Factory reflects the new register, the new sampler type can be
used for sample any `Environment Parameter`. For example, lets say a new sampler
type was implemented as below and we register the `CustomSampler` class with the
string `custom-sampler` in the Sampler Factory.
```python
class CustomSampler(Sampler):
def __init__(self, argA, argB, argC):
self.possible_vals = [argA, argB, argC]
def sample_all(self):
return np.random.choice(self.possible_vals)
```
Now we need to specify the new sampler type in the sampler YAML file. For
example, we use this new sampler type for the `Environment Parameter` _mass_.
```yaml
mass:
sampler-type: "custom-sampler"
argB: 1
argA: 2
argC: 3
```
#### Training with Environment Parameter Randomization
After the sampler YAML file is defined, we proceed by launching `mlagents-learn`
and specify our configured sampler file with the `--sampler` flag. For example,
if we wanted to train the 3D ball agent with parameter randomization using
`Environment Parameters` with `config/3dball_randomize.yaml` sampling setup, we
would run
```sh
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_randomize.yaml
--run-id=3D-Ball-randomize
```
We can observe progress and metrics via Tensorboard.
### Training Using Concurrent Unity Instances
In order to run concurrent Unity instances during training, set the number of
environment instances using the command line option `--num-envs=<n>` when you
invoke `mlagents-learn`. Optionally, you can also set the `--base-port`, which
is the starting port used for the concurrent Unity instances.
Some considerations:
- **Buffer Size** - If you are having trouble getting an agent to train, even
with multiple concurrent Unity instances, you could increase `buffer_size` in
the `config/trainer_config.yaml` file. A common practice is to multiply
`buffer_size` by `num-envs`.
- **Resource Constraints** - Invoking concurrent Unity instances is constrained
by the resources on the machine. Please use discretion when setting
`--num-envs=<n>`.
- **Result Variation Using Concurrent Unity Instances** - If you keep all the
hyperparameters the same, but change `--num-envs=<n>`, the results and model
would likely change.

2
docs/Using-Docker.md


- `<environment-name>` **(Optional)**: If you are training with a linux
executable, this is the name of the executable. If you are training in the
Editor, do not pass a `<environment-name>` argument and press the
:arrow_forward: button in Unity when the message _"Start training by pressing
**Play** button in Unity when the message _"Start training by pressing
the Play button in the Unity Editor"_ is displayed on the screen.
- `source`: Reference to the path in your host OS where you will store the Unity
executable.

65
docs/Using-Tensorboard.md


### Environment Statistics
- `Environment/Lesson` - Plots the progress from lesson to lesson. Only
interesting when performing
[curriculum training](Training-Curriculum-Learning.md).
interesting when performing curriculum training.
- `Environment/Cumulative Reward` - The mean cumulative episode reward over all
agents. Should increase during a successful training session.

### Is Training
- `Is Training` - A boolean indicating if the agent is updating its model.
- `Policy/Entropy` (PPO; BC) - How random the decisions of the model are. Should
slowly decrease during a successful training process. If it decreases too
quickly, the `beta` hyperparameter should be increased.
- `Policy/Entropy` (PPO; SAC) - How random the decisions of the model are.
Should slowly decrease during a successful training process. If it decreases
too quickly, the `beta` hyperparameter should be increased.
- `Policy/Learning Rate` (PPO; BC) - How large a step the training algorithm
- `Policy/Learning Rate` (PPO; SAC) - How large a step the training algorithm
- `Policy/Value Estimate` (PPO) - The mean value estimate for all states visited
by the agent. Should increase during a successful training session.
- `Policy/Entropy Coefficient` (SAC) - Determines the relative importance of the
entropy term. This value is adjusted automatically so that the agent retains
some amount of randomness during training.
- `Policy/Curiosity Reward` (PPO+Curiosity) - This corresponds to the mean
- `Policy/Extrinsic Reward` (PPO; SAC) - This corresponds to the mean cumulative
reward received from the environment per-episode.
- `Policy/Value Estimate` (PPO; SAC) - The mean value estimate for all states
visited by the agent. Should increase during a successful training session.
- `Policy/Curiosity Reward` (PPO/SAC+Curiosity) - This corresponds to the mean
- `Policy/Curiosity Value Estimate` (PPO/SAC+Curiosity) - The agent's value
estimate for the curiosity reward.
- `Policy/GAIL Reward` (PPO/SAC+GAIL) - This corresponds to the mean cumulative
discriminator-based reward generated per-episode.
- `Policy/GAIL Value Estimate` (PPO/SAC+GAIL) - The agent's value estimate for
the GAIL reward.
- `Policy/GAIL Policy Estimate` (PPO/SAC+GAIL) - The discriminator's estimate
for states and actions generated by the policy.
- `Policy/GAIL Expert Estimate` (PPO/SAC+GAIL) - The discriminator's estimate
for states and actions drawn from expert demonstrations.
- `Losses/Policy Loss` (PPO) - The mean magnitude of policy loss function.
- `Losses/Policy Loss` (PPO; SAC) - The mean magnitude of policy loss function.
- `Losses/Value Loss` (PPO) - The mean loss of the value function update.
- `Losses/Value Loss` (PPO; SAC) - The mean loss of the value function update.
- `Losses/Forward Loss` (PPO+Curiosity) - The mean magnitude of the inverse
- `Losses/Forward Loss` (PPO/SAC+Curiosity) - The mean magnitude of the inverse
- `Losses/Inverse Loss` (PPO+Curiosity) - The mean magnitude of the forward
- `Losses/Inverse Loss` (PPO/SAC+Curiosity) - The mean magnitude of the forward
- `Losses/Cloning Loss` (BC) - The mean magnitude of the behavioral cloning
- `Losses/Pretraining Loss` (BC) - The mean magnitude of the behavioral cloning
- `Losses/GAIL Loss` (GAIL) - The mean magnitude of the GAIL discriminator loss.
Corresponds to how well the model imitates the demonstration data.
### Self-Play
- `Self-Play/ELO` (Self-Play) -
[ELO](https://en.wikipedia.org/wiki/Elo_rating_system) measures the relative
skill level between two players. In a proper training run, the ELO of the
agent should steadily increase.
StatsSideChannel:
`StatsRecorder`:
```csharp
var statsRecorder = Academy.Instance.StatsRecorder;

123
docs/images/learning_environment_basic.png

之前 之后
宽度: 848  |  高度: 529  |  大小: 17 KiB

251
docs/images/learning_environment_example.png

之前 之后
宽度: 1126  |  高度: 967  |  大小: 46 KiB

167
docs/images/learning_environment_full.png

之前 之后
宽度: 1783  |  高度: 1019  |  大小: 69 KiB

216
docs/Training-Configuration-File.md


# Training Configuration File
**Table of Contents**
- [Common Trainer Configurations](#common-trainer-configurations)
- [Trainer-specific Configurations](#trainer-specific-configurations)
- [PPO-specific Configurations](#ppo-specific-configurations)
- [SAC-specific Configurations](#sac-specific-configurations)
- [Reward Signals](#reward-signals)
- [Extrinsic Rewards](#extrinsic-rewards)
- [Curiosity Intrinsic Reward](#curiosity-intrinsic-reward)
- [GAIL Intrinsic Reward](#gail-intrinsic-reward)
- [SAC-specific Reward Signal](#sac-specific-reward-signal)
- [Behavioral Cloning](#behavioral-cloning)
- [Memory-enhanced Agents using Recurrent Neural Networks](#memory-enhanced-agents-using-recurrent-neural-networks)
- [Self-Play](#self-play)
- [Note on Reward Signals](#note-on-reward-signals)
- [Note on Swap Steps](#note-on-swap-steps)
## Common Trainer Configurations
One of the first decisions you need to make regarding your training run is which
trainer to use: PPO or SAC. There are some training configurations that are
common to both trainers (which we review now) and others that depend on the
choice of the trainer (which we review on subsequent sections).
| **Setting** | **Description** |
| :----------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `trainer` | The type of training to perform: `ppo` or `sac` |
| `init_path` | Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. |
| `summary_freq` | Number of experiences that needs to be collected before generating and displaying training statistics. This determines the granularity of the graphs in Tensorboard. |
| `batch_size` | Number of experiences in each iteration of gradient descent. **This should always be a fraction of the `buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value should be smaller (in order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`. |
| `buffer_size` | Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be a multiple of `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates. In SAC, the max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences. <br><br>Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000` |
| `hidden_units` | Number of units in the hidden layers of the neural network. Correspond to how many units are in each fully connected layer of the neural network. For simple problems where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where the action is a very complex interaction between the observation variables, this should be larger. <br><br> Typical range: `32` - `512` |
| `learning_rate` | Initial learning rate for gradient descent. Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase. <br><br>Typical range: `1e-5` - `1e-3` |
| `learning_rate_schedule` | Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally. <br><br>`linear` (default) decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run. |
| `max_steps` | Total number of experience points that must be collected from the simulation before ending the training process. <br><br>Typical range: `5e5` - `1e7` |
| `normalize` | Whether normalization is applied to the vector observation inputs. This normalization is based on the running average and variance of the vector observation. Normalization can be helpful in cases with complex continuous control problems, but may be harmful with simpler discrete control problems. |
| `num_layers` | The number of hidden layers in the neural network. Corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation. For simple problems, fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems. <br><br> Typical range: `1` - `3` |
| `time_horizon` | How many steps of experience to collect per-agent before adding it to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions. <br><br> Typical range: `32` - `2048` |
| `vis_encoder_type` | Encoder type for encoding visual observations. <br><br> `simple` (default) uses a simple encoder which consists of two convolutional layers, `nature_cnn` uses the CNN implementation proposed by [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of three convolutional layers, and `resnet` uses the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. |
## Trainer-specific Configurations
Depending on your choice of a trainer, there are additional trainer-specific
configurations. We present them below in two separate tables, but keep in mind
that you only need to include the configurations for the trainer selected (i.e.
the `trainer` setting above).
### PPO-specific Configurations
| **Setting** | **Description** |
| :---------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `beta` | Strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase beta. If entropy drops too slowly, decrease `beta`. <br><br>Typical range: `1e-4` - `1e-2` |
| `epsilon` | Influences how rapidly the policy can evolve during training. Corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process. <br><br>Typical range: `0.1` - `0.3` |
| `lambd` | Regularization parameter (lambda) used when calculating the Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process. <br><br>Typical range: `0.9` - `0.95` |
| `num_epoch` | Number of passes to make through the experience buffer when performing gradient descent optimization.The larger the batch_size, the larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning. <br><br>Typical range: `3` - `10` |
| `threaded` | (Optional, default = `true`) By default, PPO model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a 10-20% training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. |
### SAC-specific Configurations
| **Setting** | **Description** |
| :------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `buffer_init_steps` | Number of experiences to collect into the buffer before updating the policy model. As the untrained policy is fairly random, pre-filling the buffer with random actions is useful for exploration. Typically, at least several episodes of experiences should be pre-filled. <br><br>Typical range: `1000` - `10000` |
| `init_entcoef` | How much the agent should explore in the beginning of training. Corresponds to the initial entropy coefficient set at the beginning of training. In SAC, the agent is incentivized to make its actions entropic to facilitate better exploration. The entropy coefficient weighs the true reward with a bonus entropy reward. The entropy coefficient is [automatically adjusted](https://arxiv.org/abs/1812.05905) to a preset target entropy, so the `init_entcoef` only corresponds to the starting value of the entropy bonus. Increase init_entcoef to explore more in the beginning, decrease to converge to a solution faster. <br><br>Typical range: (Continuous): `0.5` - `1.0`; (Discrete): `0.05` - `0.5` |
| `save_replay_buffer` | (Optional, default = `false`) Whether to save and load the experience replay buffer as well as the model when quitting and re-starting training. This may help resumes go more smoothly, as the experiences collected won't be wiped. Note that replay buffers can be very large, and will take up a considerable amount of disk space. For that reason, we disable this feature by default. |
| `tau` | How aggressively to update the target network used for bootstrapping value estimation in SAC. Corresponds to the magnitude of the target Q update during the SAC model update. In SAC, there are two neural networks: the target and the policy. The target network is used to bootstrap the policy's estimate of the future rewards at a given state, and is fixed while the policy is being updated. This target is then slowly updated according to tau. Typically, this value should be left at 0.005. For simple problems, increasing tau to 0.01 might reduce the time it takes to learn, at the cost of stability. <br><br>Typical range: `0.005` - `0.01` |
| `steps_per_update` | Average ratio of agent steps (actions) taken to updates made of the agent's policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps. Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will improve sample efficiency (reduce the number of steps required to train) but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example environments) `steps_per_update` equal to the number of agents in the scene is a good balance. For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed. We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will usually result in a slowdown unless the environment is very slow. <br><br>Typical range: `1` - `20` |
| `train_interval` | Number of steps taken between each agent training event. Typically, we can train after every step, but if your environment's steps are very small and very frequent, there may not be any new interesting information between steps, and `train_interval` can be increased. <br><br>Typical range: `1` - `5` |
## Reward Signals
The `reward_signals` section enables the specification of settings for both
extrinsic (i.e. environment-based) and intrinsic reward signals (e.g. curiosity
and GAIL). Each reward signal should define at least two parameters, `strength`
and `gamma`, in addition to any class-specific hyperparameters. Note that to
remove a reward signal, you should delete its entry entirely from
`reward_signals`. At least one reward signal should be left defined at all
times. Provide the following configurations to design the reward signal for your
training run.
### Extrinsic Rewards
Enable these settings to ensure that your training run incorporates your
environment-based reward signal:
| **Setting** | **Description** |
| :--------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `extrinsic > strength` | Factor by which to multiply the reward given by the environment. Typical ranges will vary depending on the reward signal. <br><br>Typical range: `1.00` |
| `extrinsic > gamma` | Discount factor for future rewards coming from the environment. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large. In cases when rewards are more immediate, it can be smaller. Must be strictly smaller than 1. <br><br>Typical range: `0.8` - `0.995` |
### Curiosity Intrinsic Reward
To enable curiosity, provide these settings:
| **Setting** | **Description** |
| :-------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `curiosity > strength` | Magnitude of the curiosity reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. <br><br>Typical range: `0.001` - `0.1` |
| `curiosity > gamma` | Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.995` |
| `curiosity > encoding_size` | (Optional, default = `64`) Size of the encoding used by the intrinsic curiosity model. This value should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations. <br><br>Typical range: `64` - `256` |
| `curiosity > learning_rate` | (Optional, default = `3e-4`) Learning rate used to update the intrinsic curiosity module. This should typically be decreased if training is unstable, and the curiosity loss is unstable. <br><br>Typical range: `1e-5` - `1e-3` |
### GAIL Intrinsic Reward
To enable GAIL (assuming you have recorded demonstrations), provide these
settings:
| **Setting** | **Description** |
| :--------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `gail > strength` | Factor by which to multiply the raw reward. Note that when using GAIL with an Extrinsic Signal, this value should be set lower if your demonstrations are suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. <br><br>Typical range: `0.01` - `1.0` |
| `gail > gamma` | Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.9` |
| `gail > demo_path` | The path to your .demo file or directory of .demo files. |
| `gail > encoding_size` | (Optional, default = `64`) Size of the hidden layer used by the discriminator. This value should be small enough to encourage the discriminator to compress the original observation, but also not too small to prevent it from learning to differentiate between demonstrated and actual behavior. Dramatically increasing this size will also negatively affect training times. <br><br>Typical range: `64` - `256` |
| `gail > learning_rate` | (Optional, default = `3e-4`) Learning rate used to update the discriminator. This should typically be decreased if training is unstable, and the GAIL loss is unstable. <br><br>Typical range: `1e-5` - `1e-3` |
| `gail > use_actions` | (Optional, default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. |
| `gail > use_vail` | (Optional, default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand. |
### SAC-specific Reward Signal
All of the reward signals configurations described above apply to both PPO and
SAC. There is one configuration for reward signals that only applies to SAC.
| **Setting** | **Description** |
| :------------------------------------------ | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `reward_signals > reward_signal_num_update` | (Optional, default = `steps_per_update`) Number of steps per mini batch sampled and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. However, to imitate the training procedure in certain imitation learning papers (e.g. [Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)), we may want to update the reward signal (GAIL) M times for every update of the policy. We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update` under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to `steps_per_update`. |
## Behavioral Cloning
To enable Behavioral Cloning as a pre-training option (assuming you have
recorded demonstrations), provide the following configurations under the
`behavior_cloning` section:
| **Setting** | **Description** |
| :------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `demo_path` | The path to your .demo file or directory of .demo files. |
| `strength` | Learning rate of the imitation relative to the learning rate of PPO, and roughly corresponds to how strongly we allow BC to influence the policy. <br><br>Typical range: `0.1` - `0.5` |
| `steps` | During BC, it is often desirable to stop using demonstrations after the agent has "seen" rewards, and allow it to optimize past the available demonstrations and/or generalize outside of the provided demonstrations. steps corresponds to the training steps over which BC is active. The learning rate of BC will anneal over the steps. Set the steps to 0 for constant imitation over the entire training run. |
| `batch_size` | Number of demonstration experiences used for one iteration of a gradient descent update. If not specified, it will default to the `batch_size`. <br><br>Typical range: (Continuous): `512` - `5120`; (Discrete): `32` - `512` |
| `num_epoch` | Number of passes through the experience buffer during gradient descent. If not specified, it will default to the number of epochs set for PPO. <br><br>Typical range: `3` - `10` |
| `samples_per_update` | (Optional, default = `0`) Maximum number of samples to use during each imitation update. You may want to lower this if your demonstration dataset is very large to avoid overfitting the policy on demonstrations. Set to 0 to train over all of the demonstrations at each update step. <br><br>Typical range: `buffer_size` |
| `init_path` | Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. |
## Memory-enhanced Agents using Recurrent Neural Networks
You can enable your agents to use memory, by setting `use_recurrent` to `true`
and setting `memory_size` and `sequence_length`:
| **Setting** | **Description** |
| :---------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `use_recurrent` | Whether to enable this option or not. |
| `memory_size` | Size of the memory an agent must keep. In order to use a LSTM, training requires a sequence of experiences instead of single experiences. Corresponds to the size of the array of floating point numbers used to store the hidden state of the recurrent neural network of the policy. This value must be a multiple of 2, and should scale with the amount of information you expect the agent will need to remember in order to successfully complete the task. <br><br>Typical range: `32` - `256` |
| `sequence_length` | Defines how long the sequences of experiences must be while training. Note that if this number is too small, the agent will not be able to remember things over longer periods of time. If this number is too large, the neural network will take longer to train. <br><br>Typical range: `4` - `128` |
A few considerations when deciding to use memory:
- LSTM does not work well with continuous vector action space. Please use
discrete vector action space for better results.
- Since the memories must be sent back and forth between Python and Unity, using
too large `memory_size` will slow down training.
- Adding a recurrent layer increases the complexity of the neural network, it is
recommended to decrease `num_layers` when using recurrent.
- It is required that `memory_size` be divisible by 4.
## Self-Play
Training with self-play adds additional confounding factors to the usual issues
faced by reinforcement learning. In general, the tradeoff is between the skill
level and generality of the final policy and the stability of learning. Training
against a set of slowly or unchanging adversaries with low diversity results in
a more stable learning process than training against a set of quickly changing
adversaries with high diversity. With this context, this guide discusses the
exposed self-play hyperparameters and intuitions for tuning them.
If your environment contains multiple agents that are divided into teams, you
can leverage our self-play training option by providing these configurations for
each Behavior:
| **Setting** | **Description** |
| :-------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `save_steps` | Number of _trainer steps_ between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. <br><br>A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent. <br><br> Typical range: `10000` - `100000` |
| `team_change` | Number of _trainer_steps_ between switching the learning team. This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents per team switch. <br><br>A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies and so the agent may fail against the next batch of opponents. <br><br> The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we recommend setting this value as a function of the `save_steps` parameter discussed previously. <br><br> Typical range: 4x-10x where x=`save_steps` |
| `swap_steps` | Number of _ghost steps_ (not trainer steps) between swapping the opponents policy with a different snapshot. A 'ghost step' refers to a step taken by an agent _that is following a fixed policy and not learning_. The reason for this distinction is that in asymmetric games, we may have teams with an unequal number of agents e.g. a 2v1 scenario like our Strikers Vs Goalie example environment. The team with two agents collects twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents` agents during `team-change` total steps is: `(num_agents / num_opponent_agents) * (team_change / x)` <br><br> Typical range: `10000` - `100000` |
| `play_against_latest_model_ratio` | Probability an agent will play against the latest opponent policy. With probability 1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its opponent from a past iteration. <br><br> A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy. <br><br> Typical range: `0.0` - `1.0` |
| `window` | Size of the sliding window of past snapshots from which the agent's opponents are sampled. For example, a `window` size of 5 will save the last 5 snapshots taken. Each time a new snapshot is taken, the oldest is discarded. A larger value of `window` means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. Like in the `save_steps` hyperparameter, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. <br><br> Typical range: `5` - `30` |
### Note on Reward Signals
We make the assumption that the final reward in a trajectory corresponds to the
outcome of an episode. A final reward of +1 indicates winning, -1 indicates
losing and 0 indicates a draw. The ELO calculation (discussed below) depends on
this final reward being either +1, 0, -1.
The reward signal should still be used as described in the documentation for the
other trainers. However, we encourage users to be a bit more conservative when
shaping reward functions due to the instability and non-stationarity of learning
in adversarial games. Specifically, we encourage users to begin with the
simplest possible reward function (+1 winning, -1 losing) and to allow for more
iterations of training to compensate for the sparsity of reward.
### Note on Swap Steps
As an example, in a 2v1 scenario, if we want the swap to occur x=4 times during
team-change=200000 steps, the swap_steps for the team of one agent is:
swap_steps = (1 / 2) \* (200000 / 4) = 25000 The swap_steps for the team of two
agents is:
swap_steps = (2 / 1) \* (200000 / 4) = 100000 Note, with equal team sizes, the
first term is equal to 1 and swap_steps can be calculated by just dividing the
total steps by the desired number of swaps.
A larger value of swap_steps means that an agent will play against the same
fixed opponent for a longer number of training iterations. This results in a
more stable training scenario, but leaves the agent open to the risk of
overfitting it's behavior for this particular opponent. Thus, when a new
opponent is swapped, the agent may lose more often than expected.

48
docs/Feature-Memory.md


# Memory-enhanced agents using Recurrent Neural Networks
## What are memories used for?
Have you ever entered a room to get something and immediately forgot what you
were looking for? Don't let that happen to your agents.
It is now possible to give memories to your agents. When training, the agents
will be able to store a vector of floats to be used next time they need to make
a decision.
![Inspector](images/ml-agents-LSTM.png)
Deciding what the agents should remember in order to solve a task is not easy to
do by hand, but our training algorithms can learn to keep track of what is
important to remember with
[LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).
## How to use
When configuring the trainer parameters in the `config/trainer_config.yaml`
file, add the following parameters to the Behavior you want to use.
```json
use_recurrent: true
sequence_length: 64
memory_size: 256
```
* `use_recurrent` is a flag that notifies the trainer that you want to use a
Recurrent Neural Network.
* `sequence_length` defines how long the sequences of experiences must be while
training. In order to use a LSTM, training requires a sequence of experiences
instead of single experiences.
* `memory_size` corresponds to the size of the memory the agent must keep. Note
that if this number is too small, the agent will not be able to remember a lot
of things. If this number is too large, the neural network will take longer to
train.
## Limitations
* LSTM does not work well with continuous vector action space. Please use
discrete vector action space for better results.
* Since the memories must be sent back and forth between Python and Unity, using
too large `memory_size` will slow down training.
* Adding a recurrent layer increases the complexity of the neural network, it is
recommended to decrease `num_layers` when using recurrent.
* It is required that `memory_size` be divisible by 4.

50
docs/Feature-Monitor.md


# Using the Monitor
![Monitor](images/monitor.png)
The monitor allows visualizing information related to the agents or training
process within a Unity scene.
You can track many different things both related and unrelated to the agents
themselves. By default, the Monitor is only active in the *inference* phase, so
not during training. To change this behavior, you can activate or deactivate it
by calling `SetActive(boolean)`. For example to also show the monitor during
training, you can call it in the `Awake()` method of your `MonoBehaviour`:
```csharp
using Unity.MLAgents;
public class MyBehaviour : MonoBehaviour {
public void Awake()
{
Monitor.SetActive(true);
}
}
```
To add values to monitor, call the `Log` function anywhere in your code:
```csharp
Monitor.Log(key, value, target)
```
* `key` is the name of the information you want to display.
* `value` is the information you want to display. *`value`* can have different
types:
* `string` - The Monitor will display the string next to the key. It can be
useful for displaying error messages.
* `float` - The Monitor will display a slider. Note that the values must be
between -1 and 1. If the value is positive, the slider will be green, if the
value is negative, the slider will be red.
* `float[]` - The Monitor Log call can take an additional argument called
`displayType` that can be either `INDEPENDENT` (default) or `PROPORTIONAL`:
* `INDEPENDENT` is used to display multiple independent floats as a
histogram. The histogram will be a sequence of vertical sliders.
* `PROPORTION` is used to see the proportions between numbers. For each
float in values, a rectangle of width of value divided by the sum of all
values will be show. It is best for visualizing values that sum to 1.
* `target` is the transform to which you want to attach information. If the
transform is `null` the information will be attached to the global monitor.
* **NB:** When adding a target transform that is not the global monitor, make
sure you have your main camera object tagged as `MainCamera` via the
inspector. This is needed to properly display the text onto the screen.

25
docs/Training-Using-Concurrent-Unity-Instances.md


# Training Using Concurrent Unity Instances
As part of release v0.8, we enabled developers to run concurrent, parallel instances of the Unity executable during training. For certain scenarios, this should speed up the training.
## How to Run Concurrent Unity Instances During Training
Please refer to the general instructions on [Training ML-Agents](Training-ML-Agents.md). In order to run concurrent Unity instances during training, set the number of environment instances using the command line option `--num-envs=<n>` when you invoke `mlagents-learn`. Optionally, you can also set the `--base-port`, which is the starting port used for the concurrent Unity instances.
## Considerations
### Buffer Size
If you are having trouble getting an agent to train, even with multiple concurrent Unity instances, you could increase `buffer_size` in the `config/trainer_config.yaml` file. A common practice is to multiply `buffer_size` by `num-envs`.
### Resource Constraints
Invoking concurrent Unity instances is constrained by the resources on the machine. Please use discretion when setting `--num-envs=<n>`.
### Using num-runs and num-envs
If you set `--num-runs=<n>` greater than 1 and are also invoking concurrent Unity instances using `--num-envs=<n>`, then the number of concurrent Unity instances is equal to `num-runs` times `num-envs`.
### Result Variation Using Concurrent Unity Instances
If you keep all the hyperparameters the same, but change `--num-envs=<n>`, the results and model would likely change.

104
docs/Training-Imitation-Learning.md


# Training with Imitation Learning
It is often more intuitive to simply demonstrate the behavior we want an agent
to perform, rather than attempting to have it learn via trial-and-error methods.
Consider our
[running example](ML-Agents-Overview.md#running-example-training-npc-behaviors)
of training a medic NPC. Instead of indirectly training a medic with the help
of a reward function, we can give the medic real world examples of observations
from the game and actions from a game controller to guide the medic's behavior.
Imitation Learning uses pairs of observations and actions from
a demonstration to learn a policy.
Imitation learning can also be used to help reinforcement learning. Especially in
environments with sparse (i.e., infrequent or rare) rewards, the agent may never see
the reward and thus not learn from it. Curiosity (which is available in the toolkit)
helps the agent explore, but in some cases
it is easier to show the agent how to achieve the reward. In these cases,
imitation learning combined with reinforcement learning can dramatically
reduce the time the agent takes to solve the environment.
For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
using 6 episodes of demonstrations can reduce training steps by more than 4 times.
See Behavioral Cloning + GAIL + Curiosity + RL below.
<p align="center">
<img src="images/mlagents-ImitationAndRL.png"
alt="Using Demonstrations with Reinforcement Learning"
width="700" border="0" />
</p>
The ML-Agents Toolkit provides two features that enable your agent to learn from demonstrations.
In most scenarios, you can combine these two features.
* GAIL (Generative Adversarial Imitation Learning) uses an adversarial approach to
reward your Agent for behaving similar to a set of demonstrations. To use GAIL, you can add the
[GAIL reward signal](Reward-Signals.md#gail-reward-signal). GAIL can be
used with or without environment rewards, and works well when there are a limited
number of demonstrations.
* Behavioral Cloning (BC) trains the Agent's neural network to exactly mimic the actions
shown in a set of demonstrations.
The BC feature can be enabled on the [PPO](Training-PPO.md#optional-behavioral-cloning-using-demonstrations)
or [SAC](Training-SAC.md#optional-behavioral-cloning-using-demonstrations) trainer. As BC cannot generalize
past the examples shown in the demonstrations, BC tends to work best when there exists demonstrations
for nearly all of the states that the agent can experience, or in conjunction with GAIL and/or an extrinsic reward.
### What to Use
If you want to help your agents learn (especially with environments that have sparse rewards)
using pre-recorded demonstrations, you can generally enable both GAIL and Behavioral Cloning
at low strengths in addition to having an extrinsic reward.
An example of this is provided for the Pyramids example environment under
`PyramidsLearning` in `config/gail_config.yaml`.
If you want to train purely from demonstrations, GAIL and BC _without_ an
extrinsic reward signal is the preferred approach. An example of this is provided for the Crawler
example environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.
## Recording Demonstrations
Demonstrations of agent behavior can be recorded from the Unity Editor,
and saved as assets. These demonstrations contain information on the
observations, actions, and rewards for a given agent during the recording session.
They can be managed in the Editor, as well as used for training with BC and GAIL.
In order to record demonstrations from an agent, add the `Demonstration Recorder`
component to a GameObject in the scene which contains an `Agent` component.
Once added, it is possible to name the demonstration that will be recorded
from the agent.
<p align="center">
<img src="images/demo_component.png"
alt="Demonstration Recorder"
width="375" border="10" />
</p>
When `Record` is checked, a demonstration will be created whenever the scene
is played from the Editor. Depending on the complexity of the task, anywhere
from a few minutes or a few hours of demonstration data may be necessary to
be useful for imitation learning. When you have recorded enough data, end
the Editor play session. A `.demo` file will be created in the
`Assets/Demonstrations` folder (by default). This file contains the demonstrations.
Clicking on the file will provide metadata about the demonstration in the
inspector.
<p align="center">
<img src="images/demo_inspector.png"
alt="Demonstration Inspector"
width="375" border="10" />
</p>
You can then specify the path to this file as the `demo_path` in your `trainer_config.yaml` file
when using BC or GAIL. For instance, for BC:
```
behavioral_cloning:
demo_path: <path_to_your_demo_file>
...
```
And for GAIL:
```
reward_signals:
gail:
demo_path: <path_to_your_demo_file>
...
```

205
docs/Reward-Signals.md


# Reward Signals
In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy)
that maximizes a reward. Typically, a reward is defined by your environment, and corresponds
to reaching some goal. These are what we refer to as "extrinsic" rewards, as they are defined
external of the learning algorithm.
Rewards, however, can be defined outside of the environment as well, to encourage the agent to
behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these
rewards as "intrinsic" reward signals. The total reward that the agent will learn to maximize can
be a mix of extrinsic and intrinsic reward signals.
ML-Agents allows reward signals to be defined in a modular way, and we provide three reward
signals that can the mixed and matched to help shape your agent's behavior. The `extrinsic` Reward
Signal represents the rewards defined in your environment, and is enabled by default.
The `curiosity` reward signal helps your agent explore when extrinsic rewards are sparse.
## Enabling Reward Signals
Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An
example is provided in `config/trainer_config.yaml` and `config/gail_config.yaml`. To enable a reward signal, add it to the
`reward_signals:` section under the behavior name. For instance, to enable the extrinsic signal
in addition to a small curiosity reward and a GAIL reward signal, you would define your `reward_signals` as follows:
```yaml
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 256
gail:
strength: 0.01
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
```
Each reward signal should define at least two parameters, `strength` and `gamma`, in addition
to any class-specific hyperparameters. Note that to remove a reward signal, you should delete
its entry entirely from `reward_signals`. At least one reward signal should be left defined
at all times.
## Reward Signal Types
As part of the toolkit, we provide three reward signal types as part of hyperparameters - Extrinsic, Curiosity, and GAIL.
### Extrinsic Reward Signal
The `extrinsic` reward signal is simply the reward given by the
[environment](Learning-Environment-Design.md). Remove it to force the agent
to ignore the environment reward.
#### Strength
`strength` is the factor by which to multiply the raw
reward. Typical ranges will vary depending on the reward signal.
Typical Range: `1.0`
#### Gamma
`gamma` corresponds to the discount factor for future rewards. This can be
thought of as how far into the future the agent should care about possible
rewards. In situations when the agent should be acting in the present in order
to prepare for rewards in the distant future, this value should be large. In
cases when rewards are more immediate, it can be smaller.
Typical Range: `0.8` - `0.995`
### Curiosity Reward Signal
The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an implementation
of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction"
by Pathak, et al. It trains two networks:
* an inverse model, which takes the current and next observation of the agent, encodes them, and
uses the encoding to predict the action that was taken between the observations
* a forward model, which takes the encoded current observation and action, and predicts the
next encoded observation.
The loss of the forward model (the difference between the predicted and actual encoded observations) is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be.
For more information, see
* https://arxiv.org/abs/1705.05363
* https://pathak22.github.io/noreward-rl/
* https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/
#### Strength
In this case, `strength` corresponds to the magnitude of the curiosity reward generated
by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough
to not be overwhelmed by extrinsic reward signals in the environment.
Likewise it should not be too large to overwhelm the extrinsic reward signal.
Typical Range: `0.001` - `0.1`
#### Gamma
`gamma` corresponds to the discount factor for future rewards.
Typical Range: `0.8` - `0.995`
#### (Optional) Encoding Size
`encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model.
This value should be small enough to encourage the ICM to compress the original
observation, but also not too small to prevent it from learning to differentiate between
demonstrated and actual behavior.
Default Value: `64`
Typical Range: `64` - `256`
#### (Optional) Learning Rate
`learning_rate` is the learning rate used to update the intrinsic curiosity module.
This should typically be decreased if training is unstable, and the curiosity loss is unstable.
Default Value: `3e-4`
Typical Range: `1e-5` - `1e-3`
### GAIL Reward Signal
GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an
imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs
(Generative Adversarial Networks). In this framework, a second neural network, the
discriminator, is taught to distinguish whether an observation/action is from a demonstration or
produced by the agent. This discriminator can the examine a new observation/action and provide it a
reward based on how close it believes this new observation/action is to the provided demonstrations.
At each training step, the agent tries to learn how to maximize this reward. Then, the
discriminator is trained to better distinguish between demonstrations and agent state/actions.
In this way, while the agent gets better and better at mimicing the demonstrations, the
discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.
This approach learns a _policy_ that produces states and actions similar to the demonstrations,
requiring fewer demonstrations than direct cloning of the actions. In addition to learning purely
from demonstrations, the GAIL reward signal can be mixed with an extrinsic reward signal to guide
the learning process.
Using GAIL requires recorded demonstrations from your Unity environment. See the
[imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations.
#### Strength
`strength` is the factor by which to multiply the raw reward. Note that when using GAIL
with an Extrinsic Signal, this value should be set lower if your demonstrations are
suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic
rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases.
Typical Range: `0.01` - `1.0`
#### Gamma
`gamma` corresponds to the discount factor for future rewards.
Typical Range: `0.8` - `0.9`
#### Demo Path
`demo_path` is the path to your `.demo` file or directory of `.demo` files. See the [imitation learning guide](Training-Imitation-Learning.md).
#### (Optional) Encoding Size
`encoding_size` corresponds to the size of the hidden layer used by the discriminator.
This value should be small enough to encourage the discriminator to compress the original
observation, but also not too small to prevent it from learning to differentiate between
demonstrated and actual behavior. Dramatically increasing this size will also negatively affect
training times.
Default Value: `64`
Typical Range: `64` - `256`
#### (Optional) Learning Rate
`learning_rate` is the learning rate used to update the discriminator.
This should typically be decreased if training is unstable, and the GAIL loss is unstable.
Default Value: `3e-4`
Typical Range: `1e-5` - `1e-3`
#### (Optional) Use Actions
`use_actions` determines whether the discriminator should discriminate based on both
observations and actions, or just observations. Set to `True` if you want the agent to
mimic the actions from the demonstrations, and `False` if you'd rather have the agent
visit the same states as in the demonstrations but with possibly different actions.
Setting to `False` is more likely to be stable, especially with imperfect demonstrations,
but may learn slower.
Default Value: `false`
#### (Optional) Variational Discriminator Bottleneck
`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the
GAIL discriminator. This forces the discriminator to learn a more general representation
and reduces its tendency to be "too good" at discriminating, making learning more stable.
However, it does increase training time. Enable this if you notice your imitation learning is
unstable, or unable to learn the task at hand.
Default Value: `false`

159
docs/Training-Self-Play.md


# Training with Self-Play
ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with
[Self-Play](https://openai.com/blog/competitive-self-play/).
A symmetric game is one in which opposing agents are equal in form, function and objective. Examples of symmetric games
are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and
action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games,
this is not the case. An example of an asymmetric game is our Strikers Vs Goalie example environment. Agents in these
types of games do not always have the same observation or action spaces and so sharing policy networks is not
necessarily ideal.
With self-play, an agent learns in adversarial games by competing against fixed, past versions of its opponent
(which could be itself as in symmetric games) to provide a more stable, stationary learning environment. This is compared
to competing against the current, best opponent in every episode, which is constantly changing (because it's learning).
Self-play can be used with our implementations of both [Proximal Policy Optimization (PPO)](Training-PPO.md) and [Soft Actor-Critc (SAC)](Training-SAC.md).
However, from the perspective of an individual agent, these scenarios appear to have non-stationary dynamics because the opponent is often changing.
This can cause significant issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For further reading on
this issue in particular, see the paper [Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf).
For more general information on training with ML-Agents, see [Training ML-Agents](Training-ML-Agents.md).
For more algorithm specific instruction, please see the documentation for [PPO](Training-PPO.md) or [SAC](Training-SAC.md).
Self-play is triggered by including the self-play hyperparameter hierarchy in the trainer configuration file. Detailed description of the self-play hyperparameters are contained below. Furthermore, to distinguish opposing agents, set the team ID to different integer values in the behavior parameters script on the agent prefab.
![Team ID](images/team_id.png)
***Team ID must be 0 or an integer greater than 0.***
In symmetric games, since all agents (even on opposing teams) will share the same policy, they should have the same 'Behavior Name' in their
Behavior Parameters Script. In asymmetric games, they should have a different Behavior Name in their Behavior Parameters script.
Note, in asymmetric games, the agents must have both different Behavior Names *and* different team IDs! Then, specify the trainer configuration
for each Behavior Name in your scene as you would normally, and remember to include the self-play hyperparameter hierarchy!
For examples of how to use this feature, you can see the trainer configurations and agent prefabs for our Tennis, Soccer, and
Strikers Vs Goalie environments.
Tennis and Soccer provide examples of symmetric games and Strikers Vs Goalie provides an example of an asymmetric game.
## Best Practices Training with Self-Play
Training with self-play adds additional confounding factors to the usual
issues faced by reinforcement learning. In general, the tradeoff is between
the skill level and generality of the final policy and the stability of learning.
Training against a set of slowly or unchanging adversaries with low diversity
results in a more stable learning process than training against a set of quickly
changing adversaries with high diversity. With this context, this guide discusses
the exposed self-play hyperparameters and intuitions for tuning them.
## Hyperparameters
### Reward Signals
We make the assumption that the final reward in a trajectory corresponds to the outcome of an episode.
A final reward greater than 0 indicates winning, less than 0 indicates losing and 0 indicates a draw.
The final reward determines the result of an episode (win, loss, or draw) in the ELO calculation.
The reward signal should still be used as described in the documentation for the other trainers and [reward signals.](Reward-Signals.md) However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games. Specifically, we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing) and to allow for more iterations of training to compensate for the sparsity of reward.
In problems that are too challenging to be solved by sparse rewards, it may be necessary to provide intermediate rewards to encourage useful instrumental behaviors.
For example, it may be difficult for a soccer agent to learn that kicking a ball into the net receives a reward because this sequence has a low probability
of occurring randomly. However, it will have a higher probability of occurring if the agent learns generally that kicking the ball has utility. So, we may be able
to speed up training by giving the agent intermediate reward for kicking the ball. However, we must be careful that the agent doesn't learn to undermine
its original objective of scoring goals e.g. if it scores a goal, the episode ends and it can no longer receive reward for kicking the ball. The behavior
that receives the most reward may be to keep the ball out of the net and to kick it indefinitely! To address this, we suggest
using a curriculum that allows the agents to learn the necessary intermediate behavior (i.e. colliding with a ball) and then
decays this reward signal to allow training on just the rewards of winning and losing. Please see our documentation on
how to use curriculum learning [here](./Training-Curriculum-Learning.md) and our SoccerTwos example environment.
### Save Steps
The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.
Recommended Range : 10000-100000
### Team Change
The `team_change` parameter corresponds to the number of *trainer_steps* between switching the learning team.
This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team
becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar
performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents
per team switch.
A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents
the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies
and so the agent may fail against the next batch of opponents.
The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we
recommend setting this value as a function of the `save_steps` parameter discussed previously.
Recommended Range : 4x-10x where x=`save_steps`
### Swap Steps
The `swap_steps` parameter corresponds to the number of *ghost steps* (not trainer steps) between swapping the opponents policy with a different snapshot.
A 'ghost step' refers to a step taken by an agent *that is following a fixed policy and not learning*. The reason for this distinction is that in asymmetric games,
we may have teams with an unequal number of agents e.g. a 2v1 scenario like our Strikers Vs Goalie example environment. The team with two agents collects
twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number
of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if
a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents`
agents during `team-change` total steps is:
```
swap_steps = (num_agents / num_opponent_agents) * (team_change / x)
```
As an example, in a 2v1 scenario, if we want the swap to occur `x=4` times during `team-change=200000` steps,
the `swap_steps` for the team of one agent is:
```
swap_steps = (1 / 2) * (200000 / 4) = 25000
```
The `swap_steps` for the team of two agents is:
```
swap_steps = (2 / 1) * (200000 / 4) = 100000
```
Note, with equal team sizes, the first term is equal to 1 and `swap_steps` can be calculated by just dividing the total steps by the desired number of swaps.
A larger value of `swap_steps` means that an agent will play against the same fixed opponent for a longer number of training iterations. This results in a more stable training scenario, but leaves the agent open to the risk of overfitting it's behavior for this particular opponent. Thus, when a new opponent is swapped, the agent may lose more often than expected.
Recommended Range : 10000-100000
### Play against latest model ratio
The `play_against_latest_model_ratio` parameter corresponds to the probability
an agent will play against the latest opponent policy. With probability
1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its
opponent from a past iteration.
A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy.
Range : 0.0 - 1.0
### Window
The `window` parameter corresponds to the size of the sliding window of past snapshots from which the agent's opponents are sampled. For example, a `window` size of 5 will save the last 5 snapshots taken. Each time a new snapshot is taken, the oldest is discarded.
A larger value of `window` means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. Like in the `save_steps` hyperparameter, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training.
Recommended Range : 5 - 30
## Training Statistics
To view training statistics, use TensorBoard. For information on launching and
using TensorBoard, see
[here](./Getting-Started.md#observing-training-progress).
### ELO
In adversarial games, the cumulative environment reward may not be a meaningful metric by which to track learning progress. This is because cumulative reward is entirely dependent on the skill of the opponent. An agent at a particular skill level will get more or less reward against a worse or better agent, respectively.
We provide an implementation of the ELO rating system, a method for calculating the relative skill level between two players from a given population in a zero-sum game. For more information on ELO, please see [the ELO wiki](https://en.wikipedia.org/wiki/Elo_rating_system).
In a proper training run, the ELO of the agent should steadily increase. The absolute value of the ELO is less important than the change in ELO over training iterations.
Note, this implementation will support any number of teams but ELO is only applicable to games with two teams. It is ongoing work to implement
a reliable metric for measuring progress in scenarios with three or more teams. These scenarios can still train, though as of now, reward and qualitative observations
are the only metric by which we can judge performance.

171
docs/Training-Environment-Parameter-Randomization.md


# Training With Environment Parameter Randomization
One of the challenges of training and testing agents on the same
environment is that the agents tend to overfit. The result is that the
agents are unable to generalize to any tweaks or variations in the environment.
This is analogous to a model being trained and tested on an identical dataset
in supervised learning. This becomes problematic in cases where environments
are instantiated with varying objects or properties.
To help agents robust and better generalizable to changes in the environment, the agent
can be trained over multiple variations of a given environment. We refer to this approach as **Environment Parameter Randomization**. For those familiar with Reinforcement Learning research, this approach is based on the concept of Domain Randomization (you can read more about it [here](https://arxiv.org/abs/1703.06907)). By using parameter randomization
during training, the agent can be better suited to adapt (with higher performance)
to future unseen variations of the environment.
_Example of variations of the 3D Ball environment._
Ball scale of 0.5 | Ball scale of 4
:-------------------------:|:-------------------------:
![](images/3dball_small.png) | ![](images/3dball_big.png)
To enable variations in the environments, we implemented `Environment Parameters`.
`Environment Parameters` are values in the `FloatPropertiesChannel` that can be read when setting
up the environment. We
also included different sampling methods and the ability to create new kinds of
sampling methods for each `Environment Parameter`. In the 3D ball environment example displayed
in the figure above, the environment parameters are `gravity`, `ball_mass` and `ball_scale`.
## How to Enable Environment Parameter Randomization
We first need to provide a way to modify the environment by supplying a set of `Environment Parameters`
and vary them over time. This provision can be done either deterministically or randomly.
This is done by assigning each `Environment Parameter` a `sampler-type`(such as a uniform sampler),
which determines how to sample an `Environment
Parameter`. If a `sampler-type` isn't provided for a
`Environment Parameter`, the parameter maintains the default value throughout the
training procedure, remaining unchanged. The samplers for all the `Environment Parameters`
are handled by a **Sampler Manager**, which also handles the generation of new
values for the environment parameters when needed.
To setup the Sampler Manager, we create a YAML file that specifies how we wish to
generate new samples for each `Environment Parameters`. In this file, we specify the samplers and the
`resampling-interval` (the number of simulation steps after which environment parameters are
resampled). Below is an example of a sampler file for the 3D ball environment.
```yaml
resampling-interval: 5000
mass:
sampler-type: "uniform"
min_value: 0.5
max_value: 10
gravity:
sampler-type: "multirange_uniform"
intervals: [[7, 10], [15, 20]]
scale:
sampler-type: "uniform"
min_value: 0.75
max_value: 3
```
Below is the explanation of the fields in the above example.
* `resampling-interval` - Specifies the number of steps for the agent to
train under a particular environment configuration before resetting the
environment with a new sample of `Environment Parameters`.
* `Environment Parameter` - Name of the `Environment Parameter` like `mass`, `gravity` and `scale`. This should match the name
specified in the `FloatPropertiesChannel` of the environment being trained. If a parameter specified in the file doesn't exist in the
environment, then this parameter will be ignored. Within each `Environment Parameter`
* `sampler-type` - Specify the sampler type to use for the `Environment Parameter`.
This is a string that should exist in the `Sampler Factory` (explained
below).
* `sampler-type-sub-arguments` - Specify the sub-arguments depending on the `sampler-type`.
In the example above, this would correspond to the `intervals`
under the `sampler-type` `"multirange_uniform"` for the `Environment Parameter` called `gravity`.
The key name should match the name of the corresponding argument in the sampler definition.
(See below)
The Sampler Manager allocates a sampler type for each `Environment Parameter` by using the *Sampler Factory*,
which maintains a dictionary mapping of string keys to sampler objects. The available sampler types
to be used for each `Environment Parameter` is available in the Sampler Factory.
### Included Sampler Types
Below is a list of included `sampler-type` as part of the toolkit.
* `uniform` - Uniform sampler
* Uniformly samples a single float value between defined endpoints.
The sub-arguments for this sampler to specify the interval
endpoints are as below. The sampling is done in the range of
[`min_value`, `max_value`).
* **sub-arguments** - `min_value`, `max_value`
* `gaussian` - Gaussian sampler
* Samples a single float value from the distribution characterized by
the mean and standard deviation. The sub-arguments to specify the
gaussian distribution to use are as below.
* **sub-arguments** - `mean`, `st_dev`
* `multirange_uniform` - Multirange uniform sampler
* Uniformly samples a single float value between the specified intervals.
Samples by first performing a weight pick of an interval from the list
of intervals (weighted based on interval width) and samples uniformly
from the selected interval (half-closed interval, same as the uniform
sampler). This sampler can take an arbitrary number of intervals in a
list in the following format:
[[`interval_1_min`, `interval_1_max`], [`interval_2_min`, `interval_2_max`], ...]
* **sub-arguments** - `intervals`
The implementation of the samplers can be found at `ml-agents-envs/mlagents_envs/sampler_class.py`.
### Defining a New Sampler Type
If you want to define your own sampler type, you must first inherit the *Sampler*
base class (included in the `sampler_class` file) and preserve the interface.
Once the class for the required method is specified, it must be registered in the Sampler Factory.
This can be done by subscribing to the *register_sampler* method of the SamplerFactory. The command
is as follows:
`SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)`
Once the Sampler Factory reflects the new register, the new sampler type can be used for sample any
`Environment Parameter`. For example, lets say a new sampler type was implemented as below and we register
the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory.
```python
class CustomSampler(Sampler):
def __init__(self, argA, argB, argC):
self.possible_vals = [argA, argB, argC]
def sample_all(self):
return np.random.choice(self.possible_vals)
```
Now we need to specify the new sampler type in the sampler YAML file. For example, we use this new
sampler type for the `Environment Parameter` *mass*.
```yaml
mass:
sampler-type: "custom-sampler"
argB: 1
argA: 2
argC: 3
```
### Training with Environment Parameter Randomization
After the sampler YAML file is defined, we proceed by launching `mlagents-learn` and specify
our configured sampler file with the `--sampler` flag. For example, if we wanted to train the
3D ball agent with parameter randomization using `Environment Parameters` with `config/3dball_randomize.yaml`
sampling setup, we would run
```sh
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_randomize.yaml
--run-id=3D-Ball-randomize
```
We can observe progress and metrics via Tensorboard.

350
docs/Training-PPO.md


# Training with Proximal Policy Optimization
ML-Agents provides an implementation of a reinforcement learning algorithm called
[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/).
PPO uses a neural network to approximate the ideal function that maps an agent's
observations to the best action an agent can take in a given state. The
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
Python process (communicating with the running Unity application over a socket).
ML-Agents also provides an implementation of
[Soft Actor-Critic (SAC)](https://bair.berkeley.edu/blog/2018/12/14/sac/). SAC tends
to be more _sample-efficient_, i.e. require fewer environment steps,
than PPO, but may spend more time performing model updates. This can produce a large
speedup on heavy or slow environments. Check out how to train with
SAC [here](Training-SAC.md).
To train an agent, you will need to provide the agent one or more reward signals which
the agent should attempt to maximize. See [Reward Signals](Reward-Signals.md)
for the available reward signals and the corresponding hyperparameters.
See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the
training program, `learn.py`.
If you are using the recurrent neural network (RNN) to utilize memory, see
[Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training
details.
If you are using curriculum training to pace the difficulty of the learning task
presented to an agent, see [Training with Curriculum
Learning](Training-Curriculum-Learning.md).
For information about imitation learning from demonstrations, see
[Training with Imitation Learning](Training-Imitation-Learning.md).
## Best Practices Training with PPO
Successfully training a Reinforcement Learning model often involves tuning the
training hyperparameters. This guide contains some best practices for tuning the
training process when the default parameters don't seem to be giving the level
of performance you would like.
## Hyperparameters
### Reward Signals
In reinforcement learning, the goal is to learn a Policy that maximizes reward.
At a base level, the reward is given by the environment. However, we could imagine
rewarding the agent for various different behaviors. For instance, we could reward
the agent for exploring new states, rather than just when an explicit reward is given.
Furthermore, we could mix reward signals to help the learning process.
Using `reward_signals` allows you to define [reward signals.](Reward-Signals.md)
The ML-Agents Toolkit provides three reward signals by default, the Extrinsic (environment)
reward signal, the Curiosity reward signal, which can be used to encourage exploration in
sparse extrinsic reward environments, and the GAIL reward signal. Please see [Reward Signals](Reward-Signals.md)
for additional details.
### Lambda
`lambd` corresponds to the `lambda` parameter used when calculating the
Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This
can be thought of as how much the agent relies on its current value estimate
when calculating an updated value estimate. Low values correspond to relying
more on the current value estimate (which can be high bias), and high values
correspond to relying more on the actual rewards received in the environment
(which can be high variance). The parameter provides a trade-off between the
two, and the right value can lead to a more stable training process.
Typical Range: `0.9` - `0.95`
### Buffer Size
`buffer_size` corresponds to how many experiences (agent observations, actions
and rewards obtained) should be collected before we do any learning or updating
of the model. **This should be a multiple of `batch_size`**. Typically a larger
`buffer_size` corresponds to more stable training updates.
Typical Range: `2048` - `409600`
### Batch Size
`batch_size` is the number of experiences used for one iteration of a gradient
descent update. **This should always be a fraction of the `buffer_size`**. If
you are using a continuous action space, this value should be large (in the
order of 1000s). If you are using a discrete action space, this value should be
smaller (in order of 10s).
Typical Range (Continuous): `512` - `5120`
Typical Range (Discrete): `32` - `512`
### Number of Epochs
`num_epoch` is the number of passes through the experience buffer during
gradient descent. The larger the `batch_size`, the larger it is acceptable to
make this. Decreasing this will ensure more stable updates, at the cost of
slower learning.
Typical Range: `3` - `10`
### Learning Rate
`learning_rate` corresponds to the strength of each gradient descent update
step. This should typically be decreased if training is unstable, and the reward
does not consistently increase.
Typical Range: `1e-5` - `1e-3`
### (Optional) Learning Rate Schedule
`learning_rate_schedule` corresponds to how the learning rate is changed over time.
For PPO, we recommend decaying learning rate until `max_steps` so learning converges
more stably. However, for some cases (e.g. training for an unknown amount of time)
this feature can be disabled.
Options:
* `linear` (default): Decay `learning_rate` linearly, reaching 0 at `max_steps`.
* `constant`: Keep learning rate constant for the entire training run.
Options: `linear`, `constant`
### Time Horizon
`time_horizon` corresponds to how many steps of experience to collect per-agent
before adding it to the experience buffer. When this limit is reached before the
end of an episode, a value estimate is used to predict the overall expected
reward from the agent's current state. As such, this parameter trades off
between a less biased, but higher variance estimate (long time horizon) and more
biased, but less varied estimate (short time horizon). In cases where there are
frequent rewards within an episode, or episodes are prohibitively large, a
smaller number can be more ideal. This number should be large enough to capture
all the important behavior within a sequence of an agent's actions.
Typical Range: `32` - `2048`
### Max Steps
`max_steps` corresponds to how many steps of the simulation (multiplied by
frame-skip) are run during the training process. This value should be increased
for more complex problems.
Typical Range: `5e5` - `1e7`
### Beta
`beta` corresponds to the strength of the entropy regularization, which makes
the policy "more random." This ensures that agents properly explore the action
space during training. Increasing this will ensure more random actions are
taken. This should be adjusted such that the entropy (measurable from
TensorBoard) slowly decreases alongside increases in reward. If entropy drops
too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.
Typical Range: `1e-4` - `1e-2`
### Epsilon
`epsilon` corresponds to the acceptable threshold of divergence between the old
and new policies during gradient descent updating. Setting this value small will
result in more stable updates, but will also slow the training process.
Typical Range: `0.1` - `0.3`
### Normalize
`normalize` corresponds to whether normalization is applied to the vector
observation inputs. This normalization is based on the running average and
variance of the vector observation. Normalization can be helpful in cases with
complex continuous control problems, but may be harmful with simpler discrete
control problems.
### Number of Layers
`num_layers` corresponds to how many hidden layers are present after the
observation input, or after the CNN encoding of the visual observation. For
simple problems, fewer layers are likely to train faster and more efficiently.
More layers may be necessary for more complex control problems.
Typical range: `1` - `3`
### Hidden Units
`hidden_units` correspond to how many units are in each fully connected layer of
the neural network. For simple problems where the correct action is a
straightforward combination of the observation inputs, this should be small. For
problems where the action is a very complex interaction between the observation
variables, this should be larger.
Typical Range: `32` - `512`
### (Optional) Visual Encoder Type
`vis_encode_type` corresponds to the encoder type for encoding visual observations.
Valid options include:
* `simple` (default): a simple encoder which consists of two convolutional layers
* `nature_cnn`: [CNN implementation proposed by Mnih et al.](https://www.nature.com/articles/nature14236),
consisting of three convolutional layers
* `resnet`: [IMPALA Resnet implementation](https://arxiv.org/abs/1802.01561),
consisting of three stacked layers, each with two residual blocks, making a
much larger network than the other two.
Options: `simple`, `nature_cnn`, `resnet`
## (Optional) Recurrent Neural Network Hyperparameters
The below hyperparameters are only used when `use_recurrent` is set to true.
### Sequence Length
`sequence_length` corresponds to the length of the sequences of experience
passed through the network during training. This should be long enough to
capture whatever information your agent might need to remember over time. For
example, if your agent needs to remember the velocity of objects, then this can
be a small value. If your agent needs to remember a piece of information given
only once at the beginning of an episode, then this should be a larger value.
Typical Range: `4` - `128`
### Memory Size
`memory_size` corresponds to the size of the array of floating point numbers
used to store the hidden state of the recurrent neural network of the policy. This value must
be a multiple of 2, and should scale with the amount of information you expect
the agent will need to remember in order to successfully complete the task.
Typical Range: `32` - `256`
## (Optional) Behavioral Cloning Using Demonstrations
In some cases, you might want to bootstrap the agent's policy using behavior recorded
from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
training operations that mimic a demonstration rather than attempting to maximize reward.
To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:
```
behavioral_cloning:
demo_path: ./Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
strength: 0.5
steps: 10000
```
Below are the available hyperparameters for BC.
### Strength
`strength` corresponds to the learning rate of the imitation relative to the learning
rate of PPO, and roughly corresponds to how strongly we allow BC
to influence the policy.
Typical Range: `0.1` - `0.5`
### Demo Path
`demo_path` is the path to your `.demo` file or directory of `.demo` files.
See the [imitation learning guide](Training-Imitation-Learning.md) for more on `.demo` files.
### Steps
During BC, it is often desirable to stop using demonstrations after the agent has
"seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
outside of the provided demonstrations. `steps` corresponds to the training steps over which
BC is active. The learning rate of BC will anneal over the steps. Set
the steps to 0 for constant imitation over the entire training run.
### (Optional) Batch Size
`batch_size` is the number of demonstration experiences used for one iteration of a gradient
descent update. If not specified, it will default to the `batch_size` defined for PPO.
Typical Range (Continuous): `512` - `5120`
Typical Range (Discrete): `32` - `512`
### (Optional) Number of Epochs
`num_epoch` is the number of passes through the experience buffer during
gradient descent. If not specified, it will default to the number of epochs set for PPO.
Typical Range: `3` - `10`
### (Optional) Samples Per Update
`samples_per_update` is the maximum number of samples
to use during each imitation update. You may want to lower this if your demonstration
dataset is very large to avoid overfitting the policy on demonstrations. Set to 0
to train over all of the demonstrations at each update step.
Default Value: `0` (all)
Typical Range: Approximately equal to PPO's `buffer_size`
### (Optional) Advanced: Initialize Model Path
`init_path` can be specified to initialize your model from a previous run before starting.
Note that the prior run should have used the same trainer configurations as the current run,
and have been saved with the same version of ML-Agents. You should provide the full path
to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`.
This option is provided in case you want to initialize different behaviors from different runs;
in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize
all models from the same run.
### (Optional) Advanced: Disable Threading
By default, PPO model updates can happen while the environment is being stepped. This violates the
[on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms)
assumption of PPO slightly in exchange for a 10-20% training speedup. To maintain the
strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`.
Default Value: `true`
## Training Statistics
To view training statistics, use TensorBoard. For information on launching and
using TensorBoard, see
[here](./Getting-Started.md#observing-training-progress).
### Cumulative Reward
The general trend in reward should consistently increase over time. Small ups
and downs are to be expected. Depending on the complexity of the task, a
significant increase in reward may not present itself until millions of steps
into the training process.
### Entropy
This corresponds to how random the decisions are. This should
consistently decrease during training. If it decreases too soon or not at all,
`beta` should be adjusted (when using discrete action space).
### Learning Rate
This will decrease over time on a linear schedule by default, unless `learning_rate_schedule`
is set to `constant`.
### Policy Loss
These values will oscillate during training. Generally they should be less than
1.0.
### Value Estimate
These values should increase as the cumulative reward increases. They correspond
to how much future reward the agent predicts itself receiving at any given
point.
### Value Loss
These values will increase as the reward increases, and then should decrease
once reward becomes stable.

356
docs/Training-SAC.md


# Training with Soft-Actor Critic
In addition to [Proximal Policy Optimization (PPO)](Training-PPO.md), ML-Agents also provides
[Soft Actor-Critic](http://bair.berkeley.edu/blog/2018/12/14/sac/) to perform
reinforcement learning.
In contrast with PPO, SAC is _off-policy_, which means it can learn from experiences collected
at any time during the past. As experiences are collected, they are placed in an
experience replay buffer and randomly drawn during training. This makes SAC
significantly more sample-efficient, often requiring 5-10 times less samples to learn
the same task as PPO. However, SAC tends to require more model updates. SAC is a
good choice for heavier or slower environments (about 0.1 seconds per step or more).
SAC is also a "maximum entropy" algorithm, and enables exploration in an intrinsic way.
Read more about maximum entropy RL [here](https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/).
To train an agent, you will need to provide the agent one or more reward signals which
the agent should attempt to maximize. See [Reward Signals](Reward-Signals.md)
for the available reward signals and the corresponding hyperparameters.
## Best Practices when training with SAC
Successfully training a reinforcement learning model often involves tuning
hyperparameters. This guide contains some best practices for training
when the default parameters don't seem to be giving the level of performance
you would like.
## Hyperparameters
### Reward Signals
In reinforcement learning, the goal is to learn a Policy that maximizes reward.
In the most basic case, the reward is given by the environment. However, we could imagine
rewarding the agent for various different behaviors. For instance, we could reward
the agent for exploring new states, rather than explicitly defined reward signals.
Furthermore, we could mix reward signals to help the learning process.
`reward_signals` provides a section to define [reward signals.](Reward-Signals.md)
ML-Agents provides two reward signals by default, the Extrinsic (environment) reward, and the
Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
environments.
#### Steps Per Update for Reward Signal (Optional)
`reward_signal_steps_per_update` for the reward signals corresponds to the number of steps per mini batch sampled
and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated.
However, to imitate the training procedure in certain imitation learning papers (e.g.
[Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)),
we may want to update the reward signal (GAIL) M times for every update of the policy.
We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update`
under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to
`steps_per_update`.
Typical Range: `steps_per_update`
### Buffer Size
`buffer_size` corresponds the maximum number of experiences (agent observations, actions
and rewards obtained) that can be stored in the experience replay buffer. This value should be
large, on the order of thousands of times longer than your episodes, so that SAC
can learn from old as well as new experiences. It should also be much larger than
`batch_size`.
Typical Range: `50000` - `1000000`
### Buffer Init Steps
`buffer_init_steps` is the number of experiences to prefill the buffer with before attempting training.
As the untrained policy is fairly random, prefilling the buffer with random actions is
useful for exploration. Typically, at least several episodes of experiences should be
prefilled.
Typical Range: `1000` - `10000`
### Batch Size
`batch_size` is the number of experiences used for one iteration of a gradient
descent update. If
you are using a continuous action space, this value should be large (in the
order of 1000s). If you are using a discrete action space, this value should be
smaller (in order of 10s).
Typical Range (Continuous): `128` - `1024`
Typical Range (Discrete): `32` - `512`
### Initial Entropy Coefficient
`init_entcoef` refers to the initial entropy coefficient set at the beginning of training. In
SAC, the agent is incentivized to make its actions entropic to facilitate better exploration.
The entropy coefficient weighs the true reward with a bonus entropy reward. The entropy
coefficient is [automatically adjusted](https://arxiv.org/abs/1812.05905) to a preset target
entropy, so the `init_entcoef` only corresponds to the starting value of the entropy bonus.
Increase `init_entcoef` to explore more in the beginning, decrease to converge to a solution faster.
Typical Range (Continuous): `0.5` - `1.0`
Typical Range (Discrete): `0.05` - `0.5`
### Train Interval
`train_interval` is the number of steps taken between each agent training event. Typically,
we can train after every step, but if your environment's steps are very small and very frequent,
there may not be any new interesting information between steps, and `train_interval` can be increased.
Typical Range: `1` - `5`
### Steps Per Update
`steps_per_update` corresponds to the average ratio of agent steps (actions) taken to updates made of the agent's
policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after
exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps.
Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will
improve sample efficiency (reduce the number of steps required to train)
but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example
environments) `steps_per_update` equal to the number of agents in the scene is a good balance.
For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed.
We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will
usually result in a slowdown unless the environment is very slow.
Typical Range: `1` - `20`
### Tau
`tau` corresponds to the magnitude of the target Q update during the SAC model update.
In SAC, there are two neural networks: the target and the policy. The target network is
used to bootstrap the policy's estimate of the future rewards at a given state, and is fixed
while the policy is being updated. This target is then slowly updated according to `tau`.
Typically, this value should be left at `0.005`. For simple problems, increasing
`tau` to `0.01` might reduce the time it takes to learn, at the cost of stability.
Typical Range: `0.005` - `0.01`
### Learning Rate
`learning_rate` corresponds to the strength of each gradient descent update
step. This should typically be decreased if training is unstable, and the reward
does not consistently increase.
Typical Range: `1e-5` - `1e-3`
### (Optional) Learning Rate Schedule
`learning_rate_schedule` corresponds to how the learning rate is changed over time.
For SAC, we recommend holding learning rate constant so that the agent can continue to
learn until its Q function converges naturally.
Options:
* `linear`: Decay `learning_rate` linearly, reaching 0 at `max_steps`.
* `constant` (default): Keep learning rate constant for the entire training run.
Options: `linear`, `constant`
### Time Horizon
`time_horizon` corresponds to how many steps of experience to collect per-agent
before adding it to the experience buffer. This parameter is a lot less critical
to SAC than PPO, and can typically be set to approximately your episode length.
Typical Range: `32` - `2048`
### Max Steps
`max_steps` corresponds to how many steps of the simulation (multiplied by
frame-skip) are run during the training process. This value should be increased
for more complex problems.
Typical Range: `5e5` - `1e7`
### Normalize
`normalize` corresponds to whether normalization is applied to the vector
observation inputs. This normalization is based on the running average and
variance of the vector observation. Normalization can be helpful in cases with
complex continuous control problems, but may be harmful with simpler discrete
control problems.
### Number of Layers
`num_layers` corresponds to how many hidden layers are present after the
observation input, or after the CNN encoding of the visual observation. For
simple problems, fewer layers are likely to train faster and more efficiently.
More layers may be necessary for more complex control problems.
Typical range: `1` - `3`
### Hidden Units
`hidden_units` correspond to how many units are in each fully connected layer of
the neural network. For simple problems where the correct action is a
straightforward combination of the observation inputs, this should be small. For
problems where the action is a very complex interaction between the observation
variables, this should be larger.
Typical Range: `32` - `512`
### (Optional) Visual Encoder Type
`vis_encode_type` corresponds to the encoder type for encoding visual observations.
Valid options include:
* `simple` (default): a simple encoder which consists of two convolutional layers
* `nature_cnn`: [CNN implementation proposed by Mnih et al.](https://www.nature.com/articles/nature14236),
consisting of three convolutional layers
* `resnet`: [IMPALA Resnet implementation](https://arxiv.org/abs/1802.01561),
consisting of three stacked layers, each with two residual blocks, making a
much larger network than the other two.
Options: `simple`, `nature_cnn`, `resnet`
## (Optional) Recurrent Neural Network Hyperparameters
The below hyperparameters are only used when `use_recurrent` is set to true.
### Sequence Length
`sequence_length` corresponds to the length of the sequences of experience
passed through the network during training. This should be long enough to
capture whatever information your agent might need to remember over time. For
example, if your agent needs to remember the velocity of objects, then this can
be a small value. If your agent needs to remember a piece of information given
only once at the beginning of an episode, then this should be a larger value.
Typical Range: `4` - `128`
### Memory Size
`memory_size` corresponds to the size of the array of floating point numbers
used to store the hidden state of the recurrent neural network in the policy.
This value must be a multiple of 2, and should scale with the amount of information you expect
the agent will need to remember in order to successfully complete the task.
Typical Range: `32` - `256`
### (Optional) Save Replay Buffer
`save_replay_buffer` enables you to save and load the experience replay buffer as well as
the model when quitting and re-starting training. This may help resumes go more smoothly,
as the experiences collected won't be wiped. Note that replay buffers can be very large, and
will take up a considerable amount of disk space. For that reason, we disable this feature by
default.
Default: `False`
## (Optional) Behavioral Cloning Using Demonstrations
In some cases, you might want to bootstrap the agent's policy using behavior recorded
from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
training operations that mimic a demonstration rather than attempting to maximize reward.
To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:
```
behavioral_cloning:
demo_path: ./Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
strength: 0.5
steps: 10000
```
Below are the available hyperparameters for BC.
### Strength
`strength` corresponds to the learning rate of the imitation relative to the learning
rate of SAC, and roughly corresponds to how strongly we allow BC
to influence the policy.
Typical Range: `0.1` - `0.5`
### Demo Path
`demo_path` is the path to your `.demo` file or directory of `.demo` files.
See the [imitation learning guide](Training-Imitation-Learning.md) for more on `.demo` files.
### Steps
During BC, it is often desirable to stop using demonstrations after the agent has
"seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
outside of the provided demonstrations. `steps` corresponds to the training steps over which
BC is active. The learning rate of BC will anneal over the steps. Set
the steps to 0 for constant imitation over the entire training run.
### (Optional) Batch Size
`batch_size` is the number of demonstration experiences used for one iteration of a gradient
descent update. If not specified, it will default to the `batch_size` defined for SAC.
Typical Range (Continuous): `512` - `5120`
Typical Range (Discrete): `32` - `512`
### (Optional) Advanced: Initialize Model Path
`init_path` can be specified to initialize your model from a previous run before starting.
Note that the prior run should have used the same trainer configurations as the current run,
and have been saved with the same version of ML-Agents. You should provide the full path
to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`.
This option is provided in case you want to initialize different behaviors from different runs;
in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize
all models from the same run.
## Training Statistics
To view training statistics, use TensorBoard. For information on launching and
using TensorBoard, see
[here](./Getting-Started.md#observing-training-progress).
### Cumulative Reward
The general trend in reward should consistently increase over time. Small ups
and downs are to be expected. Depending on the complexity of the task, a
significant increase in reward may not present itself until millions of steps
into the training process.
### Entropy Coefficient
SAC is a "maximum entropy" reinforcement learning algorithm, and agents trained using
SAC are incentivized to behave randomly while also solving the problem. The entropy
coefficient balances the incentive to behave randomly vs. maximizing the reward.
This value is adjusted automatically so that the agent retains some amount of randomness during
training. It should steadily decrease in the beginning of training, and reach some small
value where it will level off. If it decreases too soon or takes too
long to decrease, `init_entcoef` should be adjusted.
### Entropy
This corresponds to how random the decisions are. This should
initially increase during training, reach a peak, and should decline along
with the Entropy Coefficient. This is because in the beginning, the agent is
incentivized to be more random for exploration due to a high entropy coefficient.
If it decreases too soon or takes too long to decrease, `init_entcoef` should be adjusted.
### Learning Rate
This will stay a constant value by default, unless `learning_rate_schedule`
is set to `linear`.
### Policy Loss
These values may increase as the agent explores, but should decrease long-term
as the agent learns how to solve the task.
### Value Estimate
These values should increase as the cumulative reward increases. They correspond
to how much future reward the agent predicts itself receiving at any given
point. They may also increase at the beginning as the agent is rewarded for
being random (see: Entropy and Entropy Coefficient), but should decline as
Entropy Coefficient decreases.
### Value Loss
These values will increase as the reward increases, and then should decrease
once reward becomes stable.

111
docs/Training-Curriculum-Learning.md


# Training with Curriculum Learning
Curriculum learning is a feature of ML-Agents which allows for the properties of environments to be changed during the training process to aid in learning.
## An Instructional Example
*[**Note**: The example provided below is for instructional purposes, and was based on an early version of the [Wall Jump example environment](Learning-Environment-Examples.md).
As such, it is not possible to directly replicate the results here using that environment.]*
Imagine a task in which an agent needs to scale a wall to arrive at a goal. The
starting point when training an agent to accomplish this task will be a random
policy. That starting policy will have the agent running in circles, and will
likely never, or very rarely scale the wall properly to the achieve the reward.
If we start with a simpler task, such as moving toward an unobstructed goal,
then the agent can easily learn to accomplish the task. From there, we can
slowly add to the difficulty of the task by increasing the size of the wall
until the agent can complete the initially near-impossible task of scaling the
wall.
![Wall](images/curriculum.png)
_Demonstration of a hypothetical curriculum training scenario in which a progressively taller
wall obstructs the path to the goal._
## How-To
Each group of Agents under the same `Behavior Name` in an environment can have
a corresponding curriculum. These curricula are held in what we call a "metacurriculum".
A metacurriculum allows different groups of Agents to follow different curricula within
the same environment.
### Specifying Curricula
In order to define the curricula, the first step is to decide which parameters of
the environment will vary. In the case of the Wall Jump environment,
the height of the wall is what varies. We define this as a `Environment Parameters`
that can be accessed in `Academy.Instance.EnvironmentParameters`, and by doing
so it becomes adjustable via the Python API.
Rather than adjusting it by hand, we will create a YAML file which
describes the structure of the curricula. Within it, we can specify which
points in the training process our wall height will change, either based on the
percentage of training steps which have taken place, or what the average reward
the agent has received in the recent past is. Below is an example config for the
curricula for the Wall Jump environment.
```yaml
BigWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
SmallWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
small_wall_height: [1.5, 2.0, 2.5, 4.0]
```
At the top level of the config is the behavior name. Note that this must be the
same as the Behavior Name in the [Agent's Behavior Parameters](Learning-Environment-Design-Agents.md#agent-properties).
The curriculum for each
behavior has the following parameters:
* `measure` - What to measure learning progress, and advancement in lessons by.
* `reward` - Uses a measure received reward.
* `progress` - Uses ratio of steps/max_steps.
* `thresholds` (float array) - Points in value of `measure` where lesson should
be increased.
* `min_lesson_length` (int) - The minimum number of episodes that should be
completed before the lesson can change. If `measure` is set to `reward`, the
average cumulative reward of the last `min_lesson_length` episodes will be
used to determine if the lesson should change. Must be nonnegative.
__Important__: the average reward that is compared to the thresholds is
different than the mean reward that is logged to the console. For example,
if `min_lesson_length` is `100`, the lesson will increment after the average
cumulative reward of the last `100` episodes exceeds the current threshold.
The mean reward logged to the console is dictated by the `summary_freq`
parameter in the
[trainer configuration file](Training-ML-Agents.md#training-config-file).
* `signal_smoothing` (true/false) - Whether to weight the current progress
measure by previous values.
* If `true`, weighting will be 0.75 (new) 0.25 (old).
* `parameters` (dictionary of key:string, value:float array) - Corresponds to
Environment parameters to control. Length of each array should be one
greater than number of thresholds.
Once our curriculum is defined, we have to use the environment parameters we defined
and modify the environment from the Agent's `OnEpisodeBegin()` function. See
[WallJumpAgent.cs](../Project/Assets/ML-Agents/Examples/WallJump/Scripts/WallJumpAgent.cs)
for an example.
### Training with a Curriculum
Once we have specified our metacurriculum and curricula, we can launch
`mlagents-learn` using the `–curriculum` flag to point to the config file
for our curricula and PPO will train using Curriculum Learning. For example,
to train agents in the Wall Jump environment with curriculum learning, you can run:
```sh
mlagents-learn config/trainer_config.yaml --curriculum=config/curricula/wall_jump.yaml --run-id=wall-jump-curriculum
```
You can then keep track of the current lessons and progresses via TensorBoard.
__Note__: If you are resuming a training session that uses curriculum, please pass the number of the last-reached lesson using the `--lesson` flag when running `mlagents-learn`.
正在加载...
取消
保存