Release mm GitHub docs (#3864)

* Improvements to Key Components section of ML-Agents Overview - Moved some documentation from Learning-Environment-Design. - Added the trainers vs LL-API separation. - Made a note about gym-unity. - Some update to the Agent/Behavior sections - Updated diagrams to reflect new side channels. Made Behavior type a consistent color. * Reorganizing the overview file and creating new (empty) sections This change defines the new structure for the overview doc. Subsequent commits will fill in the sections and rewrite existing sections. * Reorganizing the main Training ML-Agents page Re-organizes into feature-specific sections that somewhat mirror the previous commit of reorganizing the overview doc. Subsequent commits will populate these empty sections. * Adding Deep RL - Update ML-Agents-Overview with description of DeepRL training algorithms - Decribe the common and trainer-specific hyperparams in Training-ML-Agents. - Removed ...
5 年前 · 0dff739b
--- a/README.md
+++ b/README.md
 <img src="docs/images/image-banner.png" align="middle" width="3000"/>

 # Unity ML-Agents Toolkit (Beta)
+
 [![docs badge](https://img.shields.io/badge/docs-reference-blue.svg)](https://github.com/Unity-Technologies/ml-agents/tree/latest_release/docs/)
 [![license badge](https://img.shields.io/badge/license-Apache--2.0-green.svg)](LICENSE)


 ## Features

-* Unity environment control from Python
-* 15+ sample Unity environments
-* Two deep reinforcement learning algorithms,
-[Proximal Policy Optimization](docs/Training-PPO.md)
- (PPO) and [Soft Actor-Critic](docs/Training-SAC.md)
- (SAC)
-* Support for multiple environment configurations and training scenarios
-* Self-play mechanism for training agents in adversarial scenarios
-* Train memory-enhanced agents using deep reinforcement learning
-* Easily definable Curriculum Learning and Generalization scenarios
-* Built-in support for [Imitation Learning](docs/Training-Imitation-Learning.md) through Behavioral Cloning or Generative Adversarial Imitation Learning
-* Flexible agent control with On Demand Decision Making
-* Visualizing network outputs within the environment
-* Wrap learning environments as a gym
-* Utilizes the Unity Inference Engine
-* Train using concurrent Unity environment instances
+- Unity environment control from Python
+- 15+ sample Unity environments
+- Two deep reinforcement learning algorithms, Proximal Policy Optimization (PPO)
+  and Soft Actor-Critic (SAC)
+- Support for multiple environment configurations and training scenarios
+- Self-play mechanism for training agents in adversarial scenarios
+- Train memory-enhanced agents using deep reinforcement learning
+- Easily definable Curriculum Learning and Generalization scenarios
+- Built-in support for Imitation Learning through Behavioral Cloning or
+  Generative Adversarial Imitation Learning
+- Flexible agent control with On Demand Decision Making
+- Wrap learning environments as a gym
+- Utilizes the Unity Inference Engine
+- Train using concurrent Unity environment instances
+
 **Our latest, stable release is `Release 1`. Click [here](docs/Readme.md) to
 get started with the latest release of ML-Agents.**

 details of the changes between versions.
 * If you have used an earlier version of the ML-Agents Toolkit, we strongly recommend our
 [guide on migrating from earlier versions](docs/Migrating.md).
-

 | **Version** | **Release Date** | **Source** | **Documentation** | **Download** |
 |:-------:|:------:|:-------------:|:-------:|:------------:|

 ## Citation

-If you are a researcher interested in a discussion of Unity as an AI platform, see a pre-print
-of our [reference paper on Unity and the ML-Agents Toolkit](https://arxiv.org/abs/1809.02627).
-
-If you use Unity or the ML-Agents Toolkit to conduct research, we ask that you cite the following
-paper as a reference:
+If you are a researcher interested in a discussion of Unity as an AI platform,
+see a pre-print of our
+[reference paper on Unity and the ML-Agents Toolkit](https://arxiv.org/abs/1809.02627).
-Juliani, A., Berges, V., Vckay, E., Gao, Y., Henry, H., Mattar, M., Lange, D. (2018). Unity: A General Platform for Intelligent Agents. *arXiv preprint arXiv:1809.02627.* https://github.com/Unity-Technologies/ml-agents.
+If you use Unity or the ML-Agents Toolkit to conduct research, we ask that you
+cite the following paper as a reference:
+Juliani, A., Berges, V., Vckay, E., Gao, Y., Henry, H., Mattar, M., Lange, D.
+(2018). Unity: A General Platform for Intelligent Agents. _arXiv preprint
+arXiv:1809.02627._ https://github.com/Unity-Technologies/ml-agents.
-* (February 28, 2020) [Training intelligent adversaries using self-play with ML-Agents](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/)
-* (November 11, 2019) [Training your agents 7 times faster with ML-Agents](https://blogs.unity3d.com/2019/11/11/training-your-agents-7-times-faster-with-ml-agents/)
-* (October 21, 2019) [The AI@Unity interns help shape the world](https://blogs.unity3d.com/2019/10/21/the-aiunity-interns-help-shape-the-world/)
-* (April 15, 2019) [Unity ML-Agents Toolkit v0.8: Faster training on real games](https://blogs.unity3d.com/2019/04/15/unity-ml-agents-toolkit-v0-8-faster-training-on-real-games/)
-* (March 1, 2019) [Unity ML-Agents Toolkit v0.7: A leap towards cross-platform inference](https://blogs.unity3d.com/2019/03/01/unity-ml-agents-toolkit-v0-7-a-leap-towards-cross-platform-inference/)
-* (December 17, 2018) [ML-Agents Toolkit v0.6: Improved usability of Brains and Imitation Learning](https://blogs.unity3d.com/2018/12/17/ml-agents-toolkit-v0-6-improved-usability-of-brains-and-imitation-learning/)
-* (October 2, 2018) [Puppo, The Corgi: Cuteness Overload with the Unity ML-Agents Toolkit](https://blogs.unity3d.com/2018/10/02/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit/)
-* (September 11, 2018) [ML-Agents Toolkit v0.5, new resources for AI researchers available now](https://blogs.unity3d.com/2018/09/11/ml-agents-toolkit-v0-5-new-resources-for-ai-researchers-available-now/)
-* (June 26, 2018) [Solving sparse-reward tasks with Curiosity](https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/)
-* (June 19, 2018) [Unity ML-Agents Toolkit v0.4 and Udacity Deep Reinforcement Learning Nanodegree](https://blogs.unity3d.com/2018/06/19/unity-ml-agents-toolkit-v0-4-and-udacity-deep-reinforcement-learning-nanodegree/)
-* (May 24, 2018) [Imitation Learning in Unity: The Workflow](https://blogs.unity3d.com/2018/05/24/imitation-learning-in-unity-the-workflow/)
-* (March 15, 2018) [ML-Agents Toolkit v0.3 Beta released: Imitation Learning, feedback-driven features, and more](https://blogs.unity3d.com/2018/03/15/ml-agents-v0-3-beta-released-imitation-learning-feedback-driven-features-and-more/)
-* (December 11, 2017) [Using Machine Learning Agents in a real game: a beginner’s guide](https://blogs.unity3d.com/2017/12/11/using-machine-learning-agents-in-a-real-game-a-beginners-guide/)
-* (December 8, 2017) [Introducing ML-Agents Toolkit v0.2: Curriculum Learning, new environments, and more](https://blogs.unity3d.com/2017/12/08/introducing-ml-agents-v0-2-curriculum-learning-new-environments-and-more/)
-* (September 19, 2017) [Introducing: Unity Machine Learning Agents Toolkit](https://blogs.unity3d.com/2017/09/19/introducing-unity-machine-learning-agents/)
-* Overviewing reinforcement learning concepts
+
+- (February 28, 2020)
+  [Training intelligent adversaries using self-play with ML-Agents](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/)
+- (November 11, 2019)
+  [Training your agents 7 times faster with ML-Agents](https://blogs.unity3d.com/2019/11/11/training-your-agents-7-times-faster-with-ml-agents/)
+- (October 21, 2019)
+  [The AI@Unity interns help shape the world](https://blogs.unity3d.com/2019/10/21/the-aiunity-interns-help-shape-the-world/)
+- (April 15, 2019)
+  [Unity ML-Agents Toolkit v0.8: Faster training on real games](https://blogs.unity3d.com/2019/04/15/unity-ml-agents-toolkit-v0-8-faster-training-on-real-games/)
+- (March 1, 2019)
+  [Unity ML-Agents Toolkit v0.7: A leap towards cross-platform inference](https://blogs.unity3d.com/2019/03/01/unity-ml-agents-toolkit-v0-7-a-leap-towards-cross-platform-inference/)
+- (December 17, 2018)
+  [ML-Agents Toolkit v0.6: Improved usability of Brains and Imitation Learning](https://blogs.unity3d.com/2018/12/17/ml-agents-toolkit-v0-6-improved-usability-of-brains-and-imitation-learning/)
+- (October 2, 2018)
+  [Puppo, The Corgi: Cuteness Overload with the Unity ML-Agents Toolkit](https://blogs.unity3d.com/2018/10/02/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit/)
+- (September 11, 2018)
+  [ML-Agents Toolkit v0.5, new resources for AI researchers available now](https://blogs.unity3d.com/2018/09/11/ml-agents-toolkit-v0-5-new-resources-for-ai-researchers-available-now/)
+- (June 26, 2018)
+  [Solving sparse-reward tasks with Curiosity](https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/)
+- (June 19, 2018)
+  [Unity ML-Agents Toolkit v0.4 and Udacity Deep Reinforcement Learning Nanodegree](https://blogs.unity3d.com/2018/06/19/unity-ml-agents-toolkit-v0-4-and-udacity-deep-reinforcement-learning-nanodegree/)
+- (May 24, 2018)
+  [Imitation Learning in Unity: The Workflow](https://blogs.unity3d.com/2018/05/24/imitation-learning-in-unity-the-workflow/)
+- (March 15, 2018)
+  [ML-Agents Toolkit v0.3 Beta released: Imitation Learning, feedback-driven features, and more](https://blogs.unity3d.com/2018/03/15/ml-agents-v0-3-beta-released-imitation-learning-feedback-driven-features-and-more/)
+- (December 11, 2017)
+  [Using Machine Learning Agents in a real game: a beginner’s guide](https://blogs.unity3d.com/2017/12/11/using-machine-learning-agents-in-a-real-game-a-beginners-guide/)
+- (December 8, 2017)
+  [Introducing ML-Agents Toolkit v0.2: Curriculum Learning, new environments, and more](https://blogs.unity3d.com/2017/12/08/introducing-ml-agents-v0-2-curriculum-learning-new-environments-and-more/)
+- (September 19, 2017)
+  [Introducing: Unity Machine Learning Agents Toolkit](https://blogs.unity3d.com/2017/09/19/introducing-unity-machine-learning-agents/)
+- Overviewing reinforcement learning concepts
-In addition to our own documentation, here are some additional, relevant articles:
+In addition to our own documentation, here are some additional, relevant
+articles:
-* [A Game Developer Learns Machine Learning](https://mikecann.co.uk/machine-learning/a-game-developer-learns-machine-learning-intent/)
-* [Explore Unity Technologies ML-Agents Exclusively on Intel Architecture](https://software.intel.com/en-us/articles/explore-unity-technologies-ml-agents-exclusively-on-intel-architecture)
-* [ML-Agents Penguins tutorial](https://learn.unity.com/project/ml-agents-penguins)
+- [A Game Developer Learns Machine Learning](https://mikecann.co.uk/machine-learning/a-game-developer-learns-machine-learning-intent/)
+- [Explore Unity Technologies ML-Agents Exclusively on Intel Architecture](https://software.intel.com/en-us/articles/explore-unity-technologies-ml-agents-exclusively-on-intel-architecture)
+- [ML-Agents Penguins tutorial](https://learn.unity.com/project/ml-agents-penguins)

 ## Community and Feedback


 For problems with the installation and setup of the the ML-Agents Toolkit, or
 discussions about how to best setup or train your agents, please create a new
-thread on the [Unity ML-Agents forum](https://forum.unity.com/forums/ml-agents.453/)
-and make sure to include as much detail as possible.
-If you run into any other problems using the ML-Agents Toolkit, or have a specific
-feature requests, please [submit a GitHub issue](https://github.com/Unity-Technologies/ml-agents/issues).
+thread on the
+[Unity ML-Agents forum](https://forum.unity.com/forums/ml-agents.453/) and make
+sure to include as much detail as possible. If you run into any other problems
+using the ML-Agents Toolkit, or have a specific feature requests, please
+[submit a GitHub issue](https://github.com/Unity-Technologies/ml-agents/issues).
-Your opinion matters a great deal to us. Only by hearing your thoughts on the Unity ML-Agents
-Toolkit can we continue to improve and grow. Please take a few minutes to
+Your opinion matters a great deal to us. Only by hearing your thoughts on the
+Unity ML-Agents Toolkit can we continue to improve and grow. Please take a few
+minutes to
-For any other questions or feedback, connect directly with the ML-Agents
-team at ml-agents@unity3d.com.
-
+For any other questions or feedback, connect directly with the ML-Agents team at
+ml-agents@unity3d.com.

 ## License

--- a/com.unity.ml-agents/Runtime/Agent.cs
+++ b/com.unity.ml-agents/Runtime/Agent.cs
        /// Imitation Learning (GAIL) with rewards supplied through this method.
        ///
        /// [Agents - Rewards]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Learning-Environment-Design-Agents.md#rewards
-        /// [Reward Signals]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Reward-Signals.md
+        /// [Reward Signals]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/ML-Agents-Overview.md#a-quick-note-on-reward-signals
        /// </remarks>
        /// <param name="reward">The new value of the reward.</param>
        public void SetReward(float reward)
        /// Imitation Learning (GAIL) with rewards supplied through this method.
        ///
        /// [Agents - Rewards]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Learning-Environment-Design-Agents.md#rewards
-        /// [Reward Signals]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Reward-Signals.md
+        /// [Reward Signals]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/ML-Agents-Overview.md#a-quick-note-on-reward-signals
        ///</remarks>
        /// <param name="increment">Incremental reward value.</param>
        public void AddReward(float increment)
        /// implementing a simple heuristic function can aid in debugging agent actions and interactions
        /// with its environment.
        ///
-        /// [Demonstration Recorder]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Training-Imitation-Learning.md#recording-demonstrations
+        /// [Demonstration Recorder]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Learning-Environment-Design-Agents.md#recording-demonstrations
        /// [Actions]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Learning-Environment-Design-Agents.md#actions
        /// [GameObject]: https://docs.unity3d.com/Manual/GameObjects.html
        /// </remarks>
--- a/com.unity.ml-agents/Runtime/Demonstrations/DemonstrationRecorder.cs
+++ b/com.unity.ml-agents/Runtime/Demonstrations/DemonstrationRecorder.cs
    /// See [Imitation Learning - Recording Demonstrations] for more information.
    ///
    /// [GameObject]: https://docs.unity3d.com/Manual/GameObjects.html
-    /// [Imitation Learning - Recording Demonstrations]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs/Training-Imitation-Learning.md#recording-demonstrations
+    /// [Imitation Learning - Recording Demonstrations]: https://github.com/Unity-Technologies/ml-agents/blob/release_1_docs/docs//Learning-Environment-Design-Agents.md#recording-demonstrations
    /// </remarks>
    [RequireComponent(typeof(Agent))]
    [AddComponentMenu("ML Agents/Demonstration Recorder", (int)MenuGroup.Default)]
--- a/docs/Getting-Started.md
+++ b/docs/Getting-Started.md
   **Note** : You can modify multiple game objects in a scene by selecting them
   all at once using the search bar in the Scene Hierarchy.
 1. Set the **Inference Device** to use for this model as `CPU`.
-1. Click the :arrow_forward: button in the Unity Editor and you will see the
-   platforms balance the balls using the pre-trained model.
+1. Click the **Play** button in the Unity Editor and you will see the platforms
+   balance the balls using the pre-trained model.

 ## Training a new model with Reinforcement Learning

     all our example environments, including 3DBall.
   - `run-id` is a unique name for this training session.
 1. When the message _"Start training by pressing the Play button in the Unity
-   Editor"_ is displayed on the screen, you can press the :arrow_forward: button
-   in Unity to start training in the Editor.
+   Editor"_ is displayed on the screen, you can press the **Play** button in
+   Unity to start training in the Editor.

 If `mlagents-learn` runs correctly and starts training, you should see something
 like this:
 1. Select the **3DBall** prefab Agent object.
 1. Drag the `<behavior_name>.nn` file from the Project window of the Editor to
   the **Model** placeholder in the **Ball3DAgent** inspector window.
-1. Press the :arrow_forward: button at the top of the Editor.
+1. Press the **Play** button at the top of the Editor.

 ## Next Steps

--- a/docs/Glossary.md
+++ b/docs/Glossary.md
  agent’s action within the current state of the environment.
 - **State** - The underlying properties of the environment (including all agents
  within it) at a given time.
- **Step** - Corresponds to each `FixedUpdate` call of the game engine. Is the
-  smallest atomic change to the state possible.
+- **Step** - Corresponds to an atomic change of the engine that happens between
+  Agent decisions.
+- **Experience** - Corresponds to a tuple of [Agent observations, actions,
+  rewards] of a single Agent obtained after a Step.
 - **Update** - Unity function called each time a frame is rendered. ML-Agents
  logic should not be placed here.
 - **External Coordinator** - ML-Agents class responsible for communication with
--- a/docs/Learning-Environment-Create-New.md
+++ b/docs/Learning-Environment-Create-New.md

 1. In the Unity Project window, double-click the `RollerAgent` script to open it
   in your code editor.
-1. In the editor, add the `using Unity.MLAgents;` and `using Unity.MLAgents.Sensors`
-   statements and then change the base class from `MonoBehaviour` to `Agent`.
+1. In the editor, add the `using Unity.MLAgents;` and
+   `using Unity.MLAgents.Sensors` statements and then change the base class from
+   `MonoBehaviour` to `Agent`.
 1. Delete the `Update()` method, but we will use the `Start()` function, so
   leave it alone for now.

 `Behavior Type` to `Heuristic Only` in the `Behavior Parameters` of the
 RollerAgent.

-Press :arrow_forward: to run the scene and use the arrows keys to move the Agent
-around the platform. Make sure that there are no errors displayed in the Unity
-Editor Console window and that the Agent resets when it reaches its target or
-falls from the platform. Note that for more involved debugging, the ML-Agents
-SDK includes a convenient [Monitor](Feature-Monitor.md) class that you can use
-to easily display Agent status information in the Game window.
+Press **Play** to run the scene and use the arrows keys to move the Agent around
+the platform. Make sure that there are no errors displayed in the Unity Editor
+Console window and that the Agent resets when it reaches its target or falls
+from the platform.

 ## Training the Environment

 decisions the training algorithm has to consider and, in this simple
 environment, speeds up training.

-To train your agent, run the following command before pressing :arrow_forward:
-in the Editor:
+To train your agent, run the following command before pressing **Play** in the
+Editor:

    mlagents-learn config/rollerball_config.yaml --run-id=RollerBall

--- a/docs/Learning-Environment-Design-Agents.md
+++ b/docs/Learning-Environment-Design-Agents.md
 # Agents

+**Table of Contents:**
+
+- [Decisions](#decisions)
+- [Observations and Sensors](#observations-and-sensors)
+  - [Vector Observations](#vector-observations)
+    - [One-hot encoding categorical information](#one-hot-encoding-categorical-information)
+    - [Normalization](#normalization)
+    - [Vector Observation Summary & Best Practices](#vector-observation-summary--best-practices)
+  - [Visual Observations](#visual-observations)
+    - [Visual Observation Summary & Best Practices](#visual-observation-summary--best-practices)
+  - [Raycast Observations](#raycast-observations)
+    - [RayCast Observation Summary & Best Practices](#raycast-observation-summary--best-practices)
+- [Actions](#actions)
+  - [Continuous Action Space](#continuous-action-space)
+  - [Discrete Action Space](#discrete-action-space)
+    - [Masking Discrete Actions](#masking-discrete-actions)
+  - [Actions Summary & Best Practices](#actions-summary--best-practices)
+- [Rewards](#rewards)
+  - [Examples](#examples)
+  - [Rewards Summary & Best Practices](#rewards-summary--best-practices)
+- [Agent Properties](#agent-properties)
+- [Destroying an Agent](#destroying-an-agent)
+- [Defining Teams for Multi-agent Scenarios](#defining-teams-for-multi-agent-scenarios)
+- [Recording Demonstrations](#recording-demonstrations)
+
 An agent is an entity that can observe its environment, decide on the best
 course of action using those observations, and execute those actions within its
 environment. Agents can be created in Unity by extending the `Agent` class. The
  agent to take the optimally informed decision, and ideally no extraneous
  information.
 - In cases where Vector Observations need to be remembered or compared over
-  time, either an LSTM (see [here](Feature-Memory.md)) should be used in the
-  model, or the `Stacked Vectors` value in the agent GameObject's
-  `Behavior Parameters` should be changed.
+  time, either an RNN should be used in the model, or the `Stacked Vectors`
+  value in the agent GameObject's `Behavior Parameters` should be changed.
 - Categorical variables such as type of object (Sword, Shield, Bow) should be
  encoded in one-hot fashion (i.e. `3` -> `0, 0, 1`). This can be done
  automatically using the `AddOneHotObservation()` method of the `VectorSensor`.
  not sufficient.
 - Image size should be kept as small as possible, without the loss of needed
  details for decision making.
- Images should be made greyscale in situations where color information is not
+- Images should be made grayscale in situations where color information is not
  needed for making informed decisions.

 ### Raycast Observations
 - `Behavior Parameters` - The parameters dictating what Policy the Agent will
  receive.
  - `Behavior Name` - The identifier for the behavior. Agents with the same
-    behavior name will learn the same policy. If you're using
-    [curriculum learning](Training-Curriculum-Learning.md), this is used as the
-    top-level key in the config.
+    behavior name will learn the same policy.
  - `Vector Observation`
    - `Space Size` - Length of vector observation for the Agent.
    - `Stacked Vectors` - The number of previous vector observations that will
      otherwise they will perform inference.
    - `Heuristic Only` - the Agent will always use the `Heuristic()` method.
    - `Inference Only` - the Agent will always perform inference.
-  - `Team ID` - Used to define the team for [self-play](Training-Self-Play.md)
+  - `Team ID` - Used to define the team for self-play
-## Monitoring Agents
-
-We created a helpful `Monitor` class that enables visualizing variables within a
-Unity environment. While this was built for monitoring an agent's value function
-throughout the training process, we imagine it can be more broadly useful. You
-can learn more [here](Feature-Monitor.md).
-
 ## Destroying an Agent

 You can destroy an Agent GameObject during the simulation. Make sure that there
+
+## Defining Teams for Multi-agent Scenarios
+
+Self-play is triggered by including the self-play hyperparameter hierarchy in
+the [trainer configuration](Training-ML-Agents.md#training-configurations). To
+distinguish opposing agents, set the team ID to different integer values in the
+behavior parameters script on the agent prefab.
+
+![Team ID](images/team_id.png)
+
+**_Team ID must be 0 or an integer greater than 0._**
+
+In symmetric games, since all agents (even on opposing teams) will share the
+same policy, they should have the same 'Behavior Name' in their Behavior
+Parameters Script. In asymmetric games, they should have a different Behavior
+Name in their Behavior Parameters script. Note, in asymmetric games, the agents
+must have both different Behavior Names _and_ different team IDs!
+
+For examples of how to use this feature, you can see the trainer configurations
+and agent prefabs for our Tennis and Soccer environments. Tennis and Soccer
+provide examples of symmetric games. To train an asymmetric game, specify
+trainer configurations for each of your behavior names and include the self-play
+hyperparameter hierarchy in both.
+
+## Recording Demonstrations
+
+In order to record demonstrations from an agent, add the
+`Demonstration Recorder` component to a GameObject in the scene which contains
+an `Agent` component. Once added, it is possible to name the demonstration that
+will be recorded from the agent.
+
+<p align="center">
+  <img src="images/demo_component.png"
+       alt="Demonstration Recorder"
+       width="375" border="10" />
+</p>
+
+When `Record` is checked, a demonstration will be created whenever the scene is
+played from the Editor. Depending on the complexity of the task, anywhere from a
+few minutes or a few hours of demonstration data may be necessary to be useful
+for imitation learning. When you have recorded enough data, end the Editor play
+session. A `.demo` file will be created in the `Assets/Demonstrations` folder
+(by default). This file contains the demonstrations. Clicking on the file will
+provide metadata about the demonstration in the inspector.
+
+<p align="center">
+  <img src="images/demo_inspector.png"
+       alt="Demonstration Inspector"
+       width="375" border="10" />
+</p>
+
+You can then specify the path to this file in your training configurations.
--- a/docs/Learning-Environment-Design.md
+++ b/docs/Learning-Environment-Design.md
 # Reinforcement Learning in Unity

-Reinforcement learning is an artificial intelligence technique that trains
-_agents_ to perform tasks by rewarding desirable behavior. During reinforcement
-learning, an agent explores its environment, observes the state of things, and,
-based on those observations, takes an action. If the action leads to a better
-state, the agent receives a positive reward. If it leads to a less desirable
-state, then the agent receives no reward or a negative reward (punishment). As
-the agent learns during training, it optimizes its decision making so that it
-receives the maximum reward over time.
-
-The ML-Agents Toolkit uses a reinforcement learning technique called
-[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/).
-PPO uses a neural network to approximate the ideal function that maps an agent's
-observations to the best action an agent can take in a given state. The
-ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
-Python process (communicating with the running Unity application over a socket).
-
-**Note:** if you aren't studying machine and reinforcement learning as a subject
-and just want to train agents to accomplish tasks, you can treat PPO training as
-a _black box_. There are a few training-related parameters to adjust inside
-Unity as well as on the Python training side, but you do not need in-depth
-knowledge of the algorithm itself to successfully create and train agents.
-Step-by-step procedures for running the training process are provided in the
-[Training section](Training-ML-Agents.md).
-
-class. The Academy works with Agent objects in the scene to step
-through the simulation.
+class. The Academy works with Agent objects in the scene to step through the
+simulation.
-neural network model. When training is completed
-successfully, you can add the trained model file to your Unity project for later
-use.
+neural network model. When training is completed successfully, you can add the
+trained model file to your Unity project for later use.
-2. Calls the `OnEpisodeBegin()` function for each Agent in the scene.
-3. Calls the  `CollectObservations(VectorSensor sensor)` function for each Agent in the scene.
-4. Uses each Agent's Policy to decide on the Agent's next action.
-5. Calls the `OnActionReceived()` function for each Agent in the scene, passing in
-   the action chosen by the Agent's Policy.
-6. Calls the Agent's `OnEpisodeBegin()` function if the Agent has reached its `Max
-   Step` count or has otherwise marked itself as `EndEpisode()`.
-
-To create a training environment, extend the Agent class to
-implement the above methods whether you need to implement them or not depends on
-your specific scenario.
+1. Calls the `OnEpisodeBegin()` function for each Agent in the scene.
+1. Calls the `CollectObservations(VectorSensor sensor)` function for each Agent
+   in the scene.
+1. Uses each Agent's Policy to decide on the Agent's next action.
+1. Calls the `OnActionReceived()` function for each Agent in the scene, passing
+   in the action chosen by the Agent's Policy.
+1. Calls the Agent's `OnEpisodeBegin()` function if the Agent has reached its
+   `Max Step` count or has otherwise marked itself as `EndEpisode()`.
-**Note:** The API used by the Python training process to communicate with
-and control the Academy during training can be used for other purposes as well.
-For example, you could use the API to use Unity as the simulation engine for
-your own machine learning algorithms. See [Python API](Python-API.md) for more
-information.
+To create a training environment, extend the Agent class to implement the above
+methods whether you need to implement them or not depends on your specific
+scenario.
-To train and use the ML-Agents Toolkit in a Unity scene, the scene as many Agent subclasses as you need.
-Agent instances should be attached to the GameObject representing that Agent.
+To train and use the ML-Agents Toolkit in a Unity scene, the scene as many Agent
+subclasses as you need. Agent instances should be attached to the GameObject
+representing that Agent.
-The Academy is a singleton which orchestrates Agents and their decision making processes. Only
-a single Academy exists at a time.
+The Academy is a singleton which orchestrates Agents and their decision making
+processes. Only a single Academy exists at a time.
-To alter the environment at the start of each episode, add your method to the Academy's OnEnvironmentReset action.
+
+To alter the environment at the start of each episode, add your method to the
+Academy's OnEnvironmentReset action.

 ```csharp
 public class MySceneBehavior : MonoBehaviour
 }
 ```

-For example, you might want to reset an Agent to its starting
-position or move a goal to a random position. An environment resets when the
-`reset()` method is called on the Python `UnityEnvironment`.
+For example, you might want to reset an Agent to its starting position or move a
+goal to a random position. An environment resets when the `reset()` method is
+called on the Python `UnityEnvironment`.

 When you reset an environment, consider the factors that should change so that
 training is generalizable to different conditions. For example, if you were

+### Environment Parameters
+
+Curriculum learning and environment parameter randomization are two training
+methods that control specific parameters in your environment. As such, it is
+important to ensure that your environment parameters are updated at each step to
+the correct values. To enable this, we expose a `EnvironmentParameters` C# class
+that you can use to retrieve the values of the parameters defined in the
+training configurations for both of those features.
+
+We recommend modifying the environment from the Agent's `OnEpisodeBegin()`
+function by leveraging `Academy.Instance.EnvironmentParameters`. See the
+WallJump example environment for a sample usage (specifically,
+[WallJumpAgent.cs](../Project/Assets/ML-Agents/Examples/WallJump/Scripts/WallJumpAgent.cs)
+).
+
 ### Agent

 The Agent class represents an actor in the scene that collects observations and
 To create an Agent, extend the Agent class and implement the essential
 `CollectObservations(VectorSensor sensor)` and `OnActionReceived()` methods:

-* `CollectObservations(VectorSensor sensor)` — Collects the Agent's observation of its environment.
-* `OnActionReceived()` — Carries out the action chosen by the Agent's Policy and
+- `CollectObservations(VectorSensor sensor)` — Collects the Agent's observation
+  of its environment.
+- `OnActionReceived()` — Carries out the action chosen by the Agent's Policy and
  assigns a reward to the current state.

 Your implementations of these functions determine how the Behavior Parameters
-manually terminate an Agent episode in your `OnActionReceived()` function when the Agent
-has finished (or irrevocably failed) its task by calling the `EndEpisode()` function.
-You can also set the Agent's `Max Steps` property to a positive value and the
-Agent will consider the episode over after it has taken that many steps. You can
-use the `Agent.OnEpisodeBegin()` function to prepare the Agent to start again.
+manually terminate an Agent episode in your `OnActionReceived()` function when
+the Agent has finished (or irrevocably failed) its task by calling the
+`EndEpisode()` function. You can also set the Agent's `Max Steps` property to a
+positive value and the Agent will consider the episode over after it has taken
+that many steps. You can use the `Agent.OnEpisodeBegin()` function to prepare
+the Agent to start again.
-
-An _environment_ in the ML-Agents Toolkit can be any scene built in Unity. The
-Unity scene provides the environment in which agents observe, act, and learn.
-How you set up the Unity scene to serve as a learning environment really depends
-on your goal. You may be trying to solve a specific reinforcement learning
-problem of limited scope, in which case you can use the same scene for both
-training and for testing trained agents. Or, you may be training agents to
-operate in a complex game or simulation. In this case, it might be more
-efficient and practical to create a purpose-built training scene.
-* The training scene must start automatically when your Unity application is
+- The training scene must start automatically when your Unity application is
-* The Academy must reset the scene to a valid starting point for each episode of
+- The Academy must reset the scene to a valid starting point for each episode of
-* A training episode must have a definite end — either using `Max Steps` or by
+- A training episode must have a definite end — either using `Max Steps` or by
+
+## Recording Statistics
+
+We offer developers a mechanism to record statistics from within their Unity
+environments. These statistics are aggregated and generated during the training
+process. To record statistics, see the `StatsRecorder` C# class.
+
+See the FoodCollector example environment for a sample usage (specifically,
+[FoodCollectorSettings.cs](../Project/Assets/ML-Agents/Examples/FoodCollector/Scripts/FoodCollectorSettings.cs)
+).
--- a/docs/Learning-Environment-Executable.md
+++ b/docs/Learning-Environment-Executable.md
 1. Select the **3DBall** prefab from the Project window and select **Agent**.
 1. Drag the `<behavior_name>.nn` file from the Project window of the Editor to
   the **Model** placeholder in the **Ball3DAgent** inspector window.
-1. Press the :arrow_forward: button at the top of the editor.
+1. Press the **Play** button at the top of the Editor.
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
 # ML-Agents Toolkit Overview

+**Table of Contents**
+
+- [Running Example: Training NPC Behaviors](#running-example-training-npc-behaviors)
+- [Key Components](#key-components)
+- [Training Modes](#training-modes)
+  - [Built-in Training and Inference](#built-in-training-and-inference)
+  - [Custom Training and Inference](#custom-training-and-inference)
+- [Flexible Training Scenarios](#flexible-training-scenarios)
+- [Training Methods: Environment-agnostic](#training-methods-environment-agnostic)
+  - [A Quick Note on Reward Signals](#a-quick-note-on-reward-signals)
+  - [Deep Reinforcement Learning](#deep-reinforcement-learning)
+    - [Curiosity for Sparse-reward Environments](#curiosity-for-sparse-reward-environments)
+  - [Imitation Learning](#imitation-learning)
+    - [GAIL (Generative Adversarial Imitation Learning)](#gail-generative-adversarial-imitation-learning)
+    - [Behavioral Cloning (BC)](#behavioral-cloning-bc)
+    - [Recording Demonstrations](#recording-demonstrations)
+  - [Summary](#summary)
+- [Training Methods: Environment-specific](#training-methods-environment-specific)
+  - [Training in Multi-Agent Environments with Self-Play](#training-in-multi-agent-environments-with-self-play)
+  - [Solving Complex Tasks using Curriculum Learning](#solving-complex-tasks-using-curriculum-learning)
+  - [Training Robust Agents using Environment Parameter Randomization](#training-robust-agents-using-environment-parameter-randomization)
+- [Model Types](#model-types)
+  - [Learning from Vector Observations](#learning-from-vector-observations)
+  - [Learning from Cameras using Convolutional Neural Networks](#learning-from-cameras-using-convolutional-neural-networks)
+  - [Memory-enhanced Agents using Recurrent Neural Networks](#memory-enhanced-agents-using-recurrent-neural-networks)
+- [Additional Features](#additional-features)
+- [Summary and Next Steps](#summary-and-next-steps)
+
-open-source Unity plugin that enables games and simulations to serve as
-environments for training intelligent agents. Agents can be trained using
-reinforcement learning, imitation learning, neuroevolution, or other machine
-learning methods through a simple-to-use Python API. We also provide
-implementations (based on TensorFlow) of state-of-the-art algorithms to enable
-game developers and hobbyists to easily train intelligent agents for 2D, 3D and
-VR/AR games. These trained agents can be used for multiple purposes, including
-controlling NPC behavior (in a variety of settings such as multi-agent and
-adversarial), automated testing of game builds and evaluating different game
-design decisions pre-release. The ML-Agents Toolkit is mutually beneficial for
-both game developers and AI researchers as it provides a central platform where
-advances in AI can be evaluated on Unity’s rich environments and then made
-accessible to the wider research and game developer communities.
+open-source project that enables games and simulations to serve as environments
+for training intelligent agents. Agents can be trained using reinforcement
+learning, imitation learning, neuroevolution, or other machine learning methods
+through a simple-to-use Python API. We also provide implementations (based on
+TensorFlow) of state-of-the-art algorithms to enable game developers and
+hobbyists to easily train intelligent agents for 2D, 3D and VR/AR games. These
+trained agents can be used for multiple purposes, including controlling NPC
+behavior (in a variety of settings such as multi-agent and adversarial),
+automated testing of game builds and evaluating different game design decisions
+pre-release. The ML-Agents Toolkit is mutually beneficial for both game
+developers and AI researchers as it provides a central platform where advances
+in AI can be evaluated on Unity’s rich environments and then made accessible to
+the wider research and game developer communities.
-that include overviews and helpful resources on the [Unity
-Engine](Background-Unity.md), [machine learning](Background-Machine-Learning.md)
-and [TensorFlow](Background-TensorFlow.md). We **strongly** recommend browsing
-the relevant background pages if you're not familiar with a Unity scene, basic
+that include overviews and helpful resources on the
+[Unity Engine](Background-Unity.md),
+[machine learning](Background-Machine-Learning.md) and
+[TensorFlow](Background-TensorFlow.md). We **strongly** recommend browsing the
+relevant background pages if you're not familiar with a Unity scene, basic
-subsequent documentation pages provide examples of _how_ to use ML-Agents.
+subsequent documentation pages provide examples of _how_ to use ML-Agents. To
+get started, watch this
+[demo video of ML-Agents in action](https://www.youtube.com/watch?v=fiQsmdwEGT8&feature=youtu.be).

 ## Running Example: Training NPC Behaviors


 ## Key Components

-The ML-Agents Toolkit is a Unity plugin that contains three high-level
-components:
+The ML-Agents Toolkit contains five high-level components:
-  characters.
- **Python API** - which contains all the machine learning algorithms that are
-  used for training (learning a behavior or policy). Note that, unlike the
+  characters. The Unity scene provides the environment in which agents observe,
+  act, and learn. How you set up the Unity scene to serve as a learning
+  environment really depends on your goal. You may be trying to solve a specific
+  reinforcement learning problem of limited scope, in which case you can use the
+  same scene for both training and for testing trained agents. Or, you may be
+  training agents to operate in a complex game or simulation. In this case, it
+  might be more efficient and practical to create a purpose-built training
+  scene. The ML-Agents Toolkit includes an ML-Agents Unity SDK
+  (`com.unity.ml-agents` package) that enables you to transform any Unity scene
+  into a learning environment by defining the agents and their behaviors.
+- **Python Low-Level API** - which contains a low-level Python interface for
+  interacting and manipulating a learning environment. Note that, unlike the
-  and communicates with Unity through the External Communicator.
+  and communicates with Unity through the Communicator. This API is contained in
+  a dedicated `mlagents_envs` Python package and is used by the Python training
+  process to communicate with and control the Academy during training. However,
+  it can be used for other purposes as well. For example, you could use the API
+  to use Unity as the simulation engine for your own machine learning
+  algorithms. See [Python API](Python-API.md) for more information.
-  Python API. It lives within the Learning Environment.
+  Python Low-Level API. It lives within the Learning Environment.
+- **Python Trainers** which contains all the machine learning algorithms that
+  enable training agents. The algorithms are implemented in Python and are part
+  of their own `mlagents` Python package. The package exposes a single
+  command-line utility `mlagents-learn` that supports all the training methods
+  and options outlined in this document. The Python Trainers interface solely
+  with the Python Low-Level API.
+- **Gym Wrapper** (not pictured). A common way in which machine learning
+  researchers interact with simulation environments is via a wrapper provided by
+  OpenAI called [gym](https://github.com/openai/gym). We provide a gym wrapper
+  in a dedicated `gym-unity` Python package and
+  [instructions](../gym-unity/README.md) for using it with existing machine
+  learning algorithms which utilize gym.
-       width="700" border="10" />
+       width="600"
+       border="10" />
-The Learning Environment contains an additional component that help
-organize the Unity scene:
+The Learning Environment contains two Unity Components that help organize the
+Unity scene:
+- **Behavior** - defines specific attributes of the agent such as the number of
+  actions that agent can take. Each Behavior is uniquely identified by a
+  `Behavior Name` field. A Behavior can be thought as a function that receives
+  observations and rewards from the Agent and returns actions. A Behavior can be
+  of one of three types: Learning, Heuristic or Inference. A Learning Behavior
+  is one that is not, yet, defined but about to be trained. A Heuristic Behavior
+  is one that is defined by a hard-coded set of rules implemented in code. An
+  Inference Behavior is one that includes a trained Neural Network file. In
+  essence, after a Learning Behavior is trained, it becomes an Inference
+  Behavior.
-Every Learning Environment will always have one Agent for
-every character in the scene. While each Agent must be linked to a Behavior, it is
-possible for Agents that have similar observations and actions to have
-the same Behavior. In our sample game, we have two teams each with their own medic.
-Thus we will have two Agents in our Learning Environment, one for each medic,
-but both of these medics can have the same Behavior. Note that these two
-medics have the same Behavior. This does not mean that at each instance they will have
-identical observation and action _values_. If we expanded our game to include
-tank driver NPCs, then the Agent
-attached to those characters cannot share its Behavior with the Agent linked to the
-medics (medics and drivers have different actions).
+Every Learning Environment will always have one Agent for every character in the
+scene. While each Agent must be linked to a Behavior, it is possible for Agents
+that have similar observations and actions to have the same Behavior. In our
+sample game, we have two teams each with their own medic. Thus we will have two
+Agents in our Learning Environment, one for each medic, but both of these medics
+can have the same Behavior. This does not mean that at each instance they will
+have identical observation and action _values_.
+       width="700"
-We have yet to discuss how the ML-Agents Toolkit trains behaviors, and what role
-the Python API and External Communicator play. Before we dive into those
-details, let's summarize the earlier components. Each character is attached to
-an Agent, and each Agent has a Behavior. The Behavior can be thought as a function
-that receives observations
-and rewards from the Agent and returns actions. The Learning Environment through
-the Academy (not represented in the diagram) ensures that all the
-Agents are in sync in addition to controlling environment-wide
-settings.
+Note that in a single environment, there can be multiple Agents and multiple
+Behaviors at the same time. For example, if we expanded our game to include tank
+driver NPCs, then the Agent attached to those characters cannot share its
+Behavior with the Agent linked to the medics (medics and drivers have different
+actions). The Learning Environment through the Academy (not represented in the
+diagram) ensures that all the Agents are in sync in addition to controlling
+environment-wide settings.
-Note that in a single environment, there can be multiple Agents and multiple Behaviors
-at the same time. These Behaviors can communicate with Python through the communicator
-but can also use a pre-trained _Neural Network_ or a _Heuristic_. Note that it is also
-possible to communicate data with Python without using Agents through _Side Channels_.
-One example of using _Side Channels_ is to exchange data with Python about
-_Environment Parameters_. The following diagram illustrates the above.
+Lastly, it is possible to exchange data between Unity and Python outside of the
+machine learning loop through _Side Channels_. One example of using _Side
+Channels_ is to exchange data with Python about _Environment Parameters_. The
+following diagram illustrates the above.

 <p align="center">
  <img src="images/learning_environment_full.png"

 As mentioned previously, the ML-Agents Toolkit ships with several
 implementations of state-of-the-art algorithms for training intelligent agents.
-More specifically, during training, all the medics in the
-scene send their observations to the Python API through the External
-Communicator. The Python API
+More specifically, during training, all the medics in the scene send their
+observations to the Python API through the External Communicator. The Python API
-during the inference phase, we use the
-TensorFlow model generated from the training phase. Now during the inference
-phase, the medics still continue to generate their observations, but instead of
-being sent to the Python API, they will be fed into their (internal, embedded)
-model to generate the _optimal_ action for each medic to take at every point in
-time.
+during the inference phase, we use the TensorFlow model generated from the
+training phase. Now during the inference phase, the medics still continue to
+generate their observations, but instead of being sent to the Python API, they
+will be fed into their (internal, embedded) model to generate the _optimal_
+action for each medic to take at every point in time.
-The [Getting Started Guide](Getting-Started.md)
-tutorial covers this training mode with the **3D Balance Ball** sample environment.
+The [Getting Started Guide](Getting-Started.md) tutorial covers this training
+mode with the **3D Balance Ball** sample environment.
-In the previous mode, the Agents were used for training to generate
-a TensorFlow model that the Agents can later use. However,
-any user of the ML-Agents Toolkit can leverage their own algorithms for
-training. In this case, the behaviors of all the Agents in the scene
-will be controlled within Python.
-You can even turn your environment into a [gym.](../gym-unity/README.md)
+In the previous mode, the Agents were used for training to generate a TensorFlow
+model that the Agents can later use. However, any user of the ML-Agents Toolkit
+can leverage their own algorithms for training. In this case, the behaviors of
+all the Agents in the scene will be controlled within Python. You can even turn
+your environment into a [gym.](../gym-unity/README.md)
+
+We do not currently have a tutorial highlighting this mode, but you can learn
+more about the Python API [here](Python-API.md).
+
+## Flexible Training Scenarios
+
+While the discussion so-far has mostly focused on training a single agent, with
+ML-Agents, several training scenarios are possible. We are excited to see what
+kinds of novel and fun environments the community creates. For those new to
+training intelligent agents, below are a few examples that can serve as
+inspiration:
+
+- Single-Agent. A single agent, with its own reward signal. The traditional way
+  of training an agent. An example is any single-player game, such as Chicken.
+- Simultaneous Single-Agent. Multiple independent agents with independent reward
+  signals with same `Behavior Parameters`. A parallelized version of the
+  traditional training scenario, which can speed-up and stabilize the training
+  process. Helpful when you have multiple versions of the same character in an
+  environment who should learn similar behaviors. An example might be training a
+  dozen robot-arms to each open a door simultaneously.
+- Adversarial Self-Play. Two interacting agents with inverse reward signals. In
+  two-player games, adversarial self-play can allow an agent to become
+  increasingly more skilled, while always having the perfectly matched opponent:
+  itself. This was the strategy employed when training AlphaGo, and more
+  recently used by OpenAI to train a human-beating 1-vs-1 Dota 2 agent.
+- Cooperative Multi-Agent. Multiple interacting agents with a shared reward
+  signal with same or different `Behavior Parameters`. In this scenario, all
+  agents must work together to accomplish a task that cannot be done alone.
+  Examples include environments where each agent only has access to partial
+  information, which needs to be shared in order to accomplish the task or
+  collaboratively solve a puzzle.
+- Competitive Multi-Agent. Multiple interacting agents with inverse reward
+  signals with same or different `Behavior Parameters`. In this scenario, agents
+  must compete with one another to either win a competition, or obtain some
+  limited set of resources. All team sports fall into this scenario.
+- Ecosystem. Multiple interacting agents with independent reward signals with
+  same or different `Behavior Parameters`. This scenario can be thought of as
+  creating a small world in which animals with different goals all interact,
+  such as a savanna in which there might be zebras, elephants and giraffes, or
+  an autonomous driving simulation within an urban environment.
+
+## Training Methods: Environment-agnostic
+
+The remaining sections overview the various state-of-the-art machine learning
+algorithms that are part of the ML-Agents Toolkit. If you aren't studying
+machine and reinforcement learning as a subject and just want to train agents to
+accomplish tasks, you can treat these algorithms as _black boxes_. There are a
+few training-related parameters to adjust inside Unity as well as on the Python
+training side, but you do not need in-depth knowledge of the algorithms
+themselves to successfully create and train agents. Step-by-step procedures for
+running the training process are provided in the
+[Training ML-Agents](Training-ML-Agents.md) page.
+
+This section specifically focuses on the training methods that are available
+regardless of the specifics of your learning environment.
+
+#### A Quick Note on Reward Signals
+
+In this section we introduce the concepts of _intrinsic_ and _extrinsic_
+rewards, which helps explain some of the training methods.
+
+In reinforcement learning, the end goal for the Agent is to discover a behavior
+(a Policy) that maximizes a reward. You will need to provide the agent one or
+more reward signals to use during training.Typically, a reward is defined by
+your environment, and corresponds to reaching some goal. These are what we refer
+to as _extrinsic_ rewards, as they are defined external of the learning
+algorithm.
+
+Rewards, however, can be defined outside of the environment as well, to
+encourage the agent to behave in certain ways, or to aid the learning of the
+true extrinsic reward. We refer to these rewards as _intrinsic_ reward signals.
+The total reward that the agent will learn to maximize can be a mix of extrinsic
+and intrinsic reward signals.
+
+The ML-Agents Toolkit allows reward signals to be defined in a modular way, and
+we provide three reward signals that can the mixed and matched to help shape
+your agent's behavior:
+
+- `extrinsic`: represents the rewards defined in your environment, and is
+  enabled by default
+- `gail`: represents an intrinsic reward signal that is defined by GAIL (see
+  below)
+- `curiosity`: represents an intrinsic reward signal that encourages exploration
+  in sparse-reward environments that is defined by the Curiosity module (see
+  below).
+
+### Deep Reinforcement Learning
+
+ML-Agents provide an implementation of two reinforcement learning algorithms:
+
+- [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/)
+- [Soft Actor-Critic (SAC)](https://bair.berkeley.edu/blog/2018/12/14/sac/)
+
+The default algorithm is PPO. This is a method that has been shown to be more
+general purpose and stable than many other RL algorithms.
+
+In contrast with PPO, SAC is _off-policy_, which means it can learn from
+experiences collected at any time during the past. As experiences are collected,
+they are placed in an experience replay buffer and randomly drawn during
+training. This makes SAC significantly more sample-efficient, often requiring
+5-10 times less samples to learn the same task as PPO. However, SAC tends to
+require more model updates. SAC is a good choice for heavier or slower
+environments (about 0.1 seconds per step or more). SAC is also a "maximum
+entropy" algorithm, and enables exploration in an intrinsic way. Read more about
+maximum entropy RL
+[here](https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/).
+
+#### Curiosity for Sparse-reward Environments
+
+In environments where the agent receives rare or infrequent rewards (i.e.
+sparse-reward), an agent may never receive a reward signal on which to bootstrap
+its training process. This is a scenario where the use of an intrinsic reward
+signals can be valuable. Curiosity is one such signal which can help the agent
+explore when extrinsic rewards are sparse.
+
+The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an
+implementation of the approach described in
+[Curiosity-driven Exploration by Self-supervised Prediction](https://pathak22.github.io/noreward-rl/)
+by Pathak, et al. It trains two networks:
+
+- an inverse model, which takes the current and next observation of the agent,
+  encodes them, and uses the encoding to predict the action that was taken
+  between the observations
+- a forward model, which takes the encoded current observation and action, and
+  predicts the next encoded observation.
+
+The loss of the forward model (the difference between the predicted and actual
+encoded observations) is used as the intrinsic reward, so the more surprised the
+model is, the larger the reward will be.
+
+For more information, see our dedicated
+[blog post on the Curiosity module](https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/).
+
+### Imitation Learning
+
+It is often more intuitive to simply demonstrate the behavior we want an agent
+to perform, rather than attempting to have it learn via trial-and-error methods.
+For example, instead of indirectly training a medic with the help of a reward
+function, we can give the medic real world examples of observations from the
+game and actions from a game controller to guide the medic's behavior. Imitation
+Learning uses pairs of observations and actions from a demonstration to learn a
+policy. See this [video demo](https://youtu.be/kpb8ZkMBFYs) of imitation
+learning .
+
+Imitation learning can either be used alone or in conjunction with reinforcement
+learning. If used alone it can provide a mechanism for learning a specific type
+of behavior (i.e. a specific style of solving the task). If used in conjunction
+with reinforcement learning it can dramatically reduce the time the agent takes
+to solve the environment. This can be especially pronounced in sparse-reward
+environments. For instance, on the
+[Pyramids environment](Learning-Environment-Examples.md#pyramids), using 6
+episodes of demonstrations can reduce training steps by more than 4 times. See
+Behavioral Cloning + GAIL + Curiosity + RL below.
+
+<p align="center">
+  <img src="images/mlagents-ImitationAndRL.png"
+       alt="Using Demonstrations with Reinforcement Learning"
+       width="700" border="0" />
+</p>
+
+The ML-Agents Toolkit provides a way to learn directly from demonstrations, as
+well as use them to help speed up reward-based training (RL). We include two
+algorithms called Behavioral Cloning (BC) and Generative Adversarial Imitation
+Learning (GAIL). In most scenarios, you can combine these two features:
+
+- If you want to help your agents learn (especially with environments that have
+  sparse rewards) using pre-recorded demonstrations, you can generally enable
+  both GAIL and Behavioral Cloning at low strengths in addition to having an
+  extrinsic reward. An example of this is provided for the Pyramids example
+  environment under `PyramidsLearning` in `config/gail_config.yaml`.
+- If you want to train purely from demonstrations, GAIL and BC _without_ an
+  extrinsic reward signal is the preferred approach. An example of this is
+  provided for the Crawler example environment under `CrawlerStaticLearning` in
+  `config/gail_config.yaml`.
+
+#### GAIL (Generative Adversarial Imitation Learning)
+
+GAIL, or
+[Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476),
+uses an adversarial approach to reward your Agent for behaving similar to a set
+of demonstrations. GAIL can be used with or without environment rewards, and
+works well when there are a limited number of demonstrations. In this framework,
+a second neural network, the discriminator, is taught to distinguish whether an
+observation/action is from a demonstration or produced by the agent. This
+discriminator can the examine a new observation/action and provide it a reward
+based on how close it believes this new observation/action is to the provided
+demonstrations.
+
+At each training step, the agent tries to learn how to maximize this reward.
+Then, the discriminator is trained to better distinguish between demonstrations
+and agent state/actions. In this way, while the agent gets better and better at
+mimicking the demonstrations, the discriminator keeps getting stricter and
+stricter and the agent must try harder to "fool" it.
+
+This approach learns a _policy_ that produces states and actions similar to the
+demonstrations, requiring fewer demonstrations than direct cloning of the
+actions. In addition to learning purely from demonstrations, the GAIL reward
+signal can be mixed with an extrinsic reward signal to guide the learning
+process.
+
+#### Behavioral Cloning (BC)
+
+BC trains the Agent's policy to exactly mimic the actions shown in a set of
+demonstrations. The BC feature can be enabled on the PPO or SAC trainers. As BC
+cannot generalize past the examples shown in the demonstrations, BC tends to
+work best when there exists demonstrations for nearly all of the states that the
+agent can experience, or in conjunction with GAIL and/or an extrinsic reward.
+
+#### Recording Demonstrations
+
+Demonstrations of agent behavior can be recorded from the Unity Editor or build,
+and saved as assets. These demonstrations contain information on the
+observations, actions, and rewards for a given agent during the recording
+session. They can be managed in the Editor, as well as used for training with BC
+and GAIL.
+
+### Summary
+
+To summarize, we provide 3 training methods: BC, GAIL and RL (PPO or SAC) that
+can be used independently or together:
+
+- BC can be used on its own or as a pre-training step before GAIL and/or RL
+- GAIL can be used with or without extrinsic rewards
+- RL can be used on its own (either PPO or SAC) or in conjunction with BC and/or
+  GAIL.
+
+Leveraging either BC or GAIL requires recording demonstrations to be provided as
+input to the training algorithms.
+
+## Training Methods: Environment-specific
+
+In addition to the three environment-agnostic training methods introduced in the
+previous section, the ML-Agents Toolkit provides additional methods that can aid
+in training behaviors for specific types of environments.
+
+### Training in Multi-Agent Environments with Self-Play
+
+ML-Agents provides the functionality to train both symmetric and asymmetric
+adversarial games with
+[Self-Play](https://openai.com/blog/competitive-self-play/). A symmetric game is
+one in which opposing agents are equal in form, function and objective. Examples
+of symmetric games are our Tennis and Soccer example environments. In
+reinforcement learning, this means both agents have the same observation and
+action spaces and learn from the same reward function and so _they can share the
+same policy_. In asymmetric games, this is not the case. An example of an
+asymmetric games are Hide and Seek. Agents in these types of games do not always
+have the same observation or action spaces and so sharing policy networks is not
+necessarily ideal.
-We do not currently have a tutorial highlighting this mode, but you can
-learn more about the Python API [here](Python-API.md).
+With self-play, an agent learns in adversarial games by competing against fixed,
+past versions of its opponent (which could be itself as in symmetric games) to
+provide a more stable, stationary learning environment. This is compared to
+competing against the current, best opponent in every episode, which is
+constantly changing (because it's learning).
+
+Self-play can be used with our implementations of both Proximal Policy
+Optimization (PPO) and Soft Actor-Critic (SAC). However, from the perspective of
+an individual agent, these scenarios appear to have non-stationary dynamics
+because the opponent is often changing. This can cause significant issues in the
+experience replay mechanism used by SAC. Thus, we recommend that users use PPO.
+For further reading on this issue in particular, see the paper
+[Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf).
-### Curriculum Learning
+### Solving Complex Tasks using Curriculum Learning
-This mode is an extension of _Built-in Training and Inference_, and is
-particularly helpful when training intricate behaviors for complex environments.
 Curriculum learning is a way of training a machine learning model where more
 difficult aspects of a problem are gradually introduced in such a way that the
 model is always optimally challenged. This idea has been around for a long time,
 training on easier tasks can provide a scaffolding for harder tasks in the
 future.

-<p align="center">
-  <img src="images/math.png"
-       alt="Example Math Curriculum"
-       width="700"
-       border="10" />
-</p>
+Imagine training the medic to to scale a wall to arrive at a wounded team
+member. The starting point when training a medic to accomplish this task will be
+a random policy. That starting policy will have the medic running in circles,
+and will likely never, or very rarely scale the wall properly to revive their
+team member (and achieve the reward). If we start with a simpler task, such as
+moving toward an unobstructed team member, then the medic can easily learn to
+accomplish the task. From there, we can slowly add to the difficulty of the task
+by increasing the size of the wall until the medic can complete the initially
+near-impossible task of scaling the wall. We have included an environment to
+demonstrate this with ML-Agents, called
+[Wall Jump](Learning-Environment-Examples.md#wall-jump).
-_Example of a mathematics curriculum. Lessons progress from simpler topics to
-more complex ones, with each building on the last._
+![Wall](images/curriculum.png)
-When we think about how reinforcement learning actually works, the learning reward
-signal is received occasionally throughout training. The starting point
-when training an agent to accomplish this task will be a random policy. That
-starting policy will have the agent running in circles, and will likely never,
-or very rarely achieve the reward for complex environments. Thus by simplifying
-the environment at the beginning of training, we allow the agent to quickly
-update the random policy to a more meaningful one that is successively improved
-as the environment gradually increases in complexity. In our example, we can
-imagine first training the medic when each team only contains one player, and
-then iteratively increasing the number of players (i.e. the environment
-complexity). The ML-Agents Toolkit supports setting custom environment
-parameters within the Academy. This allows elements of the environment related
-to difficulty or complexity to be dynamically adjusted based on training
-progress.
+_Demonstration of a hypothetical curriculum training scenario in which a
+progressively taller wall obstructs the path to the goal._
-The [Training with Curriculum Learning](Training-Curriculum-Learning.md)
-tutorial covers this training mode with the **Wall Area** sample environment.
+_[**Note**: The example provided above is for instructional purposes, and was
+based on an early version of the
+[Wall Jump example environment](Learning-Environment-Examples.md). As such, it
+is not possible to directly replicate the results here using that environment.]_
-### Imitation Learning
+The ML-Agents Toolkit supports modifying custom environment parameters during
+the training process to aid in learning.. This allows elements of the
+environment related to difficulty or complexity to be dynamically adjusted based
+on training progress.
-It is often more intuitive to simply demonstrate the behavior we want an agent
-to perform, rather than attempting to have it learn via trial-and-error methods.
-For example, instead of training the medic by setting up its reward function,
-this mode allows providing real examples from a game controller on how the medic
-should behave. More specifically, in this mode, the Agent must use its heuristic
-to generate action, and all the actions performed with the controller (in addition
-to the agent observations) will be recorded. The
-imitation learning algorithm will then use these pairs of observations and
-actions from the human player to learn a policy. [Video
-Link](https://youtu.be/kpb8ZkMBFYs).
+### Training Robust Agents using Environment Parameter Randomization
+
+An agent trained on a specific environment, may be unable to generalize to any
+tweaks or variations in the environment (in machine learning this is referred to
+as overfitting). This becomes problematic in cases where environments are
+instantiated with varying objects or properties. One mechanism to alleviate this
+and train more robust agents that can generalize to unseen variations of the
+environment is to expose them to these variations during training. Similar to
+Curriculum Learning, where environments become more difficult as the agent
+learns, the ML-Agents Toolkit provides a way to randomly sample parameters of
+the environment during training. We refer to this approach as **Environment
+Parameter Randomization**. For those familiar with Reinforcement Learning
+research, this approach is based on the concept of Domain Randomization (you can
+read more about it [here](https://arxiv.org/abs/1703.06907)). By using parameter
+randomization during training, the agent can be better suited to adapt (with
+higher performance) to future unseen variations of the environment.
+
+_Example of variations of the 3D Ball environment._
+
+|      Ball scale of 0.5       |      Ball scale of 4       |
+| :--------------------------: | :------------------------: |
+| ![](images/3dball_small.png) | ![](images/3dball_big.png) |
+
+In the 3D ball environment example displayed in the figure above, the
+environment parameters are `gravity`, `ball_mass` and `ball_scale`.
+
+## Model Types
+
+Regardless of the training method deployed, there are a few model types that
+users can train using the ML-Agents Toolkit. This is due to the flexibility in
+defining agent observations, which can include vector, ray cast and visual
+observations. You can learn more about how to instrument an agent's observation
+in the [Designing Agents](Learning-Environment-Design-Agents.md) guide.
+
+### Learning from Vector Observations
+
+Whether an agent's observations are ray cast or vector, the ML-Agents Toolkit
+provides a fully connected neural network model to learn from those
+observations. At training time you can configure different aspects of this model
+such as the number of hidden units and number of layers.
+
+### Learning from Cameras using Convolutional Neural Networks
+
+Unlike other platforms, where the agent’s observation might be limited to a
+single vector or image, the ML-Agents Toolkit allows multiple cameras to be used
+for observations per agent. This enables agents to learn to integrate
+information from multiple visual streams. This can be helpful in several
+scenarios such as training a self-driving car which requires multiple cameras
+with different viewpoints, or a navigational agent which might need to integrate
+aerial and first-person visuals. You can learn more about adding visual
+observations to an agent
+[here](Learning-Environment-Design-Agents.md#multiple-visual-observations).
+
+When visual observations are utilized, the ML-Agents Toolkit leverages
+convolutional neural networks (CNN) to learn from the input images. We offer
+three network architectures:
+
+- a simple encoder which consists of two convolutional layers
+- the implementation proposed by
+  [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of
+  three convolutional layers,
+- the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three
+  stacked layers, each with two residual blocks, making a much larger network
+  than the other two.
+
+The choice of the architecture depends on the visual complexity of the scene and
+the available computational resources.
-The toolkit provides a way to learn directly from demonstrations, as well as use them
-to help speed up reward-based training (RL). We include two algorithms called
-Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL). The
-[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial covers these
-features in more depth.
+### Memory-enhanced Agents using Recurrent Neural Networks
-## Flexible Training Scenarios
+Have you ever entered a room to get something and immediately forgot what you
+were looking for? Don't let that happen to your agents.
-While the discussion so-far has mostly focused on training a single agent, with
-ML-Agents, several training scenarios are possible. We are excited to see what
-kinds of novel and fun environments the community creates. For those new to
-training intelligent agents, below are a few examples that can serve as
-inspiration:
+![Inspector](images/ml-agents-LSTM.png)
- Single-Agent. A single agent, with its own reward
-  signal. The traditional way of training an agent. An example is any
-  single-player game, such as Chicken. [Video
-  Link](https://www.youtube.com/watch?v=fiQsmdwEGT8&feature=youtu.be).
- Simultaneous Single-Agent. Multiple independent agents with independent reward
-  signals with same `Behavior Parameters`. A parallelized version of the traditional
-  training scenario, which can speed-up and stabilize the training process.
-  Helpful when you have multiple versions of the same character in an
-  environment who should learn similar behaviors. An example might be training a
-  dozen robot-arms to each open a door simultaneously. [Video
-  Link](https://www.youtube.com/watch?v=fq0JBaiCYNA).
- Adversarial Self-Play. Two interacting agents with inverse reward signals.
-  In two-player games, adversarial self-play can allow
-  an agent to become increasingly more skilled, while always having the
-  perfectly matched opponent: itself. This was the strategy employed when
-  training AlphaGo, and more recently used by OpenAI to train a human-beating
-  1-vs-1 Dota 2 agent.
- Cooperative Multi-Agent. Multiple interacting agents with a shared reward
-  signal with same or different `Behavior Parameters`. In this
-  scenario, all agents must work together to accomplish a task that cannot be
-  done alone. Examples include environments where each agent only has access to
-  partial information, which needs to be shared in order to accomplish the task
-  or collaboratively solve a puzzle.
- Competitive Multi-Agent. Multiple interacting agents with inverse reward
-  signals with same or different `Behavior Parameters`. In this
-  scenario, agents must compete with one another to either win a competition, or
-  obtain some limited set of resources. All team sports fall into this scenario.
- Ecosystem. Multiple interacting agents with independent reward signals with
-  same or different `Behavior Parameters`. This scenario can be thought
-  of as creating a small world in which animals with different goals all
-  interact, such as a savanna in which there might be zebras, elephants and
-  giraffes, or an autonomous driving simulation within an urban environment.
+In some scenarios, agents must learn to remember the past in order to take the
+best decision. When an agent only has partial observability of the environment,
+keeping track of past observations can help the agent learn. Deciding what the
+agents should remember in order to solve a task is not easy to do by hand, but
+our training algorithms can learn to keep track of what is important to remember
+with [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).

 ## Additional Features


- **Memory-enhanced Agents** - In some scenarios, agents must learn to remember
-  the past in order to take the best decision. When an agent only has partial
-  observability of the environment, keeping track of past observations can help
-  the agent learn. We provide an implementation of _Long Short-term Memory_
-  ([LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory)) in our trainers
-  that enable the agent to store memories to be used in future steps. You can
-  learn more about enabling LSTM during training [here](Feature-Memory.md).
-
- **Monitoring Agent’s Decision Making** - Since communication in ML-Agents is a
-  two-way street, we provide an Agent Monitor class in Unity which can display
-  aspects of the trained Agent, such as the Agents perception on how well it is
-  doing (called **value estimates**) within the Unity environment itself. By
-  leveraging Unity as a visualization tool and providing these outputs in
-  real-time, researchers and developers can more easily debug an Agent’s
-  behavior. You can learn more about using the Monitor class
-  [here](Feature-Monitor.md).
-
- **Complex Visual Observations** - Unlike other platforms, where the agent’s
-  observation might be limited to a single vector or image, the ML-Agents
-  toolkit allows multiple cameras to be used for observations per agent. This
-  enables agents to learn to integrate information from multiple visual streams.
-  This can be helpful in several scenarios such as training a self-driving car
-  which requires multiple cameras with different viewpoints, or a navigational
-  agent which might need to integrate aerial and first-person visuals. You can
-  learn more about adding visual observations to an agent
-  [here](Learning-Environment-Design-Agents.md#multiple-visual-observations).
-
- **Training with Environment Parameter Randomization** - If an agent is exposed to several variations of an environment, it will be more robust (i.e. generalize better) to
-  unseen variations of the environment. Similar to Curriculum Learning,
-  where environments become more difficult as the agent learns, the toolkit provides
-  a way to randomly sample parameters of the environment during training. See
-  [Training With Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md)
-  to learn more about this feature.
+- **Concurrent Unity Instances** - We enable developers to run concurrent,
+  parallel instances of the Unity executable during training. For certain
+  scenarios, this should speed up training.
+- **Recording Statistics from Unity** - We enable developers to record
+  statistics from within their Unity environments. These statistics are
+  aggregated and generated during the training process.
+- **Custom Side Channels** - We enable developers to create custom side channels
+  to manage data transfer between Unity and Python that is unique to their
+  training workflow and/or environment.
+- **Custom Samplers** - We enable developers to create custom sampling methods
+  for Environment Parameter Randomization. This enables users to customize this
+  training method for their particular environment.

 ## Summary and Next Steps

 (and enhance) machine learning within Unity.

 To help you use ML-Agents, we've created several in-depth tutorials for
-[installing ML-Agents](Installation.md),
-[getting started](Getting-Started.md) with the 3D Balance Ball
-environment (one of our many
+[installing ML-Agents](Installation.md), [getting started](Getting-Started.md)
+with the 3D Balance Ball environment (one of our many
 [sample environments](Learning-Environment-Examples.md)) and
 [making your own environment](Learning-Environment-Create-New.md).
--- a/docs/Migrating.md
+++ b/docs/Migrating.md
 # Migrating

 ## Migrating from Release 1 to latest
+
+
-## Migrating from 0.15 to  Release 1
-
+## Migrating from 0.15 to Release 1

 ### Important changes

   create an `EnvironmentParametersChannel` instead.
  - `SideChannel.OnMessageReceived` is now a protected method (was public)
  - SideChannel IncomingMessages methods now take an optional default argument,
-  which is used when trying to read more data than the message contains.
+    which is used when trying to read more data than the message contains.
-  (and other python StatsWriters). To do this from your code, use
-  `Academy.Instance.StatsRecorder.Add(key, value)`(#3660)
- `num_updates` and `train_interval` for SAC have been replaced with `steps_per_update`.
+    (and other python StatsWriters). To do this from your code, use
+    `Academy.Instance.StatsRecorder.Add(key, value)`(#3660)
+- `num_updates` and `train_interval` for SAC have been replaced with
+  `steps_per_update`.
-  `UnityToGymWrapper` and no longer creates the `UnityEnvironment`. Instead,
-  the `UnityEnvironment` must be passed as input to the
-  constructor of `UnityToGymWrapper`
+  `UnityToGymWrapper` and no longer creates the `UnityEnvironment`. Instead, the
+  `UnityEnvironment` must be passed as input to the constructor of
+  `UnityToGymWrapper`
 - Public fields and properties on several classes were renamed to follow Unity's
  C# style conventions. All public fields and properties now use "PascalCase"
  instead of "camelCase"; for example, `Agent.maxStep` was renamed to
  `public override void Heuristic(float[] actionsOut)` and assign values to
  `actionsOut` instead of returning an array.
 - If you used `SideChannels` you must:
-  - Replace `Academy.FloatProperties` with `Academy.Instance.EnvironmentParameters`.
+  - Replace `Academy.FloatProperties` with
+    `Academy.Instance.EnvironmentParameters`.
-  removed. Use `SideChannelManager.RegisterSideChannel` and
-  `SideChannelManager.UnregisterSideChannel` instead.
- Set `steps_per_update` to be around equal to the number of agents in your environment,
-  times `num_updates` and divided by `train_interval`.
- Replace `UnityEnv` with `UnityToGymWrapper` in your code. The constructor
-  no longer takes a file name as input but a fully constructed
-  `UnityEnvironment` instead.
+    removed. Use `SideChannelManager.RegisterSideChannel` and
+    `SideChannelManager.UnregisterSideChannel` instead.
+- Set `steps_per_update` to be around equal to the number of agents in your
+  environment, times `num_updates` and divided by `train_interval`.
+- Replace `UnityEnv` with `UnityToGymWrapper` in your code. The constructor no
+  longer takes a file name as input but a fully constructed `UnityEnvironment`
+  instead.
- If you have a custom `ISensor` implementation, you will need to change the signature of
-  its `Write()` method to use `ObservationWriter` instead of `WriteAdapter`.
+- If you have a custom `ISensor` implementation, you will need to change the
+  signature of its `Write()` method to use `ObservationWriter` instead of
+  `WriteAdapter`.

 ## Migrating from 0.14 to 0.15

  The Academy class no longer has a `ResetParameters`. To access shared float
  properties with Python, use the new `FloatProperties` field on the Academy.
 - Offline Behavioral Cloning has been removed. To learn from demonstrations, use
-  the GAIL and Behavioral Cloning features with either PPO or SAC. See
-  [Imitation Learning](Training-Imitation-Learning.md) for more information.
+  the GAIL and Behavioral Cloning features with either PPO or SAC.
 - `mlagents.envs` was renamed to `mlagents_envs`. The previous repo layout
  depended on [PEP420](https://www.python.org/dev/peps/pep-0420/), which caused
  problems with some of our tooling such as mypy and pylint.
  - `use_curiosity`, `curiosity_strength`, `curiosity_enc_size`: Define a
    `curiosity` reward signal and set its `strength` to `curiosity_strength`,
    and `encoding_size` to `curiosity_enc_size`. Give it the same `gamma` as
-    your `extrinsic` signal to mimic previous behavior. See
-    [Reward Signals](Reward-Signals.md) for more information on defining reward
-    signals.
+    your `extrinsic` signal to mimic previous behavior.
 - TensorBoards generated when running multiple environments in v0.8 are not
  comparable to those generated in v0.9 in terms of step count. Multiply your
  v0.8 step count by `num_envs` for an approximate comparison. You may need to
  [trainer_config.yaml](../config/trainer_config.yaml). An example of passing a
  trainer configuration to `mlagents-learn` is shown above.
 - The environment name is now passed through the `--env` option.
- Curriculum learning has been changed. Refer to the
-  [curriculum learning documentation](Training-Curriculum-Learning.md) for
-  detailed information. In summary:
+- Curriculum learning has been changed. In summary:
  - Curriculum files for the same environment must now be placed into a folder.
    Each curriculum file should be named after the Brain whose curriculum it
    specifies.
  [here](Training-ML-Agents.md#training-with-mlagents-learn).
 - Hyperparameters for training Brains are now stored in the
  `trainer_config.yaml` file. For more information on using this file, see
-  [here](Training-ML-Agents.md#training-config-file).
+  [here](Training-ML-Agents.md#training-configurations).

 ### Unity API

--- a/docs/Python-API.md
+++ b/docs/Python-API.md
  or properties. More on them in the [Modifying the environment from Python](Python-API.md#modifying-the-environment-from-python) section.

 If you want to directly interact with the Editor, you need to use
-`file_name=None`, then press the :arrow_forward: button in the Editor when the
+`file_name=None`, then press the **Play** button in the Editor when the
 message _"Start training by pressing the Play button in the Unity Editor"_ is
 displayed on the screen

--- a/docs/Readme.md
+++ b/docs/Readme.md

 ### Advanced Usage

- [Using the Monitor](Feature-Monitor.md)
-  - [Reward Signals](Reward-Signals.md)
- [Training Using Concurrent Unity Instances](Training-Using-Concurrent-Unity-Instances.md)
- [Training with Proximal Policy Optimization](Training-PPO.md)
- [Training with Soft Actor-Critic](Training-SAC.md)
- [Training with Self-Play](Training-Self-Play.md)
-
-### Advanced Training Methods
-
- [Training with Curriculum Learning](Training-Curriculum-Learning.md)
- [Training with Imitation Learning](Training-Imitation-Learning.md)
- [Training with LSTM](Feature-Memory.md)
- [Training with Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md)

 ## Inference

--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 # Training ML-Agents

+**Table of Contents**
+
+- [Training with mlagents-learn](#training-with-mlagents-learn)
+  - [Starting Training](#starting-training)
+    - [Observing Training](#observing-training)
+    - [Stopping and Resuming Training](#stopping-and-resuming-training)
+    - [Loading an Existing Model](#loading-an-existing-model)
+- [Training Configurations](#training-configurations)
+  - [Trainer Config File](#trainer-config-file)
+  - [Curriculum Learning](#curriculum-learning)
+    - [Specifying Curricula](#specifying-curricula)
+    - [Training with a Curriculum](#training-with-a-curriculum)
+  - [Environment Parameter Randomization](#environment-parameter-randomization)
+    - [Included Sampler Types](#included-sampler-types)
+    - [Defining a New Sampler Type](#defining-a-new-sampler-type)
+    - [Training with Environment Parameter Randomization](#training-with-environment-parameter-randomization)
+  - [Training Using Concurrent Unity Instances](#training-using-concurrent-unity-instances)
+
 For a broad overview of reinforcement learning, imitation learning and all the
 training scenarios, methods and options within the ML-Agents Toolkit, see
 [ML-Agents Toolkit Overview](ML-Agents-Overview.md).

 - `<trainer-config-file>` is the file path of the trainer configuration yaml.
  This contains all the hyperparameter values. We offer a detailed guide on the
-  structure of this file and the meaning of the hyperameters (and advice on how
-  to set them) in the dedicated [Training Config File](#training-config-file)
-  section below.
+  structure of this file and the meaning of the hyperparameters (and advice on
+  how to set them) in the dedicated
+  [Training Configurations](#training-configurations) section below.
-  Editor. Press the :arrow_forward: button in Unity when the message _"Start
-  training by pressing the Play button in the Unity Editor"_ is displayed on
-  the screen.
+  Editor. Press the **Play** button in Unity when the message _"Start training
+  by pressing the Play button in the Unity Editor"_ is displayed on the screen.
 - `<run-identifier>` is a unique name you can use to identify the results of
  your training runs.

 `--initialize-from=<run-identifier>`, where `<run-identifier>` is the old run
 ID.

-## Training Config File
+## Training Configurations

 The Unity ML-Agents Toolkit provides a wide range of training scenarios, methods
 and options. As such, specific training runs may require different training

-The training config files `config/trainer_config.yaml`,
-`config/sac_trainer_config.yaml`, `config/gail_config.yaml` and
-`config/offline_bc_config.yaml` specifies the training method, the
-hyperparameters, and a few additional values to use when training with Proximal
-Policy Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial
-Imitation Learning) with PPO/SAC, and Behavioral Cloning(BC)/Imitation with
-PPO/SAC. These files are divided into sections. The **default** section defines
-the default values for all the available training with PPO, SAC, GAIL (with
-PPO), and BC. These files are divided into sections. The **default** section
-defines the default values for all the available settings. You can also add new
-sections to override these defaults to train specific Behaviors. Name each of
-these override sections after the appropriate `Behavior Name`. Sections for the
-example environments are included in the provided config file.
+More specifically, this section offers a detailed guide on four command-line
+flags for `mlagents-learn` that control the training configurations:
+
+- `<trainer-config-file>`: defines the training hyperparameters for each
+  Behavior in the scene
+- `--curriculum`: defines the set-up for Curriculum Learning
+- `--sampler`: defines the set-up for Environment Parameter Randomization
+- `--num-envs`: number of concurrent Unity instances to use during training
+
+Reminder that a detailed description of all command-line options can be found by
+using the help utility:
+
+```sh
+mlagents-learn --help
+```
+
+It is important to highlight that successfully training a Behavior in the
+ML-Agents Toolkit involves tuning the training hyperparameters and
+configuration. This guide contains some best practices for tuning the training
+process when the default parameters don't seem to be giving the level of
+performance you would like. We provide sample configuration files for our
+example environments in the [config/](../config/) directory. The
+`config/trainer_config.yaml` was used to train the 3D Balance Ball in the
+[Getting Started](Getting-Started.md) guide. That configuration file uses the
+PPO trainer, but we also have configuration files for SAC and GAIL.
+
+Additionally, the set of configurations you provide depend on the training
+functionalities you use (see [ML-Agents Toolkit Overview](ML-Agents-Overview.md)
+for a description of all the training functionalities). Each functionality you
+add typically has its own training configurations or additional configuration
+files. For instance:
+
+- Use PPO or SAC?
+- Use Recurrent Neural Networks for adding memory to your agents?
+- Use the intrinsic curiosity module?
+- Ignore the environment reward signal?
+- Pre-train using behavioral cloning? (Assuming you have recorded
+  demonstrations.)
+- Include the GAIL intrinsic reward signals? (Assuming you have recorded
+  demonstrations.)
+- Use self-play? (Assuming your environment includes multiple agents.)
+
+The answers to the above questions will dictate the configuration files and the
+parameters within them. The rest of this section breaks down the different
+configuration files and explains the possible settings for each.
+
+### Trainer Config File
+
+We begin with the trainer config file, `<trainer-config-file>`, which includes a
+set of configurations for each Behavior in your scene. Some of the
+configurations are required while others are optional. To help us get started,
+below is a sample file that includes all the possible settings if we're using a
+PPO trainer with all the possible training functionalities enabled (memory,
+behavioral cloning, curiosity, GAIL and self-play). You will notice that
+curriculum and environment parameter randomization settings are not part of this
+file, but their settings live in different files that we'll cover in subsequent
+sections.
+
+```yaml
+BehaviorPPO:
+  trainer: ppo
+
+  # Trainer configs common to PPO/SAC (excluding reward signals)
+  batch_size: 1024
+  buffer_size: 10240
+  hidden_units: 128
+  learning_rate: 3.0e-4
+  learning_rate_schedule: linear
+  max_steps: 5.0e5
+  normalize: false
+  num_layers: 2
+  time_horizon: 64
+  vis_encoder_type: simple
+
+  # PPO-specific configs
+  beta: 5.0e-3
+  epsilon: 0.2
+  lambd: 0.95
+  num_epoch: 3
+  threaded: true
+
+  # memory
+  use_recurrent: true
+  sequence_length: 64
+  memory_size: 256
+
+  # behavior cloning
+  behavioral_cloning:
+    demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
+    strength: 0.5
+    steps: 150000
+    batch_size: 512
+    num_epoch: 3
+    samples_per_update: 0
+    init_path:
+
+  reward_signals:
+    # environment reward
+    extrinsic:
+      strength: 1.0
+      gamma: 0.99
+
+    # curiosity module
+    curiosity:
+      strength: 0.02
+      gamma: 0.99
+      encoding_size: 256
+      learning_rate: 3e-4
+
+    # GAIL
+    gail:
+      strength: 0.01
+      gamma: 0.99
+      encoding_size: 128
+      demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
+      learning_rate: 3e-4
+      use_actions: false
+      use_vail: false
+
+  # self-play
+  self_play:
+    window: 10
+    play_against_latest_model_ratio: 0.5
+    save_steps: 50000
+    swap_steps: 50000
+    team_change: 100000
+```
+
+Here is an equivalent file if we use an SAC trainer instead. Notice that the
+configurations for the additional functionalities (memory, behavioral cloning,
+curiosity and self-play) remain unchanged.
+
+```yaml
+BehaviorSAC:
+  trainer: sac
+
+  # Trainer configs common to PPO/SAC (excluding reward signals)
+  # same as PPO config
+
+  # SAC-specific configs (replaces the "PPO-specific configs" section above)
+  buffer_init_steps: 0
+  tau: 0.005
+  steps_per_update: 1
+  train_interval: 1
+  init_entcoef: 1.0
+  save_replay_buffer: false
+
+  # memory
+  # same as PPO config
+
+  # pre-training using behavior cloning
+  behavioral_cloning:
+    # same as PPO config
+
+  reward_signals:
+    reward_signal_num_update: 1 # only applies to SAC
+
+    # environment reward
+    extrinsic:
+      # same as PPO config
+
+    # curiosity module
+    curiosity:
+      # same as PPO config
+
+    # GAIL
+    gail:
+      # same as PPO config
+
+  # self-play
+  self_play:
+    # same as PPO config
+```
+
+We now break apart the components of the configuration file and describe what
+each of these parameters mean and provide guidelines on how to set them. See
+[Training Configuration File](Training-Configuration-File.md) for a detailed
+description of all the configurations listed above.
+
+### Curriculum Learning
+
+To enable curriculum learning, you need to provide the `--curriculum` CLI option
+and point to a YAML file that defines the curriculum. Here is one example file:
+
+```yml
+BehaviorY:
+  measure: progress
+  thresholds: [0.1, 0.3, 0.5]
+  min_lesson_length: 100
+  signal_smoothing: true
+  parameters:
+    wall_height: [1.5, 2.0, 2.5, 4.0]
+```
+
+Each group of Agents under the same `Behavior Name` in an environment can have a
+corresponding curriculum. These curricula are held in what we call a
+"metacurriculum". A metacurriculum allows different groups of Agents to follow
+different curricula within the same environment.
+
+#### Specifying Curricula
+
+In order to define the curricula, the first step is to decide which parameters
+of the environment will vary. In the case of the Wall Jump environment, the
+height of the wall is what varies. Rather than adjusting it by hand, we will
+create a YAML file which describes the structure of the curricula. Within it, we
+can specify which points in the training process our wall height will change,
+either based on the percentage of training steps which have taken place, or what
+the average reward the agent has received in the recent past is. Below is an
+example config for the curricula for the Wall Jump environment.
-\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral
-Cloning (Imitation), GAIL = Generative Adversarial Imitation Learning
+```yaml
+BigWallJump:
+  measure: progress
+  thresholds: [0.1, 0.3, 0.5]
+  min_lesson_length: 100
+  signal_smoothing: true
+  parameters:
+    big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
+    big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
+SmallWallJump:
+  measure: progress
+  thresholds: [0.1, 0.3, 0.5]
+  min_lesson_length: 100
+  signal_smoothing: true
+  parameters:
+    small_wall_height: [1.5, 2.0, 2.5, 4.0]
+```
-| **Setting**            | **Description**                                                                                                                                                                         | **Applies To Trainer\*** |
-| :--------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- |
-| batch_size             | The number of experiences in each iteration of gradient descent.                                                                                                                        | PPO, SAC                 |
-| batches_per_epoch      | In imitation learning, the number of batches of training examples to collect before training the model.                                                                                 |                          |
-| beta                   | The strength of entropy regularization.                                                                                                                                                 | PPO                      |
-| buffer_size            | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer.                                                                   | PPO, SAC                 |
-| buffer_init_steps      | The number of experiences to collect into the buffer before updating the policy model.                                                                                                  | SAC                      |
-| epsilon                | Influences how rapidly the policy can evolve during training.                                                                                                                           | PPO                      |
-| hidden_units           | The number of units in the hidden layers of the neural network.                                                                                                                         | PPO, SAC                 |
-| init_entcoef           | How much the agent should explore in the beginning of training.                                                                                                                         | SAC                      |
-| lambd                  | The regularization parameter.                                                                                                                                                           | PPO                      |
-| learning_rate          | The initial learning rate for gradient descent.                                                                                                                                         | PPO, SAC                 |
-| learning_rate_schedule | Determines how learning rate changes over time.                                                                                                                                         | PPO, SAC                 |
-| max_steps              | The maximum number of simulation steps to run during a training session.                                                                                                                | PPO, SAC                 |
-| memory_size            | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                 | PPO, SAC                 |
-| normalize              | Whether to automatically normalize observations.                                                                                                                                        | PPO, SAC                 |
-| num_epoch              | The number of passes to make through the experience buffer when performing gradient descent optimization.                                                                               | PPO                      |
-| num_layers             | The number of hidden layers in the neural network.                                                                                                                                      | PPO, SAC                 |
-| behavioral_cloning     | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations).                    | PPO, SAC                 |
-| reward_signals         | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options.                                         | PPO, SAC                 |
-| save_replay_buffer     | Saves the replay buffer when exiting training, and loads it on resume.                                                                                                                  | SAC                      |
-| sequence_length        | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC                 |
-| summary_freq           | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard.                                                                       | PPO, SAC                 |
-| tau                    | How aggressively to update the target network used for bootstrapping value estimation in SAC.                                                                                           | SAC                      |
-| time_horizon           | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, SAC                 |
-| trainer                | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc".                                                                                                             | PPO, SAC                 |
-| steps_per_update           | Ratio of agent steps per mini-batch update.                                                                                                                     | SAC                      |
-| use_recurrent          | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                                                                       | PPO, SAC                 |
-| init_path              | Initialize trainer from a previously saved model.                                                                                                                                       | PPO, SAC                 |
-| threaded              | Run the trainer in a parallel thread from the environment steps. (Default: true)                                                                                                                                      | PPO, SAC                 |
+The curriculum for each Behavior has the following parameters:
+
+| **Setting**         | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| :------------------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `measure`           | What to measure learning progress, and advancement in lessons by.<br><br> `reward` uses a measure received reward, while `progress` uses the ratio of steps/max_steps.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| `thresholds`        | Points in value of `measure` where lesson should be increased.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| `min_lesson_length` | The minimum number of episodes that should be completed before the lesson can change. If `measure` is set to `reward`, the average cumulative reward of the last `min_lesson_length` episodes will be used to determine if the lesson should change. Must be nonnegative. <br><br> **Important**: the average reward that is compared to the thresholds is different than the mean reward that is logged to the console. For example, if `min_lesson_length` is `100`, the lesson will increment after the average cumulative reward of the last `100` episodes exceeds the current threshold. The mean reward logged to the console is dictated by the `summary_freq` parameter defined above. |
+| `signal_smoothing`  | Whether to weight the current progress measure by previous values.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| `parameters`        | Corresponds to environment parameters to control. Length of each array should be one greater than number of thresholds.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+
+#### Training with a Curriculum
+
+Once we have specified our metacurriculum and curricula, we can launch
+`mlagents-learn` using the `–curriculum` flag to point to the config file for
+our curricula and PPO will train using Curriculum Learning. For example, to
+train agents in the Wall Jump environment with curriculum learning, we can run:
+
+```sh
+mlagents-learn config/trainer_config.yaml --curriculum=config/curricula/wall_jump.yaml --run-id=wall-jump-curriculum
+```
+
+We can then keep track of the current lessons and progresses via TensorBoard.
+
+**Note**: If you are resuming a training session that uses curriculum, please
+pass the number of the last-reached lesson using the `--lesson` flag when
+running `mlagents-learn`.
+
+### Environment Parameter Randomization
+
+To enable parameter randomization, you need to provide the `--sampler` CLI
+option and point to a YAML file that defines the curriculum. Here is one example
+file:
-For specific advice on setting hyperparameters based on the type of training you
-are conducting, see:
+```yaml
+resampling-interval: 5000
- [Training with PPO](Training-PPO.md)
- [Training with SAC](Training-SAC.md)
- [Training with Self-Play](Training-Self-Play.md)
- [Using Recurrent Neural Networks](Feature-Memory.md)
- [Training with Curriculum Learning](Training-Curriculum-Learning.md)
- [Training with Imitation Learning](Training-Imitation-Learning.md)
- [Training with Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md)
+mass:
+  sampler-type: "uniform"
+  min_value: 0.5
+  max_value: 10
-You can also compare the
-[example environments](Learning-Environment-Examples.md) to the corresponding
-sections of the `config/trainer_config.yaml` file for each example to see how
-the hyperparameters and other configuration variables have been changed from the
-defaults.
+gravity:
+  sampler-type: "multirange_uniform"
+  intervals: [[7, 10], [15, 20]]
+
+scale:
+  sampler-type: "uniform"
+  min_value: 0.75
+  max_value: 3
+```
+
+Note that `mass`, `gravity` and `scale` are the names of the environment
+parameters that will be sampled. If a parameter specified in the file doesn't
+exist in the environment, then this parameter will be ignored.
+
+| **Setting**                  | **Description**                                                                                                                                                                                                                                                                                                                         |
+| :--------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `resampling-interval`        | Number of steps for the agent to train under a particular environment configuration before resetting the environment with a new sample of `Environment Parameters`.                                                                                                                                                                     |
+| `sampler-type`               | Type of sampler use for this `Environment Parameter`. This is a string that should exist in the `Sampler Factory` (explained below).                                                                                                                                                                                                    |
+| `sampler-type-sub-arguments` | Specify the sub-arguments depending on the `sampler-type`. In the example above, this would correspond to the `intervals` under the `sampler-type` `multirange_uniform` for the `Environment Parameter` called `gravity`. The key name should match the name of the corresponding argument in the sampler definition (explained) below) |
+
+#### Included Sampler Types
+
+Below is a list of included `sampler-type` as part of the toolkit.
+
+- `uniform` - Uniform sampler
+  - Uniformly samples a single float value between defined endpoints. The
+    sub-arguments for this sampler to specify the interval endpoints are as
+    below. The sampling is done in the range of [`min_value`, `max_value`).
+  - **sub-arguments** - `min_value`, `max_value`
+- `gaussian` - Gaussian sampler
+  - Samples a single float value from the distribution characterized by the mean
+    and standard deviation. The sub-arguments to specify the Gaussian
+    distribution to use are as below.
+  - **sub-arguments** - `mean`, `st_dev`
+- `multirange_uniform` - Multirange uniform sampler
+  - Uniformly samples a single float value between the specified intervals.
+    Samples by first performing a weight pick of an interval from the list of
+    intervals (weighted based on interval width) and samples uniformly from the
+    selected interval (half-closed interval, same as the uniform sampler). This
+    sampler can take an arbitrary number of intervals in a list in the following
+    format: [[`interval_1_min`, `interval_1_max`], [`interval_2_min`,
+    `interval_2_max`], ...]
+  - **sub-arguments** - `intervals`
+
+The implementation of the samplers can be found at
+`ml-agents-envs/mlagents_envs/sampler_class.py`.
+
+#### Defining a New Sampler Type
+
+If you want to define your own sampler type, you must first inherit the
+_Sampler_ base class (included in the `sampler_class` file) and preserve the
+interface. Once the class for the required method is specified, it must be
+registered in the Sampler Factory.
+
+This can be done by subscribing to the _register_sampler_ method of the
+`SamplerFactory`. The command is as follows:
+
+`SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)`
+
+Once the Sampler Factory reflects the new register, the new sampler type can be
+used for sample any `Environment Parameter`. For example, lets say a new sampler
+type was implemented as below and we register the `CustomSampler` class with the
+string `custom-sampler` in the Sampler Factory.
+
+```python
+class CustomSampler(Sampler):
+
+    def __init__(self, argA, argB, argC):
+        self.possible_vals = [argA, argB, argC]
+
+    def sample_all(self):
+        return np.random.choice(self.possible_vals)
+```
+
+Now we need to specify the new sampler type in the sampler YAML file. For
+example, we use this new sampler type for the `Environment Parameter` _mass_.
+
+```yaml
+mass:
+  sampler-type: "custom-sampler"
+  argB: 1
+  argA: 2
+  argC: 3
+```
+
+#### Training with Environment Parameter Randomization
+
+After the sampler YAML file is defined, we proceed by launching `mlagents-learn`
+and specify our configured sampler file with the `--sampler` flag. For example,
+if we wanted to train the 3D ball agent with parameter randomization using
+`Environment Parameters` with `config/3dball_randomize.yaml` sampling setup, we
+would run
+
+```sh
+mlagents-learn config/trainer_config.yaml --sampler=config/3dball_randomize.yaml
+--run-id=3D-Ball-randomize
+```
+
+We can observe progress and metrics via Tensorboard.
+
+### Training Using Concurrent Unity Instances
+
+In order to run concurrent Unity instances during training, set the number of
+environment instances using the command line option `--num-envs=<n>` when you
+invoke `mlagents-learn`. Optionally, you can also set the `--base-port`, which
+is the starting port used for the concurrent Unity instances.
+
+Some considerations:
+
+- **Buffer Size** - If you are having trouble getting an agent to train, even
+  with multiple concurrent Unity instances, you could increase `buffer_size` in
+  the `config/trainer_config.yaml` file. A common practice is to multiply
+  `buffer_size` by `num-envs`.
+- **Resource Constraints** - Invoking concurrent Unity instances is constrained
+  by the resources on the machine. Please use discretion when setting
+  `--num-envs=<n>`.
+- **Result Variation Using Concurrent Unity Instances** - If you keep all the
+  hyperparameters the same, but change `--num-envs=<n>`, the results and model
+  would likely change.
--- a/docs/Using-Docker.md
+++ b/docs/Using-Docker.md
 - `<environment-name>` **(Optional)**: If you are training with a linux
  executable, this is the name of the executable. If you are training in the
  Editor, do not pass a `<environment-name>` argument and press the
-  :arrow_forward: button in Unity when the message _"Start training by pressing
+  **Play** button in Unity when the message _"Start training by pressing
  the Play button in the Unity Editor"_ is displayed on the screen.
 - `source`: Reference to the path in your host OS where you will store the Unity
  executable.
--- a/docs/Using-Tensorboard.md
+++ b/docs/Using-Tensorboard.md
 ### Environment Statistics

 - `Environment/Lesson` - Plots the progress from lesson to lesson. Only
-  interesting when performing
-  [curriculum training](Training-Curriculum-Learning.md).
+  interesting when performing curriculum training.

 - `Environment/Cumulative Reward` - The mean cumulative episode reward over all
  agents. Should increase during a successful training session.

+### Is Training
+
+- `Is Training` - A boolean indicating if the agent is updating its model.
+
- `Policy/Entropy` (PPO; BC) - How random the decisions of the model are. Should
-  slowly decrease during a successful training process. If it decreases too
-  quickly, the `beta` hyperparameter should be increased.
+- `Policy/Entropy` (PPO; SAC) - How random the decisions of the model are.
+  Should slowly decrease during a successful training process. If it decreases
+  too quickly, the `beta` hyperparameter should be increased.
- `Policy/Learning Rate` (PPO; BC) - How large a step the training algorithm
+- `Policy/Learning Rate` (PPO; SAC) - How large a step the training algorithm
- `Policy/Value Estimate` (PPO) - The mean value estimate for all states visited
-  by the agent. Should increase during a successful training session.
+- `Policy/Entropy Coefficient` (SAC) - Determines the relative importance of the
+  entropy term. This value is adjusted automatically so that the agent retains
+  some amount of randomness during training.
- `Policy/Curiosity Reward` (PPO+Curiosity) - This corresponds to the mean
+- `Policy/Extrinsic Reward` (PPO; SAC) - This corresponds to the mean cumulative
+  reward received from the environment per-episode.
+
+- `Policy/Value Estimate` (PPO; SAC) - The mean value estimate for all states
+  visited by the agent. Should increase during a successful training session.
+
+- `Policy/Curiosity Reward` (PPO/SAC+Curiosity) - This corresponds to the mean
+- `Policy/Curiosity Value Estimate` (PPO/SAC+Curiosity) - The agent's value
+  estimate for the curiosity reward.
+
+- `Policy/GAIL Reward` (PPO/SAC+GAIL) - This corresponds to the mean cumulative
+  discriminator-based reward generated per-episode.
+
+- `Policy/GAIL Value Estimate` (PPO/SAC+GAIL) - The agent's value estimate for
+  the GAIL reward.
+
+- `Policy/GAIL Policy Estimate` (PPO/SAC+GAIL) - The discriminator's estimate
+  for states and actions generated by the policy.
+
+- `Policy/GAIL Expert Estimate` (PPO/SAC+GAIL) - The discriminator's estimate
+  for states and actions drawn from expert demonstrations.
+
- `Losses/Policy Loss` (PPO) - The mean magnitude of policy loss function.
+- `Losses/Policy Loss` (PPO; SAC) - The mean magnitude of policy loss function.
- `Losses/Value Loss` (PPO) - The mean loss of the value function update.
+- `Losses/Value Loss` (PPO; SAC) - The mean loss of the value function update.
- `Losses/Forward Loss` (PPO+Curiosity) - The mean magnitude of the inverse
+- `Losses/Forward Loss` (PPO/SAC+Curiosity) - The mean magnitude of the inverse
- `Losses/Inverse Loss` (PPO+Curiosity) - The mean magnitude of the forward
+- `Losses/Inverse Loss` (PPO/SAC+Curiosity) - The mean magnitude of the forward
- `Losses/Cloning Loss` (BC) - The mean magnitude of the behavioral cloning
+- `Losses/Pretraining Loss` (BC) - The mean magnitude of the behavioral cloning
+- `Losses/GAIL Loss` (GAIL) - The mean magnitude of the GAIL discriminator loss.
+  Corresponds to how well the model imitates the demonstration data.
+
+### Self-Play
+
+- `Self-Play/ELO` (Self-Play) -
+  [ELO](https://en.wikipedia.org/wiki/Elo_rating_system) measures the relative
+  skill level between two players. In a proper training run, the ELO of the
+  agent should steadily increase.
+
-StatsSideChannel:
+`StatsRecorder`:

 ```csharp
 var statsRecorder = Academy.Instance.StatsRecorder;
--- a/docs/images/learning_environment_basic.png
+++ b/docs/images/learning_environment_basic.png
--- a/docs/images/learning_environment_example.png
+++ b/docs/images/learning_environment_example.png
--- a/docs/images/learning_environment_full.png
+++ b/docs/images/learning_environment_full.png
--- a/docs/Training-Configuration-File.md
+++ b/docs/Training-Configuration-File.md
+# Training Configuration File
+
+**Table of Contents**
+
+- [Common Trainer Configurations](#common-trainer-configurations)
+- [Trainer-specific Configurations](#trainer-specific-configurations)
+  - [PPO-specific Configurations](#ppo-specific-configurations)
+  - [SAC-specific Configurations](#sac-specific-configurations)
+- [Reward Signals](#reward-signals)
+  - [Extrinsic Rewards](#extrinsic-rewards)
+  - [Curiosity Intrinsic Reward](#curiosity-intrinsic-reward)
+  - [GAIL Intrinsic Reward](#gail-intrinsic-reward)
+  - [SAC-specific Reward Signal](#sac-specific-reward-signal)
+- [Behavioral Cloning](#behavioral-cloning)
+- [Memory-enhanced Agents using Recurrent Neural Networks](#memory-enhanced-agents-using-recurrent-neural-networks)
+- [Self-Play](#self-play)
+  - [Note on Reward Signals](#note-on-reward-signals)
+  - [Note on Swap Steps](#note-on-swap-steps)
+
+## Common Trainer Configurations
+
+One of the first decisions you need to make regarding your training run is which
+trainer to use: PPO or SAC. There are some training configurations that are
+common to both trainers (which we review now) and others that depend on the
+choice of the trainer (which we review on subsequent sections).
+
+| **Setting**              | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| :----------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `trainer`                | The type of training to perform: `ppo` or `sac`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| `init_path`              | Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run.                                                                                                                                                             |
+| `summary_freq`           | Number of experiences that needs to be collected before generating and displaying training statistics. This determines the granularity of the graphs in Tensorboard.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+| `batch_size`             | Number of experiences in each iteration of gradient descent. **This should always be a fraction of the `buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value should be smaller (in order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`.                                                                                                                                                                                                                                                                         |
+| `buffer_size`            | Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be a multiple of `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates. In SAC, the max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences. <br><br>Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000`                                                                                                                                                                   |
+| `hidden_units`           | Number of units in the hidden layers of the neural network. Correspond to how many units are in each fully connected layer of the neural network. For simple problems where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where the action is a very complex interaction between the observation variables, this should be larger. <br><br> Typical range: `32` - `512`                                                                                                                                                                                                                                                                                    |
+| `learning_rate`          | Initial learning rate for gradient descent. Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase. <br><br>Typical range: `1e-5` - `1e-3`                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| `learning_rate_schedule` | Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally. <br><br>`linear` (default) decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run.                                                                                                                                                               |
+| `max_steps`              | Total number of experience points that must be collected from the simulation before ending the training process. <br><br>Typical range: `5e5` - `1e7`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| `normalize`              | Whether normalization is applied to the vector observation inputs. This normalization is based on the running average and variance of the vector observation. Normalization can be helpful in cases with complex continuous control problems, but may be harmful with simpler discrete control problems.                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| `num_layers`             | The number of hidden layers in the neural network. Corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation. For simple problems, fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems. <br><br> Typical range: `1` - `3`                                                                                                                                                                                                                                                                                                                                                    |
+| `time_horizon`           | How many steps of experience to collect per-agent before adding it to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions. <br><br> Typical range: `32` - `2048` |
+| `vis_encoder_type`       | Encoder type for encoding visual observations. <br><br> `simple` (default) uses a simple encoder which consists of two convolutional layers, `nature_cnn` uses the CNN implementation proposed by [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of three convolutional layers, and `resnet` uses the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two.                                                                                                                                                                                                                            |
+
+## Trainer-specific Configurations
+
+Depending on your choice of a trainer, there are additional trainer-specific
+configurations. We present them below in two separate tables, but keep in mind
+that you only need to include the configurations for the trainer selected (i.e.
+the `trainer` setting above).
+
+### PPO-specific Configurations
+
+| **Setting** | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| :---------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `beta`      | Strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase beta. If entropy drops too slowly, decrease `beta`. <br><br>Typical range: `1e-4` - `1e-2`                                                                                                                                                                     |
+| `epsilon`   | Influences how rapidly the policy can evolve during training. Corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process. <br><br>Typical range: `0.1` - `0.3`                                                                                                                                                                                                                                                                                                                      |
+| `lambd`     | Regularization parameter (lambda) used when calculating the Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process. <br><br>Typical range: `0.9` - `0.95` |
+| `num_epoch` | Number of passes to make through the experience buffer when performing gradient descent optimization.The larger the batch_size, the larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning. <br><br>Typical range: `3` - `10`                                                                                                                                                                                                                                                                                                                                                           |
+| `threaded`  | (Optional, default = `true`) By default, PPO model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a 10-20% training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`.                                                                                                                                                                                                                               |
+
+### SAC-specific Configurations
+
+| **Setting**          | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+| :------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `buffer_init_steps`  | Number of experiences to collect into the buffer before updating the policy model. As the untrained policy is fairly random, pre-filling the buffer with random actions is useful for exploration. Typically, at least several episodes of experiences should be pre-filled. <br><br>Typical range: `1000` - `10000`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| `init_entcoef`       | How much the agent should explore in the beginning of training. Corresponds to the initial entropy coefficient set at the beginning of training. In SAC, the agent is incentivized to make its actions entropic to facilitate better exploration. The entropy coefficient weighs the true reward with a bonus entropy reward. The entropy coefficient is [automatically adjusted](https://arxiv.org/abs/1812.05905) to a preset target entropy, so the `init_entcoef` only corresponds to the starting value of the entropy bonus. Increase init_entcoef to explore more in the beginning, decrease to converge to a solution faster. <br><br>Typical range: (Continuous): `0.5` - `1.0`; (Discrete): `0.05` - `0.5`                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| `save_replay_buffer` | (Optional, default = `false`) Whether to save and load the experience replay buffer as well as the model when quitting and re-starting training. This may help resumes go more smoothly, as the experiences collected won't be wiped. Note that replay buffers can be very large, and will take up a considerable amount of disk space. For that reason, we disable this feature by default.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| `tau`                | How aggressively to update the target network used for bootstrapping value estimation in SAC. Corresponds to the magnitude of the target Q update during the SAC model update. In SAC, there are two neural networks: the target and the policy. The target network is used to bootstrap the policy's estimate of the future rewards at a given state, and is fixed while the policy is being updated. This target is then slowly updated according to tau. Typically, this value should be left at 0.005. For simple problems, increasing tau to 0.01 might reduce the time it takes to learn, at the cost of stability. <br><br>Typical range: `0.005` - `0.01`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| `steps_per_update`   | Average ratio of agent steps (actions) taken to updates made of the agent's policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps. Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will improve sample efficiency (reduce the number of steps required to train) but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example environments) `steps_per_update` equal to the number of agents in the scene is a good balance. For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed. We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will usually result in a slowdown unless the environment is very slow. <br><br>Typical range: `1` - `20` |
+| `train_interval`     | Number of steps taken between each agent training event. Typically, we can train after every step, but if your environment's steps are very small and very frequent, there may not be any new interesting information between steps, and `train_interval` can be increased. <br><br>Typical range: `1` - `5`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+
+## Reward Signals
+
+The `reward_signals` section enables the specification of settings for both
+extrinsic (i.e. environment-based) and intrinsic reward signals (e.g. curiosity
+and GAIL). Each reward signal should define at least two parameters, `strength`
+and `gamma`, in addition to any class-specific hyperparameters. Note that to
+remove a reward signal, you should delete its entry entirely from
+`reward_signals`. At least one reward signal should be left defined at all
+times. Provide the following configurations to design the reward signal for your
+training run.
+
+### Extrinsic Rewards
+
+Enable these settings to ensure that your training run incorporates your
+environment-based reward signal:
+
+| **Setting**            | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| :--------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `extrinsic > strength` | Factor by which to multiply the reward given by the environment. Typical ranges will vary depending on the reward signal. <br><br>Typical range: `1.00`                                                                                                                                                                                                                                                                                              |
+| `extrinsic > gamma`    | Discount factor for future rewards coming from the environment. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large. In cases when rewards are more immediate, it can be smaller. Must be strictly smaller than 1. <br><br>Typical range: `0.8` - `0.995` |
+
+### Curiosity Intrinsic Reward
+
+To enable curiosity, provide these settings:
+
+| **Setting**                 | **Description**                                                                                                                                                                                                                                                                                                                       |
+| :-------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `curiosity > strength`      | Magnitude of the curiosity reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. <br><br>Typical range: `0.001` - `0.1` |
+| `curiosity > gamma`         | Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.995`                                                                                                                                                                                                                                                            |
+| `curiosity > encoding_size` | (Optional, default = `64`) Size of the encoding used by the intrinsic curiosity model. This value should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations. <br><br>Typical range: `64` - `256` |
+| `curiosity > learning_rate` | (Optional, default = `3e-4`) Learning rate used to update the intrinsic curiosity module. This should typically be decreased if training is unstable, and the curiosity loss is unstable. <br><br>Typical range: `1e-5` - `1e-3`                                                                                                      |
+
+### GAIL Intrinsic Reward
+
+To enable GAIL (assuming you have recorded demonstrations), provide these
+settings:
+
+| **Setting**            | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| :--------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `gail > strength`      | Factor by which to multiply the raw reward. Note that when using GAIL with an Extrinsic Signal, this value should be set lower if your demonstrations are suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. <br><br>Typical range: `0.01` - `1.0`                                                                              |
+| `gail > gamma`         | Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.9`                                                                                                                                                                                                                                                                                                                                                                                                        |
+| `gail > demo_path`     | The path to your .demo file or directory of .demo files.                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| `gail > encoding_size` | (Optional, default = `64`) Size of the hidden layer used by the discriminator. This value should be small enough to encourage the discriminator to compress the original observation, but also not too small to prevent it from learning to differentiate between demonstrated and actual behavior. Dramatically increasing this size will also negatively affect training times. <br><br>Typical range: `64` - `256`                                                           |
+| `gail > learning_rate` | (Optional, default = `3e-4`) Learning rate used to update the discriminator. This should typically be decreased if training is unstable, and the GAIL loss is unstable. <br><br>Typical range: `1e-5` - `1e-3`                                                                                                                                                                                                                                                                  |
+| `gail > use_actions`   | (Optional, default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. |
+| `gail > use_vail`      | (Optional, default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand.                                                                       |
+
+### SAC-specific Reward Signal
+
+All of the reward signals configurations described above apply to both PPO and
+SAC. There is one configuration for reward signals that only applies to SAC.
+
+| **Setting**                                 | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| :------------------------------------------ | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `reward_signals > reward_signal_num_update` | (Optional, default = `steps_per_update`) Number of steps per mini batch sampled and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. However, to imitate the training procedure in certain imitation learning papers (e.g. [Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)), we may want to update the reward signal (GAIL) M times for every update of the policy. We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update` under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to `steps_per_update`. |
+
+## Behavioral Cloning
+
+To enable Behavioral Cloning as a pre-training option (assuming you have
+recorded demonstrations), provide the following configurations under the
+`behavior_cloning` section:
+
+| **Setting**          | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+| :------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `demo_path`          | The path to your .demo file or directory of .demo files.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| `strength`           | Learning rate of the imitation relative to the learning rate of PPO, and roughly corresponds to how strongly we allow BC to influence the policy. <br><br>Typical range: `0.1` - `0.5`                                                                                                                                                                                                                                                                                                                                                                     |
+| `steps`              | During BC, it is often desirable to stop using demonstrations after the agent has "seen" rewards, and allow it to optimize past the available demonstrations and/or generalize outside of the provided demonstrations. steps corresponds to the training steps over which BC is active. The learning rate of BC will anneal over the steps. Set the steps to 0 for constant imitation over the entire training run.                                                                                                                                        |
+| `batch_size`         | Number of demonstration experiences used for one iteration of a gradient descent update. If not specified, it will default to the `batch_size`. <br><br>Typical range: (Continuous): `512` - `5120`; (Discrete): `32` - `512`                                                                                                                                                                                                                                                                                                                              |
+| `num_epoch`          | Number of passes through the experience buffer during gradient descent. If not specified, it will default to the number of epochs set for PPO. <br><br>Typical range: `3` - `10`                                                                                                                                                                                                                                                                                                                                                                           |
+| `samples_per_update` | (Optional, default = `0`) Maximum number of samples to use during each imitation update. You may want to lower this if your demonstration dataset is very large to avoid overfitting the policy on demonstrations. Set to 0 to train over all of the demonstrations at each update step. <br><br>Typical range: `buffer_size`                                                                                                                                                                                                                              |
+| `init_path`          | Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. |
+
+## Memory-enhanced Agents using Recurrent Neural Networks
+
+You can enable your agents to use memory, by setting `use_recurrent` to `true`
+and setting `memory_size` and `sequence_length`:
+
+| **Setting**       | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| :---------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `use_recurrent`   | Whether to enable this option or not.                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| `memory_size`     | Size of the memory an agent must keep. In order to use a LSTM, training requires a sequence of experiences instead of single experiences. Corresponds to the size of the array of floating point numbers used to store the hidden state of the recurrent neural network of the policy. This value must be a multiple of 2, and should scale with the amount of information you expect the agent will need to remember in order to successfully complete the task. <br><br>Typical range: `32` - `256` |
+| `sequence_length` | Defines how long the sequences of experiences must be while training. Note that if this number is too small, the agent will not be able to remember things over longer periods of time. If this number is too large, the neural network will take longer to train. <br><br>Typical range: `4` - `128`                                                                                                                                                                                                 |
+
+A few considerations when deciding to use memory:
+
+- LSTM does not work well with continuous vector action space. Please use
+  discrete vector action space for better results.
+- Since the memories must be sent back and forth between Python and Unity, using
+  too large `memory_size` will slow down training.
+- Adding a recurrent layer increases the complexity of the neural network, it is
+  recommended to decrease `num_layers` when using recurrent.
+- It is required that `memory_size` be divisible by 4.
+
+## Self-Play
+
+Training with self-play adds additional confounding factors to the usual issues
+faced by reinforcement learning. In general, the tradeoff is between the skill
+level and generality of the final policy and the stability of learning. Training
+against a set of slowly or unchanging adversaries with low diversity results in
+a more stable learning process than training against a set of quickly changing
+adversaries with high diversity. With this context, this guide discusses the
+exposed self-play hyperparameters and intuitions for tuning them.
+
+If your environment contains multiple agents that are divided into teams, you
+can leverage our self-play training option by providing these configurations for
+each Behavior:
+
+| **Setting**                       | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| :-------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `save_steps`                      | Number of _trainer steps_ between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. <br><br>A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent. <br><br> Typical range: `10000` - `100000`                                                                                                                                                                                                                                                                                            |
+| `team_change`                     | Number of _trainer_steps_ between switching the learning team. This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents per team switch. <br><br>A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies and so the agent may fail against the next batch of opponents. <br><br> The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we recommend setting this value as a function of the `save_steps` parameter discussed previously. <br><br> Typical range: 4x-10x where x=`save_steps` |
+| `swap_steps`                      | Number of _ghost steps_ (not trainer steps) between swapping the opponents policy with a different snapshot. A 'ghost step' refers to a step taken by an agent _that is following a fixed policy and not learning_. The reason for this distinction is that in asymmetric games, we may have teams with an unequal number of agents e.g. a 2v1 scenario like our Strikers Vs Goalie example environment. The team with two agents collects twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents` agents during `team-change` total steps is: `(num_agents / num_opponent_agents) * (team_change / x)` <br><br> Typical range: `10000` - `100000`                                                                                                                                                                                                 |
+| `play_against_latest_model_ratio` | Probability an agent will play against the latest opponent policy. With probability 1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its opponent from a past iteration. <br><br> A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy. <br><br> Typical range: `0.0` - `1.0`                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| `window`                          | Size of the sliding window of past snapshots from which the agent's opponents are sampled. For example, a `window` size of 5 will save the last 5 snapshots taken. Each time a new snapshot is taken, the oldest is discarded. A larger value of `window` means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. Like in the `save_steps` hyperparameter, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. <br><br> Typical range: `5` - `30`                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+
+### Note on Reward Signals
+
+We make the assumption that the final reward in a trajectory corresponds to the
+outcome of an episode. A final reward of +1 indicates winning, -1 indicates
+losing and 0 indicates a draw. The ELO calculation (discussed below) depends on
+this final reward being either +1, 0, -1.
+
+The reward signal should still be used as described in the documentation for the
+other trainers. However, we encourage users to be a bit more conservative when
+shaping reward functions due to the instability and non-stationarity of learning
+in adversarial games. Specifically, we encourage users to begin with the
+simplest possible reward function (+1 winning, -1 losing) and to allow for more
+iterations of training to compensate for the sparsity of reward.
+
+### Note on Swap Steps
+
+As an example, in a 2v1 scenario, if we want the swap to occur x=4 times during
+team-change=200000 steps, the swap_steps for the team of one agent is:
+
+swap_steps = (1 / 2) \* (200000 / 4) = 25000 The swap_steps for the team of two
+agents is:
+
+swap_steps = (2 / 1) \* (200000 / 4) = 100000 Note, with equal team sizes, the
+first term is equal to 1 and swap_steps can be calculated by just dividing the
+total steps by the desired number of swaps.
+
+A larger value of swap_steps means that an agent will play against the same
+fixed opponent for a longer number of training iterations. This results in a
+more stable training scenario, but leaves the agent open to the risk of
+overfitting it's behavior for this particular opponent. Thus, when a new
+opponent is swapped, the agent may lose more often than expected.
--- a/docs/Feature-Memory.md
+++ b/docs/Feature-Memory.md
-# Memory-enhanced agents using Recurrent Neural Networks
-
-## What are memories used for?
-
-Have you ever entered a room to get something and immediately forgot what you
-were looking for? Don't let that happen to your agents.
-
-It is now possible to give memories to your agents. When training, the agents
-will be able to store a vector of floats to be used next time they need to make
-a decision.
-
-![Inspector](images/ml-agents-LSTM.png)
-
-Deciding what the agents should remember in order to solve a task is not easy to
-do by hand, but our training algorithms can learn to keep track of what is
-important to remember with
-[LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).
-
-## How to use
-
-When configuring the trainer parameters in the `config/trainer_config.yaml`
-file, add the following parameters to the Behavior you want to use.
-
-```json
-use_recurrent: true
-sequence_length: 64
-memory_size: 256
-```
-
-* `use_recurrent` is a flag that notifies the  trainer that you want to use a
-  Recurrent Neural Network.
-* `sequence_length` defines how long the sequences of experiences must be while
-  training. In order to use a LSTM, training requires a sequence of experiences
-  instead of single experiences.
-* `memory_size` corresponds to the size of the memory the agent must keep. Note
-  that if this number is too small, the agent will not be able to remember a lot
-  of things. If this number is too large, the neural network will take longer to
-  train.
-
-## Limitations
-
-* LSTM does not work well with continuous vector action space. Please use
-  discrete vector action space for better results.
-* Since the memories must be sent back and forth between Python and Unity, using
-  too large `memory_size` will slow down training.
-* Adding a recurrent layer increases the complexity of the neural network, it is
-  recommended to decrease `num_layers` when using recurrent.
-* It is required that `memory_size` be divisible by 4.
--- a/docs/Feature-Monitor.md
+++ b/docs/Feature-Monitor.md
-# Using the Monitor
-
-![Monitor](images/monitor.png)
-
-The monitor allows visualizing information related to the agents or training
-process within a Unity scene.
-
-You can track many different things both related and unrelated to the agents
-themselves. By default, the Monitor is only active in the *inference* phase, so
-not during training. To change this behavior, you can activate or deactivate it
-by calling `SetActive(boolean)`. For example to also show the monitor during
-training, you can call it in the `Awake()` method of your `MonoBehaviour`:
-
-```csharp
-using Unity.MLAgents;
-
-public class MyBehaviour : MonoBehaviour {
-    public void Awake()
-    {
-        Monitor.SetActive(true);
-    }
-}
-```
-
-To add values to monitor, call the `Log` function anywhere in your code:
-
-```csharp
-Monitor.Log(key, value, target)
-```
-
-* `key` is the name of the information you want to display.
-* `value` is the information you want to display. *`value`* can have different
-  types:
-  * `string` - The Monitor will display the string next to the key. It can be
-    useful for displaying error messages.
-  * `float` - The Monitor will display a slider. Note that the values must be
-    between -1 and 1. If the value is positive, the slider will be green, if the
-    value is negative, the slider will be red.
-  * `float[]` - The Monitor Log call can take an additional argument called
-    `displayType` that can be either `INDEPENDENT` (default) or `PROPORTIONAL`:
-    * `INDEPENDENT` is used to display multiple independent floats as a
-      histogram. The histogram will be a sequence of vertical sliders.
-    * `PROPORTION` is used to see the proportions between numbers. For each
-      float in values, a rectangle of width of value divided by the sum of all
-      values will be show. It is best for visualizing values that sum to 1.
-* `target` is the transform to which you want to attach information. If the
-  transform is `null` the information will be attached to the global monitor.
-  * **NB:** When adding a target transform that is not the global monitor, make
-    sure you have your main camera object tagged as `MainCamera` via the
-    inspector. This is needed to properly display the text onto the screen.
--- a/docs/Training-Using-Concurrent-Unity-Instances.md
+++ b/docs/Training-Using-Concurrent-Unity-Instances.md
-# Training Using Concurrent Unity Instances
-
-As part of release v0.8, we enabled developers to run concurrent, parallel instances of the Unity executable during training. For certain scenarios, this should speed up the training.
-
-## How to Run Concurrent Unity Instances During Training
-
-Please refer to the general instructions on [Training ML-Agents](Training-ML-Agents.md).  In order to run concurrent Unity instances during training, set the number of environment instances using the command line option `--num-envs=<n>` when you invoke `mlagents-learn`. Optionally, you can also set the `--base-port`, which is the starting port used for the concurrent Unity instances.
-
-## Considerations
-
-### Buffer Size
-
-If you are having trouble getting an agent to train, even with multiple concurrent Unity instances, you could increase  `buffer_size` in the `config/trainer_config.yaml` file. A common practice is to multiply `buffer_size` by `num-envs`.
-
-### Resource Constraints
-
-Invoking concurrent Unity instances is constrained by the resources on the machine.  Please use discretion when setting `--num-envs=<n>`.
-
-### Using num-runs and num-envs
-
-If you set `--num-runs=<n>` greater than 1 and are also invoking concurrent Unity instances using `--num-envs=<n>`, then the number of concurrent Unity instances is equal to `num-runs` times `num-envs`.
-
-### Result Variation Using Concurrent Unity Instances
-
-If you keep all the hyperparameters the same, but change `--num-envs=<n>`, the results and model would likely change.
--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
-# Training with Imitation Learning
-
-It is often more intuitive to simply demonstrate the behavior we want an agent
-to perform, rather than attempting to have it learn via trial-and-error methods.
-Consider our
-[running example](ML-Agents-Overview.md#running-example-training-npc-behaviors)
-of training a medic NPC. Instead of indirectly training a medic with the help
-of a reward function, we can give the medic real world examples of observations
-from the game and actions from a game controller to guide the medic's behavior.
-Imitation Learning uses pairs of observations and actions from
-a demonstration to learn a policy.
-
-Imitation learning can also be used to help reinforcement learning. Especially in
-environments with sparse (i.e., infrequent or rare) rewards, the agent may never see
-the reward and thus not learn from it. Curiosity (which is available in the toolkit)
-helps the agent explore, but in some cases
-it is easier to show the agent how to achieve the reward. In these cases,
-imitation learning combined with reinforcement learning can dramatically
-reduce the time the agent takes to solve the environment.
-For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
-using 6 episodes of demonstrations can reduce training steps by more than 4 times.
-See Behavioral Cloning + GAIL + Curiosity + RL below.
-
-<p align="center">
-  <img src="images/mlagents-ImitationAndRL.png"
-       alt="Using Demonstrations with Reinforcement Learning"
-       width="700" border="0" />
-</p>
-
-The ML-Agents Toolkit provides two features that enable your agent to learn from demonstrations.
-In most scenarios, you can combine these two features.
-
-* GAIL (Generative Adversarial Imitation Learning) uses an adversarial approach to
-  reward your Agent for behaving similar to a set of demonstrations. To use GAIL, you can add the
-  [GAIL reward signal](Reward-Signals.md#gail-reward-signal). GAIL can be
-  used with or without environment rewards, and works well when there are a limited
-  number of demonstrations.
-* Behavioral Cloning (BC) trains the Agent's neural network to exactly mimic the actions
-  shown in a set of demonstrations.
-  The BC feature can be enabled on the [PPO](Training-PPO.md#optional-behavioral-cloning-using-demonstrations)
-  or [SAC](Training-SAC.md#optional-behavioral-cloning-using-demonstrations) trainer. As BC cannot generalize
-  past the examples shown in the demonstrations, BC tends to work best when there exists demonstrations
-  for nearly all of the states that the agent can experience, or in conjunction with GAIL and/or an extrinsic reward.
-
-### What to Use
-
-If you want to help your agents learn (especially with environments that have sparse rewards)
-using pre-recorded demonstrations, you can generally enable both GAIL and Behavioral Cloning
-at low strengths in addition to having an extrinsic reward.
-An example of this is provided for the Pyramids example environment under
- `PyramidsLearning` in `config/gail_config.yaml`.
-
-If you want to train purely from demonstrations, GAIL and BC _without_ an
-extrinsic reward signal is the preferred approach. An example of this is provided for the Crawler
-example environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.
-
-## Recording Demonstrations
-
-Demonstrations of agent behavior can be recorded from the Unity Editor,
-and saved as assets. These demonstrations contain information on the
-observations, actions, and rewards for a given agent during the recording session.
-They can be managed in the Editor, as well as used for training with BC and GAIL.
-
-In order to record demonstrations from an agent, add the `Demonstration Recorder`
-component to a GameObject in the scene which contains an `Agent` component.
-Once added, it is possible to name the demonstration that will be recorded
-from the agent.
-
-<p align="center">
-  <img src="images/demo_component.png"
-       alt="Demonstration Recorder"
-       width="375" border="10" />
-</p>
-
-When `Record` is checked, a demonstration will be created whenever the scene
-is played from the Editor. Depending on the complexity of the task, anywhere
-from a few minutes or a few hours of demonstration data may be necessary to
-be useful for imitation learning. When you have recorded enough data, end
-the Editor play session. A `.demo` file will be created in the
-`Assets/Demonstrations` folder (by default). This file contains the demonstrations.
-Clicking on the file will provide metadata about the demonstration in the
-inspector.
-
-<p align="center">
-  <img src="images/demo_inspector.png"
-       alt="Demonstration Inspector"
-       width="375" border="10" />
-</p>
-
-You can then specify the path to this file as the `demo_path` in your `trainer_config.yaml` file
-when using BC or GAIL. For instance, for BC:
-
-```
-    behavioral_cloning:
-        demo_path: <path_to_your_demo_file>
-        ...
-```
-And for GAIL:
-```
-    reward_signals:
-        gail:
-            demo_path: <path_to_your_demo_file>
-            ...
-```
--- a/docs/Reward-Signals.md
+++ b/docs/Reward-Signals.md
-# Reward Signals
-
-In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy)
-that maximizes a reward. Typically, a reward is defined by your environment, and corresponds
-to reaching some goal. These are what we refer to as "extrinsic" rewards, as they are defined
-external of the learning algorithm.
-
-Rewards, however, can be defined outside of the environment as well, to encourage the agent to
-behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these
-rewards as "intrinsic" reward signals. The total reward that the agent will learn to maximize can
-be a mix of extrinsic and intrinsic reward signals.
-
-ML-Agents allows reward signals to be defined in a modular way, and we provide three reward
-signals that can the mixed and matched to help shape your agent's behavior. The `extrinsic` Reward
-Signal represents the rewards defined in your environment, and is enabled by default.
-The `curiosity` reward signal helps your agent explore when extrinsic rewards are sparse.
-
-## Enabling Reward Signals
-
-Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An
-example is provided in `config/trainer_config.yaml` and `config/gail_config.yaml`. To enable a reward signal, add it to the
-`reward_signals:` section under the behavior name. For instance, to enable the extrinsic signal
-in addition to a small curiosity reward and a GAIL reward signal, you would define your `reward_signals` as follows:
-
-```yaml
-reward_signals:
-    extrinsic:
-        strength: 1.0
-        gamma: 0.99
-    curiosity:
-        strength: 0.02
-        gamma: 0.99
-        encoding_size: 256
-    gail:
-        strength: 0.01
-        gamma: 0.99
-        encoding_size: 128
-        demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
-```
-
-Each reward signal should define at least two parameters, `strength` and `gamma`, in addition
-to any class-specific hyperparameters. Note that to remove a reward signal, you should delete
-its entry entirely from `reward_signals`. At least one reward signal should be left defined
-at all times.
-
-## Reward Signal Types
-As part of the toolkit, we provide three reward signal types as part of hyperparameters - Extrinsic, Curiosity, and GAIL.
-
-### Extrinsic Reward Signal
-
-The `extrinsic` reward signal is simply the reward given by the
-[environment](Learning-Environment-Design.md). Remove it to force the agent
-to ignore the environment reward.
-
-#### Strength
-
-`strength` is the factor by which to multiply the raw
-reward. Typical ranges will vary depending on the reward signal.
-
-Typical Range: `1.0`
-
-#### Gamma
-
-`gamma` corresponds to the discount factor for future rewards. This can be
-thought of as how far into the future the agent should care about possible
-rewards. In situations when the agent should be acting in the present in order
-to prepare for rewards in the distant future, this value should be large. In
-cases when rewards are more immediate, it can be smaller.
-
-Typical Range: `0.8` - `0.995`
-
-### Curiosity Reward Signal
-
-The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an implementation
-of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction"
-by Pathak, et al. It trains two networks:
-* an inverse model, which takes the current and next observation of the agent, encodes them, and
-uses the encoding to predict the action that was taken between the observations
-* a forward model, which takes the encoded current observation and action, and predicts the
-next encoded observation.
-
-The loss of the forward model (the difference between the predicted and actual encoded observations) is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be.
-
-For more information, see
-* https://arxiv.org/abs/1705.05363
-* https://pathak22.github.io/noreward-rl/
-* https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/
-
-#### Strength
-
-In this case, `strength` corresponds to the magnitude of the curiosity reward generated
-by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough
-to not be overwhelmed by extrinsic reward signals in the environment.
-Likewise it should not be too large to overwhelm the extrinsic reward signal.
-
-Typical Range: `0.001` - `0.1`
-
-#### Gamma
-
-`gamma` corresponds to the discount factor for future rewards.
-
-Typical Range: `0.8` - `0.995`
-
-#### (Optional) Encoding Size
-
-`encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model.
-This value should be small enough to encourage the ICM to compress the original
-observation, but also not too small to prevent it from learning to differentiate between
-demonstrated and actual behavior.
-
-Default Value: `64`
-
-Typical Range: `64` - `256`
-
-#### (Optional) Learning Rate
-
-`learning_rate` is the learning rate used to update the intrinsic curiosity module.
-This should typically be decreased if training is unstable, and the curiosity loss is unstable.
-
-Default Value: `3e-4`
-
-Typical Range: `1e-5` - `1e-3`
-
-### GAIL Reward Signal
-
-GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an
-imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs
-(Generative Adversarial Networks). In this framework, a second neural network, the
-discriminator, is taught to distinguish whether an observation/action is from a demonstration or
-produced by the agent. This discriminator can the examine a new observation/action and provide it a
-reward based on how close it believes this new observation/action is to the provided demonstrations.
-
-At each training step, the agent tries to learn how to maximize this reward. Then, the
-discriminator is trained to better distinguish between demonstrations and agent state/actions.
-In this way, while the agent gets better and better at mimicing the demonstrations, the
-discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.
-
-This approach learns a _policy_ that produces states and actions similar to the demonstrations,
-requiring fewer demonstrations than direct cloning of the actions. In addition to learning purely
-from demonstrations, the GAIL reward signal can be mixed with an extrinsic reward signal to guide
-the learning process.
-
-Using GAIL requires recorded demonstrations from your Unity environment. See the
-[imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations.
-
-#### Strength
-
-`strength` is the factor by which to multiply the raw reward. Note that when using GAIL
-with an Extrinsic Signal, this value should be set lower if your demonstrations are
-suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic
-rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases.
-
-Typical Range: `0.01` - `1.0`
-
-#### Gamma
-
-`gamma` corresponds to the discount factor for future rewards.
-
-Typical Range: `0.8` - `0.9`
-
-#### Demo Path
-
-`demo_path` is the path to your `.demo` file or directory of `.demo` files. See the [imitation learning guide](Training-Imitation-Learning.md).
-
-#### (Optional) Encoding Size
-
-`encoding_size` corresponds to the size of the hidden layer used by the discriminator.
-This value should be small enough to encourage the discriminator to compress the original
-observation, but also not too small to prevent it from learning to differentiate between
-demonstrated and actual behavior. Dramatically increasing this size will also negatively affect
-training times.
-
-Default Value: `64`
-
-Typical Range: `64` - `256`
-
-#### (Optional) Learning Rate
-
-`learning_rate` is the learning rate used to update the discriminator.
-This should typically be decreased if training is unstable, and the GAIL loss is unstable.
-
-Default Value: `3e-4`
-
-Typical Range: `1e-5` - `1e-3`
-
-#### (Optional) Use Actions
-
-`use_actions` determines whether the discriminator should discriminate based on both
-observations and actions, or just observations. Set to `True` if you want the agent to
-mimic the actions from the demonstrations, and `False` if you'd rather have the agent
-visit the same states as in the demonstrations but with possibly different actions.
-Setting to `False` is more likely to be stable, especially with imperfect demonstrations,
-but may learn slower.
-
-Default Value: `false`
-
-#### (Optional) Variational Discriminator Bottleneck
-
-`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the
-GAIL discriminator. This forces the discriminator to learn a more general representation
-and reduces its tendency to be "too good" at discriminating, making learning more stable.
-However, it does increase training time. Enable this if you notice your imitation learning is
-unstable, or unable to learn the task at hand.
-
-Default Value: `false`
--- a/docs/Training-Self-Play.md
+++ b/docs/Training-Self-Play.md
-# Training with Self-Play
-
-ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with
-[Self-Play](https://openai.com/blog/competitive-self-play/).
-A symmetric game is one in which opposing agents are equal in form, function and objective. Examples of symmetric games
-are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and
-action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games,
-this is not the case. An example of an asymmetric game is our Strikers Vs Goalie example environment. Agents in these
-types of games do not always have the same observation or action spaces and so sharing policy networks is not
-necessarily ideal.
-
-With self-play, an agent learns in adversarial games by competing against fixed, past versions of its opponent
-(which could be itself as in symmetric games) to provide a more stable, stationary learning environment. This is compared
-to competing against the current, best opponent in every episode, which is constantly changing (because it's learning).
-
-Self-play can be used with our implementations of both [Proximal Policy Optimization (PPO)](Training-PPO.md) and [Soft Actor-Critc (SAC)](Training-SAC.md).
-However, from the perspective of an individual agent, these scenarios appear to have non-stationary dynamics because the opponent is often changing.
-This can cause significant issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For further reading on
-this issue in particular, see the paper [Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf).
-For more general information on training with ML-Agents, see [Training ML-Agents](Training-ML-Agents.md).
-For more algorithm specific instruction, please see the documentation for [PPO](Training-PPO.md) or [SAC](Training-SAC.md).
-
-Self-play is triggered by including the self-play hyperparameter hierarchy in the trainer configuration file.  Detailed description of the self-play hyperparameters are contained below. Furthermore, to distinguish opposing agents, set the team ID to different integer values in the behavior parameters script on the agent prefab.
-
-![Team ID](images/team_id.png)
-
-***Team ID must be 0 or an integer greater than 0.***
-
-In symmetric games, since all agents (even on opposing teams) will share the same policy, they should have the same 'Behavior Name' in their
-Behavior Parameters Script.  In asymmetric games, they should have a different Behavior Name in their Behavior Parameters script.
-Note, in asymmetric games, the agents must have both different Behavior Names *and* different team IDs! Then, specify the trainer configuration
-for each Behavior Name in your scene as you would normally, and remember to include the self-play hyperparameter hierarchy!
-
-For examples of how to use this feature, you can see the trainer configurations and agent prefabs for our Tennis, Soccer, and
-Strikers Vs Goalie environments.
-Tennis and Soccer provide examples of symmetric games and Strikers Vs Goalie provides an example of an asymmetric game.
-
-
-## Best Practices Training with Self-Play
-
-Training with self-play adds additional confounding factors to the usual
-issues faced by reinforcement learning. In general, the tradeoff is between
-the skill level and generality of the final policy and the stability of learning.
-Training against a set of slowly or unchanging adversaries with low diversity
-results in a more stable learning process than training against a set of quickly
-changing adversaries with high diversity. With this context, this guide discusses
-the exposed self-play hyperparameters and intuitions for tuning them.
-
-
-## Hyperparameters
-
-### Reward Signals
-
-We make the assumption that the final reward in a trajectory corresponds to the outcome of an episode.
-A final reward greater than 0 indicates winning, less than 0 indicates losing and 0 indicates a draw.
-The final reward determines the result of an episode (win, loss, or draw) in the ELO calculation.
-
-The reward signal should still be used as described in the documentation for the other trainers and [reward signals.](Reward-Signals.md) However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games. Specifically, we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing) and to allow for more iterations of training to compensate for the sparsity of reward.
-
-In problems that are too challenging to be solved by sparse rewards, it may be necessary to provide intermediate rewards to encourage useful instrumental behaviors.
-For example, it may be difficult for a soccer agent to learn that kicking a ball into the net receives a reward because this sequence has a low probability
-of occurring randomly.  However, it will have a higher probability of occurring if the agent learns generally that kicking the ball has utility. So, we may be able
-to speed up training by giving the agent intermediate reward for kicking the ball. However, we must be careful that the agent doesn't learn to undermine
-its original objective of scoring goals e.g. if it scores a goal, the episode ends and it can no longer receive reward for kicking the ball. The behavior
-that receives the most reward may be to keep the ball out of the net and to kick it indefinitely! To address this, we suggest
-using a curriculum that allows the agents to learn the necessary intermediate behavior (i.e. colliding with a ball) and then
-decays this reward signal to allow training on just the rewards of winning and losing. Please see our documentation on
-how to use curriculum learning [here](./Training-Curriculum-Learning.md) and our SoccerTwos example environment.
-
-### Save Steps
-
-The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots.  For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
-
-A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.
-
-Recommended Range : 10000-100000
-
-### Team Change
-
-The `team_change` parameter corresponds to the number of *trainer_steps* between switching the learning team.
-This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team
-becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar
-performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents
-per team switch.
-
-A larger value of `team-change` will allow the agent to train longer against it's opponents.  The longer an agent trains against the same set of opponents
-the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies
-and so the agent may fail against the next batch of opponents.
-
-The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team.  So, we
-recommend setting this value as a function of the `save_steps` parameter discussed previously.
-
-Recommended Range : 4x-10x where x=`save_steps`
-
-
-### Swap Steps
-
-The `swap_steps` parameter corresponds to the number of *ghost steps* (not trainer steps) between swapping the opponents policy with a different snapshot.
-A 'ghost step' refers to a step taken by an agent *that is following a fixed policy and not learning*. The reason for this distinction is that in asymmetric games,
-we may have teams with an unequal number of agents e.g. a 2v1 scenario like our Strikers Vs Goalie example environment. The team with two agents collects
-twice as many agent steps per environment step as the team with one agent.  Thus, these two values will need to be distinct to ensure that the same number
-of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if
-a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents`
-agents during `team-change` total steps is:
-
-```
-swap_steps = (num_agents / num_opponent_agents) * (team_change / x)
-```
-
-As an example, in a 2v1 scenario, if we want the swap to occur `x=4` times during `team-change=200000` steps,
-the `swap_steps` for the team of one agent is:
-
-```
-swap_steps = (1 / 2) * (200000 / 4) = 25000
-```
-The `swap_steps` for the team of two agents is:
-```
-swap_steps = (2 / 1) * (200000 / 4) = 100000
-```
-Note, with equal team sizes, the first term is equal to 1 and `swap_steps` can be calculated by just dividing the total steps by the desired number of swaps.
-
-A larger value of `swap_steps` means that an agent will play against the same fixed opponent for a longer number of training iterations. This results in a more stable training scenario, but leaves the agent open to the risk of overfitting it's behavior for this particular opponent. Thus, when a new opponent is swapped, the agent may lose more often than expected.
-
-Recommended Range : 10000-100000
-
-### Play against latest model ratio
-
-The `play_against_latest_model_ratio` parameter corresponds to the probability
-an agent will play against the latest opponent policy. With probability
-1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its
-opponent from a past iteration.
-
-A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration.  This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy.
-
-Range : 0.0 - 1.0
-
-### Window
-
-The `window` parameter corresponds to the size of the sliding window of past snapshots from which the agent's opponents are sampled.  For example, a `window` size of 5 will save the last 5 snapshots taken. Each time a new snapshot is taken, the oldest is discarded.
-
-A larger value of `window` means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. Like in the `save_steps` hyperparameter, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training.
-
-Recommended Range : 5 - 30
-
-## Training Statistics
-
-To view training statistics, use TensorBoard. For information on launching and
-using TensorBoard, see
-[here](./Getting-Started.md#observing-training-progress).
-
-### ELO
-In adversarial games, the cumulative environment reward may not be a meaningful metric by which to track learning progress.  This is because cumulative reward is entirely dependent on the skill of the opponent. An agent at a particular skill level will get more or less reward against a worse or better agent, respectively.
-
-We provide an implementation of the ELO rating system, a method for calculating the relative skill level between two players from a given population in a zero-sum game. For more information on ELO, please see [the ELO wiki](https://en.wikipedia.org/wiki/Elo_rating_system).
-In a proper training run, the ELO of the agent should steadily increase. The absolute value of the ELO is less important than the change in ELO over training iterations.
-
-Note, this implementation will support any number of teams but ELO is only applicable to games with two teams.  It is ongoing work to implement
-a reliable metric for measuring progress in scenarios with three or more teams. These scenarios can still train, though as of now, reward and qualitative observations
-are the only metric by which we can judge performance.
--- a/docs/Training-Environment-Parameter-Randomization.md
+++ b/docs/Training-Environment-Parameter-Randomization.md
-# Training With Environment Parameter Randomization
-
-One of the challenges of training and testing agents on the same
-environment is that the agents tend to overfit. The result is that the
-agents are unable to generalize to any tweaks or variations in the environment.
-This is analogous to a model being trained and tested on an identical dataset
-in supervised learning. This becomes problematic in cases where environments
-are instantiated with varying objects or properties.
-
-To help agents robust and better generalizable to changes in the environment, the agent
-can be trained over multiple variations of a given environment. We refer to this approach as **Environment Parameter Randomization**. For those familiar with Reinforcement Learning research, this approach is based on the concept of Domain Randomization (you can read more about it [here](https://arxiv.org/abs/1703.06907)). By using parameter randomization
-during training, the agent can be better suited to adapt (with higher performance)
-to future unseen variations of the environment.
-
-_Example of variations of the 3D Ball environment._
-
-Ball scale of 0.5          |  Ball scale of 4
-:-------------------------:|:-------------------------:
-![](images/3dball_small.png)  |  ![](images/3dball_big.png)
-
-
-To enable variations in the environments, we implemented `Environment Parameters`.
-`Environment Parameters` are values in the `FloatPropertiesChannel` that can be read when setting
-up the environment. We
-also included different sampling methods and the ability to create new kinds of
-sampling methods for each `Environment Parameter`. In the 3D ball environment example displayed
-in the figure above, the environment parameters are `gravity`, `ball_mass` and `ball_scale`.
-
-
-## How to Enable Environment Parameter Randomization
-
-We first need to provide a way to modify the environment by supplying a set of `Environment Parameters`
-and vary them over time. This provision can be done either deterministically or randomly.
-
-This is done by assigning each `Environment Parameter` a `sampler-type`(such as a uniform sampler),
-which determines how to sample an `Environment
-Parameter`. If a `sampler-type` isn't provided for a
-`Environment Parameter`, the parameter maintains the default value throughout the
-training procedure, remaining unchanged. The samplers for all the `Environment Parameters`
-are handled by a **Sampler Manager**, which also handles the generation of new
-values for the environment parameters when needed.
-
-To setup the Sampler Manager, we create a YAML file that specifies how we wish to
-generate new samples for each `Environment Parameters`. In this file, we specify the samplers and the
-`resampling-interval` (the number of simulation steps after which environment parameters are
-resampled). Below is an example of a sampler file for the 3D ball environment.
-
-```yaml
-resampling-interval: 5000
-
-mass:
-    sampler-type: "uniform"
-    min_value: 0.5
-    max_value: 10
-
-gravity:
-    sampler-type: "multirange_uniform"
-    intervals: [[7, 10], [15, 20]]
-
-scale:
-    sampler-type: "uniform"
-    min_value: 0.75
-    max_value: 3
-
-```
-
-Below is the explanation of the fields in the above example.
-
-* `resampling-interval` - Specifies the number of steps for the agent to
-train under a particular environment configuration before resetting the
-environment with a new sample of `Environment Parameters`.
-
-* `Environment Parameter` - Name of the `Environment Parameter` like `mass`, `gravity` and `scale`. This should match the name
-specified in the `FloatPropertiesChannel` of the environment being trained. If a parameter specified in the file doesn't exist in the
-environment, then this parameter will be ignored.  Within each `Environment Parameter`
-
-    * `sampler-type` - Specify the sampler type to use for the `Environment Parameter`.
-    This is a string that should exist in the `Sampler Factory` (explained
-    below).
-
-    * `sampler-type-sub-arguments` - Specify the sub-arguments depending on the `sampler-type`.
-    In the example above, this would correspond to the `intervals`
-    under the `sampler-type` `"multirange_uniform"` for the `Environment Parameter` called `gravity`.
-    The key name should match the name of the corresponding argument in the sampler definition.
-    (See below)
-
-The Sampler Manager allocates a sampler type for each `Environment Parameter` by using the *Sampler Factory*,
-which maintains a dictionary mapping of string keys to sampler objects. The available sampler types
-to be used for each `Environment Parameter` is available in the Sampler Factory.
-
-### Included Sampler Types
-
-Below is a list of included `sampler-type` as part of the toolkit.
-
-* `uniform` - Uniform sampler
-    *   Uniformly samples a single float value between defined endpoints.
-        The sub-arguments for this sampler to specify the interval
-        endpoints are as below. The sampling is done in the range of
-        [`min_value`, `max_value`).
-
-    * **sub-arguments** - `min_value`, `max_value`
-
-* `gaussian` - Gaussian sampler
-    *   Samples a single float value from the distribution characterized by
-        the mean and standard deviation. The sub-arguments to specify the
-        gaussian distribution to use are as below.
-
-    * **sub-arguments** - `mean`, `st_dev`
-
-* `multirange_uniform` - Multirange uniform sampler
-    *   Uniformly samples a single float value between the specified intervals.
-        Samples by first performing a weight pick of an interval from the list
-        of intervals (weighted based on interval width) and samples uniformly
-        from the selected interval (half-closed interval, same as the uniform
-        sampler). This sampler can take an arbitrary number of intervals in a
-        list in the following format:
-    [[`interval_1_min`, `interval_1_max`], [`interval_2_min`, `interval_2_max`], ...]
-
-    * **sub-arguments** - `intervals`
-
-The implementation of the samplers can be found at `ml-agents-envs/mlagents_envs/sampler_class.py`.
-
-### Defining a New Sampler Type
-
-If you want to define your own sampler type, you must first inherit the *Sampler*
-base class (included in the `sampler_class` file) and preserve the interface.
-Once the class for the required method is specified, it must be registered in the Sampler Factory.
-
-This can be done by subscribing to the *register_sampler* method of the SamplerFactory. The command
-is as follows:
-
-`SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)`
-
-Once the Sampler Factory reflects the new register, the new sampler type can be used for sample any
-`Environment Parameter`. For example, lets say a new sampler type was implemented as below and we register
-the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory.
-
-```python
-class CustomSampler(Sampler):
-
-    def __init__(self, argA, argB, argC):
-        self.possible_vals = [argA, argB, argC]
-
-    def sample_all(self):
-        return np.random.choice(self.possible_vals)
-```
-
-Now we need to specify the new sampler type in the sampler YAML file. For example, we use this new
-sampler type for the `Environment Parameter` *mass*.
-
-```yaml
-mass:
-    sampler-type: "custom-sampler"
-    argB: 1
-    argA: 2
-    argC: 3
-```
-
-### Training with Environment Parameter Randomization
-
-After the sampler YAML file is defined, we proceed by launching `mlagents-learn` and specify
-our configured sampler file with the `--sampler` flag. For example, if we wanted to train the
-3D ball agent with parameter randomization using `Environment Parameters` with `config/3dball_randomize.yaml`
-sampling setup, we would run
-
-```sh
-mlagents-learn config/trainer_config.yaml --sampler=config/3dball_randomize.yaml
--run-id=3D-Ball-randomize
-```
-
-We can observe progress and metrics via Tensorboard.
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md
-# Training with Proximal Policy Optimization
-
-ML-Agents provides an implementation of a reinforcement learning algorithm called
-[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/).
-PPO uses a neural network to approximate the ideal function that maps an agent's
-observations to the best action an agent can take in a given state. The
-ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
-Python process (communicating with the running Unity application over a socket).
-
-ML-Agents also provides an implementation of
-[Soft Actor-Critic (SAC)](https://bair.berkeley.edu/blog/2018/12/14/sac/). SAC tends
-to be more _sample-efficient_, i.e. require fewer environment steps,
-than PPO, but may spend more time performing model updates. This can produce a large
-speedup on heavy or slow environments. Check out how to train with
-SAC [here](Training-SAC.md).
-
-To train an agent, you will need to provide the agent one or more reward signals which
-the agent should attempt to maximize. See [Reward Signals](Reward-Signals.md)
-for the available reward signals and the corresponding hyperparameters.
-
-See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the
-training program, `learn.py`.
-
-If you are using the recurrent neural network (RNN) to utilize memory, see
-[Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training
-details.
-
-If you are using curriculum training to pace the difficulty of the learning task
-presented to an agent, see [Training with Curriculum
-Learning](Training-Curriculum-Learning.md).
-
-For information about imitation learning from demonstrations, see
-[Training with Imitation Learning](Training-Imitation-Learning.md).
-
-## Best Practices Training with PPO
-
-Successfully training a Reinforcement Learning model often involves tuning the
-training hyperparameters. This guide contains some best practices for tuning the
-training process when the default parameters don't seem to be giving the level
-of performance you would like.
-
-## Hyperparameters
-
-### Reward Signals
-
-In reinforcement learning, the goal is to learn a Policy that maximizes reward.
-At a base level, the reward is given by the environment. However, we could imagine
-rewarding the agent for various different behaviors. For instance, we could reward
-the agent for exploring new states, rather than just when an explicit reward is given.
-Furthermore, we could mix reward signals to help the learning process.
-
-Using `reward_signals` allows you to define [reward signals.](Reward-Signals.md)
-The ML-Agents Toolkit provides three reward signals by default, the Extrinsic (environment)
-reward signal, the Curiosity reward signal, which can be used to encourage exploration in
-sparse extrinsic reward environments, and the GAIL reward signal. Please see [Reward Signals](Reward-Signals.md)
-for additional details.
-
-### Lambda
-
-`lambd` corresponds to the `lambda` parameter used when calculating the
-Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This
-can be thought of as how much the agent relies on its current value estimate
-when calculating an updated value estimate. Low values correspond to relying
-more on the current value estimate (which can be high bias), and high values
-correspond to relying more on the actual rewards received in the environment
-(which can be high variance). The parameter provides a trade-off between the
-two, and the right value can lead to a more stable training process.
-
-Typical Range: `0.9` - `0.95`
-
-### Buffer Size
-
-`buffer_size` corresponds to how many experiences (agent observations, actions
-and rewards obtained) should be collected before we do any learning or updating
-of the model. **This should be a multiple of `batch_size`**. Typically a larger
-`buffer_size` corresponds to more stable training updates.
-
-Typical Range: `2048` - `409600`
-
-### Batch Size
-
-`batch_size` is the number of experiences used for one iteration of a gradient
-descent update. **This should always be a fraction of the `buffer_size`**. If
-you are using a continuous action space, this value should be large (in the
-order of 1000s). If you are using a discrete action space, this value should be
-smaller (in order of 10s).
-
-Typical Range (Continuous): `512` - `5120`
-
-Typical Range (Discrete): `32` - `512`
-
-### Number of Epochs
-
-`num_epoch` is the number of passes through the experience buffer during
-gradient descent. The larger the `batch_size`, the larger it is acceptable to
-make this. Decreasing this will ensure more stable updates, at the cost of
-slower learning.
-
-Typical Range: `3` - `10`
-
-### Learning Rate
-
-`learning_rate` corresponds to the strength of each gradient descent update
-step. This should typically be decreased if training is unstable, and the reward
-does not consistently increase.
-
-Typical Range: `1e-5` - `1e-3`
-
-### (Optional) Learning Rate Schedule
-
-`learning_rate_schedule` corresponds to how the learning rate is changed over time.
-For PPO, we recommend decaying learning rate until `max_steps` so learning converges
-more stably. However, for some cases (e.g. training for an unknown amount of time)
-this feature can be disabled.
-
-Options:
-* `linear` (default): Decay `learning_rate` linearly, reaching 0 at `max_steps`.
-* `constant`: Keep learning rate constant for the entire training run.
-
-Options: `linear`, `constant`
-
-### Time Horizon
-
-`time_horizon` corresponds to how many steps of experience to collect per-agent
-before adding it to the experience buffer. When this limit is reached before the
-end of an episode, a value estimate is used to predict the overall expected
-reward from the agent's current state. As such, this parameter trades off
-between a less biased, but higher variance estimate (long time horizon) and more
-biased, but less varied estimate (short time horizon). In cases where there are
-frequent rewards within an episode, or episodes are prohibitively large, a
-smaller number can be more ideal. This number should be large enough to capture
-all the important behavior within a sequence of an agent's actions.
-
-Typical Range: `32` - `2048`
-
-### Max Steps
-
-`max_steps` corresponds to how many steps of the simulation (multiplied by
-frame-skip) are run during the training process. This value should be increased
-for more complex problems.
-
-Typical Range: `5e5` - `1e7`
-
-### Beta
-
-`beta` corresponds to the strength of the entropy regularization, which makes
-the policy "more random." This ensures that agents properly explore the action
-space during training. Increasing this will ensure more random actions are
-taken. This should be adjusted such that the entropy (measurable from
-TensorBoard) slowly decreases alongside increases in reward. If entropy drops
-too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.
-
-Typical Range: `1e-4` - `1e-2`
-
-### Epsilon
-
-`epsilon` corresponds to the acceptable threshold of divergence between the old
-and new policies during gradient descent updating. Setting this value small will
-result in more stable updates, but will also slow the training process.
-
-Typical Range: `0.1` - `0.3`
-
-### Normalize
-
-`normalize` corresponds to whether normalization is applied to the vector
-observation inputs. This normalization is based on the running average and
-variance of the vector observation. Normalization can be helpful in cases with
-complex continuous control problems, but may be harmful with simpler discrete
-control problems.
-
-### Number of Layers
-
-`num_layers` corresponds to how many hidden layers are present after the
-observation input, or after the CNN encoding of the visual observation. For
-simple problems, fewer layers are likely to train faster and more efficiently.
-More layers may be necessary for more complex control problems.
-
-Typical range: `1` - `3`
-
-### Hidden Units
-
-`hidden_units` correspond to how many units are in each fully connected layer of
-the neural network. For simple problems where the correct action is a
-straightforward combination of the observation inputs, this should be small. For
-problems where the action is a very complex interaction between the observation
-variables, this should be larger.
-
-Typical Range: `32` - `512`
-
-### (Optional) Visual Encoder Type
-
-`vis_encode_type` corresponds to the encoder type for encoding visual observations.
-Valid options include:
-* `simple` (default): a simple encoder which consists of two convolutional layers
-* `nature_cnn`: [CNN implementation proposed by Mnih et al.](https://www.nature.com/articles/nature14236),
-consisting of three convolutional layers
-* `resnet`: [IMPALA Resnet implementation](https://arxiv.org/abs/1802.01561),
-consisting of three stacked layers, each with two residual blocks, making a
-much larger network than the other two.
-
-Options: `simple`, `nature_cnn`, `resnet`
-
-## (Optional) Recurrent Neural Network Hyperparameters
-
-The below hyperparameters are only used when `use_recurrent` is set to true.
-
-### Sequence Length
-
-`sequence_length` corresponds to the length of the sequences of experience
-passed through the network during training. This should be long enough to
-capture whatever information your agent might need to remember over time. For
-example, if your agent needs to remember the velocity of objects, then this can
-be a small value. If your agent needs to remember a piece of information given
-only once at the beginning of an episode, then this should be a larger value.
-
-Typical Range: `4` - `128`
-
-### Memory Size
-
-`memory_size` corresponds to the size of the array of floating point numbers
-used to store the hidden state of the recurrent neural network of the policy. This value must
-be a multiple of 2, and should scale with the amount of information you expect
-the agent will need to remember in order to successfully complete the task.
-
-Typical Range: `32` - `256`
-
-## (Optional) Behavioral Cloning Using Demonstrations
-
-In some cases, you might want to bootstrap the agent's policy using behavior recorded
-from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
-training operations that mimic a demonstration rather than attempting to maximize reward.
-
-To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:
-
-```
-    behavioral_cloning:
-        demo_path: ./Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
-        strength: 0.5
-        steps: 10000
-```
-
-Below are the available hyperparameters for BC.
-
-### Strength
-
-`strength` corresponds to the learning rate of the imitation relative to the learning
-rate of PPO, and roughly corresponds to how strongly we allow BC
-to influence the policy.
-
-Typical Range: `0.1` - `0.5`
-
-### Demo Path
-
-`demo_path` is the path to your `.demo` file or directory of `.demo` files.
-See the [imitation learning guide](Training-Imitation-Learning.md) for more on `.demo` files.
-
-### Steps
-
-During BC, it is often desirable to stop using demonstrations after the agent has
-"seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
-outside of the provided demonstrations. `steps` corresponds to the training steps over which
-BC is active. The learning rate of BC will anneal over the steps. Set
-the steps to 0 for constant imitation over the entire training run.
-
-### (Optional) Batch Size
-
-`batch_size` is the number of demonstration experiences used for one iteration of a gradient
-descent update. If not specified, it will default to the `batch_size` defined for PPO.
-
-Typical Range (Continuous): `512` - `5120`
-
-Typical Range (Discrete): `32` - `512`
-
-### (Optional) Number of Epochs
-
-`num_epoch` is the number of passes through the experience buffer during
-gradient descent. If not specified, it will default to the number of epochs set for PPO.
-
-Typical Range: `3` - `10`
-
-### (Optional) Samples Per Update
-
-`samples_per_update` is the maximum number of samples
-to use during each imitation update. You may want to lower this if your demonstration
-dataset is very large to avoid overfitting the policy on demonstrations. Set to 0
-to train over all of the demonstrations at each update step.
-
-Default Value: `0` (all)
-
-Typical Range: Approximately equal to PPO's `buffer_size`
-
-### (Optional) Advanced: Initialize Model Path
-
-`init_path` can be specified to initialize your model from a previous run before starting.
-Note that the prior run should have used the same trainer configurations as the current run,
-and have been saved with the same version of ML-Agents. You should provide the full path
-to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`.
-
-This option is provided in case you want to initialize different behaviors from different runs;
-in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize
-all models from the same run.
-
-### (Optional) Advanced: Disable Threading
-
-By default, PPO model updates can happen while the environment is being stepped. This violates the
-[on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms)
-assumption of PPO slightly in exchange for a 10-20% training speedup. To maintain the
-strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`.
-
-Default Value: `true`
-
-## Training Statistics
-
-To view training statistics, use TensorBoard. For information on launching and
-using TensorBoard, see
-[here](./Getting-Started.md#observing-training-progress).
-
-### Cumulative Reward
-
-The general trend in reward should consistently increase over time. Small ups
-and downs are to be expected. Depending on the complexity of the task, a
-significant increase in reward may not present itself until millions of steps
-into the training process.
-
-### Entropy
-
-This corresponds to how random the decisions are. This should
-consistently decrease during training. If it decreases too soon or not at all,
-`beta` should be adjusted (when using discrete action space).
-
-### Learning Rate
-
-This will decrease over time on a linear schedule by default, unless `learning_rate_schedule`
-is set to `constant`.
-
-### Policy Loss
-
-These values will oscillate during training. Generally they should be less than
-1.0.
-
-### Value Estimate
-
-These values should increase as the cumulative reward increases. They correspond
-to how much future reward the agent predicts itself receiving at any given
-point.
-
-### Value Loss
-
-These values will increase as the reward increases, and then should decrease
-once reward becomes stable.
--- a/docs/Training-SAC.md
+++ b/docs/Training-SAC.md
-# Training with Soft-Actor Critic
-
-In addition to [Proximal Policy Optimization (PPO)](Training-PPO.md), ML-Agents also provides
-[Soft Actor-Critic](http://bair.berkeley.edu/blog/2018/12/14/sac/) to perform
-reinforcement learning.
-
-In contrast with PPO, SAC is _off-policy_, which means it can learn from experiences collected
-at any time during the past. As experiences are collected, they are placed in an
-experience replay buffer and randomly drawn during training. This makes SAC
-significantly more sample-efficient, often requiring 5-10 times less samples to learn
-the same task as PPO. However, SAC tends to require more model updates. SAC is a
-good choice for heavier or slower environments (about 0.1 seconds per step or more).
-
-SAC is also a "maximum entropy" algorithm, and enables exploration in an intrinsic way.
-Read more about maximum entropy RL [here](https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/).
-
-To train an agent, you will need to provide the agent one or more reward signals which
-the agent should attempt to maximize. See [Reward Signals](Reward-Signals.md)
-for the available reward signals and the corresponding hyperparameters.
-
-## Best Practices when training with SAC
-
-Successfully training a reinforcement learning model often involves tuning
-hyperparameters. This guide contains some best practices for training
-when the default parameters don't seem to be giving the level of performance
-you would like.
-
-## Hyperparameters
-
-### Reward Signals
-
-In reinforcement learning, the goal is to learn a Policy that maximizes reward.
-In the most basic case, the reward is given by the environment. However, we could imagine
-rewarding the agent for various different behaviors. For instance, we could reward
-the agent for exploring new states, rather than explicitly defined reward signals.
-Furthermore, we could mix reward signals to help the learning process.
-
-`reward_signals` provides a section to define [reward signals.](Reward-Signals.md)
-ML-Agents provides two reward signals by default, the Extrinsic (environment) reward, and the
-Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
-environments.
-
-#### Steps Per Update for Reward Signal (Optional)
-
-`reward_signal_steps_per_update` for the reward signals corresponds to the number of steps per mini batch sampled
-and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated.
-However, to imitate the training procedure in certain imitation learning papers (e.g.
-[Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)),
-we may want to update the reward signal (GAIL) M times for every update of the policy.
-We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update`
-under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to
-`steps_per_update`.
-
-Typical Range: `steps_per_update`
-
-### Buffer Size
-
-`buffer_size` corresponds the maximum number of experiences (agent observations, actions
-and rewards obtained) that can be stored in the experience replay buffer. This value should be
-large, on the order of thousands of times longer than your episodes, so that SAC
-can learn from old as well as new experiences. It should also be much larger than
-`batch_size`.
-
-Typical Range: `50000` - `1000000`
-
-### Buffer Init Steps
-
-`buffer_init_steps` is the number of experiences to prefill the buffer with before attempting training.
-As the untrained policy is fairly random, prefilling the buffer with random actions is
-useful for exploration. Typically, at least several episodes of experiences should be
-prefilled.
-
-Typical Range: `1000` - `10000`
-
-### Batch Size
-
-`batch_size` is the number of experiences used for one iteration of a gradient
-descent update. If
-you are using a continuous action space, this value should be large (in the
-order of 1000s). If you are using a discrete action space, this value should be
-smaller (in order of 10s).
-
-Typical Range (Continuous): `128` - `1024`
-
-Typical Range (Discrete): `32` - `512`
-
-### Initial Entropy Coefficient
-
-`init_entcoef` refers to the initial entropy coefficient set at the beginning of training. In
-SAC, the agent is incentivized to make its actions entropic to facilitate better exploration.
-The entropy coefficient weighs the true reward with a bonus entropy reward. The entropy
-coefficient is [automatically adjusted](https://arxiv.org/abs/1812.05905) to a preset target
-entropy, so the `init_entcoef` only corresponds to the starting value of the entropy bonus.
-Increase `init_entcoef` to explore more in the beginning, decrease to converge to a solution faster.
-
-Typical Range (Continuous): `0.5` - `1.0`
-
-Typical Range (Discrete): `0.05` - `0.5`
-
-### Train Interval
-
-`train_interval` is the number of steps taken between each agent training event. Typically,
-we can train after every step, but if your environment's steps are very small and very frequent,
-there may not be any new interesting information between steps, and `train_interval` can be increased.
-
-Typical Range: `1` - `5`
-
-### Steps Per Update
-
-`steps_per_update` corresponds to the average ratio of agent steps (actions) taken to updates made of the agent's
-policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
-replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after
-exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps.
-
-Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will
-improve sample efficiency (reduce the number of steps required to train)
-but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example
-environments) `steps_per_update` equal to the number of agents in the scene is a good balance.
-For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed.
-We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will
-usually result in a slowdown unless the environment is very slow.
-
-Typical Range: `1` - `20`
-
-### Tau
-
-`tau` corresponds to the magnitude of the target Q update during the SAC model update.
-In SAC, there are two neural networks: the target and the policy. The target network is
-used to bootstrap the policy's estimate of the future rewards at a given state, and is fixed
-while the policy is being updated. This target is then slowly updated according to `tau`.
-Typically, this value should be left at `0.005`. For simple problems, increasing
-`tau` to `0.01` might reduce the time it takes to learn, at the cost of stability.
-
-Typical Range: `0.005` - `0.01`
-
-### Learning Rate
-
-`learning_rate` corresponds to the strength of each gradient descent update
-step. This should typically be decreased if training is unstable, and the reward
-does not consistently increase.
-
-Typical Range: `1e-5` - `1e-3`
-
-### (Optional) Learning Rate Schedule
-
-`learning_rate_schedule` corresponds to how the learning rate is changed over time.
-For SAC, we recommend holding learning rate constant so that the agent can continue to
-learn until its Q function converges naturally.
-
-Options:
-* `linear`: Decay `learning_rate` linearly, reaching 0 at `max_steps`.
-* `constant` (default): Keep learning rate constant for the entire training run.
-
-Options: `linear`, `constant`
-
-### Time Horizon
-
-`time_horizon` corresponds to how many steps of experience to collect per-agent
-before adding it to the experience buffer. This parameter is a lot less critical
-to SAC than PPO, and can typically be set to approximately your episode length.
-
-Typical Range: `32` - `2048`
-
-### Max Steps
-
-`max_steps` corresponds to how many steps of the simulation (multiplied by
-frame-skip) are run during the training process. This value should be increased
-for more complex problems.
-
-Typical Range: `5e5` - `1e7`
-
-### Normalize
-
-`normalize` corresponds to whether normalization is applied to the vector
-observation inputs. This normalization is based on the running average and
-variance of the vector observation. Normalization can be helpful in cases with
-complex continuous control problems, but may be harmful with simpler discrete
-control problems.
-
-### Number of Layers
-
-`num_layers` corresponds to how many hidden layers are present after the
-observation input, or after the CNN encoding of the visual observation. For
-simple problems, fewer layers are likely to train faster and more efficiently.
-More layers may be necessary for more complex control problems.
-
-Typical range: `1` - `3`
-
-### Hidden Units
-
-`hidden_units` correspond to how many units are in each fully connected layer of
-the neural network. For simple problems where the correct action is a
-straightforward combination of the observation inputs, this should be small. For
-problems where the action is a very complex interaction between the observation
-variables, this should be larger.
-
-Typical Range: `32` - `512`
-
-### (Optional) Visual Encoder Type
-
-`vis_encode_type` corresponds to the encoder type for encoding visual observations.
-Valid options include:
-* `simple` (default): a simple encoder which consists of two convolutional layers
-* `nature_cnn`: [CNN implementation proposed by Mnih et al.](https://www.nature.com/articles/nature14236),
-consisting of three convolutional layers
-* `resnet`: [IMPALA Resnet implementation](https://arxiv.org/abs/1802.01561),
-consisting of three stacked layers, each with two residual blocks, making a
-much larger network than the other two.
-
-Options: `simple`, `nature_cnn`, `resnet`
-
-## (Optional) Recurrent Neural Network Hyperparameters
-
-The below hyperparameters are only used when `use_recurrent` is set to true.
-
-### Sequence Length
-
-`sequence_length` corresponds to the length of the sequences of experience
-passed through the network during training. This should be long enough to
-capture whatever information your agent might need to remember over time. For
-example, if your agent needs to remember the velocity of objects, then this can
-be a small value. If your agent needs to remember a piece of information given
-only once at the beginning of an episode, then this should be a larger value.
-
-Typical Range: `4` - `128`
-
-### Memory Size
-
-`memory_size` corresponds to the size of the array of floating point numbers
-used to store the hidden state of the recurrent neural network in the policy.
-This value must be a multiple of 2, and should scale with the amount of information you expect
-the agent will need to remember in order to successfully complete the task.
-
-Typical Range: `32` - `256`
-
-### (Optional) Save Replay Buffer
-
-`save_replay_buffer` enables you to save and load the experience replay buffer as well as
-the model when quitting and re-starting training. This may help resumes go more smoothly,
-as the experiences collected won't be wiped. Note that replay buffers can be very large, and
-will take up a considerable amount of disk space. For that reason, we disable this feature by
-default.
-
-Default: `False`
-
-## (Optional) Behavioral Cloning Using Demonstrations
-
-In some cases, you might want to bootstrap the agent's policy using behavior recorded
-from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
-training operations that mimic a demonstration rather than attempting to maximize reward.
-
-To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:
-
-```
-    behavioral_cloning:
-        demo_path: ./Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
-        strength: 0.5
-        steps: 10000
-```
-
-Below are the available hyperparameters for BC.
-
-### Strength
-
-`strength` corresponds to the learning rate of the imitation relative to the learning
-rate of SAC, and roughly corresponds to how strongly we allow BC
-to influence the policy.
-
-Typical Range: `0.1` - `0.5`
-
-### Demo Path
-
-`demo_path` is the path to your `.demo` file or directory of `.demo` files.
-See the [imitation learning guide](Training-Imitation-Learning.md) for more on `.demo` files.
-
-### Steps
-
-During BC, it is often desirable to stop using demonstrations after the agent has
-"seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
-outside of the provided demonstrations. `steps` corresponds to the training steps over which
-BC is active. The learning rate of BC will anneal over the steps. Set
-the steps to 0 for constant imitation over the entire training run.
-
-### (Optional) Batch Size
-
-`batch_size` is the number of demonstration experiences used for one iteration of a gradient
-descent update. If not specified, it will default to the `batch_size` defined for SAC.
-
-Typical Range (Continuous): `512` - `5120`
-
-Typical Range (Discrete): `32` - `512`
-
-### (Optional) Advanced: Initialize Model Path
-
-`init_path` can be specified to initialize your model from a previous run before starting.
-Note that the prior run should have used the same trainer configurations as the current run,
-and have been saved with the same version of ML-Agents. You should provide the full path
-to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`.
-
-This option is provided in case you want to initialize different behaviors from different runs;
-in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize
-all models from the same run.
-
-## Training Statistics
-
-To view training statistics, use TensorBoard. For information on launching and
-using TensorBoard, see
-[here](./Getting-Started.md#observing-training-progress).
-
-### Cumulative Reward
-
-The general trend in reward should consistently increase over time. Small ups
-and downs are to be expected. Depending on the complexity of the task, a
-significant increase in reward may not present itself until millions of steps
-into the training process.
-
-### Entropy Coefficient
-
-SAC is a "maximum entropy" reinforcement learning algorithm, and agents trained using
-SAC are incentivized to behave randomly while also solving the problem. The entropy
-coefficient balances the incentive to behave randomly vs. maximizing the reward.
-This value is adjusted automatically so that the agent retains some amount of randomness during
-training. It should steadily decrease in the beginning of training, and reach some small
-value where it will level off. If it decreases too soon or takes too
-long to decrease, `init_entcoef` should be adjusted.
-
-### Entropy
-
-This corresponds to how random the decisions are. This should
-initially increase during training, reach a peak, and should decline along
-with the Entropy Coefficient. This is because in the beginning, the agent is
-incentivized to be more random for exploration due to a high entropy coefficient.
-If it decreases too soon or takes too long to decrease, `init_entcoef` should be adjusted.
-
-### Learning Rate
-
-This will stay a constant value by default, unless `learning_rate_schedule`
-is set to `linear`.
-
-### Policy Loss
-
-These values may increase as the agent explores, but should decrease long-term
-as the agent learns how to solve the task.
-
-### Value Estimate
-
-These values should increase as the cumulative reward increases. They correspond
-to how much future reward the agent predicts itself receiving at any given
-point. They may also increase at the beginning as the agent is rewarded for
-being random (see: Entropy and Entropy Coefficient), but should decline as
-Entropy Coefficient decreases.
-
-### Value Loss
-
-These values will increase as the reward increases, and then should decrease
-once reward becomes stable.
--- a/docs/Training-Curriculum-Learning.md
+++ b/docs/Training-Curriculum-Learning.md
-# Training with Curriculum Learning
-
-Curriculum learning is a feature of ML-Agents which allows for the properties of environments to be changed during the training process to aid in learning.
-
-## An Instructional Example
-
-*[**Note**: The example provided below is for instructional purposes, and was based on an early version of the [Wall Jump example environment](Learning-Environment-Examples.md).
-As such, it is not possible to directly replicate the results here using that environment.]*
-
-Imagine a task in which an agent needs to scale a wall to arrive at a goal. The
-starting point when training an agent to accomplish this task will be a random
-policy. That starting policy will have the agent running in circles, and will
-likely never, or very rarely scale the wall properly to the achieve the reward.
-If we start with a simpler task, such as moving toward an unobstructed goal,
-then the agent can easily learn to accomplish the task. From there, we can
-slowly add to the difficulty of the task by increasing the size of the wall
-until the agent can complete the initially near-impossible task of scaling the
-wall.
-
-![Wall](images/curriculum.png)
-
-_Demonstration of a hypothetical curriculum training scenario in which a progressively taller
-wall obstructs the path to the goal._
-
-## How-To
-
-Each group of Agents under the same `Behavior Name` in an environment can have
-a corresponding curriculum. These curricula are held in what we call a "metacurriculum".
-A metacurriculum allows different groups of Agents to follow different curricula within
-the same environment.
-
-### Specifying Curricula
-
-In order to define the curricula, the first step is to decide which parameters of
-the environment will vary. In the case of the Wall Jump environment,
-the height of the wall is what varies. We define this as a `Environment Parameters`
-that can be accessed in `Academy.Instance.EnvironmentParameters`, and by doing
-so it becomes adjustable via the Python API.
-Rather than adjusting it by hand, we will create a YAML file which
-describes the structure of the curricula. Within it, we can specify which
-points in the training process our wall height will change, either based on the
-percentage of training steps which have taken place, or what the average reward
-the agent has received in the recent past is. Below is an example config for the
-curricula for the Wall Jump environment.
-
-```yaml
-BigWallJump:
-  measure: progress
-  thresholds: [0.1, 0.3, 0.5]
-  min_lesson_length: 100
-  signal_smoothing: true
-  parameters:
-    big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
-    big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
-SmallWallJump:
-  measure: progress
-  thresholds: [0.1, 0.3, 0.5]
-  min_lesson_length: 100
-  signal_smoothing: true
-  parameters:
-    small_wall_height: [1.5, 2.0, 2.5, 4.0]
-```
-
-At the top level of the config is the behavior name. Note that this must be the
-same as the Behavior Name in the [Agent's Behavior Parameters](Learning-Environment-Design-Agents.md#agent-properties).
- The curriculum for each
-behavior has the following parameters:
-* `measure` - What to measure learning progress, and advancement in lessons by.
-  * `reward` - Uses a measure received reward.
-  * `progress` - Uses ratio of steps/max_steps.
-* `thresholds` (float array) - Points in value of `measure` where lesson should
-  be increased.
-* `min_lesson_length` (int) - The minimum number of episodes that should be
-  completed before the lesson can change. If `measure` is set to `reward`, the
-  average cumulative reward of the last `min_lesson_length` episodes will be
-  used to determine if the lesson should change. Must be nonnegative.
-
-  __Important__: the average reward that is compared to the thresholds is
-  different than the mean reward that is logged to the console. For example,
-  if `min_lesson_length` is `100`, the lesson will increment after the average
-  cumulative reward of the last `100` episodes exceeds the current threshold.
-  The mean reward logged to the console is dictated by the `summary_freq`
-  parameter in the
-  [trainer configuration file](Training-ML-Agents.md#training-config-file).
-* `signal_smoothing` (true/false) - Whether to weight the current progress
-  measure by previous values.
-  * If `true`, weighting will be 0.75 (new) 0.25 (old).
-* `parameters` (dictionary of key:string, value:float array) - Corresponds to
-  Environment parameters to control. Length of each array should be one
-  greater than number of thresholds.
-
-Once our curriculum is defined, we have to use the environment parameters we defined
-and modify the environment from the Agent's `OnEpisodeBegin()` function. See
-[WallJumpAgent.cs](../Project/Assets/ML-Agents/Examples/WallJump/Scripts/WallJumpAgent.cs)
-for an example.
-
-
-### Training with a Curriculum
-
-Once we have specified our metacurriculum and curricula, we can launch
-`mlagents-learn` using the `–curriculum` flag to point to the config file
-for our curricula and PPO will train using Curriculum Learning. For example,
-to train agents in the Wall Jump environment with curriculum learning, you can run:
-
-```sh
-mlagents-learn config/trainer_config.yaml --curriculum=config/curricula/wall_jump.yaml --run-id=wall-jump-curriculum
-```
-
-You can then keep track of the current lessons and progresses via TensorBoard.
-
-__Note__: If you are resuming a training session that uses curriculum, please pass the number of the last-reached lesson using the `--lesson` flag when running `mlagents-learn`.