Consolidate Feature descriptions into ML-Agents-Overview page

Merged the "Overview" sections of a few pages into their respective sections in ML-Agents-Overview: - Training-Using-Concurrent-Unity-Instances.md - Training-Self-Play.md - Training-SAC.md - Training-PPO.md - Training-Imitation-Learning.md - Training-Environment-Parameter-Randomization.md - Training-Curriculum-Learning.md - Reward-Signals.md - Feature-Monitor.md - Feature-Memory.md Organized ML-Agents-Overview into Training Methods and Training Options sections. Follow-up action items (part of a separate PR): - Smooth over the documentation in ML-Agents-Overview (right now, we somewhat just pasted text from other pages). If we align on the new structure for this page, we can iterate on it. - Update “Key Components” section with new graph and discuss side channels and revise use of Academy. - Consolidate “Training-*” docs into Training-ML-Agents to offer a single guide for all hyperparameter selection
5 年前 · 9084db7b
--- a/docs/Feature-Memory.md
+++ b/docs/Feature-Memory.md
 # Memory-enhanced agents using Recurrent Neural Networks

-## What are memories used for?
-
-Have you ever entered a room to get something and immediately forgot what you
-were looking for? Don't let that happen to your agents.
-
-It is now possible to give memories to your agents. When training, the agents
-will be able to store a vector of floats to be used next time they need to make
-a decision.
-
-![Inspector](images/ml-agents-LSTM.png)
-
-Deciding what the agents should remember in order to solve a task is not easy to
-do by hand, but our training algorithms can learn to keep track of what is
-important to remember with
-[LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).
-
 ## How to use

 When configuring the trainer parameters in the `config/trainer_config.yaml`
--- a/docs/Feature-Monitor.md
+++ b/docs/Feature-Monitor.md
 # Using the Monitor

-![Monitor](images/monitor.png)
-
-The monitor allows visualizing information related to the agents or training
-process within a Unity scene.
-
 You can track many different things both related and unrelated to the agents
 themselves. By default, the Monitor is only active in the *inference* phase, so
 not during training. To change this behavior, you can activate or deactivate it
--- a/docs/Learning-Environment-Design.md
+++ b/docs/Learning-Environment-Design.md
 # Reinforcement Learning in Unity

-Reinforcement learning is an artificial intelligence technique that trains
-_agents_ to perform tasks by rewarding desirable behavior. During reinforcement
-learning, an agent explores its environment, observes the state of things, and,
-based on those observations, takes an action. If the action leads to a better
-state, the agent receives a positive reward. If it leads to a less desirable
-state, then the agent receives no reward or a negative reward (punishment). As
-the agent learns during training, it optimizes its decision making so that it
-receives the maximum reward over time.
-
-The ML-Agents toolkit uses a reinforcement learning technique called
-[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/).
-PPO uses a neural network to approximate the ideal function that maps an agent's
-observations to the best action an agent can take in a given state. The
-ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
-Python process (communicating with the running Unity application over a socket).
-
-**Note:** if you aren't studying machine and reinforcement learning as a subject
-and just want to train agents to accomplish tasks, you can treat PPO training as
-a _black box_. There are a few training-related parameters to adjust inside
-Unity as well as on the Python training side, but you do not need in-depth
-knowledge of the algorithm itself to successfully create and train agents.
-Step-by-step procedures for running the training process are provided in the
-[Training section](Training-ML-Agents.md).
-
 ## The Simulation and Training Process

 Training and simulation proceed in steps orchestrated by the ML-Agents Academy
 The ML-Agents Academy class orchestrates the agent simulation loop as follows:

 1. Calls your Academy's `OnEnvironmentReset` delegate.
-2. Calls the `OnEpisodeBegin()` function for each Agent in the scene.
-3. Calls the  `CollectObservations(VectorSensor sensor)` function for each Agent in the scene.
-4. Uses each Agent's Policy to decide on the Agent's next action.
-5. Calls the `OnActionReceived()` function for each Agent in the scene, passing in
+1. Calls the `OnEpisodeBegin()` function for each Agent in the scene.
+1. Calls the  `CollectObservations(VectorSensor sensor)` function for each Agent in the scene.
+1. Uses each Agent's Policy to decide on the Agent's next action.
+1. Calls the `OnActionReceived()` function for each Agent in the scene, passing in
-6. Calls the Agent's `OnEpisodeBegin()` function if the Agent has reached its `Max
+1. Calls the Agent's `OnEpisodeBegin()` function if the Agent has reached its `Max
-
-**Note:** The API used by the Python training process to communicate with
-and control the Academy during training can be used for other purposes as well.
-For example, you could use the API to use Unity as the simulation engine for
-your own machine learning algorithms. See [Python API](Python-API.md) for more
-information.

 ## Organizing the Unity Scene

 about programming your own Agents.

 ## Environments
-
-An _environment_ in the ML-Agents toolkit can be any scene built in Unity. The
-Unity scene provides the environment in which agents observe, act, and learn.
-How you set up the Unity scene to serve as a learning environment really depends
-on your goal. You may be trying to solve a specific reinforcement learning
-problem of limited scope, in which case you can use the same scene for both
-training and for testing trained agents. Or, you may be training agents to
-operate in a complex game or simulation. In this case, it might be more
-efficient and practical to create a purpose-built training scene.

 When you create a training environment in Unity, you must set up the scene so
 that it can be controlled by the external training process. Considerations
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
 # ML-Agents Toolkit Overview

 **The Unity Machine Learning Agents Toolkit** (ML-Agents Toolkit) is an
-open-source Unity plugin that enables games and simulations to serve as
+open-source project that enables games and simulations to serve as
 environments for training intelligent agents. Agents can be trained using
 reinforcement learning, imitation learning, neuroevolution, or other machine
 learning methods through a simple-to-use Python API. We also provide
 The remainder of this page contains a deep dive into ML-Agents, its key
 components, different training modes and scenarios. By the end of it, you should
 have a good sense of _what_ the ML-Agents toolkit allows you to do. The
-subsequent documentation pages provide examples of _how_ to use ML-Agents.
+subsequent documentation pages provide examples of _how_ to use ML-Agents. To get
+started, watch this
+[demo video of ML-Agents in action](https://www.youtube.com/watch?v=fiQsmdwEGT8&feature=youtu.be).

 ## Running Example: Training NPC Behaviors

 **agents**) using a variety of methods. The basic idea is quite simple. We need
 to define three entities at every moment of the game (called **environment**):

- **Observations** - what the medic perceives about the environment.
+* **Observations** - what the medic perceives about the environment.
  Observations can be numeric and/or visual. Numeric observations measure
  attributes of the environment from the point of view of the agent. For our
  medic this would be attributes of the battlefield that are visible to it. For
  the agent is aware of and is typically a subset of the environment state. For
  example, the medic observation cannot include information about an enemy in
  hiding that the medic is unaware of.
- **Actions** - what actions the medic can take. Similar to observations,
+* **Actions** - what actions the medic can take. Similar to observations,
  actions can either be continuous or discrete depending on the complexity of
  the environment and agent. In the case of the medic, if the environment is a
  simple grid world where only their location matters, then a discrete action
  appropriate.
- **Reward signals** - a scalar value indicating how well the medic is doing.
+* **Reward signals** - a scalar value indicating how well the medic is doing.
  Note that the reward signal need not be provided at every moment, but only
  when the medic performs an action that is good or bad. For example, it can
  receive a large negative reward if it dies, a modest positive reward whenever
 environment. In the next few sections, we discuss how the ML-Agents toolkit
 achieves this and what features it provides.

+
-The ML-Agents toolkit is a Unity plugin that contains three high-level
-components:
+The ML-Agents Toolkit contains four high-level components:
- **Learning Environment** - which contains the Unity scene and all the game
-  characters.
- **Python API** - which contains all the machine learning algorithms that are
-  used for training (learning a behavior or policy). Note that, unlike the
+* **Learning Environment** - which contains the Unity scene and all the game
+  characters. The Unity scene provides the environment in which agents observe, act, and learn.
+  How you set up the Unity scene to serve as a learning environment really depends
+  on your goal. You may be trying to solve a specific reinforcement learning
+  problem of limited scope, in which case you can use the same scene for both
+  training and for testing trained agents. Or, you may be training agents to
+  operate in a complex game or simulation. In this case, it might be more
+  efficient and practical to create a purpose-built training scene.
+* **Python API** - which contains a low-level Python interface for interacting and
+  manipulating a learning environment. Note that, unlike the
-  and communicates with Unity through the External Communicator.
- **External Communicator** - which connects the Learning Environment with the
+  and communicates with Unity through the Communicator. This API is contained in a
+  dedicated `mlagents_envs` Python package and is used by the Python training process to
+  communicate with and control the Academy during training. However, it can be used for other
+  purposes as well. For example, you could use the API to use Unity as the simulation engine for
+  your own machine learning algorithms. See [Python API](Python-API.md) for more
+  information.
+* **Communicator** - which connects the Learning Environment with the
+* **Python Trainers** which contains all the machine learning algorithms that enable
+  training agents. The algorithms are implemented in Python and are part of their
+  own `mlagents` Python package. The package exposes a single command-line utility
+  `mlagents-learn` that supports all the training methods and options outlined in
+  this document. The Python Trainers interface solely with the Python API.
-  <img src="images/learning_environment_basic.png"
-       alt="Simplified ML-Agents Scene Block Diagram"
-       width="700" border="10" />
+  <img src="images/learning_environment_example.png"
+       alt="Example ML-Agents Scene Block Diagram"
+       border="10" />
-_Simplified block diagram of ML-Agents._
+_Example block diagram of ML-Agents toolkit for our sample game._

 The Learning Environment contains an additional component that help
 organize the Unity scene:
 attached to those characters cannot share a Policy with the Agent linked to the
 medics (medics and drivers have different actions).

-<p align="center">
-  <img src="images/learning_environment_example.png"
-       alt="Example ML-Agents Scene Block Diagram"
-       border="10" />
-</p>
-
-_Example block diagram of ML-Agents toolkit for our sample game._
-
 We have yet to discuss how the ML-Agents toolkit trains behaviors, and what role
 the Python API and External Communicator play. Before we dive into those
 details, let's summarize the earlier components. Each character is attached to
 We do not currently have a tutorial highlighting this mode, but you can
 learn more about the Python API [here](Python-API.md).

-### Curriculum Learning
-
-This mode is an extension of _Built-in Training and Inference_, and is
-particularly helpful when training intricate behaviors for complex environments.
-Curriculum learning is a way of training a machine learning model where more
-difficult aspects of a problem are gradually introduced in such a way that the
-model is always optimally challenged. This idea has been around for a long time,
-and it is how we humans typically learn. If you imagine any childhood primary
-school education, there is an ordering of classes and topics. Arithmetic is
-taught before algebra, for example. Likewise, algebra is taught before calculus.
-The skills and knowledge learned in the earlier subjects provide a scaffolding
-for later lessons. The same principle can be applied to machine learning, where
-training on easier tasks can provide a scaffolding for harder tasks in the
-future.
-
-<p align="center">
-  <img src="images/math.png"
-       alt="Example Math Curriculum"
-       width="700"
-       border="10" />
-</p>
-
-_Example of a mathematics curriculum. Lessons progress from simpler topics to
-more complex ones, with each building on the last._
-
-When we think about how reinforcement learning actually works, the learning reward
-signal is received occasionally throughout training. The starting point
-when training an agent to accomplish this task will be a random policy. That
-starting policy will have the agent running in circles, and will likely never,
-or very rarely achieve the reward for complex environments. Thus by simplifying
-the environment at the beginning of training, we allow the agent to quickly
-update the random policy to a more meaningful one that is successively improved
-as the environment gradually increases in complexity. In our example, we can
-imagine first training the medic when each team only contains one player, and
-then iteratively increasing the number of players (i.e. the environment
-complexity). The ML-Agents toolkit supports setting custom environment
-parameters within the Academy. This allows elements of the environment related
-to difficulty or complexity to be dynamically adjusted based on training
-progress.
-
-The [Training with Curriculum Learning](Training-Curriculum-Learning.md)
-tutorial covers this training mode with the **Wall Area** sample environment.
-
-### Imitation Learning
-
-It is often more intuitive to simply demonstrate the behavior we want an agent
-to perform, rather than attempting to have it learn via trial-and-error methods.
-For example, instead of training the medic by setting up its reward function,
-this mode allows providing real examples from a game controller on how the medic
-should behave. More specifically, in this mode, the Agent must use its heuristic
-to generate action, and all the actions performed with the controller (in addition
-to the agent observations) will be recorded. The
-imitation learning algorithm will then use these pairs of observations and
-actions from the human player to learn a policy. [Video
-Link](https://youtu.be/kpb8ZkMBFYs).
-
-The toolkit provides a way to learn directly from demonstrations, as well as use them
-to help speed up reward-based training (RL). We include two algorithms called
-Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL). The
-[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial covers these
-features in more depth.
-
 ## Flexible Training Scenarios

 While the discussion so-far has mostly focused on training a single agent, with
 inspiration:

- Single-Agent. A single agent, with its own reward
+* Single-Agent. A single agent, with its own reward
-  single-player game, such as Chicken. [Video
-  Link](https://www.youtube.com/watch?v=fiQsmdwEGT8&feature=youtu.be).
- Simultaneous Single-Agent. Multiple independent agents with independent reward
+  single-player game, such as Chicken.
+* Simultaneous Single-Agent. Multiple independent agents with independent reward
-  dozen robot-arms to each open a door simultaneously. [Video
-  Link](https://www.youtube.com/watch?v=fq0JBaiCYNA).
- Adversarial Self-Play. Two interacting agents with inverse reward signals.
+  dozen robot-arms to each open a door simultaneously.
+* Adversarial Self-Play. Two interacting agents with inverse reward signals.
- Cooperative Multi-Agent. Multiple interacting agents with a shared reward
+* Cooperative Multi-Agent. Multiple interacting agents with a shared reward
- Competitive Multi-Agent. Multiple interacting agents with inverse reward
+* Competitive Multi-Agent. Multiple interacting agents with inverse reward
- Ecosystem. Multiple interacting agents with independent reward signals with
+* Ecosystem. Multiple interacting agents with independent reward signals with
+
+## Training Methods
+
+This section overviews the various state-of-the-art machine learning algorithms that are
+part of the ML-Agents Toolkit. If you aren't studying machine and reinforcement learning
+as a subject and just want to train agents to accomplish tasks, you can treat these algorithms
+as _black boxes_. There are a few training-related parameters to adjust inside
+Unity as well as on the Python training side, but you do not need in-depth
+knowledge of the algorithms themselves to successfully create and train agents.
+Step-by-step procedures for running the training process are provided in the
+[Training ML-Agents](Training-ML-Agents.md) page.
+
+#### A Quick Note on Reward Signals
+
+In this section we introduce the concepts of _intrinsic_ and _extrinsic_ rewards, which helps
+explain some of the training methods.
+
+To train an agent, you will need to provide the agent one or more reward signals which
+the agent should attempt to maximize. In reinforcement learning, the end goal for the Agent is to
+discover a behavior (a Policy) that maximizes a reward. Typically, a reward is defined by your
+environment, and corresponds to reaching some goal. These are what we refer to as "extrinsic"
+rewards, as they are defined external of the learning algorithm.
+
+Rewards, however, can be defined outside of the environment as well, to encourage the agent to
+behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these
+rewards as "intrinsic" reward signals. The total reward that the agent will learn to maximize can
+be a mix of extrinsic and intrinsic reward signals.
+
+The ML-Agents Toolkit allows reward signals to be defined in a modular way, and we provide three
+reward signals that can the mixed and matched to help shape your agent's behavior:
+* `extrinsic`: represents the rewards defined in your environment, and is enabled by default
+* `gail`: represents an intrinsic reward signal that is defined by GAIL (see below)
+* `curiosity`: represents an intrinsic reward signal that encourages exploration in sparse-reward
+environments that is defined by the Curiosity module (see below)
+
+### Deep Reinforcement Learning
+
+ML-Agents provide an implementation of two reinforcement learning algorithms:
+* [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/)
+* [Soft Actor-Critic (SAC)](https://bair.berkeley.edu/blog/2018/12/14/sac/)
+
+The default algorithm is PPO. This is a method that has been shown to be more general purpose and
+stable than many other RL algorithms.
+
+In contrast with PPO, SAC is _off-policy_, which means it can learn from experiences collected
+at any time during the past. As experiences are collected, they are placed in an
+experience replay buffer and randomly drawn during training. This makes SAC
+significantly more sample-efficient, often requiring 5-10 times less samples to learn
+the same task as PPO. However, SAC tends to require more model updates. SAC is a
+good choice for heavier or slower environments (about 0.1 seconds per step or more).
+SAC is also a "maximum entropy" algorithm, and enables exploration in an intrinsic way.
+Read more about maximum entropy RL [here](https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/).
+
+#### Curiosity for Sparse-reward Environments
+
+In environments where the agent receives rare or infrequent rewards (i.e. sparse-reward), an
+agent may never receive a reward signal on which to bootstrap its training process. This is a
+scenario where the use of an intrinsic reward signals can be valuable. Curiosity is one such
+signal which can help the agent explore when extrinsic rewards are sparse.
+
+The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an implementation
+of the approach described in
+[Curiosity-driven Exploration by Self-supervised Prediction(https://arxiv.org/abs/1705.05363).
+It trains two networks:
+* an inverse model, which takes the current and next observation of the agent, encodes them, and
+uses the encoding to predict the action that was taken between the observations
+* a forward model, which takes the encoded current observation and action, and predicts the
+next encoded observation.
+
+The loss of the forward model (the difference between the predicted and actual encoded observations)
+is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be.
+
+For more information, see our dedicated [blog post on the Curiosity module](https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/).
+
+### Imitation Learning
+
+It is often more intuitive to simply demonstrate the behavior we want an agent
+to perform, rather than attempting to have it learn via trial-and-error methods.
+For example, instead of indirectly training a medic with the help
+of a reward function, we can give the medic real world examples of observations
+from the game and actions from a game controller to guide the medic's behavior.
+Imitation Learning uses pairs of observations and actions from
+a demonstration to learn a policy. See this
+[video demo](https://youtu.be/kpb8ZkMBFYs) of imitation learning .
+
+Imitation learning can either be used alone or in conjunction with reinforcement learning.
+If used alone it can provide a mechanism for learning a specific type of behavior
+(i.e. a specific style of solving the task). If used in conjunction with reinforcement
+learning it can dramatically reduce the time the agent takes to solve the environment.
+This can be especially pronounced in sparse-reward environments.
+For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
+using 6 episodes of demonstrations can reduce training steps by more than 4 times.
+See Behavioral Cloning + GAIL + Curiosity + RL below.
+
+<p align="center">
+  <img src="images/mlagents-ImitationAndRL.png"
+       alt="Using Demonstrations with Reinforcement Learning"
+       width="700" border="0" />
+</p>
+
+The ML-Agents Toolkit provides a way to learn directly from demonstrations, as well as use them
+to help speed up reward-based training (RL). We include two algorithms called
+Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL).
+In most scenarios, you can combine these two features.
+
+#### GAIL (Generative Adversarial Imitation Learning)
+
+GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an
+imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs
+(Generative Adversarial Networks). In this framework, a second neural network, the
+discriminator, is taught to distinguish whether an observation/action is from a demonstration or
+produced by the agent. This discriminator can the examine a new observation/action and provide it a
+reward based on how close it believes this new observation/action is to the provided demonstrations.
+
+At each training step, the agent tries to learn how to maximize this reward. Then, the
+discriminator is trained to better distinguish between demonstrations and agent state/actions.
+In this way, while the agent gets better and better at mimicking the demonstrations, the
+discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.
+
+This approach learns a _policy_ that produces states and actions similar to the demonstrations,
+requiring fewer demonstrations than direct cloning of the actions. In addition to learning purely
+from demonstrations, the GAIL reward signal can be mixed with an extrinsic reward signal to guide
+the learning process.
+
+GAIL uses an adversarial approach to reward your Agent for behaving similar to a set of
+demonstrations. GAIL can be used with or without environment rewards, and works well when there are
+a limited number of demonstrations.
+
+#### Behavioral Cloning (BC)
+
+BC trains the Agent's policy to exactly mimic the actions shown in a set of demonstrations.
+The BC feature can be enabled on the PPO or SAC trainers. As BC cannot generalize
+past the examples shown in the demonstrations, BC tends to work best when there exists demonstrations
+for nearly all of the states that the agent can experience, or in conjunction with GAIL and/or an
+extrinsic reward.
+
+#### Recording Demonstrations
+
+Demonstrations of agent behavior can be recorded from the Unity Editor or build,
+and saved as assets. These demonstrations contain information on the
+observations, actions, and rewards for a given agent during the recording session.
+They can be managed in the Editor, as well as used for training with BC and GAIL.
+
+### Summary
+
+To summarize, we provide 3 training methods: BC, GAIL and RL (PPO or SAC) that can be used
+independently or together:
+* BC can be used on its own or as a pre-training step before GAIL and/or RL
+* GAIL can be used with or without extrinsic rewards
+* RL can be used on its own (either PPO or SAC) or in conjunction with BC and/or GAIL.
+
+Leveraging either BC or GAIL requires recording demonstrations to be provided as input to the
+training algorithms.
+
+
+## Training Options
+
+In addition to the three training methods introduced in the previous section, the ML-Agents Toolkit
+provides additional options that can aid in training behaviors in complex environments. Each of
+the options below can be utilized with any of the training methods above.
+
+### Training in Multi-Agent Environments with Self-Play
+
+ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with
+[Self-Play](https://openai.com/blog/competitive-self-play/).
+A symmetric game is one in which opposing agents are equal in form, function and objective. Examples
+of symmetric games are our Tennis and Soccer example environments. In reinforcement learning, this
+means both agents have the same observation and action spaces and learn from the same reward
+function and so *they can share the same policy*. In asymmetric games, this is not the case. An
+example of an asymmetric games are Hide and Seek. Agents in these types of games do not always have
+the same observation or action spaces and so sharing policy networks is not necessarily ideal.
+
+With self-play, an agent learns in adversarial games by competing against fixed, past versions of
+its opponent (which could be itself as in symmetric games) to provide a more stable, stationary
+learning environment. This is compared to competing against the current, best opponent in every
+episode, which is constantly changing (because it's learning).
+
+Self-play can be used with our implementations of both Proximal Policy Optimization (PPO) and Soft
+Actor-Critic (SAC). However, from the perspective of an individual agent, these scenarios appear to
+have non-stationary dynamics because the opponent is often changing. This can cause significant
+issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For
+further reading on this issue in particular, see the paper [Stabilising Experience Replay for Deep
+Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf).
+
+### Memory-enhanced Agents using Recurrent Neural Networks
+
+Have you ever entered a room to get something and immediately forgot what you
+were looking for? Don't let that happen to your agents.
+
+![Inspector](images/ml-agents-LSTM.png)
+
+In some scenarios, agents must learn to remember the past in order to take the best
+decision. When an agent only has partial observability of the environment, keeping
+track of past observations can help the agent learn. Deciding what the agents should
+remember in order to solve a task is not easy to do by hand, but our training algorithms
+can learn to keep track of what is important to remember with
+[LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).
+
+### Solving Complex Tasks using Curriculum Learning
+
+Curriculum learning is a way of training a machine learning model where more
+difficult aspects of a problem are gradually introduced in such a way that the
+model is always optimally challenged. This idea has been around for a long time,
+and it is how we humans typically learn. If you imagine any childhood primary
+school education, there is an ordering of classes and topics. Arithmetic is
+taught before algebra, for example. Likewise, algebra is taught before calculus.
+The skills and knowledge learned in the earlier subjects provide a scaffolding
+for later lessons. The same principle can be applied to machine learning, where
+training on easier tasks can provide a scaffolding for harder tasks in the
+future.
+
+Imagine training the medic to to scale a wall to arrive at a wounded team member.
+The starting point when training a medic to accomplish this task will be a random
+policy. That starting policy will have the medic running in circles, and will
+likely never, or very rarely scale the wall properly to revive their team member
+(and achieve the reward). If we start with a simpler task, such as moving toward
+an unobstructed team member, then the medic can easily learn to accomplish the task.
+From there, we can slowly add to the difficulty of the task by increasing the size of
+the wall until the medic can complete the initially near-impossible task of scaling the
+wall. We have included an environment to demonstrate this with ML-Agents,
+called [Wall Jump](Learning-Environment-Examples.md#wall-jump).
+
+![Wall](images/curriculum.png)
+
+_Demonstration of a curriculum training scenario in which a progressively taller
+wall obstructs the path to the goal._
+
+To see curriculum learning in action, observe the two learning curves below. Each
+displays the reward over time for an agent trained using PPO with the same set of
+training hyperparameters. The difference is that one agent was trained using the
+full-height wall version of the task, and the other agent was trained using the
+curriculum version of the task. As you can see, without using curriculum
+learning the agent has a lot of difficulty. We think that by using well-crafted
+curricula, agents trained using reinforcement learning will be able to
+accomplish tasks otherwise much more difficult.
+
+![Log](images/curriculum_progress.png)
+
+The ML-Agents toolkit supports setting custom environment parameters within
+the Academy. This allows elements of the environment related to difficulty or
+complexity to be dynamically adjusted based on training progress.
+
+### Training Robust Agents using Environment Parameter Randomization
+
+An agent trained on a specific environment, may be unable to generalize to any
+tweaks or variations in the environment (in machine learning this is referred to
+as overfitting). This becomes problematic in cases where environments are instantiated
+with varying objects or properties. One mechanism to alleviate this and train more
+robust agents that can generalize to unseen variations of the environment
+is to expose them to these variations during training. Similar to Curriculum Learning,
+where environments become more difficult as the agent learns, the ML-Agents Toolkit provides
+a way to randomly sample parameters of the environment during training. We refer to
+this approach as **Environment Parameter Randomization**. For those familiar with
+Reinforcement Learning research, this approach is based on the concept of Domain Randomization
+(you can read more about it [here](https://arxiv.org/abs/1703.06907)). By using parameter
+randomization during training, the agent can be better suited to adapt (with higher performance)
+to future unseen variations of the environment.
+
+_Example of variations of the 3D Ball environment._
+
+Ball scale of 0.5          |  Ball scale of 4
+:-------------------------:|:-------------------------:
+![](images/3dball_small.png)  |  ![](images/3dball_big.png)
+
+
-Beyond the flexible training scenarios available, the ML-Agents toolkit includes
+Beyond the flexible training scenarios available, the ML-Agents Toolkit includes
- **Memory-enhanced Agents** - In some scenarios, agents must learn to remember
-  the past in order to take the best decision. When an agent only has partial
-  observability of the environment, keeping track of past observations can help
-  the agent learn. We provide an implementation of _Long Short-term Memory_
-  ([LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory)) in our trainers
-  that enable the agent to store memories to be used in future steps. You can
-  learn more about enabling LSTM during training [here](Feature-Memory.md).
-
- **Monitoring Agent’s Decision Making** - Since communication in ML-Agents is a
+* **Monitoring Agent’s Decision Making** - Since communication in ML-Agents is a
  two-way street, we provide an Agent Monitor class in Unity which can display
  aspects of the trained Agent, such as the Agents perception on how well it is
  doing (called **value estimates**) within the Unity environment itself. By
  [here](Feature-Monitor.md).

- **Complex Visual Observations** - Unlike other platforms, where the agent’s
+<p align="center">
+  <img src="images/monitor.png"
+       alt="Example usage of the Monitor class"
+       width="500" border="0" />
+</p>
+
+* **Complex Visual Observations** - Unlike other platforms, where the agent’s
  observation might be limited to a single vector or image, the ML-Agents
  toolkit allows multiple cameras to be used for observations per agent. This
  enables agents to learn to integrate information from multiple visual streams.
  learn more about adding visual observations to an agent
  [here](Learning-Environment-Design-Agents.md#multiple-visual-observations).

- **Training with Environment Parameter Randomization** - If an agent is exposed to several variations of an environment, it will be more robust (i.e. generalize better) to
-  unseen variations of the environment. Similar to Curriculum Learning,
-  where environments become more difficult as the agent learns, the toolkit provides
-  a way to randomly sample parameters of the environment during training. See
-  [Training With Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md)
-  to learn more about this feature.
+* **Concurrent Unity Instances** - We enable developers to run concurrent, parallel
+  instances of the Unity executable during training. For certain scenarios, this
+  should speed up training.
+

 ## Summary and Next Steps

--- a/docs/Training-Curriculum-Learning.md
+++ b/docs/Training-Curriculum-Learning.md
 # Training with Curriculum Learning

-## Sample Environment
-
-Imagine a task in which an agent needs to scale a wall to arrive at a goal. The
-starting point when training an agent to accomplish this task will be a random
-policy. That starting policy will have the agent running in circles, and will
-likely never, or very rarely scale the wall properly to the achieve the reward.
-If we start with a simpler task, such as moving toward an unobstructed goal,
-then the agent can easily learn to accomplish the task. From there, we can
-slowly add to the difficulty of the task by increasing the size of the wall
-until the agent can complete the initially near-impossible task of scaling the
-wall. We have included an environment to demonstrate this with ML-Agents,
-called __Wall Jump__.
-
-![Wall](images/curriculum.png)
-
-_Demonstration of a curriculum training scenario in which a progressively taller
-wall obstructs the path to the goal._
-
-To see curriculum learning in action, observe the two learning curves below. Each
-displays the reward over time for an agent trained using PPO with the same set of
-training hyperparameters. The difference is that one agent was trained using the
-full-height wall version of the task, and the other agent was trained using the
-curriculum version of the task. As you can see, without using curriculum
-learning the agent has a lot of difficulty. We think that by using well-crafted
-curricula, agents trained using reinforcement learning will be able to
-accomplish tasks otherwise much more difficult.
-
-![Log](images/curriculum_progress.png)
-
 ## How-To

 Each group of Agents under the same `Behavior Name` in an environment can have
--- a/docs/Training-Environment-Parameter-Randomization.md
+++ b/docs/Training-Environment-Parameter-Randomization.md
 # Training With Environment Parameter Randomization

-One of the challenges of training and testing agents on the same
-environment is that the agents tend to overfit. The result is that the
-agents are unable to generalize to any tweaks or variations in the environment.
-This is analogous to a model being trained and tested on an identical dataset
-in supervised learning. This becomes problematic in cases where environments
-are instantiated with varying objects or properties.
-
-To help agents robust and better generalizable to changes in the environment, the agent
-can be trained over multiple variations of a given environment. We refer to this approach as **Environment Parameter Randomization**. For those familiar with Reinforcement Learning research, this approach is based on the concept of Domain Randomization (you can read more about it [here](https://arxiv.org/abs/1703.06907)). By using parameter randomization
-during training, the agent can be better suited to adapt (with higher performance)
-to future unseen variations of the environment.
-
-_Example of variations of the 3D Ball environment._
-
-Ball scale of 0.5          |  Ball scale of 4
-:-------------------------:|:-------------------------:
-![](images/3dball_small.png)  |  ![](images/3dball_big.png)
-
-
 To enable variations in the environments, we implemented `Environment Parameters`.
 `Environment Parameters` are values in the `FloatPropertiesChannel` that can be read when setting
 up the environment. We
--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
 # Training with Imitation Learning

-It is often more intuitive to simply demonstrate the behavior we want an agent
-to perform, rather than attempting to have it learn via trial-and-error methods.
-Consider our
-[running example](ML-Agents-Overview.md#running-example-training-npc-behaviors)
-of training a medic NPC. Instead of indirectly training a medic with the help
-of a reward function, we can give the medic real world examples of observations
-from the game and actions from a game controller to guide the medic's behavior.
-Imitation Learning uses pairs of observations and actions from
-a demonstration to learn a policy.
-
-Imitation learning can also be used to help reinforcement learning. Especially in
-environments with sparse (i.e., infrequent or rare) rewards, the agent may never see
-the reward and thus not learn from it. Curiosity (which is available in the toolkit)
-helps the agent explore, but in some cases
-it is easier to show the agent how to achieve the reward. In these cases,
-imitation learning combined with reinforcement learning can dramatically
-reduce the time the agent takes to solve the environment.
-For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
-using 6 episodes of demonstrations can reduce training steps by more than 4 times.
-See Behavioral Cloning + GAIL + Curiosity + RL below.
-
-<p align="center">
-  <img src="images/mlagents-ImitationAndRL.png"
-       alt="Using Demonstrations with Reinforcement Learning"
-       width="700" border="0" />
-</p>
-
-The ML-Agents toolkit provides two features that enable your agent to learn from demonstrations.
-In most scenarios, you can combine these two features.
-
-* GAIL (Generative Adversarial Imitation Learning) uses an adversarial approach to
-  reward your Agent for behaving similar to a set of demonstrations. To use GAIL, you can add the
-  [GAIL reward signal](Reward-Signals.md#gail-reward-signal). GAIL can be
-  used with or without environment rewards, and works well when there are a limited
-  number of demonstrations.
-* Behavioral Cloning (BC) trains the Agent's neural network to exactly mimic the actions
-  shown in a set of demonstrations.
-  The BC feature can be enabled on the [PPO](Training-PPO.md#optional-behavioral-cloning-using-demonstrations)
-  or [SAC](Training-SAC.md#optional-behavioral-cloning-using-demonstrations) trainer. As BC cannot generalize
-  past the examples shown in the demonstrations, BC tends to work best when there exists demonstrations
-  for nearly all of the states that the agent can experience, or in conjunction with GAIL and/or an extrinsic reward.
-
 ### What to Use

 If you want to help your agents learn (especially with environments that have sparse rewards)
 example environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.

 ## Recording Demonstrations
-
-Demonstrations of agent behavior can be recorded from the Unity Editor,
-and saved as assets. These demonstrations contain information on the
-observations, actions, and rewards for a given agent during the recording session.
-They can be managed in the Editor, as well as used for training with BC and GAIL.

 In order to record demonstrations from an agent, add the `Demonstration Recorder`
 component to a GameObject in the scene which contains an `Agent` component.
        gail:
            demo_path: <path_to_your_demo_file>
            ...
-```
+```
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md
 # Training with Proximal Policy Optimization

-ML-Agents provides an implementation of a reinforcement learning algorithm called
-[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/).
-PPO uses a neural network to approximate the ideal function that maps an agent's
-observations to the best action an agent can take in a given state. The
-ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
-Python process (communicating with the running Unity application over a socket).
-
-ML-Agents also provides an implementation of
-[Soft Actor-Critic (SAC)](https://bair.berkeley.edu/blog/2018/12/14/sac/). SAC tends
-to be more _sample-efficient_, i.e. require fewer environment steps,
-than PPO, but may spend more time performing model updates. This can produce a large
-speedup on heavy or slow environments. Check out how to train with
-SAC [here](Training-SAC.md).
-
-To train an agent, you will need to provide the agent one or more reward signals which
-the agent should attempt to maximize. See [Reward Signals](Reward-Signals.md)
-for the available reward signals and the corresponding hyperparameters.
-
-See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the
-training program, `learn.py`.
-
-If you are using the recurrent neural network (RNN) to utilize memory, see
-[Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training
-details.
-
-If you are using curriculum training to pace the difficulty of the learning task
-presented to an agent, see [Training with Curriculum
-Learning](Training-Curriculum-Learning.md).
-
-For information about imitation learning from demonstrations, see
-[Training with Imitation Learning](Training-Imitation-Learning.md).
-
 ## Best Practices Training with PPO

 Successfully training a Reinforcement Learning model often involves tuning the
--- a/docs/Training-SAC.md
+++ b/docs/Training-SAC.md
 # Training with Soft-Actor Critic

-In addition to [Proximal Policy Optimization (PPO)](Training-PPO.md), ML-Agents also provides
-[Soft Actor-Critic](http://bair.berkeley.edu/blog/2018/12/14/sac/) to perform
-reinforcement learning.
-
-In contrast with PPO, SAC is _off-policy_, which means it can learn from experiences collected
-at any time during the past. As experiences are collected, they are placed in an
-experience replay buffer and randomly drawn during training. This makes SAC
-significantly more sample-efficient, often requiring 5-10 times less samples to learn
-the same task as PPO. However, SAC tends to require more model updates. SAC is a
-good choice for heavier or slower environments (about 0.1 seconds per step or more).
-
-SAC is also a "maximum entropy" algorithm, and enables exploration in an intrinsic way.
-Read more about maximum entropy RL [here](https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/).
-
-To train an agent, you will need to provide the agent one or more reward signals which
-the agent should attempt to maximize. See [Reward Signals](Reward-Signals.md)
-for the available reward signals and the corresponding hyperparameters.
-
 ## Best Practices when training with SAC

 Successfully training a reinforcement learning model often involves tuning
--- a/docs/Training-Self-Play.md
+++ b/docs/Training-Self-Play.md
 # Training with Self-Play

-ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with
-[Self-Play](https://openai.com/blog/competitive-self-play/).
-A symmetric game is one in which opposing agents are equal in form, function and objective. Examples of symmetric games
-are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and
-action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games,
-this is not the case. An example of an asymmetric games are Hide and Seek. Agents in these
-types of games do not always have the same observation or action spaces and so sharing policy networks is not
-necessarily ideal.
-
-With self-play, an agent learns in adversarial games by competing against fixed, past versions of its opponent
-(which could be itself as in symmetric games) to provide a more stable, stationary learning environment. This is compared
-to competing against the current, best opponent in every episode, which is constantly changing (because it's learning).
-
-Self-play can be used with our implementations of both [Proximal Policy Optimization (PPO)](Training-PPO.md) and [Soft Actor-Critc (SAC)](Training-SAC.md).
-However, from the perspective of an individual agent, these scenarios appear to have non-stationary dynamics because the opponent is often changing.
-This can cause significant issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For further reading on
-this issue in particular, see the paper [Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf).
-For more general information on training with ML-Agents, see [Training ML-Agents](Training-ML-Agents.md).
-For more algorithm specific instruction, please see the documentation for [PPO](Training-PPO.md) or [SAC](Training-SAC.md).
-
 Self-play is triggered by including the self-play hyperparameter hierarchy in the trainer configuration file.  Detailed description of the self-play hyperparameters are contained below. Furthermore, to distinguish opposing agents, set the team ID to different integer values in the behavior parameters script on the agent prefab.

 ![Team ID](images/team_id.png)
--- a/docs/Training-Using-Concurrent-Unity-Instances.md
+++ b/docs/Training-Using-Concurrent-Unity-Instances.md
 # Training Using Concurrent Unity Instances

-As part of release v0.8, we enabled developers to run concurrent, parallel instances of the Unity executable during training. For certain scenarios, this should speed up the training.
-
 ## How to Run Concurrent Unity Instances During Training

 Please refer to the general instructions on [Training ML-Agents](Training-ML-Agents.md).  In order to run concurrent Unity instances during training, set the number of environment instances using the command line option `--num-envs=<n>` when you invoke `mlagents-learn`. Optionally, you can also set the `--base-port`, which is the starting port used for the concurrent Unity instances.