Consolidate Feature descriptions into ML-Agents-Overview page
Merged the "Overview" sections of a few pages into their respective sections in ML-Agents-Overview:
- Training-Using-Concurrent-Unity-Instances.md
- Training-Self-Play.md
- Training-SAC.md
- Training-PPO.md
- Training-Imitation-Learning.md
- Training-Environment-Parameter-Randomization.md
- Training-Curriculum-Learning.md
- Reward-Signals.md
- Feature-Monitor.md
- Feature-Memory.md
Organized ML-Agents-Overview into Training Methods and Training Options sections.
Follow-up action items (part of a separate PR):
- Smooth over the documentation in ML-Agents-Overview (right now, we somewhat just pasted text from other pages). If we align on the new structure for this page, we can iterate on it.
- Update “Key Components” section with new graph and discuss side channels and revise use of Academy.
- Consolidate “Training-*” docs into Training-ML-Agents to offer a single guide for all hyperparameter selection
The default algorithm is PPO. This is a method that has been shown to be more general purpose and
stable than many other RL algorithms.
In contrast with PPO, SAC is _off-policy_, which means it can learn from experiences collected
at any time during the past. As experiences are collected, they are placed in an
experience replay buffer and randomly drawn during training. This makes SAC
significantly more sample-efficient, often requiring 5-10 times less samples to learn
the same task as PPO. However, SAC tends to require more model updates. SAC is a
good choice for heavier or slower environments (about 0.1 seconds per step or more).
SAC is also a "maximum entropy" algorithm, and enables exploration in an intrinsic way.
Read more about maximum entropy RL [here](https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/).
#### Curiosity for Sparse-reward Environments
In environments where the agent receives rare or infrequent rewards (i.e. sparse-reward), an
agent may never receive a reward signal on which to bootstrap its training process. This is a
scenario where the use of an intrinsic reward signals can be valuable. Curiosity is one such
signal which can help the agent explore when extrinsic rewards are sparse.
The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an implementation
of the approach described in
[Curiosity-driven Exploration by Self-supervised Prediction(https://arxiv.org/abs/1705.05363).
It trains two networks:
* an inverse model, which takes the current and next observation of the agent, encodes them, and
uses the encoding to predict the action that was taken between the observations
* a forward model, which takes the encoded current observation and action, and predicts the
next encoded observation.
The loss of the forward model (the difference between the predicted and actual encoded observations)
is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be.
For more information, see our dedicated [blog post on the Curiosity module](https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/).
### Imitation Learning
It is often more intuitive to simply demonstrate the behavior we want an agent
to perform, rather than attempting to have it learn via trial-and-error methods.
For example, instead of indirectly training a medic with the help
of a reward function, we can give the medic real world examples of observations
from the game and actions from a game controller to guide the medic's behavior.
Imitation Learning uses pairs of observations and actions from
a demonstration to learn a policy. See this
[video demo](https://youtu.be/kpb8ZkMBFYs) of imitation learning .
Imitation learning can either be used alone or in conjunction with reinforcement learning.
If used alone it can provide a mechanism for learning a specific type of behavior
(i.e. a specific style of solving the task). If used in conjunction with reinforcement
learning it can dramatically reduce the time the agent takes to solve the environment.
This can be especially pronounced in sparse-reward environments.
For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
using 6 episodes of demonstrations can reduce training steps by more than 4 times.
See Behavioral Cloning + GAIL + Curiosity + RL below.
<palign="center">
<imgsrc="images/mlagents-ImitationAndRL.png"
alt="Using Demonstrations with Reinforcement Learning"
width="700" border="0" />
</p>
The ML-Agents Toolkit provides a way to learn directly from demonstrations, as well as use them
to help speed up reward-based training (RL). We include two algorithms called
Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL).
In most scenarios, you can combine these two features.
- **Training with Environment Parameter Randomization** - If an agent is exposed to several variations of an environment, it will be more robust (i.e. generalize better) to
unseen variations of the environment. Similar to Curriculum Learning,
where environments become more difficult as the agent learns, the toolkit provides
a way to randomly sample parameters of the environment during training. See
[Training With Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md)
to learn more about this feature.
* **Concurrent Unity Instances** - We enable developers to run concurrent, parallel
instances of the Unity executable during training. For certain scenarios, this
# Training With Environment Parameter Randomization
One of the challenges of training and testing agents on the same
environment is that the agents tend to overfit. The result is that the
agents are unable to generalize to any tweaks or variations in the environment.
This is analogous to a model being trained and tested on an identical dataset
in supervised learning. This becomes problematic in cases where environments
are instantiated with varying objects or properties.
To help agents robust and better generalizable to changes in the environment, the agent
can be trained over multiple variations of a given environment. We refer to this approach as **Environment Parameter Randomization**. For those familiar with Reinforcement Learning research, this approach is based on the concept of Domain Randomization (you can read more about it [here](https://arxiv.org/abs/1703.06907)). By using parameter randomization
during training, the agent can be better suited to adapt (with higher performance)
to future unseen variations of the environment.
_Example of variations of the 3D Ball environment._
A symmetric game is one in which opposing agents are equal in form, function and objective. Examples of symmetric games
are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and
action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games,
this is not the case. An example of an asymmetric games are Hide and Seek. Agents in these
types of games do not always have the same observation or action spaces and so sharing policy networks is not
necessarily ideal.
With self-play, an agent learns in adversarial games by competing against fixed, past versions of its opponent
(which could be itself as in symmetric games) to provide a more stable, stationary learning environment. This is compared
to competing against the current, best opponent in every episode, which is constantly changing (because it's learning).
Self-play can be used with our implementations of both [Proximal Policy Optimization (PPO)](Training-PPO.md) and [Soft Actor-Critc (SAC)](Training-SAC.md).
However, from the perspective of an individual agent, these scenarios appear to have non-stationary dynamics because the opponent is often changing.
This can cause significant issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For further reading on
this issue in particular, see the paper [Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf).
For more general information on training with ML-Agents, see [Training ML-Agents](Training-ML-Agents.md).
For more algorithm specific instruction, please see the documentation for [PPO](Training-PPO.md) or [SAC](Training-SAC.md).
Self-play is triggered by including the self-play hyperparameter hierarchy in the trainer configuration file. Detailed description of the self-play hyperparameters are contained below. Furthermore, to distinguish opposing agents, set the team ID to different integer values in the behavior parameters script on the agent prefab.
As part of release v0.8, we enabled developers to run concurrent, parallel instances of the Unity executable during training. For certain scenarios, this should speed up the training.
## How to Run Concurrent Unity Instances During Training
Please refer to the general instructions on [Training ML-Agents](Training-ML-Agents.md). In order to run concurrent Unity instances during training, set the number of environment instances using the command line option `--num-envs=<n>` when you invoke `mlagents-learn`. Optionally, you can also set the `--base-port`, which is the starting port used for the concurrent Unity instances.