Merge branch 'master' into self-play-mutex

5 年前 · eefc4811
--- a/.yamato/com.unity.ml-agents-test.yml
+++ b/.yamato/com.unity.ml-agents-test.yml
  - version: 2018.4
    # 2018.4 doesn't support code-coverage
    coverageOptions:
+    minCoveragePct: 0
+    minCoveragePct: 72
+    minCoveragePct: 72
 test_platforms:
  - name: win
    type: Unity::VM
  commands:
    - npm install upm-ci-utils@stable -g --registry https://api.bintray.com/npm/unity/unity-npm
    - upm-ci package test -u {{ editor.version }} --package-path com.unity.ml-agents {{ editor.coverageOptions }}
+    - python ml-agents/tests/yamato/check_coverage_percent.py upm-ci~/test-results/ {{ editor.minCoveragePct }}
  artifacts:
    logs:
      paths:
    changes:
      only:
        - "com.unity.ml-agents/**"
+        - "ml-agents/tests/yamato/**"
        - ".yamato/com.unity.ml-agents-test.yml"

  {% endfor %}
--- a/com.unity.ml-agents/Runtime/Sensors/StackingSensor.cs
+++ b/com.unity.ml-agents/Runtime/Sensors/StackingSensor.cs

        public int Write(WriteAdapter adapter)
        {
-            // First, call the wrapped sensor's write method. Make sure to use our own adapater, not the passed one.
+            // First, call the wrapped sensor's write method. Make sure to use our own adapter, not the passed one.
            var wrappedShape = m_WrappedSensor.GetObservationShape();
            m_LocalAdapter.SetTarget(m_StackedObservations[m_CurrentIndex], wrappedShape, 0);
            m_WrappedSensor.Write(m_LocalAdapter);
--- a/config/gail_config.yaml
+++ b/config/gail_config.yaml
    hidden_units: 128
    lambd: 0.95
    learning_rate: 3.0e-4
-    max_steps: 5.0e4
+    max_steps: 5.0e5
    memory_size: 256
    normalize: false
    num_epoch: 3
--- a/config/trainer_config.yaml
+++ b/config/trainer_config.yaml
    buffer_size: 12000
    summary_freq: 12000
    time_horizon: 1000
-    max_steps: 5.0e5
+    max_steps: 5.0e6
    beta: 0.001
    reward_signals:
        extrinsic:
--- a/docs/Background-Jupyter.md
+++ b/docs/Background-Jupyter.md
 embedded visualizations. We provide one such notebook,
 `notebooks/getting-started.ipynb`, for testing the Python control interface to a
 Unity build. This notebook is introduced in the
-[Getting Started with the 3D Balance Ball Environment](Getting-Started-with-Balance-Ball.md)
+[Getting Started Guide](Getting-Started.md)
 tutorial, but can be used for testing the connection to any Unity build.

 For a walkthrough of how to use Jupyter, see
--- a/docs/Installation.md
+++ b/docs/Installation.md

 ## Next Steps

-The [Basic Guide](Basic-Guide.md) page contains several short tutorials on
+The [Getting Started](Getting-Started.md) guide contains several short tutorials on
 setting up the ML-Agents Toolkit within Unity, running a pre-trained model, in
 addition to building and training environments.

--- a/docs/Learning-Environment-Create-New.md
+++ b/docs/Learning-Environment-Create-New.md
 # Making a New Learning Environment

-This tutorial walks through the process of creating a Unity Environment. A Unity
-Environment is an application built using the Unity Engine which can be used to
-train Reinforcement Learning Agents.
+This tutorial walks through the process of creating a Unity Environment from scratch. We recommend first reading the [Getting Started](Getting-Started.md) guide to understand the concepts presented here first in an already-built environment.
-In this example, we will train a ball to roll to a randomly placed cube. The
-ball also learns to avoid falling off the platform.
+In this example, we will create an agent capable of controlling a ball on a platform. We will then train the agent to roll the ball toward the cube while avoiding falling off the platform.

 ## Overview


 This is only one way to achieve this objective. Refer to the
 [example environments](Learning-Environment-Examples.md) for other ways we can achieve relative positioning.
-
-## Review: Scene Layout
-
-This section briefly reviews how to organize your scene when using Agents in
-your Unity environment.
-
-There are two kinds of game objects you need to include in your scene in order
-to use Unity ML-Agents: an Academy and one or more Agents.
-
-Keep in mind:
-
-* If you are using multiple training areas, make sure all the Agents have the same `Behavior Name`
-and `Behavior Parameters`
--- a/docs/Learning-Environment-Design-Agents.md
+++ b/docs/Learning-Environment-Design-Agents.md
 # Agents

-An agent is an actor that can observe its environment and decide on the best
-course of action using those observations. Create Agents in Unity by extending
-the Agent class. The most important aspects of creating agents that can
-successfully learn are the observations the agent collects for
-reinforcement learning and the reward you assign to estimate the value of the
+An agent is an entity that can observe its environment, decide on the best
+course of action using those observations, and execute those actions within
+its environment. Agents can be created in Unity by extending
+the `Agent` class. The most important aspects of creating agents that can
+successfully learn are the observations the agent collects,
+and the reward you assign to estimate the value of the
-An Agent passes its observations to its Policy. The Policy, then, makes a decision
+An Agent passes its observations to its Policy. The Policy then makes a decision
 and passes the chosen action back to the agent. Your agent code must execute the
 action, for example, move the agent in one direction or another. In order to
 [train an agent using reinforcement learning](Learning-Environment-Design.md),
-The Policy class abstracts out the decision making logic from the Agent itself so
+The `Policy` class abstracts out the decision making logic from the Agent itself so
-decisions depends on the kind of Policy it is. You can change the Policy of an
-Agent by changing its `Behavior Parameters`. If you set `Behavior Type` to
-`Heuristic Only`, the Agent will use its `Heuristic()` method to make decisions
-which can allow you to control the Agent manually or write your own Policy. If
-the Agent has a `Model` file, it Policy will use the neural network `Model` to
-take decisions.
+decisions depends on the `Behavior Parameters` associated with the agent. If you
+set `Behavior Type` to `Heuristic Only`, the Agent will use its `Heuristic()`
+method to make decisions which can allow you to control the Agent manually or
+write your own Policy. If the Agent has a `Model` file, its Policy will use
+the neural network `Model` to take decisions.

 ## Decisions

 the Agent to request decisions on its own at regular intervals, add a
-`Decision Requester` component to the Agent's Game Object. Making decisions at regular step
+`Decision Requester` component to the Agent's GameObject. Making decisions at regular step
-occur, should call `Agent.RequestDecision()` manually.
+occur, such as in a turn-based game, should call `Agent.RequestDecision()` manually.
-## Observations
+## Observations and Sensors
-To make decisions, an agent must observe its environment in order to infer the
-state of the world. A state observation can take the following forms:
+To make informed decisions, an agent must first make observations of the state of
+the environment. The observations are collected by Sensors attached to the agent
+GameObject. By default, agents come with a `VectorSensor` which allows them to
+collect floating-point observations into a single array. There are additional
+sensor components which can be attached to the agent GameObject which collect their own
+observations, or modify other observations. These are:
-* **Vector Observation** — a feature vector consisting of an array of floating
-  point numbers.
-* **Visual Observations** — one or more camera images and/or render textures.
+* `CameraSensorComponent` - Allows image from `Camera` to be used as observation.
+* `RenderTextureSensorComponent` - Allows content of `RenderTexture` to be used as observation.
+* `RayPerceptionSensorComponent` - Allows information from set of ray-casts to be used as observation.
-When you use vector observations for an Agent, implement the
-`Agent.CollectObservations(VectorSensor sensor)` method to create the feature vector. When you use
-**Visual Observations**, you only need to identify which Unity Camera objects
-or RenderTextures will provide images and the base Agent class handles the rest.
-You do not need to implement the `CollectObservations(VectorSensor sensor)` method when your Agent
-uses visual observations (unless it also uses vector observations).
+### Vector Observations
-### Vector Observation Space: Feature Vectors
+Vector observations are best used for aspects of the environment which are numerical
+and non-visual. The Policy class calls the `CollectObservations(VectorSensor sensor)`
+method of each Agent. Your implementation of this function must call
+`VectorSensor.AddObservation` to add vector observations.
-For agents using a continuous state space, you create a feature vector to
-represent the agent's observation at each step of the simulation. The Policy
-class calls the `CollectObservations(VectorSensor sensor)` method of each Agent. Your
-implementation of this function must call `VectorSensor.AddObservation` to add vector
-observations.
-
-The observation must include all the information an agents needs to accomplish
-its task. Without sufficient and relevant information, an agent may learn poorly
+In order for an agent to learn, the observations should include all the
+information an agent needs to accomplish its task. Without sufficient and relevant
+information, an agent may learn poorly
-solution to the problem.
+solution to the problem, or what you would expect a human to be able to use to solve the problem.

 For examples of various state observation functions, you can look at the
 [example environments](Learning-Environment-Examples.md) included in the
 every enemy agent in an environment, you could only observe the closest five.

 When you set up an Agent's `Behavior Parameters` in the Unity Editor, set the following
-properties to use a continuous vector observation:
+properties to use a vector observation:

 * **Space Size** — The state size must match the length of your feature vector.

 of data to your observation vector. You can add Integers and booleans directly to
 the observation vector, as well as some common Unity data types such as `Vector2`,
 `Vector3`, and `Quaternion`.
+
+#### One-hot encoding categorical information

 Type enumerations should be encoded in the _one-hot_ style. That is, add an
 element to the feature vector for each element of enumeration, setting the
 }
 ```

-`VectorSensor.AddObservation` also provides a two-argument version as a shortcut for _one-hot_
+`VectorSensor` also provides a two-argument function `AddOneHotObservation()` as a shortcut for _one-hot_
 style observations. The following example is identical to the previous one.

 ```csharp
 angle, or, if the number of turns is significant, increase the maximum value
 used in your normalization formula.

-### Multiple Visual Observations
+#### Vector Observation Summary & Best Practices
+
+* Vector Observations should include all variables relevant for allowing the
+  agent to take the optimally informed decision, and ideally no extraneous information.
+* In cases where Vector Observations need to be remembered or compared over
+  time, either an LSTM (see [here](Feature-Memory.md)) should be used in the model, or the
+  `Stacked Vectors` value in the agent GameObject's `Behavior Parameters` should be changed.
+* Categorical variables such as type of object (Sword, Shield, Bow) should be
+  encoded in one-hot fashion (i.e. `3` -> `0, 0, 1`). This can be done automatically using the
+  `AddOneHotObservation()` method of the `VectorSensor`.
+* In general, all inputs should be normalized to be in
+  the range 0 to +1 (or -1 to 1). For example, the `x` position information of
+  an agent where the maximum possible value is `maxValue` should be recorded as
+  `VectorSensor.AddObservation(transform.position.x / maxValue);` rather than
+  `VectorSensor.AddObservation(transform.position.x);`.
+* Positional information of relevant GameObjects should be encoded in relative
+  coordinates wherever possible. This is often relative to the agent position.
+
+
+### Visual Observations
-Visual observations use rendered textures directly or from one or more
-cameras in a scene. The Policy vectorizes the textures into a 3D Tensor which
-can be fed into a convolutional neural network (CNN). For more information on
-CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). You
-can use visual observations along side vector observations.
+Visual observations are generally provided to agent via either a `CameraSensor` or `RenderTextureSensor`.
+These collect image information and transforms it into a 3D Tensor which
+can be fed into the convolutional neural network (CNN) of the agent policy. For more information on
+CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). This allows agents
+to learn from spatial regularities in the observation images. It is possible to
+use visual and vector observations with the same agent.
-succeed at all.
+succeed at all as compared to vector observations. As such, they should only be
+used when it is not possible to properly define the problem using vector or ray-cast observations.

 Visual observations can be derived from Cameras or RenderTextures within your scene.
 To add a visual observation to an Agent, add either a Camera Sensor Component

 ![Agent RenderTexture Debug](images/gridworld.png)

+#### Visual Observation Summary & Best Practices
+
+* To collect visual observations, attach `CameraSensor` or `RenderTextureSensor`
+  components to the agent GameObject.
+* Visual observations should generally be used unless vector observations are not sufficient.
+* Image size should be kept as small as possible, without the loss of
+  needed details for decision making.
+* Images should be made greyscale in situations where color information is
+  not needed for making informed decisions.
+
-Raycasts are an alternative system for the Agent to provide observations based on
-the physical environment. This can be easily implemented by adding a
-RayPerceptionSensorComponent3D (or RayPerceptionSensorComponent2D) to the Agent.
+
+Raycasts are another possible method for providing observations to an agent.
+This can be easily implemented by adding a
+`RayPerceptionSensorComponent3D` (or `RayPerceptionSensorComponent2D`) to the Agent GameObject.

 During observations, several rays (or spheres, depending on settings) are cast into
 the physics world, and the objects that are hit determine the observation vector that
 * _Start Vertical Offset_ (3D only) The vertical offset of the ray start point.
 * _End Vertical Offset_ (3D only) The vertical offset of the ray end point.

-In the example image above, the Agent has two RayPerceptionSensorComponent3Ds.
+In the example image above, the Agent has two `RayPerceptionSensorComponent3D`s.
 Both use 3 Rays Per Direction and 90 Max Ray Degrees. One of the components
 had a vertical offset, so the Agent can tell whether it's clear to jump over
 the wall.
 `Behavior Parameters`, so you don't need to worry about the formula above when
 setting the State Size.

-## Vector Actions
+#### RayCast Observation Summary & Best Practices
+
+* Attach `RayPerceptionSensorComponent3D` or `RayPerceptionSensorComponent2D` to use.
+* This observation type is best used when there is relevant spatial information
+  for the agent that doesn't require a fully rendered image to convey.
+* Use as few rays and tags as necessary to solve the problem in order to improve learning stability and agent performance.
+
+## Actions
-agent's `OnActionReceived()` function. When you specify that the vector action space
+agent's `OnActionReceived()` function. Actions for an agent can take one of two forms, either **Continuous** or **Discrete**.
+
+When you specify that the vector action space
-control signals with length equal to the `Vector Action Space Size` property.
+floating point numbers with length equal to the `Vector Action Space Size` property.
 When you specify a **Discrete** vector action space type, the action parameter
 is an array containing integers. Each integer is an index into a list or table
 of commands. In the **Discrete** vector action space type, the action parameter
 array of integers, each value corresponds to the number of possibilities for
 each branch.

-For example, if we wanted an Agent that can move in an plane and jump, we could
+For example, if we wanted an Agent that can move in a plane and jump, we could
 define two branches (one for motion and one for jumping) because we want our
 agent be able to move __and__ jump concurrently. We define the first branch to
 have 5 possible actions (don't move, go left, go right, go backward, go forward)
 neural network, the Agent will be unable to perform the specified action. Note
 that when the Agent is controlled by its Heuristic, the Agent will
 still be able to decide to perform the masked action. In order to mask an
-action,  override the `Agent.CollectDiscreteActionMasks()` virtual method, and call `DiscreteActionMasker.SetMask()` in it:
+action,  override the `Agent.CollectDiscreteActionMasks()` virtual method,
+and call `DiscreteActionMasker.SetMask()` in it:

 ```csharp
 public override void CollectDiscreteActionMasks(DiscreteActionMasker actionMasker){
 * You cannot mask all the actions of a branch.
 * You cannot mask actions in continuous control.

+### Actions Summary &  Best Practices
+
+* Actions can either use `Discrete` or `Continuous` spaces.
+* When using `Discrete` it is possible to assign multiple action branches, and to mask certain actions.
+* In general, smaller action spaces will make for easier learning.
+* Be sure to set the Vector Action's Space Size to the number of used Vector
+  Actions, and not greater, as doing the latter can interfere with the
+  efficiency of the training process.
+* When using continuous control, action values should be clipped to an
+  appropriate range. The provided PPO model automatically clips these values
+  between -1 and 1, but third party training systems may not do so.
+
+
 ## Rewards

 In reinforcement learning, the reward is a signal that the agent has done

 Perhaps the best advice is to start simple and only add complexity as needed. In
 general, you should reward results rather than actions you think will lead to
-the desired results. To help develop your rewards, you can use the Monitor class
-to display the cumulative reward received by an Agent. You can even use the
+the desired results. You can even use the
-Allocate rewards to an Agent by calling the `AddReward()` method in the
-`OnActionReceived()` function. The reward assigned between each decision
+Allocate rewards to an Agent by calling the `AddReward()` or `SetReward()` methods on the agent.
+The reward assigned between each decision
-decision was. There is a method called `SetReward()` that will override all
+decision was. The `SetReward()` will override all
 previous rewards given to an agent since the previous decision.

 ### Examples
 Note that all of these environments make use of the `EndEpisode()` method, which manually
 terminates an episode when a termination condition is reached. This can be
 called independently of the `Max Step` property.
+
+### Rewards Summary & Best Practices
+
+* Use `AddReward()` to accumulate rewards between decisions. Use `SetReward()`
+  to overwrite any previous rewards accumulate between decisions.
+* The magnitude of any given reward should typically not be greater than 1.0 in
+  order to ensure a more stable learning process.
+* Positive rewards are often more helpful to shaping the desired behavior of an
+  agent than negative rewards. Excessive negative rewards can result in the agent
+  failing to learn any meaningful behavior.
+* For locomotion tasks, a small positive reward (+0.1) for forward velocity is
+  typically used.
+* If you want the agent to finish a task quickly, it is often helpful to provide
+  a small penalty every step (-0.05) that the agent does not complete the task.
+  In this case completion of the task should also coincide with the end of the
+  episode by calling `EndEpisode()` on the agent when it has accomplished its goal.

 ## Agent Properties

--- a/docs/Learning-Environment-Examples.md
+++ b/docs/Learning-Environment-Examples.md
 * Float Properties: None
 * Benchmark Mean Reward: 0.93

-## [3DBall: 3D Balance Ball](https://youtu.be/dheeCO29-EI)
+## 3DBall: 3D Balance Ball

 ![3D Balance Ball](images/balance.png)

      * Recommended Maximum: 20
 * Benchmark Mean Reward: 100

-## [GridWorld](https://youtu.be/gu8HE9WKEVI)
+## GridWorld

 ![GridWorld](images/gridworld.png)

  number of goals.
 * Benchmark Mean Reward: 0.8

-## [Tennis](https://youtu.be/RDaIh7JX6RI)
+## Tennis

 ![Tennis](images/tennis.png)

      * Recommended Minimum: 0.2
      * Recommended Maximum: 5

-## [Push Block](https://youtu.be/jKdw216ZgoE)
+## Push Block

 ![Push](images/push.png)

        * Recommended Maximum: 2000
 * Benchmark Mean Reward: 4.5

-## [Wall Jump](https://youtu.be/NITLug2DIWQ)
+## Wall Jump

 ![Wall](images/wall.png)

 * Float Properties: Four
 * Benchmark Mean Reward (Big & Small Wall): 0.8

-## [Reacher](https://youtu.be/2N9EoF6pQyE)
+## Reacher

 ![Reacher](images/reacher.png)

    * Recommended Maximum: 3
 * Benchmark Mean Reward: 30

-## [Crawler](https://youtu.be/ftLliaeooYI)
+## Crawler

 ![Crawler](images/crawler.png)

 * Benchmark Mean Reward for `CrawlerStaticTarget`: 2000
 * Benchmark Mean Reward for `CrawlerDynamicTarget`: 400

-## [Food Collector](https://youtu.be/heVMs3t9qSk)
+## Food Collector

 ![Collector](images/foodCollector.png)

    * Recommended Maximum: 5
 * Benchmark Mean Reward: 10

-## [Hallway](https://youtu.be/53GyfpPQRUQ)
+## Hallway

 ![Hallway](images/hallway.png)

 * Benchmark Mean Reward: 0.7
  * To speed up training, you can enable curiosity by adding the `curiosity` reward signal in `config/trainer_config.yaml`

-## [Bouncer](https://youtu.be/Tkv-c-b1b2I)
+## Bouncer

 ![Bouncer](images/bouncer.png)

        * Recommended Maximum: 250
 * Benchmark Mean Reward: 10

-## [Soccer Twos](https://youtu.be/Hg3nmYD3DjQ)
+## Soccer Twos

 ![SoccerTwos](images/soccer.png)

--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
 training the Python API uses the observations it receives to learn a TensorFlow
 model. This model is then embedded within the Agent during inference.

-The
-[Getting Started with the 3D Balance Ball Example](Getting-Started-with-Balance-Ball.md)
+The [Getting Started Guide](Getting-Started.md)
 tutorial covers this training mode with the **3D Balance Ball** sample environment.

 ### Custom Training and Inference

 To help you use ML-Agents, we've created several in-depth tutorials for
 [installing ML-Agents](Installation.md),
-[getting started](Getting-Started-with-Balance-Ball.md) with the 3D Balance Ball
+[getting started](Getting-Started.md) with the 3D Balance Ball
 environment (one of our many
 [sample environments](Learning-Environment-Examples.md)) and
 [making your own environment](Learning-Environment-Create-New.md).
--- a/docs/Python-API.md
+++ b/docs/Python-API.md
 - `worker_id` indicates which port to use for communication with the
  environment. For use in parallel training regimes such as A3C.
 - `seed` indicates the seed to use when generating random numbers during the
-  training process. In environments which do not involve physics calculations,
+  training process. In environments which are deterministic,
  setting the seed enables reproducible experimentation by ensuring that the
  environment and trainers utilize the same random seed.
 - `side_channels` provides a way to exchange data with the Unity simulation that
--- a/docs/Readme.md
+++ b/docs/Readme.md
 * [Installation](Installation.md)
  * [Background: Jupyter Notebooks](Background-Jupyter.md)
  * [Using Virtual Environment](Using-Virtual-Environment.md)
-* [Basic Guide](Basic-Guide.md)
+* [Getting Started Guide](Getting-Started.md)
-* [Getting Started with the 3D Balance Ball Environment](Getting-Started-with-Balance-Ball.md)
 * [Example Environments](Learning-Environment-Examples.md)

 ## Creating Learning Environments
 * [Designing Agents](Learning-Environment-Design-Agents.md)
-* [Learning Environment Best Practices](Learning-Environment-Best-Practices.md)

 ### Advanced Usage
  * [Using the Monitor](Feature-Monitor.md)
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md

 To view training statistics, use TensorBoard. For information on launching and
 using TensorBoard, see
-[here](./Getting-Started-with-Balance-Ball.md#observing-training-progress).
+[here](./Getting-Started.md#observing-training-progress).

 ### Cumulative Reward

--- a/docs/Training-SAC.md
+++ b/docs/Training-SAC.md

 To view training statistics, use TensorBoard. For information on launching and
 using TensorBoard, see
-[here](./Getting-Started-with-Balance-Ball.md#observing-training-progress).
+[here](./Getting-Started.md#observing-training-progress).

 ### Cumulative Reward

--- a/docs/Training-Self-Play.md
+++ b/docs/Training-Self-Play.md

 To view training statistics, use TensorBoard. For information on launching and
 using TensorBoard, see
-[here](./Getting-Started-with-Balance-Ball.md#observing-training-progress).
+[here](./Getting-Started.md#observing-training-progress).

 ### ELO
 In adversarial games, the cumulative environment reward may not be a meaningful metric by which to track learning progress.  This is because cumulative reward is entirely dependent on the skill of the opponent. An agent at a particular skill level will get more or less reward against a worse or better agent, respectively.
--- a/docs/Using-Docker.md
+++ b/docs/Using-Docker.md
 with specific flags, building a Docker container and, finally, running the
 container. If you are not familiar with building a Unity environment for
 ML-Agents, please read through our [Getting Started with the 3D Balance Ball
-Example](Getting-Started-with-Balance-Ball.md) guide first.
+Example](Getting-Started.md) guide first.

 ### Build the Environment (Optional)

--- a/ml-agents/mlagents/trainers/agent_processor.py
+++ b/ml-agents/mlagents/trainers/agent_processor.py
                    self.experience_buffers[global_id] = []
                    if curr_agent_step.done:
                        self.stats_reporter.add_stat(
-                            "Environment/Cumulative Reward",
-                            self.episode_rewards.get(global_id, 0),
-                        )
-                        self.stats_reporter.add_stat(
                            "Environment/Episode Length",
                            self.episode_steps.get(global_id, 0),
                        )
--- a/ml-agents/mlagents/trainers/ghost/trainer.py
+++ b/ml-agents/mlagents/trainers/ghost/trainer.py
 from mlagents.trainers.trainer import Trainer
 from mlagents.trainers.trajectory import Trajectory
 from mlagents.trainers.agent_processor import AgentManagerQueue
+from mlagents.trainers.stats import StatsPropertyType
+from mlagents.trainers.behavior_id_utils import BehaviorIdentifiers

 logger = logging.getLogger("mlagents.trainers")

        self.learning_policy_queues: Dict[str, AgentManagerQueue[Policy]] = {}

        # assign ghost's stats collection to wrapped trainer's
-        self.stats_reporter = self.trainer.stats_reporter
+        self._stats_reporter = self.trainer.stats_reporter
+        # Set the logging to print ELO in the console
+        self._stats_reporter.add_property(StatsPropertyType.SELF_PLAY, True)

        self_play_parameters = trainer_parameters["self_play"]
        self.window = self_play_parameters.get("window", 10)
         """
        return self.trainer.reward_buffer

-    def _write_summary(self, step: int) -> None:
-        """
-        Saves training statistics to Tensorboard.
-        """
-        opponents = np.array(self.policy_elos, dtype=np.float32)
-        logger.info(
-            " Learning brain {} ELO: {:0.3f}\n"
-            "Mean Opponent ELO: {:0.3f}"
-            " Std Opponent ELO: {:0.3f}".format(
-                self.learning_behavior_name,
-                self.current_elo,
-                opponents.mean(),
-                opponents.std(),
-            )
-        )
-        self.stats_reporter.add_stat("ELO", self.current_elo)
-
    def _process_trajectory(self, trajectory: Trajectory) -> None:
        if trajectory.done_reached and not trajectory.max_step_reached:
            # Assumption is that final reward is 1/.5/0 for win/draw/loss
            )
            self.current_elo += change
            self.policy_elos[self.current_opponent] -= change
-
-    def _is_ready_update(self) -> bool:
-        return False
-
-    def _update_policy(self) -> None:
-        pass
+            opponents = np.array(self.policy_elos, dtype=np.float32)
+            self._stats_reporter.add_stat("Self-play/ELO", self.current_elo)
+            self._stats_reporter.add_stat(
+                "Self-play/Mean Opponent ELO", opponents.mean()
+            )
+            self._stats_reporter.add_stat("Self-play/Std Opponent ELO", opponents.std())

    def advance(self) -> None:
        """
                    pass

        self.next_summary_step = self.trainer.next_summary_step
+
-        self._maybe_write_summary(self.get_step)

        for internal_q in self.internal_policy_queues:
            # Get policies that correspond to the policy queue in question
            self.trainer.add_policy(name_behavior_id, policy)
            self._save_snapshot(policy)  # Need to save after trainer initializes policy
            self.learning_behavior_name = name_behavior_id
+            behavior_id_parsed = BehaviorIdentifiers.from_name_behavior_id(
+                self.learning_behavior_name
+            )
+            team_id = behavior_id_parsed.behavior_ids["team"]
+            self._stats_reporter.add_property(StatsPropertyType.SELF_PLAY_TEAM, team_id)
        else:
            # for saving/swapping snapshots
            policy.init_load_weights()
--- a/ml-agents/mlagents/trainers/learn.py
+++ b/ml-agents/mlagents/trainers/learn.py
    env_path: Optional[str],
    docker_target_name: Optional[str],
    no_graphics: bool,
-    seed: Optional[int],
+    seed: int,
    start_port: int,
    env_args: Optional[List[str]],
 ) -> Callable[[int, List[SideChannel]], BaseEnv]:
        #         container.
        # Navigate in docker path and find env_path and copy it.
        env_path = prepare_for_docker_run(docker_target_name, env_path)
-    seed_count = 10000
-    seed_pool = [np.random.randint(0, seed_count) for _ in range(seed_count)]
-        env_seed = seed
-        if not env_seed:
-            env_seed = seed_pool[worker_id % len(seed_pool)]
+        # Make sure that each environment gets a different seed
+        env_seed = seed + worker_id
        return UnityEnvironment(
            file_name=env_path,
            worker_id=worker_id,
--- a/ml-agents/mlagents/trainers/ppo/trainer.py
+++ b/ml-agents/mlagents/trainers/ppo/trainer.py
        super()._process_trajectory(trajectory)
        agent_id = trajectory.agent_id  # All the agents should have the same ID

-        # Add to episode_steps
-        self.episode_steps[agent_id] += len(trajectory.steps)
-
        agent_buffer_trajectory = trajectory.to_agentbuffer()
        # Update the normalization
        if self.is_training:
        )
        for name, v in value_estimates.items():
            agent_buffer_trajectory["{}_value_estimates".format(name)].extend(v)
-            self.stats_reporter.add_stat(
+            self._stats_reporter.add_stat(
                self.optimizer.reward_signals[name].value_name, np.mean(v)
            )

                    batch_update_stats[stat_name].append(value)

        for stat, stat_list in batch_update_stats.items():
-            self.stats_reporter.add_stat(stat, np.mean(stat_list))
+            self._stats_reporter.add_stat(stat, np.mean(stat_list))
-                self.stats_reporter.add_stat(stat, val)
-        self.clear_update_buffer()
+                self._stats_reporter.add_stat(stat, val)
+        self._clear_update_buffer()

    def create_policy(self, brain_parameters: BrainParameters) -> TFPolicy:
        """
--- a/ml-agents/mlagents/trainers/sac/trainer.py
+++ b/ml-agents/mlagents/trainers/sac/trainer.py
        last_step = trajectory.steps[-1]
        agent_id = trajectory.agent_id  # All the agents should have the same ID

-        # Add to episode_steps
-        self.episode_steps[agent_id] += len(trajectory.steps)
-
        agent_buffer_trajectory = trajectory.to_agentbuffer()

        # Update the normalization
            agent_buffer_trajectory, trajectory.next_obs, trajectory.done_reached
        )
        for name, v in value_estimates.items():
-            self.stats_reporter.add_stat(
+            self._stats_reporter.add_stat(
                self.optimizer.reward_signals[name].value_name, np.mean(v)
            )

            )

        for stat, stat_list in batch_update_stats.items():
-            self.stats_reporter.add_stat(stat, np.mean(stat_list))
+            self._stats_reporter.add_stat(stat, np.mean(stat_list))
-                self.stats_reporter.add_stat(stat, val)
+                self._stats_reporter.add_stat(stat, val)

    def update_reward_signals(self) -> None:
        """
            for stat_name, value in update_stats.items():
                batch_update_stats[stat_name].append(value)
        for stat, stat_list in batch_update_stats.items():
-            self.stats_reporter.add_stat(stat, np.mean(stat_list))
+            self._stats_reporter.add_stat(stat, np.mean(stat_list))

    def add_policy(self, name_behavior_id: str, policy: TFPolicy) -> None:
        """
--- a/ml-agents/mlagents/trainers/stats.py
+++ b/ml-agents/mlagents/trainers/stats.py

 class StatsPropertyType(Enum):
    HYPERPARAMETERS = "hyperparameters"
+    SELF_PLAY = "selfplay"
+    SELF_PLAY_TEAM = "selfplayteam"


 class StatsWriter(abc.ABC):
 class ConsoleWriter(StatsWriter):
    def __init__(self):
        self.training_start_time = time.time()
+        # If self-play, we want to print ELO as well as reward
+        self.self_play = False
+        self.self_play_team = -1

    def write_stats(
        self, category: str, values: Dict[str, StatsSummary], step: int
            stats_summary = stats_summary = values["Is Training"]
            if stats_summary.mean > 0.0:
                is_training = "Training."
+
        if "Environment/Cumulative Reward" in values:
            stats_summary = values["Environment/Cumulative Reward"]
            logger.info(
                    is_training,
                )
            )
+            if self.self_play and "Self-play/ELO" in values:
+                elo_stats = values["Self-play/ELO"]
+                mean_opponent_elo = values["Self-play/Mean Opponent ELO"]
+                std_opponent_elo = values["Self-play/Std Opponent ELO"]
+                logger.info(
+                    "{} Team {}: ELO: {:0.3f}. "
+                    "Mean Opponent ELO: {:0.3f}. "
+                    "Std Opponent ELO: {:0.3f}. ".format(
+                        category,
+                        self.self_play_team,
+                        elo_stats.mean,
+                        mean_opponent_elo.mean,
+                        std_opponent_elo.mean,
+                    )
+                )
        else:
            logger.info(
                "{}: Step: {}. No episode was completed since last summary. {}".format(
                    category, self._dict_to_str(value, 0)
                )
            )
+        elif property_type == StatsPropertyType.SELF_PLAY:
+            assert isinstance(value, bool)
+            self.self_play = value
+        elif property_type == StatsPropertyType.SELF_PLAY_TEAM:
+            assert isinstance(value, int)
+            self.self_play_team = value

    def _dict_to_str(self, param_dict: Dict[str, Any], num_tabs: int) -> str:
        """
--- a/ml-agents/mlagents/trainers/tests/test_learn.py
+++ b/ml-agents/mlagents/trainers/tests/test_learn.py
 from mlagents.trainers.trainer_controller import TrainerController
 from mlagents.trainers.learn import parse_command_line
 from mlagents_envs.exception import UnityEnvironmentException
+from mlagents.trainers.stats import StatsReporter


 def basic_options(extra_args=None):
                sampler_manager_mock.return_value,
                None,
            )
+    StatsReporter.writers.clear()  # make sure there aren't any writers as added by learn.py


@patch("mlagents.trainers.learn.SamplerManager")
            mock_init.assert_called_once()
            assert mock_init.call_args[0][1] == "/dockertarget/models/ppo"
            assert mock_init.call_args[0][2] == "/dockertarget/summaries"
+    StatsReporter.writers.clear()  # make sure there aren't any writers as added by learn.py


 def test_bad_env_path():
--- a/ml-agents/mlagents/trainers/tests/test_rl_trainer.py
+++ b/ml-agents/mlagents/trainers/tests/test_rl_trainer.py
 def test_rl_trainer():
    trainer = create_rl_trainer()
    agent_id = "0"
-    trainer.episode_steps[agent_id] = 3
-    for agent_id in trainer.episode_steps:
-        assert trainer.episode_steps[agent_id] == 0
    for rewards in trainer.collected_rewards.values():
        for agent_id in rewards:
            assert rewards[agent_id] == 0
    trainer = create_rl_trainer()
    trainer.update_buffer = construct_fake_buffer(0)
-    trainer.clear_update_buffer()
+    trainer._clear_update_buffer()
-@mock.patch("mlagents.trainers.trainer.rl_trainer.RLTrainer.clear_update_buffer")
+@mock.patch("mlagents.trainers.trainer.rl_trainer.RLTrainer._clear_update_buffer")
 def test_advance(mocked_clear_update_buffer):
    trainer = create_rl_trainer()
    trajectory_queue = AgentManagerQueue("testbrain")
--- a/ml-agents/mlagents/trainers/tests/test_stats.py
+++ b/ml-agents/mlagents/trainers/tests/test_stats.py

        self.assertIn("Hyperparameters for behavior name", cm.output[2])
        self.assertIn("example:\t1.0", cm.output[2])
+
+    def test_selfplay_console_writer(self):
+        with self.assertLogs("mlagents.trainers", level="INFO") as cm:
+            category = "category1"
+            console_writer = ConsoleWriter()
+            console_writer.add_property(category, StatsPropertyType.SELF_PLAY, True)
+            console_writer.add_property(category, StatsPropertyType.SELF_PLAY_TEAM, 1)
+            statssummary1 = StatsSummary(mean=1.0, std=1.0, num=1)
+            console_writer.write_stats(
+                category,
+                {
+                    "Environment/Cumulative Reward": statssummary1,
+                    "Is Training": statssummary1,
+                    "Self-play/ELO": statssummary1,
+                    "Self-play/Mean Opponent ELO": statssummary1,
+                    "Self-play/Std Opponent ELO": statssummary1,
+                },
+                10,
+            )
+
+        self.assertIn(
+            "Mean Reward: 1.000. Std of Reward: 1.000. Training.", cm.output[0]
+        )
+        self.assertIn(
+            "category1 Team 1: ELO: 1.000. Mean Opponent ELO: 1.000. Std Opponent ELO: 1.000.",
+            cm.output[1],
+        )
--- a/ml-agents/mlagents/trainers/trainer/rl_trainer.py
+++ b/ml-agents/mlagents/trainers/trainer/rl_trainer.py
 # # Unity ML-Agents Toolkit
-from typing import Dict
+from typing import Dict, List
+import abc

 from mlagents.trainers.optimizer.tf_optimizer import TFOptimizer
 from mlagents.trainers.buffer import AgentBuffer
+from mlagents_envs.timers import hierarchical_timer
+from mlagents.trainers.agent_processor import AgentManagerQueue
+from mlagents.trainers.trajectory import Trajectory
+from mlagents.trainers.stats import StatsPropertyType

 RewardSignalResults = Dict[str, RewardSignalResult]

        # collected_rewards is a dictionary from name of reward signal to a dictionary of agent_id to cumulative reward
        # used for reporting only. We always want to report the environment reward to Tensorboard, regardless
        # of what reward signals are actually present.
+        self.cumulative_returns_since_policy_update: List[float] = []
-        self.episode_steps: Dict[str, int] = defaultdict(lambda: 0)
+        self._stats_reporter.add_property(
+            StatsPropertyType.HYPERPARAMETERS, self.trainer_parameters
+        )

    def end_episode(self) -> None:
        """
-        for agent_id in self.episode_steps:
-            self.episode_steps[agent_id] = 0
-        self.episode_steps[agent_id] = 0
+                self.stats_reporter.add_stat(
+                    "Environment/Cumulative Reward", rewards.get(agent_id, 0)
+                )
                self.cumulative_returns_since_policy_update.append(
                    rewards.get(agent_id, 0)
                )
                )
                rewards[agent_id] = 0

-    def clear_update_buffer(self) -> None:
+    def _clear_update_buffer(self) -> None:
+    @abc.abstractmethod
+    def _is_ready_update(self):
+        """
+        Returns whether or not the trainer has enough elements to run update model
+        :return: A boolean corresponding to wether or not update_model() can be run
+        """
+        return False
+
+    @abc.abstractmethod
+    def _update_policy(self):
+        """
+        Uses demonstration_buffer to update model.
+        """
+        pass
+
+    def _increment_step(self, n_steps: int, name_behavior_id: str) -> None:
+        """
+        Increment the step count of the trainer
+        :param n_steps: number of steps to increment the step count by
+        """
+        self.step += n_steps
+        self.next_summary_step = self._get_next_summary_step()
+        p = self.get_policy(name_behavior_id)
+        if p:
+            p.increment_step(n_steps)
+
+    def _get_next_summary_step(self) -> int:
+        """
+        Get the next step count that should result in a summary write.
+        """
+        return self.step + (self.summary_freq - self.step % self.summary_freq)
+
+    def _write_summary(self, step: int) -> None:
+        """
+        Saves training statistics to Tensorboard.
+        """
+        self.stats_reporter.add_stat("Is Training", float(self.should_still_train))
+        self.stats_reporter.write_stats(int(step))
+
+    @abc.abstractmethod
+    def _process_trajectory(self, trajectory: Trajectory) -> None:
+        """
+        Takes a trajectory and processes it, putting it into the update buffer.
+        :param trajectory: The Trajectory tuple containing the steps to be processed.
+        """
+        self._maybe_write_summary(self.get_step + len(trajectory.steps))
+        self._increment_step(len(trajectory.steps), trajectory.behavior_id)
+
+    def _maybe_write_summary(self, step_after_process: int) -> None:
+        """
+        If processing the trajectory will make the step exceed the next summary write,
+        write the summary. This logic ensures summaries are written on the update step and not in between.
+        :param step_after_process: the step count after processing the next trajectory.
+        """
+        if step_after_process >= self.next_summary_step and self.get_step != 0:
+            self._write_summary(self.next_summary_step)
+
-        Steps the trainer, taking in trajectories and updates if ready
+        Steps the trainer, taking in trajectories and updates if ready.
-        super().advance()
-        if not self.should_still_train:
-            self.clear_update_buffer()
+        with hierarchical_timer("process_trajectory"):
+            for traj_queue in self.trajectory_queues:
+                # We grab at most the maximum length of the queue.
+                # This ensures that even if the queue is being filled faster than it is
+                # being emptied, the trajectories in the queue are on-policy.
+                for _ in range(traj_queue.maxlen):
+                    try:
+                        t = traj_queue.get_nowait()
+                        self._process_trajectory(t)
+                    except AgentManagerQueue.Empty:
+                        break
+        if self.should_still_train:
+            if self._is_ready_update():
+                with hierarchical_timer("_update_policy"):
+                    self._update_policy()
+                    for q in self.policy_queues:
+                        # Get policies that correspond to the policy queue in question
+                        q.put(self.get_policy(q.behavior_id))
+        else:
+            self._clear_update_buffer()
--- a/ml-agents/mlagents/trainers/trainer/trainer.py
+++ b/ml-agents/mlagents/trainers/trainer/trainer.py
 # # Unity ML-Agents Toolkit
 import logging
 from typing import Dict, List, Deque, Any
-import time
 import abc

 from collections import deque
-from mlagents.trainers.stats import StatsReporter, StatsPropertyType
+from mlagents.trainers.stats import StatsReporter
-from mlagents_envs.timers import hierarchical_timer

 logger = logging.getLogger("mlagents.trainers")

        self.run_id = run_id
        self.trainer_parameters = trainer_parameters
        self.summary_path = trainer_parameters["summary_path"]
-        self.stats_reporter = StatsReporter(self.summary_path)
-        self.cumulative_returns_since_policy_update: List[float] = []
+        self._stats_reporter = StatsReporter(self.summary_path)
-        self.training_start_time = time.time()
-        self.stats_reporter.add_property(
-            StatsPropertyType.HYPERPARAMETERS, self.trainer_parameters
-        )
+
+    @property
+    def stats_reporter(self):
+        """
+        Returns the stats reporter associated with this Trainer.
+        """
+        return self._stats_reporter

    def _check_param_keys(self):
        for k in self.param_keys:
        """
        return self._reward_buffer

-    def _increment_step(self, n_steps: int, name_behavior_id: str) -> None:
-        """
-        Increment the step count of the trainer
-        :param n_steps: number of steps to increment the step count by
-        """
-        self.step += n_steps
-        self.next_summary_step = self._get_next_summary_step()
-        p = self.get_policy(name_behavior_id)
-        if p:
-            p.increment_step(n_steps)
-
-    def _get_next_summary_step(self) -> int:
-        """
-        Get the next step count that should result in a summary write.
-        """
-        return self.step + (self.summary_freq - self.step % self.summary_freq)
-
    def save_model(self, name_behavior_id: str) -> None:
        """
        Saves the model
        settings = SerializationSettings(policy.model_path, policy.brain.brain_name)
        export_policy_model(settings, policy.graph, policy.sess)

-    def _write_summary(self, step: int) -> None:
-        """
-        Saves training statistics to Tensorboard.
-        """
-        self.stats_reporter.add_stat("Is Training", float(self.should_still_train))
-        self.stats_reporter.write_stats(int(step))
-
-    @abc.abstractmethod
-    def _process_trajectory(self, trajectory: Trajectory) -> None:
-        """
-        Takes a trajectory and processes it, putting it into the update buffer.
-        :param trajectory: The Trajectory tuple containing the steps to be processed.
-        """
-        self._maybe_write_summary(self.get_step + len(trajectory.steps))
-        self._increment_step(len(trajectory.steps), trajectory.behavior_id)
-
-    def _maybe_write_summary(self, step_after_process: int) -> None:
-        """
-        If processing the trajectory will make the step exceed the next summary write,
-        write the summary. This logic ensures summaries are written on the update step and not in between.
-        :param step_after_process: the step count after processing the next trajectory.
-        """
-        if step_after_process >= self.next_summary_step and self.get_step != 0:
-            self._write_summary(self.next_summary_step)
-
    @abc.abstractmethod
    def end_episode(self):
        """
    @abc.abstractmethod
    def add_policy(self, name_behavior_id: str, policy: TFPolicy) -> None:
        """
-        Adds policy to trainer
+        Adds policy to trainer.
        """
        pass

-        Gets policy from trainer
+        Gets policy from trainer.
-    def _is_ready_update(self):
+    def advance(self) -> None:
-        Returns whether or not the trainer has enough elements to run update model
-        :return: A boolean corresponding to wether or not update_model() can be run
-        """
-        return False
-
-    @abc.abstractmethod
-    def _update_policy(self):
-        """
-        Uses demonstration_buffer to update model.
+        Advances the trainer. Typically, this means grabbing trajectories
+        from all subscribed trajectory queues (self.trajectory_queues), and updating
+        a policy using the steps in them, and if needed pushing a new policy onto the right
+        policy queues (self.policy_queues).
-    def advance(self) -> None:
-        """
-        Steps the trainer, taking in trajectories and updates if ready.
-        """
-        with hierarchical_timer("process_trajectory"):
-            for traj_queue in self.trajectory_queues:
-                # We grab at most the maximum length of the queue.
-                # This ensures that even if the queue is being filled faster than it is
-                # being emptied, the trajectories in the queue are on-policy.
-                for _ in range(traj_queue.maxlen):
-                    try:
-                        t = traj_queue.get_nowait()
-                        self._process_trajectory(t)
-                    except AgentManagerQueue.Empty:
-                        break
-        if self.should_still_train:
-            if self._is_ready_update():
-                with hierarchical_timer("_update_policy"):
-                    self._update_policy()
-                    for q in self.policy_queues:
-                        # Get policies that correspond to the policy queue in question
-                        q.put(self.get_policy(q.behavior_id))
-
-        :param queue: Policy queue to publish to.
+        :param policy_queue: Policy queue to publish to.
        """
        self.policy_queues.append(policy_queue)

        """
        Adds a trajectory queue to the list of queues for the trainer to ingest Trajectories from.
-        :param queue: Trajectory queue to publish to.
+        :param trajectory_queue: Trajectory queue to read from.
        """
        self.trajectory_queues.append(trajectory_queue)
--- a/docs/Getting-Started.md
+++ b/docs/Getting-Started.md
+# Getting Started Guide
+
+This guide walks through the end-to-end process of opening an ML-Agents
+toolkit example environment in Unity, building the Unity executable, training an
+Agent in it, and finally embedding the trained model into the Unity environment.
+
+The ML-Agents toolkit includes a number of [example
+environments](Learning-Environment-Examples.md) which you can examine to help
+understand the different ways in which the ML-Agents toolkit can be used. These
+environments can also serve as templates for new environments or as ways to test
+new ML algorithms. After reading this tutorial, you should be able to explore
+train the example environments.
+
+If you are not familiar with the [Unity Engine](https://unity3d.com/unity), we
+highly recommend the [Roll-a-ball
+tutorial](https://unity3d.com/learn/tutorials/s/roll-ball-tutorial) to learn all
+the basic concepts first.
+
+![3D Balance Ball](images/balance.png)
+
+This guide uses the **3D Balance Ball** environment to teach the basic concepts and
+usage patterns of ML-Agents. 3D Balance Ball
+contains a number of agent cubes and balls (which are all copies of each other).
+Each agent cube tries to keep its ball from falling by rotating either
+horizontally or vertically. In this environment, an agent cube is an **Agent** that
+receives a reward for every step that it balances the ball. An agent is also
+penalized with a negative reward for dropping the ball. The goal of the training
+process is to have the agents learn to balance the ball on their head.
+
+Let's get started!
+
+## Installation
+
+In order to install and set up the ML-Agents toolkit, the Python dependencies
+and Unity, see the [installation instructions](Installation.md).
+
+Depending on your version of Unity, it may be necessary to change the **Scripting Runtime Version** of your project. This can be done as follows:
+
+1. Launch Unity
+2. On the Projects dialog, choose the **Open** option at the top of the window.
+3. Using the file dialog that opens, locate the `Project` folder
+   within the ML-Agents toolkit project and click **Open**.
+4. Go to **Edit** > **Project Settings** > **Player**
+5. For **each** of the platforms you target (**PC, Mac and Linux Standalone**,
+   **iOS** or **Android**):
+    1. Expand the **Other Settings** section.
+    2. Select **Scripting Runtime Version** to **Experimental (.NET 4.6
+       Equivalent or .NET 4.x Equivalent)**
+6. Go to **File** > **Save Project**
+
+
+## Understanding a Unity Environment
+
+An agent is an autonomous actor that observes and interacts with an
+_environment_. In the context of Unity, an environment is a scene containing
+one or more Agent objects, and, of course, the other
+entities that an agent interacts with.
+
+![Unity Editor](images/mlagents-3DBallHierarchy.png)
+
+**Note:** In Unity, the base object of everything in a scene is the
+_GameObject_. The GameObject is essentially a container for everything else,
+including behaviors, graphics, physics, etc. To see the components that make up
+a GameObject, select the GameObject in the Scene window, and open the Inspector
+window. The Inspector shows every component on a GameObject.
+
+The first thing you may notice after opening the 3D Balance Ball scene is that
+it contains not one, but several agent cubes.  Each agent cube in the scene is an
+independent agent, but they all share the same Behavior. 3D Balance Ball does this
+to speed up training since all twelve agents contribute to training in parallel.
+
+### Agent
+
+The Agent is the actor that observes and takes actions in the environment. In
+the 3D Balance Ball environment, the Agent components are placed on the twelve
+"Agent" GameObjects. The base Agent object has a few properties that affect its
+behavior:
+
+* **Behavior Parameters** — Every Agent must have a Behavior. The Behavior
+  determines how an Agent makes decisions. More on Behavior Parameters in
+  the next section.
+* **Max Step** — Defines how many simulation steps can occur before the Agent's
+  episode ends. In 3D Balance Ball, an Agent restarts after 5000 steps.
+
+When you create an Agent, you must extend the base Agent class.
+The Ball3DAgent subclass defines the following methods:
+
+* `Agent.OnEpisodeBegin()` — Called at the beginning of an Agent's episode, including at the beginning
+  of the simulation. The Ball3DAgent class uses this function to reset the
+  agent cube and ball to their starting positions. The function randomizes the reset values so that the
+  training generalizes to more than a specific starting position and agent cube
+  attitude.
+* `Agent.CollectObservations(VectorSensor sensor)` — Called every simulation step. Responsible for
+  collecting the Agent's observations of the environment. Since the Behavior
+  Parameters of the Agent are set with vector observation
+  space with a state size of 8, the `CollectObservations(VectorSensor sensor)` must call
+  `VectorSensor.AddObservation()` such that vector size adds up to 8.
+* `Agent.OnActionReceived()` — Called every time the Agent receives an action to take. Receives the action chosen
+  by the Agent. The vector action spaces result in a
+  small change in the agent cube's rotation at each step. The `OnActionReceived()` method
+  assigns a reward to the Agent; in this example, an Agent receives a small
+  positive reward for each step it keeps the ball on the agent cube's head and a larger,
+  negative reward for dropping the ball. An Agent's episode is also ended when it
+  drops the ball so that it will reset with a new ball for the next simulation
+  step.
+* `Agent.Heuristic()` - When the `Behavior Type` is set to `Heuristic Only` in the Behavior
+  Parameters of the Agent, the Agent will use the `Heuristic()` method to generate
+  the actions of the Agent. As such, the `Heuristic()` method returns an array of
+  floats. In the case of the Ball 3D Agent, the `Heuristic()` method converts the
+  keyboard inputs into actions.
+
+
+#### Behavior Parameters : Vector Observation Space
+
+Before making a decision, an agent collects its observation about its state in
+the world. The vector observation is a vector of floating point numbers which
+contain relevant information for the agent to make decisions.
+
+The Behavior Parameters of the 3D Balance Ball example uses a **Space Size** of 8.
+This means that the feature
+vector containing the Agent's observations contains eight elements: the `x` and
+`z` components of the agent cube's rotation and the `x`, `y`, and `z` components
+of the ball's relative position and velocity. (The observation values are
+defined in the Agent's `CollectObservations(VectorSensor sensor)` method.)
+
+#### Behavior Parameters : Vector Action Space
+
+An Agent is given instructions in the form of a float array of *actions*.
+ML-Agents toolkit classifies actions into two types: the **Continuous** vector
+action space is a vector of numbers that can vary continuously. What each
+element of the vector means is defined by the Agent logic (the training
+process just learns what values are better given particular state observations
+based on the rewards received when it tries different values). For example, an
+element might represent a force or torque applied to a `Rigidbody` in the Agent.
+The **Discrete** action vector space defines its actions as tables. An action
+given to the Agent is an array of indices into tables.
+
+The 3D Balance Ball example is programmed to use continuous action
+space with `Space Size` of 2.
+
+## Running a pre-trained model
+
+We include pre-trained models for our agents (`.nn` files) and we use the
+[Unity Inference Engine](Unity-Inference-Engine.md) to run these models
+inside Unity. In this section, we will use the pre-trained model for the
+3D Ball example.
+
+1. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Scenes` folder
+   and open the `3DBall` scene file.
+2. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Prefabs` folder.
+   Expand `3DBall` and click on the `Agent` prefab.  You should see the `Agent` prefab in the **Inspector** window.
+
+   **Note**: The platforms in the `3DBall` scene were created using the `3DBall` prefab.  Instead of updating all 12 platforms individually, you can update the `3DBall` prefab instead.
+
+   ![Platform Prefab](images/platform_prefab.png)
+
+3. In the **Project** window, drag the **3DBall** Model located in
+   `Assets/ML-Agents/Examples/3DBall/TFModels` into the `Model` property under `Behavior Parameters (Script)` component in the Agent GameObject **Inspector** window.
+
+   ![3dball learning brain](images/3dball_learning_brain.png)
+
+4. You should notice that each `Agent` under each `3DBall` in the **Hierarchy** windows now contains **3DBall** as `Model` on the `Behavior Parameters`. __Note__ : You can modify multiple game objects in a scene by selecting them all at
+   once using the search bar in the Scene Hierarchy.
+8. Select the **InferenceDevice** to use for this model (CPU or GPU) on the Agent.
+   _Note: CPU is faster for the majority of ML-Agents toolkit generated models_
+9. Click the **Play** button and you will see the platforms balance the balls
+   using the pre-trained model.
+
+## Training a new model with Reinforcement Learning
+
+While we provide pre-trained `.nn` files for the agents in this environment, any environment you make yourself will require training agents from scratch to generate a new model file. We can do this using reinforcement learning.
+
+In order to train an agent to correctly balance the ball, we provide two
+deep reinforcement learning algorithms.
+
+The default algorithm is Proximal Policy Optimization (PPO). This
+is a method that has been shown to be more general purpose and stable
+than many other RL algorithms. For more information on PPO, OpenAI
+has a [blog post](https://blog.openai.com/openai-baselines-ppo/)
+explaining it, and [our page](Training-PPO.md) for how to use it in training.
+
+We also provide Soft-Actor Critic, an off-policy algorithm that
+has been shown to be both stable and sample-efficient.
+For more information on SAC, see UC Berkeley's
+[blog post](https://bair.berkeley.edu/blog/2018/12/14/sac/) and
+[our page](Training-SAC.md) for more guidance on when to use SAC vs. PPO. To
+use SAC to train Balance Ball, replace all references to `config/trainer_config.yaml`
+with `config/sac_trainer_config.yaml` below.
+
+To train the agents within the Balance Ball environment, we will be using the
+ML-Agents Python package. We have provided a convenient command called `mlagents-learn`
+which accepts arguments used to configure both training and inference phases.
+
+### Training the environment
+
+1. Open a command or terminal window.
+2. Navigate to the folder where you cloned the ML-Agents toolkit repository.
+   **Note**: If you followed the default [installation](Installation.md), then
+   you should be able to run `mlagents-learn` from any directory.
+3. Run `mlagents-learn <trainer-config-path> --run-id=<run-identifier> --train`
+   where:
+    - `<trainer-config-path>` is the relative or absolute filepath of the
+      trainer configuration. The defaults used by example environments included
+      in `MLAgentsSDK` can be found in `config/trainer_config.yaml`.
+    - `<run-identifier>` is a string used to separate the results of different
+      training runs
+    - `--train` tells `mlagents-learn` to run a training session (rather
+      than inference)
+4. If you cloned the ML-Agents repo, then you can simply run
+
+      ```sh
+      mlagents-learn config/trainer_config.yaml --run-id=firstRun --train
+      ```
+
+5. When the message _"Start training by pressing the Play button in the Unity
+   Editor"_ is displayed on the screen, you can press the :arrow_forward: button
+   in Unity to start training in the Editor.
+
+**Note**: If you're using Anaconda, don't forget to activate the ml-agents
+environment first.
+
+The `--train` flag tells the ML-Agents toolkit to run in training mode.
+The `--time-scale=100` sets the `Time.TimeScale` value in Unity.
+
+**Note**: You can train using an executable rather than the Editor. To do so,
+follow the instructions in
+[Using an Executable](Learning-Environment-Executable.md).
+
+**Note**: Re-running this command will start training from scratch again. To resume
+a previous training run, append the `--load` flag and give the same `--run-id` as the
+run you want to resume.
+
+If `mlagents-learn` runs correctly and starts training, you should see something
+like this:
+
+```console
+INFO:mlagents_envs:
+'Ball3DAcademy' started successfully!
+Unity Academy name: Ball3DAcademy
+
+INFO:mlagents_envs:Connected new brain:
+Unity brain name: 3DBallLearning
+        Number of Visual Observations (per agent): 0
+        Vector Observation space size (per agent): 8
+        Number of stacked Vector Observation: 1
+        Vector Action space type: continuous
+        Vector Action space size (per agent): [2]
+        Vector Action descriptions: ,
+INFO:mlagents_envs:Hyperparameters for the PPO Trainer of brain 3DBallLearning:
+        batch_size:          64
+        beta:                0.001
+        buffer_size:         12000
+        epsilon:             0.2
+        gamma:               0.995
+        hidden_units:        128
+        lambd:               0.99
+        learning_rate:       0.0003
+        max_steps:           5.0e4
+        normalize:           True
+        num_epoch:           3
+        num_layers:          2
+        time_horizon:        1000
+        sequence_length:     64
+        summary_freq:        1000
+        use_recurrent:       False
+        summary_path:        ./summaries/first-run-0
+        memory_size:         256
+        use_curiosity:       False
+        curiosity_strength:  0.01
+        curiosity_enc_size:  128
+        model_path:	./models/first-run-0/3DBallLearning
+INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 1000. Mean Reward: 1.242. Std of Reward: 0.746. Training.
+INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 2000. Mean Reward: 1.319. Std of Reward: 0.693. Training.
+INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 3000. Mean Reward: 1.804. Std of Reward: 1.056. Training.
+INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 4000. Mean Reward: 2.151. Std of Reward: 1.432. Training.
+INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 5000. Mean Reward: 3.175. Std of Reward: 2.250. Training.
+INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 6000. Mean Reward: 4.898. Std of Reward: 4.019. Training.
+INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 7000. Mean Reward: 6.716. Std of Reward: 5.125. Training.
+INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 8000. Mean Reward: 12.124. Std of Reward: 11.929. Training.
+INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 9000. Mean Reward: 18.151. Std of Reward: 16.871. Training.
+INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 10000. Mean Reward: 27.284. Std of Reward: 28.667. Training.
+```
+
+
+### Observing Training Progress
+
+Once you start training using `mlagents-learn` in the way described in the
+previous section, the `ml-agents` directory will contain a `summaries`
+directory. In order to observe the training process in more detail, you can use
+TensorBoard. From the command line run:
+
+```sh
+tensorboard --logdir=summaries
+```
+
+Then navigate to `localhost:6006` in your browser.
+
+From TensorBoard, you will see the summary statistics:
+
+* **Lesson** - only interesting when performing [curriculum
+  training](Training-Curriculum-Learning.md). This is not used in the 3D Balance
+  Ball environment.
+* **Cumulative Reward** - The mean cumulative episode reward over all agents. Should
+  increase during a successful training session.
+* **Entropy** - How random the decisions of the model are. Should slowly decrease
+  during a successful training process. If it decreases too quickly, the `beta`
+  hyperparameter should be increased.
+* **Episode Length** - The mean length of each episode in the environment for all
+  agents.
+* **Learning Rate** - How large a step the training algorithm takes as it searches
+  for the optimal policy. Should decrease over time.
+* **Policy Loss** - The mean loss of the policy function update. Correlates to how
+  much the policy (process for deciding actions) is changing. The magnitude of
+  this should decrease during a successful training session.
+* **Value Estimate** - The mean value estimate for all states visited by the agent.
+  Should increase during a successful training session.
+* **Value Loss** - The mean loss of the value function update. Correlates to how
+  well the model is able to predict the value of each state. This should
+  decrease during a successful training session.
+
+![Example TensorBoard Run](images/mlagents-TensorBoard.png)
+
+## Embedding the model into the Unity Environment
+
+Once the training process completes, and the training process saves the model
+(denoted by the `Saved Model` message) you can add it to the Unity project and
+use it with compatible Agents (the Agents that generated the model).
+__Note:__ Do not just close the Unity Window once the `Saved Model` message appears.
+Either wait for the training process to close the window or press Ctrl+C at the
+command-line prompt. If you close the window manually, the `.nn` file
+containing the trained model is not exported into the ml-agents folder.
+
+You can press Ctrl+C to stop the training, and your trained model will be at
+`models/<run-identifier>/<behavior_name>.nn` where
+`<behavior_name>` is the name of the `Behavior Name` of the agents corresponding to the model.
+(**Note:** There is a known bug on Windows that causes the saving of the model to
+fail when you early terminate the training, it's recommended to wait until Step
+has reached the max_steps parameter you set in trainer_config.yaml.) This file
+corresponds to your model's latest checkpoint. You can now embed this trained
+model into your Agents by following the steps below, which is similar to
+the steps described
+[above](#running-a-pre-trained-model).
+
+1. Move your model file into
+   `Project/Assets/ML-Agents/Examples/3DBall/TFModels/`.
+2. Open the Unity Editor, and select the **3DBall** scene as described above.
+3. Select the  **3DBall** prefab Agent object.
+4. Drag the `<behavior_name>.nn` file from the Project window of
+   the Editor to the **Model** placeholder in the **Ball3DAgent**
+   inspector window.
+5. Press the :arrow_forward: button at the top of the Editor.
+
+## Next Steps
+
+- For more information on the ML-Agents toolkit, in addition to helpful
+  background, check out the [ML-Agents Toolkit Overview](ML-Agents-Overview.md)
+  page.
+- For a "Hello World" introduction to creating your own Learning Environment,
+  check out the [Making a New Learning
+  Environment](Learning-Environment-Create-New.md) page.
+- For a series of YouTube video tutorials, checkout the
+  [Machine Learning Agents PlayList](https://www.youtube.com/playlist?list=PLX2vGYjWbI0R08eWQkO7nQkGiicHAX7IX)
+  page.
--- a/ml-agents/tests/yamato/check_coverage_percent.py
+++ b/ml-agents/tests/yamato/check_coverage_percent.py
+from __future__ import print_function
+import sys
+import os
+
+SUMMARY_XML_FILENAME = "Summary.xml"
+
+# Note that this is python2 compatible, since that's currently what's installed on most CI images.
+
+
+def check_coverage(root_dir, min_percentage):
+    # Walk the root directory looking for the summary file that
+    # is output by ther code coverage checks. It's possible that
+    # we'll need to refine this later in case there are multiple
+    # such files.
+    summary_xml = None
+    for dirpath, _, filenames in os.walk(root_dir):
+        if SUMMARY_XML_FILENAME in filenames:
+            summary_xml = os.path.join(dirpath, SUMMARY_XML_FILENAME)
+            break
+    if not summary_xml:
+        print("Couldn't find {} in root directory".format(SUMMARY_XML_FILENAME))
+        sys.exit(1)
+
+    with open(summary_xml) as f:
+        # Rather than try to parse the XML, just look for a line of the form
+        # <Linecoverage>73.9</Linecoverage>
+        lines = f.readlines()
+        for l in lines:
+            if "Linecoverage" in l:
+                pct = l.replace("<Linecoverage>", "").replace("</Linecoverage>", "")
+                pct = float(pct)
+                if pct < min_percentage:
+                    print(
+                        "Coverage {} is below the min percentage of {}.".format(
+                            pct, min_percentage
+                        )
+                    )
+                    sys.exit(1)
+                else:
+                    print(
+                        "Coverage {} is above the min percentage of {}.".format(
+                            pct, min_percentage
+                        )
+                    )
+                    sys.exit(0)
+
+    # Couldn't find the results in the file.
+    print("Couldn't find Linecoverage in summary file")
+    sys.exit(1)
+
+
+def main():
+    root_dir = sys.argv[1]
+    min_percent = float(sys.argv[2])
+    if min_percent > 0:
+        # This allows us to set 0% coverage on 2018.4
+        check_coverage(root_dir, min_percent)
+
+
+if __name__ == "__main__":
+    main()
--- a/docs/Basic-Guide.md
+++ b/docs/Basic-Guide.md
-# Basic Guide
-
-This guide will show you how to use a pre-trained model in an example Unity
-environment (3D Ball) and show you how to train the model yourself.
-
-If you are not familiar with the [Unity Engine](https://unity3d.com/unity), we
-highly recommend the [Roll-a-ball
-tutorial](https://unity3d.com/learn/tutorials/s/roll-ball-tutorial) to learn all
-the basic concepts of Unity.
-
-## Setting up the ML-Agents Toolkit within Unity
-
-In order to use the ML-Agents toolkit within Unity, you first need to change a few
-Unity settings.
-
-1. Launch Unity
-2. On the Projects dialog, choose the **Open** option at the top of the window.
-3. Using the file dialog that opens, locate the `Project` folder
-   within the ML-Agents toolkit project and click **Open**.
-4. Go to **Edit** > **Project Settings** > **Player**
-5. For **each** of the platforms you target (**PC, Mac and Linux Standalone**,
-   **iOS** or **Android**):
-    1. Expand the **Other Settings** section.
-    2. Select **Scripting Runtime Version** to **Experimental (.NET 4.6
-       Equivalent or .NET 4.x Equivalent)**
-6. Go to **File** > **Save Project**
-
-## Running a Pre-trained Model
-
-We include pre-trained models for our agents (`.nn` files) and we use the
-[Unity Inference Engine](Unity-Inference-Engine.md) to run these models
-inside Unity. In this section, we will use the pre-trained model for the
-3D Ball example.
-
-1. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Scenes` folder
-   and open the `3DBall` scene file.
-2. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Prefabs` folder.
-   Expand `3DBall` and click on the `Agent` prefab.  You should see the `Agent` prefab in the **Inspector** window.
-
-   **Note**: The platforms in the `3DBall` scene were created using the `3DBall` prefab.  Instead of updating all 12 platforms individually, you can update the `3DBall` prefab instead.
-
-   ![Platform Prefab](images/platform_prefab.png)
-
-3. In the **Project** window, drag the **3DBall** Model located in
-   `Assets/ML-Agents/Examples/3DBall/TFModels` into the `Model` property under `Behavior Parameters (Script)` component in the Agent GameObject **Inspector** window.
-
-   ![3dball learning brain](images/3dball_learning_brain.png)
-
-4. You should notice that each `Agent` under each `3DBall` in the **Hierarchy** windows now contains **3DBall** as `Model` on the `Behavior Parameters`. __Note__ : You can modify multiple game objects in a scene by selecting them all at
-   once using the search bar in the Scene Hierarchy.
-8. Select the **InferenceDevice** to use for this model (CPU or GPU) on the Agent.
-   _Note: CPU is faster for the majority of ML-Agents toolkit generated models_
-9. Click the **Play** button and you will see the platforms balance the balls
-   using the pre-trained model.
-
-   ![Running a pre-trained model](images/balance.png)
-
-## Using the Basics Jupyter Notebook
-
-The `notebooks/getting-started.ipynb` [Jupyter notebook](Background-Jupyter.md)
-contains a simple walk-through of the functionality of the Python API. It can
-also serve as a simple test that your environment is configured correctly.
-Within `Basics`, be sure to set `env_name` to the name of the Unity executable
-if you want to [use an executable](Learning-Environment-Executable.md) or to
-`None` if you want to interact with the current scene in the Unity Editor.
-
-More information and documentation is provided in the
-[Python API](Python-API.md) page.
-
-## Training the Model with Reinforcement Learning
-
-### Setting up the environment for training
-
-In order to setup the Agents for Training, you will need to edit the
-`Behavior Name` under `BehaviorParamters` in the Agent Inspector window.
-The `Behavior Name` is used to group agents per behaviors. Note that Agents
-sharing the same `Behavior Name` must be agents of the same type using the
-same `Behavior Parameters`. You can make sure all your agents have the same
-`Behavior Parameters` using Prefabs.
-The `Behavior Name` corresponds to the name of the model that will be
-generated by the training process and is used to select the hyperparameters
-from the training configuration file.
-
-### Training the environment
-
-1. Open a command or terminal window.
-2. Navigate to the folder where you cloned the ML-Agents toolkit repository.
-   **Note**: If you followed the default [installation](Installation.md), then
-   you should be able to run `mlagents-learn` from any directory.
-3. Run `mlagents-learn <trainer-config-path> --run-id=<run-identifier> --train`
-   where:
-    - `<trainer-config-path>` is the relative or absolute filepath of the
-      trainer configuration. The defaults used by example environments included
-      in `MLAgentsSDK` can be found in `config/trainer_config.yaml`.
-    - `<run-identifier>` is a string used to separate the results of different
-      training runs
-    - `--train` tells `mlagents-learn` to run a training session (rather
-      than inference)
-4. If you cloned the ML-Agents repo, then you can simply run
-
-      ```sh
-      mlagents-learn config/trainer_config.yaml --run-id=firstRun --train
-      ```
-
-5. When the message _"Start training by pressing the Play button in the Unity
-   Editor"_ is displayed on the screen, you can press the :arrow_forward: button
-   in Unity to start training in the Editor.
-
-   **Note**: Alternatively, you can use an executable rather than the Editor to
-perform training. Please refer to [this
-page](Learning-Environment-Executable.md) for instructions on how to build and
-use an executable.
-
-**Note**: If you're using Anaconda, don't forget to activate the ml-agents
-environment first.
-
-If `mlagents-learn` runs correctly and starts training, you should see something
-like this:
-
-```console
-INFO:mlagents_envs:
-'Ball3DAcademy' started successfully!
-Unity Academy name: Ball3DAcademy
-
-INFO:mlagents_envs:Connected new brain:
-Unity brain name: 3DBallLearning
-        Number of Visual Observations (per agent): 0
-        Vector Observation space size (per agent): 8
-        Number of stacked Vector Observation: 1
-        Vector Action space type: continuous
-        Vector Action space size (per agent): [2]
-        Vector Action descriptions: ,
-INFO:mlagents_envs:Hyperparameters for the PPO Trainer of brain 3DBallLearning:
-        batch_size:          64
-        beta:                0.001
-        buffer_size:         12000
-        epsilon:             0.2
-        gamma:               0.995
-        hidden_units:        128
-        lambd:               0.99
-        learning_rate:       0.0003
-        max_steps:           5.0e4
-        normalize:           True
-        num_epoch:           3
-        num_layers:          2
-        time_horizon:        1000
-        sequence_length:     64
-        summary_freq:        1000
-        use_recurrent:       False
-        summary_path:        ./summaries/first-run-0
-        memory_size:         256
-        use_curiosity:       False
-        curiosity_strength:  0.01
-        curiosity_enc_size:  128
-        model_path:	./models/first-run-0/3DBallLearning
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 1000. Mean Reward: 1.242. Std of Reward: 0.746. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 2000. Mean Reward: 1.319. Std of Reward: 0.693. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 3000. Mean Reward: 1.804. Std of Reward: 1.056. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 4000. Mean Reward: 2.151. Std of Reward: 1.432. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 5000. Mean Reward: 3.175. Std of Reward: 2.250. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 6000. Mean Reward: 4.898. Std of Reward: 4.019. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 7000. Mean Reward: 6.716. Std of Reward: 5.125. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 8000. Mean Reward: 12.124. Std of Reward: 11.929. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 9000. Mean Reward: 18.151. Std of Reward: 16.871. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 10000. Mean Reward: 27.284. Std of Reward: 28.667. Training.
-```
-
-### After training
-
-You can press Ctrl+C to stop the training, and your trained model will be at
-`models/<run-identifier>/<behavior_name>.nn` where
-`<behavior_name>` is the name of the `Behavior Name` of the agents corresponding to the model.
-(**Note:** There is a known bug on Windows that causes the saving of the model to
-fail when you early terminate the training, it's recommended to wait until Step
-has reached the max_steps parameter you set in trainer_config.yaml.) This file
-corresponds to your model's latest checkpoint. You can now embed this trained
-model into your Agents by following the steps below, which is similar to
-the steps described
-[above](#running-a-pre-trained-model).
-
-1. Move your model file into
-   `Project/Assets/ML-Agents/Examples/3DBall/TFModels/`.
-2. Open the Unity Editor, and select the **3DBall** scene as described above.
-3. Select the  **3DBall** prefab Agent object.
-4. Drag the `<behavior_name>.nn` file from the Project window of
-   the Editor to the **Model** placeholder in the **Ball3DAgent**
-   inspector window.
-5. Press the :arrow_forward: button at the top of the Editor.
-
-## Next Steps
-
- For more information on the ML-Agents toolkit, in addition to helpful
-  background, check out the [ML-Agents Toolkit Overview](ML-Agents-Overview.md)
-  page.
- For a more detailed walk-through of our 3D Balance Ball environment, check out
-  the [Getting Started](Getting-Started-with-Balance-Ball.md) page.
- For a "Hello World" introduction to creating your own Learning Environment,
-  check out the [Making a New Learning
-  Environment](Learning-Environment-Create-New.md) page.
- For a series of YouTube video tutorials, checkout the
-  [Machine Learning Agents PlayList](https://www.youtube.com/playlist?list=PLX2vGYjWbI0R08eWQkO7nQkGiicHAX7IX)
-  page.
--- a/docs/Getting-Started-with-Balance-Ball.md
+++ b/docs/Getting-Started-with-Balance-Ball.md
-# Getting Started with the 3D Balance Ball Environment
-
-This tutorial walks through the end-to-end process of opening a ML-Agents
-toolkit example environment in Unity, building the Unity executable, training an
-Agent in it, and finally embedding the trained model into the Unity environment.
-
-The ML-Agents toolkit includes a number of [example
-environments](Learning-Environment-Examples.md) which you can examine to help
-understand the different ways in which the ML-Agents toolkit can be used. These
-environments can also serve as templates for new environments or as ways to test
-new ML algorithms. After reading this tutorial, you should be able to explore
-and build the example environments.
-
-![3D Balance Ball](images/balance.png)
-
-This walk-through uses the **3D Balance Ball** environment. 3D Balance Ball
-contains a number of agent cubes and balls (which are all copies of each other).
-Each agent cube tries to keep its ball from falling by rotating either
-horizontally or vertically. In this environment, an agent cube is an **Agent** that
-receives a reward for every step that it balances the ball. An agent is also
-penalized with a negative reward for dropping the ball. The goal of the training
-process is to have the agents learn to balance the ball on their head.
-
-Let's get started!
-
-## Installation
-
-In order to install and set up the ML-Agents toolkit, the Python dependencies
-and Unity, see the [installation instructions](Installation.md).
-
-## Understanding the Unity Environment (3D Balance Ball)
-
-An agent is an autonomous actor that observes and interacts with an
-_environment_. In the context of Unity, an environment is a scene containing an
-Academy and one or more Agent objects, and, of course, the other
-entities that an agent interacts with.
-
-![Unity Editor](images/mlagents-3DBallHierarchy.png)
-
-**Note:** In Unity, the base object of everything in a scene is the
-_GameObject_. The GameObject is essentially a container for everything else,
-including behaviors, graphics, physics, etc. To see the components that make up
-a GameObject, select the GameObject in the Scene window, and open the Inspector
-window. The Inspector shows every component on a GameObject.
-
-The first thing you may notice after opening the 3D Balance Ball scene is that
-it contains not one, but several agent cubes.  Each agent cube in the scene is an
-independent agent, but they all share the same Behavior. 3D Balance Ball does this
-to speed up training since all twelve agents contribute to training in parallel.
-
-
-### Agent
-
-The Agent is the actor that observes and takes actions in the environment. In
-the 3D Balance Ball environment, the Agent components are placed on the twelve
-"Agent" GameObjects. The base Agent object has a few properties that affect its
-behavior:
-
-* **Behavior Parameters** — Every Agent must have a Behavior. The Behavior
-  determines how an Agent makes decisions. More on Behavior Parameters in
-  the next section.
-* **Max Step** — Defines how many simulation steps can occur before the Agent's
-  episode ends. In 3D Balance Ball, an Agent restarts after 5000 steps.
-
-When you create an Agent, you must extend the base Agent class.
-The Ball3DAgent subclass defines the following methods:
-
-* `Agent.OnEpisodeBegin()` — Called when the Agent resets, including at the beginning
-  of the simulation. The Ball3DAgent class uses the reset function to reset the
-  agent cube and ball. The function randomizes the reset values so that the
-  training generalizes to more than a specific starting position and agent cube
-  attitude.
-* `Agent.CollectObservations(VectorSensor sensor)` — Called every simulation step. Responsible for
-  collecting the Agent's observations of the environment. Since the Behavior
-  Parameters of the Agent are set with vector observation
-  space with a state size of 8, the `CollectObservations(VectorSensor sensor)` must call
-  `VectorSensor.AddObservation()` such that vector size adds up to 8.
-* `Agent.OnActionReceived()` — Called every time the Agent receives an action to take. Receives the action chosen
-  by the Agent. The vector action spaces result in a
-  small change in the agent cube's rotation at each step. The `OnActionReceived()` method
-  assigns a reward to the Agent; in this example, an Agent receives a small
-  positive reward for each step it keeps the ball on the agent cube's head and a larger,
-  negative reward for dropping the ball. An Agent's episode is also ended when it
-  drops the ball so that it will reset with a new ball for the next simulation
-  step.
-* `Agent.Heuristic()` - When the `Behavior Type` is set to `Heuristic Only` in the Behavior
-  Parameters of the Agent, the Agent will use the `Heuristic()` method to generate
-  the actions of the Agent. As such, the `Heuristic()` method returns an array of
-  floats. In the case of the Ball 3D Agent, the `Heuristic()` method converts the
-  keyboard inputs into actions.
-
-
-#### Behavior Parameters : Vector Observation Space
-
-Before making a decision, an agent collects its observation about its state in
-the world. The vector observation is a vector of floating point numbers which
-contain relevant information for the agent to make decisions.
-
-The Behavior Parameters of the 3D Balance Ball example uses a **Space Size** of 8.
-This means that the feature
-vector containing the Agent's observations contains eight elements: the `x` and
-`z` components of the agent cube's rotation and the `x`, `y`, and `z` components
-of the ball's relative position and velocity. (The observation values are
-defined in the Agent's `CollectObservations(VectorSensor sensor)` method.)
-
-#### Behavior Parameters : Vector Action Space
-
-An Agent is given instructions in the form of a float array of *actions*.
-ML-Agents toolkit classifies actions into two types: the **Continuous** vector
-action space is a vector of numbers that can vary continuously. What each
-element of the vector means is defined by the Agent logic (the training
-process just learns what values are better given particular state observations
-based on the rewards received when it tries different values). For example, an
-element might represent a force or torque applied to a `Rigidbody` in the Agent.
-The **Discrete** action vector space defines its actions as tables. An action
-given to the Agent is an array of indices into tables.
-
-The 3D Balance Ball example is programmed to use continuous action
-space with `Space Size` of 2.
-
-## Training with Reinforcement Learning
-
-Now that we have an environment, we can perform the training.
-
-### Training with Deep Reinforcement Learning
-
-In order to train an agent to correctly balance the ball, we provide two
-deep reinforcement learning algorithms.
-
-The default algorithm is Proximal Policy Optimization (PPO). This
-is a method that has been shown to be more general purpose and stable
-than many other RL algorithms. For more information on PPO, OpenAI
-has a [blog post](https://blog.openai.com/openai-baselines-ppo/)
-explaining it, and [our page](Training-PPO.md) for how to use it in training.
-
-We also provide Soft-Actor Critic, an off-policy algorithm that
-has been shown to be both stable and sample-efficient.
-For more information on SAC, see UC Berkeley's
-[blog post](https://bair.berkeley.edu/blog/2018/12/14/sac/) and
-[our page](Training-SAC.md) for more guidance on when to use SAC vs. PPO. To
-use SAC to train Balance Ball, replace all references to `config/trainer_config.yaml`
-with `config/sac_trainer_config.yaml` below.
-
-To train the agents within the Balance Ball environment, we will be using the
-ML-Agents Python package. We have provided a convenient command called `mlagents-learn`
-which accepts arguments used to configure both training and inference phases.
-
-We can use `run_id` to identify the experiment and create a folder where the
-model and summary statistics are stored. When using TensorBoard to observe the
-training statistics, it helps to set this to a sequential value for each
-training run. In other words, "BalanceBall1" for the first run, "BalanceBall2"
-or the second, and so on. If you don't, the summaries for every training run are
-saved to the same directory and will all be included on the same graph.
-
-To summarize, go to your command line, enter the `ml-agents` directory and type:
-
-```sh
-mlagents-learn config/trainer_config.yaml --run-id=<run-identifier> --train --time-scale=100
-```
-
-When the message _"Start training by pressing the Play button in the Unity
-Editor"_ is displayed on the screen, you can press the :arrow_forward: button in
-Unity to start training in the Editor.
-
-**Note**: If you're using Anaconda, don't forget to activate the ml-agents
-environment first.
-
-The `--train` flag tells the ML-Agents toolkit to run in training mode.
-The `--time-scale=100` sets the `Time.TimeScale` value in Unity.
-
-**Note**: You can train using an executable rather than the Editor. To do so,
-follow the instructions in
-[Using an Executable](Learning-Environment-Executable.md).
-
-**Note**: Re-running this command will start training from scratch again. To resume
-a previous training run, append the `--load` flag and give the same `--run-id` as the
-run you want to resume.
-
-### Observing Training Progress
-
-Once you start training using `mlagents-learn` in the way described in the
-previous section, the `ml-agents` directory will contain a `summaries`
-directory. In order to observe the training process in more detail, you can use
-TensorBoard. From the command line run:
-
-```sh
-tensorboard --logdir=summaries
-```
-
-Then navigate to `localhost:6006` in your browser.
-
-From TensorBoard, you will see the summary statistics:
-
-* Lesson - only interesting when performing [curriculum
-  training](Training-Curriculum-Learning.md). This is not used in the 3D Balance
-  Ball environment.
-* Cumulative Reward - The mean cumulative episode reward over all agents. Should
-  increase during a successful training session.
-* Entropy - How random the decisions of the model are. Should slowly decrease
-  during a successful training process. If it decreases too quickly, the `beta`
-  hyperparameter should be increased.
-* Episode Length - The mean length of each episode in the environment for all
-  agents.
-* Learning Rate - How large a step the training algorithm takes as it searches
-  for the optimal policy. Should decrease over time.
-* Policy Loss - The mean loss of the policy function update. Correlates to how
-  much the policy (process for deciding actions) is changing. The magnitude of
-  this should decrease during a successful training session.
-* Value Estimate - The mean value estimate for all states visited by the agent.
-  Should increase during a successful training session.
-* Value Loss - The mean loss of the value function update. Correlates to how
-  well the model is able to predict the value of each state. This should
-  decrease during a successful training session.
-
-![Example TensorBoard Run](images/mlagents-TensorBoard.png)
-
-## Embedding the Model into the Unity Environment
-
-Once the training process completes, and the training process saves the model
-(denoted by the `Saved Model` message) you can add it to the Unity project and
-use it with compatible Agents (the Agents that generated the model).
-__Note:__ Do not just close the Unity Window once the `Saved Model` message appears.
-Either wait for the training process to close the window or press Ctrl+C at the
-command-line prompt. If you close the window manually, the `.nn` file
-containing the trained model is not exported into the ml-agents folder.
-
-### Embedding the trained model into Unity
-
-To embed the trained model into Unity, follow the later part of [Training the
-Model with Reinforcement
-Learning](Basic-Guide.md#training-the-model-with-reinforcement-learning) section
-of the Basic Guide page.
--- a/docs/Learning-Environment-Best-Practices.md
+++ b/docs/Learning-Environment-Best-Practices.md
-# Environment Design Best Practices
-
-## General
-
-* It is often helpful to start with the simplest version of the problem, to
-  ensure the agent can learn it. From there, increase complexity over time. This
-  can either be done manually, or via Curriculum Learning, where a set of
-  lessons which progressively increase in difficulty are presented to the agent
-  ([learn more here](Training-Curriculum-Learning.md)).
-* When possible, it is often helpful to ensure that you can complete the task by
-  using a heuristic to control the agent. To do so, set the `Behavior Type`
-  to `Heuristic Only` on the Agent's Behavior Parameters, and implement the
-   `Heuristic()` method on the Agent.
-* It is often helpful to make many copies of the agent, and give them the same
-  `Behavior Name`. In this way the learning process can get more feedback
-  information from all of these agents, which helps it train faster.
-
-## Rewards
-
-* The magnitude of any given reward should typically not be greater than 1.0 in
-  order to ensure a more stable learning process.
-* Positive rewards are often more helpful to shaping the desired behavior of an
-  agent than negative rewards.
-* For locomotion tasks, a small positive reward (+0.1) for forward velocity is
-  typically used.
-* If you want the agent to finish a task quickly, it is often helpful to provide
-  a small penalty every step (-0.05) that the agent does not complete the task.
-  In this case completion of the task should also coincide with the end of the
-  episode.
-* Overly-large negative rewards can cause undesirable behavior where an agent
-  learns to avoid any behavior which might produce the negative reward, even if
-  it is also behavior which can eventually lead to a positive reward.
-
-## Vector Observations
-
-* Vector Observations should include all variables relevant to allowing the
-  agent to take the optimally informed decision.
-* In cases where Vector Observations need to be remembered or compared over
-  time, increase the `Stacked Vectors` value to allow the agent to keep track of
-  multiple observations into the past.
-* Categorical variables such as type of object (Sword, Shield, Bow) should be
-  encoded in one-hot fashion (i.e. `3` > `0, 0, 1`).
-* Besides encoding non-numeric values, all inputs should be normalized to be in
-  the range 0 to +1 (or -1 to 1). For example, the `x` position information of
-  an agent where the maximum possible value is `maxValue` should be recorded as
-  `VectorSensor.AddObservation(transform.position.x / maxValue);` rather than
-  `VectorSensor.AddObservation(transform.position.x);`. See the equation below for one approach
-  of normalization.
-* Positional information of relevant GameObjects should be encoded in relative
-  coordinates wherever possible. This is often relative to the agent position.
-
-![normalization](images/normalization.png)
-
-## Vector Actions
-
-* When using continuous control, action values should be clipped to an
-  appropriate range. The provided PPO model automatically clips these values
-  between -1 and 1, but third party training systems may not do so.
-* Be sure to set the Vector Action's Space Size to the number of used Vector
-  Actions, and not greater, as doing the latter can interfere with the
-  efficiency of the training process.