
Merge branch 'master' into self-play-mutex

Andrew Cohen 5 年前
共有 32 个文件被更改,包括 781 次插入752 次删除
  1. 5
  2. 2
  3. 2
  4. 2
  5. 2
  6. 2
  7. 20
  8. 189
  9. 22
  10. 5
  11. 2
  12. 4
  13. 2
  14. 2
  15. 2
  16. 2
  17. 4
  18. 42
  19. 9
  20. 11
  21. 11
  22. 27
  23. 3
  24. 7
  25. 27
  26. 102
  27. 106
  28. 363
  29. 61
  30. 202
  31. 232
  32. 61


- version: 2018.4
# 2018.4 doesn't support code-coverage
minCoveragePct: 0
minCoveragePct: 72
minCoveragePct: 72
- name: win
type: Unity::VM

- npm install upm-ci-utils@stable -g --registry https://api.bintray.com/npm/unity/unity-npm
- upm-ci package test -u {{ editor.version }} --package-path com.unity.ml-agents {{ editor.coverageOptions }}
- python ml-agents/tests/yamato/check_coverage_percent.py upm-ci~/test-results/ {{ editor.minCoveragePct }}

- "com.unity.ml-agents/**"
- "ml-agents/tests/yamato/**"
- ".yamato/com.unity.ml-agents-test.yml"
{% endfor %}


public int Write(WriteAdapter adapter)
// First, call the wrapped sensor's write method. Make sure to use our own adapater, not the passed one.
// First, call the wrapped sensor's write method. Make sure to use our own adapter, not the passed one.
var wrappedShape = m_WrappedSensor.GetObservationShape();
m_LocalAdapter.SetTarget(m_StackedObservations[m_CurrentIndex], wrappedShape, 0);


hidden_units: 128
lambd: 0.95
learning_rate: 3.0e-4
max_steps: 5.0e4
max_steps: 5.0e5
memory_size: 256
normalize: false
num_epoch: 3


buffer_size: 12000
summary_freq: 12000
time_horizon: 1000
max_steps: 5.0e5
max_steps: 5.0e6
beta: 0.001


embedded visualizations. We provide one such notebook,
`notebooks/getting-started.ipynb`, for testing the Python control interface to a
Unity build. This notebook is introduced in the
[Getting Started with the 3D Balance Ball Environment](Getting-Started-with-Balance-Ball.md)
[Getting Started Guide](Getting-Started.md)
tutorial, but can be used for testing the connection to any Unity build.
For a walkthrough of how to use Jupyter, see


## Next Steps
The [Basic Guide](Basic-Guide.md) page contains several short tutorials on
The [Getting Started](Getting-Started.md) guide contains several short tutorials on
setting up the ML-Agents Toolkit within Unity, running a pre-trained model, in
addition to building and training environments.


# Making a New Learning Environment
This tutorial walks through the process of creating a Unity Environment. A Unity
Environment is an application built using the Unity Engine which can be used to
train Reinforcement Learning Agents.
This tutorial walks through the process of creating a Unity Environment from scratch. We recommend first reading the [Getting Started](Getting-Started.md) guide to understand the concepts presented here first in an already-built environment.
In this example, we will train a ball to roll to a randomly placed cube. The
ball also learns to avoid falling off the platform.
In this example, we will create an agent capable of controlling a ball on a platform. We will then train the agent to roll the ball toward the cube while avoiding falling off the platform.
## Overview

This is only one way to achieve this objective. Refer to the
[example environments](Learning-Environment-Examples.md) for other ways we can achieve relative positioning.
## Review: Scene Layout
This section briefly reviews how to organize your scene when using Agents in
your Unity environment.
There are two kinds of game objects you need to include in your scene in order
to use Unity ML-Agents: an Academy and one or more Agents.
Keep in mind:
* If you are using multiple training areas, make sure all the Agents have the same `Behavior Name`
and `Behavior Parameters`


# Agents
An agent is an actor that can observe its environment and decide on the best
course of action using those observations. Create Agents in Unity by extending
the Agent class. The most important aspects of creating agents that can
successfully learn are the observations the agent collects for
reinforcement learning and the reward you assign to estimate the value of the
An agent is an entity that can observe its environment, decide on the best
course of action using those observations, and execute those actions within
its environment. Agents can be created in Unity by extending
the `Agent` class. The most important aspects of creating agents that can
successfully learn are the observations the agent collects,
and the reward you assign to estimate the value of the
An Agent passes its observations to its Policy. The Policy, then, makes a decision
An Agent passes its observations to its Policy. The Policy then makes a decision
and passes the chosen action back to the agent. Your agent code must execute the
action, for example, move the agent in one direction or another. In order to
[train an agent using reinforcement learning](Learning-Environment-Design.md),

The Policy class abstracts out the decision making logic from the Agent itself so
The `Policy` class abstracts out the decision making logic from the Agent itself so
decisions depends on the kind of Policy it is. You can change the Policy of an
Agent by changing its `Behavior Parameters`. If you set `Behavior Type` to
`Heuristic Only`, the Agent will use its `Heuristic()` method to make decisions
which can allow you to control the Agent manually or write your own Policy. If
the Agent has a `Model` file, it Policy will use the neural network `Model` to
take decisions.
decisions depends on the `Behavior Parameters` associated with the agent. If you
set `Behavior Type` to `Heuristic Only`, the Agent will use its `Heuristic()`
method to make decisions which can allow you to control the Agent manually or
write your own Policy. If the Agent has a `Model` file, its Policy will use
the neural network `Model` to take decisions.
## Decisions

the Agent to request decisions on its own at regular intervals, add a
`Decision Requester` component to the Agent's Game Object. Making decisions at regular step
`Decision Requester` component to the Agent's GameObject. Making decisions at regular step
occur, should call `Agent.RequestDecision()` manually.
occur, such as in a turn-based game, should call `Agent.RequestDecision()` manually.
## Observations
## Observations and Sensors
To make decisions, an agent must observe its environment in order to infer the
state of the world. A state observation can take the following forms:
To make informed decisions, an agent must first make observations of the state of
the environment. The observations are collected by Sensors attached to the agent
GameObject. By default, agents come with a `VectorSensor` which allows them to
collect floating-point observations into a single array. There are additional
sensor components which can be attached to the agent GameObject which collect their own
observations, or modify other observations. These are:
* **Vector Observation** — a feature vector consisting of an array of floating
point numbers.
* **Visual Observations** — one or more camera images and/or render textures.
* `CameraSensorComponent` - Allows image from `Camera` to be used as observation.
* `RenderTextureSensorComponent` - Allows content of `RenderTexture` to be used as observation.
* `RayPerceptionSensorComponent` - Allows information from set of ray-casts to be used as observation.
When you use vector observations for an Agent, implement the
`Agent.CollectObservations(VectorSensor sensor)` method to create the feature vector. When you use
**Visual Observations**, you only need to identify which Unity Camera objects
or RenderTextures will provide images and the base Agent class handles the rest.
You do not need to implement the `CollectObservations(VectorSensor sensor)` method when your Agent
uses visual observations (unless it also uses vector observations).
### Vector Observations
### Vector Observation Space: Feature Vectors
Vector observations are best used for aspects of the environment which are numerical
and non-visual. The Policy class calls the `CollectObservations(VectorSensor sensor)`
method of each Agent. Your implementation of this function must call
`VectorSensor.AddObservation` to add vector observations.
For agents using a continuous state space, you create a feature vector to
represent the agent's observation at each step of the simulation. The Policy
class calls the `CollectObservations(VectorSensor sensor)` method of each Agent. Your
implementation of this function must call `VectorSensor.AddObservation` to add vector
The observation must include all the information an agents needs to accomplish
its task. Without sufficient and relevant information, an agent may learn poorly
In order for an agent to learn, the observations should include all the
information an agent needs to accomplish its task. Without sufficient and relevant
information, an agent may learn poorly
solution to the problem.
solution to the problem, or what you would expect a human to be able to use to solve the problem.
For examples of various state observation functions, you can look at the
[example environments](Learning-Environment-Examples.md) included in the

every enemy agent in an environment, you could only observe the closest five.
When you set up an Agent's `Behavior Parameters` in the Unity Editor, set the following
properties to use a continuous vector observation:
properties to use a vector observation:
* **Space Size** — The state size must match the length of your feature vector.

of data to your observation vector. You can add Integers and booleans directly to
the observation vector, as well as some common Unity data types such as `Vector2`,
`Vector3`, and `Quaternion`.
#### One-hot encoding categorical information
Type enumerations should be encoded in the _one-hot_ style. That is, add an
element to the feature vector for each element of enumeration, setting the

`VectorSensor.AddObservation` also provides a two-argument version as a shortcut for _one-hot_
`VectorSensor` also provides a two-argument function `AddOneHotObservation()` as a shortcut for _one-hot_
style observations. The following example is identical to the previous one.

angle, or, if the number of turns is significant, increase the maximum value
used in your normalization formula.
### Multiple Visual Observations
#### Vector Observation Summary & Best Practices
* Vector Observations should include all variables relevant for allowing the
agent to take the optimally informed decision, and ideally no extraneous information.
* In cases where Vector Observations need to be remembered or compared over
time, either an LSTM (see [here](Feature-Memory.md)) should be used in the model, or the
`Stacked Vectors` value in the agent GameObject's `Behavior Parameters` should be changed.
* Categorical variables such as type of object (Sword, Shield, Bow) should be
encoded in one-hot fashion (i.e. `3` -> `0, 0, 1`). This can be done automatically using the
`AddOneHotObservation()` method of the `VectorSensor`.
* In general, all inputs should be normalized to be in
the range 0 to +1 (or -1 to 1). For example, the `x` position information of
an agent where the maximum possible value is `maxValue` should be recorded as
`VectorSensor.AddObservation(transform.position.x / maxValue);` rather than
* Positional information of relevant GameObjects should be encoded in relative
coordinates wherever possible. This is often relative to the agent position.
### Visual Observations
Visual observations use rendered textures directly or from one or more
cameras in a scene. The Policy vectorizes the textures into a 3D Tensor which
can be fed into a convolutional neural network (CNN). For more information on
CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). You
can use visual observations along side vector observations.
Visual observations are generally provided to agent via either a `CameraSensor` or `RenderTextureSensor`.
These collect image information and transforms it into a 3D Tensor which
can be fed into the convolutional neural network (CNN) of the agent policy. For more information on
CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). This allows agents
to learn from spatial regularities in the observation images. It is possible to
use visual and vector observations with the same agent.
succeed at all.
succeed at all as compared to vector observations. As such, they should only be
used when it is not possible to properly define the problem using vector or ray-cast observations.
Visual observations can be derived from Cameras or RenderTextures within your scene.
To add a visual observation to an Agent, add either a Camera Sensor Component

![Agent RenderTexture Debug](images/gridworld.png)
#### Visual Observation Summary & Best Practices
* To collect visual observations, attach `CameraSensor` or `RenderTextureSensor`
components to the agent GameObject.
* Visual observations should generally be used unless vector observations are not sufficient.
* Image size should be kept as small as possible, without the loss of
needed details for decision making.
* Images should be made greyscale in situations where color information is
not needed for making informed decisions.
Raycasts are an alternative system for the Agent to provide observations based on
the physical environment. This can be easily implemented by adding a
RayPerceptionSensorComponent3D (or RayPerceptionSensorComponent2D) to the Agent.
Raycasts are another possible method for providing observations to an agent.
This can be easily implemented by adding a
`RayPerceptionSensorComponent3D` (or `RayPerceptionSensorComponent2D`) to the Agent GameObject.
During observations, several rays (or spheres, depending on settings) are cast into
the physics world, and the objects that are hit determine the observation vector that

* _Start Vertical Offset_ (3D only) The vertical offset of the ray start point.
* _End Vertical Offset_ (3D only) The vertical offset of the ray end point.
In the example image above, the Agent has two RayPerceptionSensorComponent3Ds.
In the example image above, the Agent has two `RayPerceptionSensorComponent3D`s.
Both use 3 Rays Per Direction and 90 Max Ray Degrees. One of the components
had a vertical offset, so the Agent can tell whether it's clear to jump over
the wall.

`Behavior Parameters`, so you don't need to worry about the formula above when
setting the State Size.
## Vector Actions
#### RayCast Observation Summary & Best Practices
* Attach `RayPerceptionSensorComponent3D` or `RayPerceptionSensorComponent2D` to use.
* This observation type is best used when there is relevant spatial information
for the agent that doesn't require a fully rendered image to convey.
* Use as few rays and tags as necessary to solve the problem in order to improve learning stability and agent performance.
## Actions
agent's `OnActionReceived()` function. When you specify that the vector action space
agent's `OnActionReceived()` function. Actions for an agent can take one of two forms, either **Continuous** or **Discrete**.
When you specify that the vector action space
control signals with length equal to the `Vector Action Space Size` property.
floating point numbers with length equal to the `Vector Action Space Size` property.
When you specify a **Discrete** vector action space type, the action parameter
is an array containing integers. Each integer is an index into a list or table
of commands. In the **Discrete** vector action space type, the action parameter

array of integers, each value corresponds to the number of possibilities for
each branch.
For example, if we wanted an Agent that can move in an plane and jump, we could
For example, if we wanted an Agent that can move in a plane and jump, we could
define two branches (one for motion and one for jumping) because we want our
agent be able to move __and__ jump concurrently. We define the first branch to
have 5 possible actions (don't move, go left, go right, go backward, go forward)

neural network, the Agent will be unable to perform the specified action. Note
that when the Agent is controlled by its Heuristic, the Agent will
still be able to decide to perform the masked action. In order to mask an
action, override the `Agent.CollectDiscreteActionMasks()` virtual method, and call `DiscreteActionMasker.SetMask()` in it:
action, override the `Agent.CollectDiscreteActionMasks()` virtual method,
and call `DiscreteActionMasker.SetMask()` in it:
public override void CollectDiscreteActionMasks(DiscreteActionMasker actionMasker){

* You cannot mask all the actions of a branch.
* You cannot mask actions in continuous control.
### Actions Summary & Best Practices
* Actions can either use `Discrete` or `Continuous` spaces.
* When using `Discrete` it is possible to assign multiple action branches, and to mask certain actions.
* In general, smaller action spaces will make for easier learning.
* Be sure to set the Vector Action's Space Size to the number of used Vector
Actions, and not greater, as doing the latter can interfere with the
efficiency of the training process.
* When using continuous control, action values should be clipped to an
appropriate range. The provided PPO model automatically clips these values
between -1 and 1, but third party training systems may not do so.
## Rewards
In reinforcement learning, the reward is a signal that the agent has done

Perhaps the best advice is to start simple and only add complexity as needed. In
general, you should reward results rather than actions you think will lead to
the desired results. To help develop your rewards, you can use the Monitor class
to display the cumulative reward received by an Agent. You can even use the
the desired results. You can even use the
Allocate rewards to an Agent by calling the `AddReward()` method in the
`OnActionReceived()` function. The reward assigned between each decision
Allocate rewards to an Agent by calling the `AddReward()` or `SetReward()` methods on the agent.
The reward assigned between each decision
decision was. There is a method called `SetReward()` that will override all
decision was. The `SetReward()` will override all
previous rewards given to an agent since the previous decision.
### Examples

Note that all of these environments make use of the `EndEpisode()` method, which manually
terminates an episode when a termination condition is reached. This can be
called independently of the `Max Step` property.
### Rewards Summary & Best Practices
* Use `AddReward()` to accumulate rewards between decisions. Use `SetReward()`
to overwrite any previous rewards accumulate between decisions.
* The magnitude of any given reward should typically not be greater than 1.0 in
order to ensure a more stable learning process.
* Positive rewards are often more helpful to shaping the desired behavior of an
agent than negative rewards. Excessive negative rewards can result in the agent
failing to learn any meaningful behavior.
* For locomotion tasks, a small positive reward (+0.1) for forward velocity is
typically used.
* If you want the agent to finish a task quickly, it is often helpful to provide
a small penalty every step (-0.05) that the agent does not complete the task.
In this case completion of the task should also coincide with the end of the
episode by calling `EndEpisode()` on the agent when it has accomplished its goal.
## Agent Properties


* Float Properties: None
* Benchmark Mean Reward: 0.93
## [3DBall: 3D Balance Ball](https://youtu.be/dheeCO29-EI)
## 3DBall: 3D Balance Ball
![3D Balance Ball](images/balance.png)

* Recommended Maximum: 20
* Benchmark Mean Reward: 100
## [GridWorld](https://youtu.be/gu8HE9WKEVI)
## GridWorld

number of goals.
* Benchmark Mean Reward: 0.8
## [Tennis](https://youtu.be/RDaIh7JX6RI)
## Tennis

* Recommended Minimum: 0.2
* Recommended Maximum: 5
## [Push Block](https://youtu.be/jKdw216ZgoE)
## Push Block

* Recommended Maximum: 2000
* Benchmark Mean Reward: 4.5
## [Wall Jump](https://youtu.be/NITLug2DIWQ)
## Wall Jump

* Float Properties: Four
* Benchmark Mean Reward (Big & Small Wall): 0.8
## [Reacher](https://youtu.be/2N9EoF6pQyE)
## Reacher

* Recommended Maximum: 3
* Benchmark Mean Reward: 30
## [Crawler](https://youtu.be/ftLliaeooYI)
## Crawler

* Benchmark Mean Reward for `CrawlerStaticTarget`: 2000
* Benchmark Mean Reward for `CrawlerDynamicTarget`: 400
## [Food Collector](https://youtu.be/heVMs3t9qSk)
## Food Collector

* Recommended Maximum: 5
* Benchmark Mean Reward: 10
## [Hallway](https://youtu.be/53GyfpPQRUQ)
## Hallway

* Benchmark Mean Reward: 0.7
* To speed up training, you can enable curiosity by adding the `curiosity` reward signal in `config/trainer_config.yaml`
## [Bouncer](https://youtu.be/Tkv-c-b1b2I)
## Bouncer

* Recommended Maximum: 250
* Benchmark Mean Reward: 10
## [Soccer Twos](https://youtu.be/Hg3nmYD3DjQ)
## Soccer Twos


training the Python API uses the observations it receives to learn a TensorFlow
model. This model is then embedded within the Agent during inference.
[Getting Started with the 3D Balance Ball Example](Getting-Started-with-Balance-Ball.md)
The [Getting Started Guide](Getting-Started.md)
tutorial covers this training mode with the **3D Balance Ball** sample environment.
### Custom Training and Inference

To help you use ML-Agents, we've created several in-depth tutorials for
[installing ML-Agents](Installation.md),
[getting started](Getting-Started-with-Balance-Ball.md) with the 3D Balance Ball
[getting started](Getting-Started.md) with the 3D Balance Ball
environment (one of our many
[sample environments](Learning-Environment-Examples.md)) and
[making your own environment](Learning-Environment-Create-New.md).


- `worker_id` indicates which port to use for communication with the
environment. For use in parallel training regimes such as A3C.
- `seed` indicates the seed to use when generating random numbers during the
training process. In environments which do not involve physics calculations,
training process. In environments which are deterministic,
setting the seed enables reproducible experimentation by ensuring that the
environment and trainers utilize the same random seed.
- `side_channels` provides a way to exchange data with the Unity simulation that


* [Installation](Installation.md)
* [Background: Jupyter Notebooks](Background-Jupyter.md)
* [Using Virtual Environment](Using-Virtual-Environment.md)
* [Basic Guide](Basic-Guide.md)
* [Getting Started Guide](Getting-Started.md)
* [Getting Started with the 3D Balance Ball Environment](Getting-Started-with-Balance-Ball.md)
* [Example Environments](Learning-Environment-Examples.md)
## Creating Learning Environments

* [Designing Agents](Learning-Environment-Design-Agents.md)
* [Learning Environment Best Practices](Learning-Environment-Best-Practices.md)
### Advanced Usage
* [Using the Monitor](Feature-Monitor.md)


To view training statistics, use TensorBoard. For information on launching and
using TensorBoard, see
### Cumulative Reward


To view training statistics, use TensorBoard. For information on launching and
using TensorBoard, see
### Cumulative Reward


To view training statistics, use TensorBoard. For information on launching and
using TensorBoard, see
### ELO
In adversarial games, the cumulative environment reward may not be a meaningful metric by which to track learning progress. This is because cumulative reward is entirely dependent on the skill of the opponent. An agent at a particular skill level will get more or less reward against a worse or better agent, respectively.


with specific flags, building a Docker container and, finally, running the
container. If you are not familiar with building a Unity environment for
ML-Agents, please read through our [Getting Started with the 3D Balance Ball
Example](Getting-Started-with-Balance-Ball.md) guide first.
Example](Getting-Started.md) guide first.
### Build the Environment (Optional)


self.experience_buffers[global_id] = []
if curr_agent_step.done:
"Environment/Cumulative Reward",
self.episode_rewards.get(global_id, 0),
"Environment/Episode Length",
self.episode_steps.get(global_id, 0),


from mlagents.trainers.trainer import Trainer
from mlagents.trainers.trajectory import Trajectory
from mlagents.trainers.agent_processor import AgentManagerQueue
from mlagents.trainers.stats import StatsPropertyType
from mlagents.trainers.behavior_id_utils import BehaviorIdentifiers
logger = logging.getLogger("mlagents.trainers")

self.learning_policy_queues: Dict[str, AgentManagerQueue[Policy]] = {}
# assign ghost's stats collection to wrapped trainer's
self.stats_reporter = self.trainer.stats_reporter
self._stats_reporter = self.trainer.stats_reporter
# Set the logging to print ELO in the console
self._stats_reporter.add_property(StatsPropertyType.SELF_PLAY, True)
self_play_parameters = trainer_parameters["self_play"]
self.window = self_play_parameters.get("window", 10)

return self.trainer.reward_buffer
def _write_summary(self, step: int) -> None:
Saves training statistics to Tensorboard.
opponents = np.array(self.policy_elos, dtype=np.float32)
" Learning brain {} ELO: {:0.3f}\n"
"Mean Opponent ELO: {:0.3f}"
" Std Opponent ELO: {:0.3f}".format(
self.stats_reporter.add_stat("ELO", self.current_elo)
def _process_trajectory(self, trajectory: Trajectory) -> None:
if trajectory.done_reached and not trajectory.max_step_reached:
# Assumption is that final reward is 1/.5/0 for win/draw/loss

self.current_elo += change
self.policy_elos[self.current_opponent] -= change
def _is_ready_update(self) -> bool:
return False
def _update_policy(self) -> None:
opponents = np.array(self.policy_elos, dtype=np.float32)
self._stats_reporter.add_stat("Self-play/ELO", self.current_elo)
"Self-play/Mean Opponent ELO", opponents.mean()
self._stats_reporter.add_stat("Self-play/Std Opponent ELO", opponents.std())
def advance(self) -> None:

self.next_summary_step = self.trainer.next_summary_step
for internal_q in self.internal_policy_queues:
# Get policies that correspond to the policy queue in question

self.trainer.add_policy(name_behavior_id, policy)
self._save_snapshot(policy) # Need to save after trainer initializes policy
self.learning_behavior_name = name_behavior_id
behavior_id_parsed = BehaviorIdentifiers.from_name_behavior_id(
team_id = behavior_id_parsed.behavior_ids["team"]
self._stats_reporter.add_property(StatsPropertyType.SELF_PLAY_TEAM, team_id)
# for saving/swapping snapshots


env_path: Optional[str],
docker_target_name: Optional[str],
no_graphics: bool,
seed: Optional[int],
seed: int,
start_port: int,
env_args: Optional[List[str]],
) -> Callable[[int, List[SideChannel]], BaseEnv]:

# container.
# Navigate in docker path and find env_path and copy it.
env_path = prepare_for_docker_run(docker_target_name, env_path)
seed_count = 10000
seed_pool = [np.random.randint(0, seed_count) for _ in range(seed_count)]
env_seed = seed
if not env_seed:
env_seed = seed_pool[worker_id % len(seed_pool)]
# Make sure that each environment gets a different seed
env_seed = seed + worker_id
return UnityEnvironment(


agent_id = trajectory.agent_id # All the agents should have the same ID
# Add to episode_steps
self.episode_steps[agent_id] += len(trajectory.steps)
agent_buffer_trajectory = trajectory.to_agentbuffer()
# Update the normalization
if self.is_training:

for name, v in value_estimates.items():
self.optimizer.reward_signals[name].value_name, np.mean(v)

for stat, stat_list in batch_update_stats.items():
self.stats_reporter.add_stat(stat, np.mean(stat_list))
self._stats_reporter.add_stat(stat, np.mean(stat_list))
self.stats_reporter.add_stat(stat, val)
self._stats_reporter.add_stat(stat, val)
def create_policy(self, brain_parameters: BrainParameters) -> TFPolicy:


last_step = trajectory.steps[-1]
agent_id = trajectory.agent_id # All the agents should have the same ID
# Add to episode_steps
self.episode_steps[agent_id] += len(trajectory.steps)
agent_buffer_trajectory = trajectory.to_agentbuffer()
# Update the normalization

agent_buffer_trajectory, trajectory.next_obs, trajectory.done_reached
for name, v in value_estimates.items():
self.optimizer.reward_signals[name].value_name, np.mean(v)

for stat, stat_list in batch_update_stats.items():
self.stats_reporter.add_stat(stat, np.mean(stat_list))
self._stats_reporter.add_stat(stat, np.mean(stat_list))
self.stats_reporter.add_stat(stat, val)
self._stats_reporter.add_stat(stat, val)
def update_reward_signals(self) -> None:

for stat_name, value in update_stats.items():
for stat, stat_list in batch_update_stats.items():
self.stats_reporter.add_stat(stat, np.mean(stat_list))
self._stats_reporter.add_stat(stat, np.mean(stat_list))
def add_policy(self, name_behavior_id: str, policy: TFPolicy) -> None:


class StatsPropertyType(Enum):
HYPERPARAMETERS = "hyperparameters"
SELF_PLAY = "selfplay"
SELF_PLAY_TEAM = "selfplayteam"
class StatsWriter(abc.ABC):

class ConsoleWriter(StatsWriter):
def __init__(self):
self.training_start_time = time.time()
# If self-play, we want to print ELO as well as reward
self.self_play = False
self.self_play_team = -1
def write_stats(
self, category: str, values: Dict[str, StatsSummary], step: int

stats_summary = stats_summary = values["Is Training"]
if stats_summary.mean > 0.0:
is_training = "Training."
if "Environment/Cumulative Reward" in values:
stats_summary = values["Environment/Cumulative Reward"]

if self.self_play and "Self-play/ELO" in values:
elo_stats = values["Self-play/ELO"]
mean_opponent_elo = values["Self-play/Mean Opponent ELO"]
std_opponent_elo = values["Self-play/Std Opponent ELO"]
"{} Team {}: ELO: {:0.3f}. "
"Mean Opponent ELO: {:0.3f}. "
"Std Opponent ELO: {:0.3f}. ".format(
"{}: Step: {}. No episode was completed since last summary. {}".format(

category, self._dict_to_str(value, 0)
elif property_type == StatsPropertyType.SELF_PLAY:
assert isinstance(value, bool)
self.self_play = value
elif property_type == StatsPropertyType.SELF_PLAY_TEAM:
assert isinstance(value, int)
self.self_play_team = value
def _dict_to_str(self, param_dict: Dict[str, Any], num_tabs: int) -> str:


from mlagents.trainers.trainer_controller import TrainerController
from mlagents.trainers.learn import parse_command_line
from mlagents_envs.exception import UnityEnvironmentException
from mlagents.trainers.stats import StatsReporter
def basic_options(extra_args=None):

StatsReporter.writers.clear() # make sure there aren't any writers as added by learn.py

assert mock_init.call_args[0][1] == "/dockertarget/models/ppo"
assert mock_init.call_args[0][2] == "/dockertarget/summaries"
StatsReporter.writers.clear() # make sure there aren't any writers as added by learn.py
def test_bad_env_path():


def test_rl_trainer():
trainer = create_rl_trainer()
agent_id = "0"
trainer.episode_steps[agent_id] = 3
for agent_id in trainer.episode_steps:
assert trainer.episode_steps[agent_id] == 0
for rewards in trainer.collected_rewards.values():
for agent_id in rewards:
assert rewards[agent_id] == 0

trainer = create_rl_trainer()
trainer.update_buffer = construct_fake_buffer(0)
def test_advance(mocked_clear_update_buffer):
trainer = create_rl_trainer()
trajectory_queue = AgentManagerQueue("testbrain")


self.assertIn("Hyperparameters for behavior name", cm.output[2])
self.assertIn("example:\t1.0", cm.output[2])
def test_selfplay_console_writer(self):
with self.assertLogs("mlagents.trainers", level="INFO") as cm:
category = "category1"
console_writer = ConsoleWriter()
console_writer.add_property(category, StatsPropertyType.SELF_PLAY, True)
console_writer.add_property(category, StatsPropertyType.SELF_PLAY_TEAM, 1)
statssummary1 = StatsSummary(mean=1.0, std=1.0, num=1)
"Environment/Cumulative Reward": statssummary1,
"Is Training": statssummary1,
"Self-play/ELO": statssummary1,
"Self-play/Mean Opponent ELO": statssummary1,
"Self-play/Std Opponent ELO": statssummary1,
"Mean Reward: 1.000. Std of Reward: 1.000. Training.", cm.output[0]
"category1 Team 1: ELO: 1.000. Mean Opponent ELO: 1.000. Std Opponent ELO: 1.000.",


# # Unity ML-Agents Toolkit
from typing import Dict
from typing import Dict, List
import abc
from mlagents.trainers.optimizer.tf_optimizer import TFOptimizer
from mlagents.trainers.buffer import AgentBuffer

from mlagents_envs.timers import hierarchical_timer
from mlagents.trainers.agent_processor import AgentManagerQueue
from mlagents.trainers.trajectory import Trajectory
from mlagents.trainers.stats import StatsPropertyType
RewardSignalResults = Dict[str, RewardSignalResult]

# collected_rewards is a dictionary from name of reward signal to a dictionary of agent_id to cumulative reward
# used for reporting only. We always want to report the environment reward to Tensorboard, regardless
# of what reward signals are actually present.
self.cumulative_returns_since_policy_update: List[float] = []
self.episode_steps: Dict[str, int] = defaultdict(lambda: 0)
StatsPropertyType.HYPERPARAMETERS, self.trainer_parameters
def end_episode(self) -> None:

for agent_id in self.episode_steps:
self.episode_steps[agent_id] = 0
self.episode_steps[agent_id] = 0
"Environment/Cumulative Reward", rewards.get(agent_id, 0)
rewards.get(agent_id, 0)

rewards[agent_id] = 0
def clear_update_buffer(self) -> None:
def _clear_update_buffer(self) -> None:
def _is_ready_update(self):
Returns whether or not the trainer has enough elements to run update model
:return: A boolean corresponding to wether or not update_model() can be run
return False
def _update_policy(self):
Uses demonstration_buffer to update model.
def _increment_step(self, n_steps: int, name_behavior_id: str) -> None:
Increment the step count of the trainer
:param n_steps: number of steps to increment the step count by
self.step += n_steps
self.next_summary_step = self._get_next_summary_step()
p = self.get_policy(name_behavior_id)
if p:
def _get_next_summary_step(self) -> int:
Get the next step count that should result in a summary write.
return self.step + (self.summary_freq - self.step % self.summary_freq)
def _write_summary(self, step: int) -> None:
Saves training statistics to Tensorboard.
self.stats_reporter.add_stat("Is Training", float(self.should_still_train))
def _process_trajectory(self, trajectory: Trajectory) -> None:
Takes a trajectory and processes it, putting it into the update buffer.
:param trajectory: The Trajectory tuple containing the steps to be processed.
self._maybe_write_summary(self.get_step + len(trajectory.steps))
self._increment_step(len(trajectory.steps), trajectory.behavior_id)
def _maybe_write_summary(self, step_after_process: int) -> None:
If processing the trajectory will make the step exceed the next summary write,
write the summary. This logic ensures summaries are written on the update step and not in between.
:param step_after_process: the step count after processing the next trajectory.
if step_after_process >= self.next_summary_step and self.get_step != 0:
Steps the trainer, taking in trajectories and updates if ready
Steps the trainer, taking in trajectories and updates if ready.
if not self.should_still_train:
with hierarchical_timer("process_trajectory"):
for traj_queue in self.trajectory_queues:
# We grab at most the maximum length of the queue.
# This ensures that even if the queue is being filled faster than it is
# being emptied, the trajectories in the queue are on-policy.
for _ in range(traj_queue.maxlen):
t = traj_queue.get_nowait()
except AgentManagerQueue.Empty:
if self.should_still_train:
if self._is_ready_update():
with hierarchical_timer("_update_policy"):
for q in self.policy_queues:
# Get policies that correspond to the policy queue in question


# # Unity ML-Agents Toolkit
import logging
from typing import Dict, List, Deque, Any
import time
import abc
from collections import deque

from mlagents.trainers.stats import StatsReporter, StatsPropertyType
from mlagents.trainers.stats import StatsReporter
from mlagents_envs.timers import hierarchical_timer
logger = logging.getLogger("mlagents.trainers")

self.run_id = run_id
self.trainer_parameters = trainer_parameters
self.summary_path = trainer_parameters["summary_path"]
self.stats_reporter = StatsReporter(self.summary_path)
self.cumulative_returns_since_policy_update: List[float] = []
self._stats_reporter = StatsReporter(self.summary_path)
self.training_start_time = time.time()
StatsPropertyType.HYPERPARAMETERS, self.trainer_parameters
def stats_reporter(self):
Returns the stats reporter associated with this Trainer.
return self._stats_reporter
def _check_param_keys(self):
for k in self.param_keys:

return self._reward_buffer
def _increment_step(self, n_steps: int, name_behavior_id: str) -> None:
Increment the step count of the trainer
:param n_steps: number of steps to increment the step count by
self.step += n_steps
self.next_summary_step = self._get_next_summary_step()
p = self.get_policy(name_behavior_id)
if p:
def _get_next_summary_step(self) -> int:
Get the next step count that should result in a summary write.
return self.step + (self.summary_freq - self.step % self.summary_freq)
def save_model(self, name_behavior_id: str) -> None:
Saves the model

settings = SerializationSettings(policy.model_path, policy.brain.brain_name)
export_policy_model(settings, policy.graph, policy.sess)
def _write_summary(self, step: int) -> None:
Saves training statistics to Tensorboard.
self.stats_reporter.add_stat("Is Training", float(self.should_still_train))
def _process_trajectory(self, trajectory: Trajectory) -> None:
Takes a trajectory and processes it, putting it into the update buffer.
:param trajectory: The Trajectory tuple containing the steps to be processed.
self._maybe_write_summary(self.get_step + len(trajectory.steps))
self._increment_step(len(trajectory.steps), trajectory.behavior_id)
def _maybe_write_summary(self, step_after_process: int) -> None:
If processing the trajectory will make the step exceed the next summary write,
write the summary. This logic ensures summaries are written on the update step and not in between.
:param step_after_process: the step count after processing the next trajectory.
if step_after_process >= self.next_summary_step and self.get_step != 0:
def end_episode(self):

def add_policy(self, name_behavior_id: str, policy: TFPolicy) -> None:
Adds policy to trainer
Adds policy to trainer.

Gets policy from trainer
Gets policy from trainer.
def _is_ready_update(self):
def advance(self) -> None:
Returns whether or not the trainer has enough elements to run update model
:return: A boolean corresponding to wether or not update_model() can be run
return False
def _update_policy(self):
Uses demonstration_buffer to update model.
Advances the trainer. Typically, this means grabbing trajectories
from all subscribed trajectory queues (self.trajectory_queues), and updating
a policy using the steps in them, and if needed pushing a new policy onto the right
policy queues (self.policy_queues).
def advance(self) -> None:
Steps the trainer, taking in trajectories and updates if ready.
with hierarchical_timer("process_trajectory"):
for traj_queue in self.trajectory_queues:
# We grab at most the maximum length of the queue.
# This ensures that even if the queue is being filled faster than it is
# being emptied, the trajectories in the queue are on-policy.
for _ in range(traj_queue.maxlen):
t = traj_queue.get_nowait()
except AgentManagerQueue.Empty:
if self.should_still_train:
if self._is_ready_update():
with hierarchical_timer("_update_policy"):
for q in self.policy_queues:
# Get policies that correspond to the policy queue in question
:param queue: Policy queue to publish to.
:param policy_queue: Policy queue to publish to.

Adds a trajectory queue to the list of queues for the trainer to ingest Trajectories from.
:param queue: Trajectory queue to publish to.
:param trajectory_queue: Trajectory queue to read from.


# Getting Started Guide
This guide walks through the end-to-end process of opening an ML-Agents
toolkit example environment in Unity, building the Unity executable, training an
Agent in it, and finally embedding the trained model into the Unity environment.
The ML-Agents toolkit includes a number of [example
environments](Learning-Environment-Examples.md) which you can examine to help
understand the different ways in which the ML-Agents toolkit can be used. These
environments can also serve as templates for new environments or as ways to test
new ML algorithms. After reading this tutorial, you should be able to explore
train the example environments.
If you are not familiar with the [Unity Engine](https://unity3d.com/unity), we
highly recommend the [Roll-a-ball
tutorial](https://unity3d.com/learn/tutorials/s/roll-ball-tutorial) to learn all
the basic concepts first.
![3D Balance Ball](images/balance.png)
This guide uses the **3D Balance Ball** environment to teach the basic concepts and
usage patterns of ML-Agents. 3D Balance Ball
contains a number of agent cubes and balls (which are all copies of each other).
Each agent cube tries to keep its ball from falling by rotating either
horizontally or vertically. In this environment, an agent cube is an **Agent** that
receives a reward for every step that it balances the ball. An agent is also
penalized with a negative reward for dropping the ball. The goal of the training
process is to have the agents learn to balance the ball on their head.
Let's get started!
## Installation
In order to install and set up the ML-Agents toolkit, the Python dependencies
and Unity, see the [installation instructions](Installation.md).
Depending on your version of Unity, it may be necessary to change the **Scripting Runtime Version** of your project. This can be done as follows:
1. Launch Unity
2. On the Projects dialog, choose the **Open** option at the top of the window.
3. Using the file dialog that opens, locate the `Project` folder
within the ML-Agents toolkit project and click **Open**.
4. Go to **Edit** > **Project Settings** > **Player**
5. For **each** of the platforms you target (**PC, Mac and Linux Standalone**,
**iOS** or **Android**):
1. Expand the **Other Settings** section.
2. Select **Scripting Runtime Version** to **Experimental (.NET 4.6
Equivalent or .NET 4.x Equivalent)**
6. Go to **File** > **Save Project**
## Understanding a Unity Environment
An agent is an autonomous actor that observes and interacts with an
_environment_. In the context of Unity, an environment is a scene containing
one or more Agent objects, and, of course, the other
entities that an agent interacts with.
![Unity Editor](images/mlagents-3DBallHierarchy.png)
**Note:** In Unity, the base object of everything in a scene is the
_GameObject_. The GameObject is essentially a container for everything else,
including behaviors, graphics, physics, etc. To see the components that make up
a GameObject, select the GameObject in the Scene window, and open the Inspector
window. The Inspector shows every component on a GameObject.
The first thing you may notice after opening the 3D Balance Ball scene is that
it contains not one, but several agent cubes. Each agent cube in the scene is an
independent agent, but they all share the same Behavior. 3D Balance Ball does this
to speed up training since all twelve agents contribute to training in parallel.
### Agent
The Agent is the actor that observes and takes actions in the environment. In
the 3D Balance Ball environment, the Agent components are placed on the twelve
"Agent" GameObjects. The base Agent object has a few properties that affect its
* **Behavior Parameters** — Every Agent must have a Behavior. The Behavior
determines how an Agent makes decisions. More on Behavior Parameters in
the next section.
* **Max Step** — Defines how many simulation steps can occur before the Agent's
episode ends. In 3D Balance Ball, an Agent restarts after 5000 steps.
When you create an Agent, you must extend the base Agent class.
The Ball3DAgent subclass defines the following methods:
* `Agent.OnEpisodeBegin()` — Called at the beginning of an Agent's episode, including at the beginning
of the simulation. The Ball3DAgent class uses this function to reset the
agent cube and ball to their starting positions. The function randomizes the reset values so that the
training generalizes to more than a specific starting position and agent cube
* `Agent.CollectObservations(VectorSensor sensor)` — Called every simulation step. Responsible for
collecting the Agent's observations of the environment. Since the Behavior
Parameters of the Agent are set with vector observation
space with a state size of 8, the `CollectObservations(VectorSensor sensor)` must call
`VectorSensor.AddObservation()` such that vector size adds up to 8.
* `Agent.OnActionReceived()` — Called every time the Agent receives an action to take. Receives the action chosen
by the Agent. The vector action spaces result in a
small change in the agent cube's rotation at each step. The `OnActionReceived()` method
assigns a reward to the Agent; in this example, an Agent receives a small
positive reward for each step it keeps the ball on the agent cube's head and a larger,
negative reward for dropping the ball. An Agent's episode is also ended when it
drops the ball so that it will reset with a new ball for the next simulation
* `Agent.Heuristic()` - When the `Behavior Type` is set to `Heuristic Only` in the Behavior
Parameters of the Agent, the Agent will use the `Heuristic()` method to generate
the actions of the Agent. As such, the `Heuristic()` method returns an array of
floats. In the case of the Ball 3D Agent, the `Heuristic()` method converts the
keyboard inputs into actions.
#### Behavior Parameters : Vector Observation Space
Before making a decision, an agent collects its observation about its state in
the world. The vector observation is a vector of floating point numbers which
contain relevant information for the agent to make decisions.
The Behavior Parameters of the 3D Balance Ball example uses a **Space Size** of 8.
This means that the feature
vector containing the Agent's observations contains eight elements: the `x` and
`z` components of the agent cube's rotation and the `x`, `y`, and `z` components
of the ball's relative position and velocity. (The observation values are
defined in the Agent's `CollectObservations(VectorSensor sensor)` method.)
#### Behavior Parameters : Vector Action Space
An Agent is given instructions in the form of a float array of *actions*.
ML-Agents toolkit classifies actions into two types: the **Continuous** vector
action space is a vector of numbers that can vary continuously. What each
element of the vector means is defined by the Agent logic (the training
process just learns what values are better given particular state observations
based on the rewards received when it tries different values). For example, an
element might represent a force or torque applied to a `Rigidbody` in the Agent.
The **Discrete** action vector space defines its actions as tables. An action
given to the Agent is an array of indices into tables.
The 3D Balance Ball example is programmed to use continuous action
space with `Space Size` of 2.
## Running a pre-trained model
We include pre-trained models for our agents (`.nn` files) and we use the
[Unity Inference Engine](Unity-Inference-Engine.md) to run these models
inside Unity. In this section, we will use the pre-trained model for the
3D Ball example.
1. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Scenes` folder
and open the `3DBall` scene file.
2. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Prefabs` folder.
Expand `3DBall` and click on the `Agent` prefab. You should see the `Agent` prefab in the **Inspector** window.
**Note**: The platforms in the `3DBall` scene were created using the `3DBall` prefab. Instead of updating all 12 platforms individually, you can update the `3DBall` prefab instead.
![Platform Prefab](images/platform_prefab.png)
3. In the **Project** window, drag the **3DBall** Model located in
`Assets/ML-Agents/Examples/3DBall/TFModels` into the `Model` property under `Behavior Parameters (Script)` component in the Agent GameObject **Inspector** window.
![3dball learning brain](images/3dball_learning_brain.png)
4. You should notice that each `Agent` under each `3DBall` in the **Hierarchy** windows now contains **3DBall** as `Model` on the `Behavior Parameters`. __Note__ : You can modify multiple game objects in a scene by selecting them all at
once using the search bar in the Scene Hierarchy.
8. Select the **InferenceDevice** to use for this model (CPU or GPU) on the Agent.
_Note: CPU is faster for the majority of ML-Agents toolkit generated models_
9. Click the **Play** button and you will see the platforms balance the balls
using the pre-trained model.
## Training a new model with Reinforcement Learning
While we provide pre-trained `.nn` files for the agents in this environment, any environment you make yourself will require training agents from scratch to generate a new model file. We can do this using reinforcement learning.
In order to train an agent to correctly balance the ball, we provide two
deep reinforcement learning algorithms.
The default algorithm is Proximal Policy Optimization (PPO). This
is a method that has been shown to be more general purpose and stable
than many other RL algorithms. For more information on PPO, OpenAI
has a [blog post](https://blog.openai.com/openai-baselines-ppo/)
explaining it, and [our page](Training-PPO.md) for how to use it in training.
We also provide Soft-Actor Critic, an off-policy algorithm that
has been shown to be both stable and sample-efficient.
For more information on SAC, see UC Berkeley's
[blog post](https://bair.berkeley.edu/blog/2018/12/14/sac/) and
[our page](Training-SAC.md) for more guidance on when to use SAC vs. PPO. To
use SAC to train Balance Ball, replace all references to `config/trainer_config.yaml`
with `config/sac_trainer_config.yaml` below.
To train the agents within the Balance Ball environment, we will be using the
ML-Agents Python package. We have provided a convenient command called `mlagents-learn`
which accepts arguments used to configure both training and inference phases.
### Training the environment
1. Open a command or terminal window.
2. Navigate to the folder where you cloned the ML-Agents toolkit repository.
**Note**: If you followed the default [installation](Installation.md), then
you should be able to run `mlagents-learn` from any directory.
3. Run `mlagents-learn <trainer-config-path> --run-id=<run-identifier> --train`
- `<trainer-config-path>` is the relative or absolute filepath of the
trainer configuration. The defaults used by example environments included
in `MLAgentsSDK` can be found in `config/trainer_config.yaml`.
- `<run-identifier>` is a string used to separate the results of different
training runs
- `--train` tells `mlagents-learn` to run a training session (rather
than inference)
4. If you cloned the ML-Agents repo, then you can simply run
mlagents-learn config/trainer_config.yaml --run-id=firstRun --train
5. When the message _"Start training by pressing the Play button in the Unity
Editor"_ is displayed on the screen, you can press the :arrow_forward: button
in Unity to start training in the Editor.
**Note**: If you're using Anaconda, don't forget to activate the ml-agents
environment first.
The `--train` flag tells the ML-Agents toolkit to run in training mode.
The `--time-scale=100` sets the `Time.TimeScale` value in Unity.
**Note**: You can train using an executable rather than the Editor. To do so,
follow the instructions in
[Using an Executable](Learning-Environment-Executable.md).