浏览代码

Cleaning up documentation.

/develop-generalizationTraining-TrainerController
Deric Pang 6 年前
当前提交
40f4eb3e
共有 37 个文件被更改,包括 2535 次插入1296 次删除
  1. 10
      docs/API-Reference.md
  2. 10
      docs/Background-Jupyter.md
  3. 301
      docs/Background-Machine-Learning.md
  4. 74
      docs/Background-TensorFlow.md
  5. 12
      docs/Background-Unity.md
  6. 11
      docs/Basic-Guide.md
  7. 12
      docs/FAQ.md
  8. 55
      docs/Feature-Memory.md
  9. 42
      docs/Feature-Monitor.md
  10. 16
      docs/Getting-Started-with-Balance-Ball.md
  11. 248
      docs/Installation-Windows.md
  12. 28
      docs/Installation.md
  13. 64
      docs/Learning-Environment-Best-Practices.md
  14. 330
      docs/Learning-Environment-Create-New.md
  15. 55
      docs/Learning-Environment-Design-Academy.md
  16. 429
      docs/Learning-Environment-Design-Agents.md
  17. 106
      docs/Learning-Environment-Design-Brains.md
  18. 118
      docs/Learning-Environment-Design-External-Internal-Brains.md
  19. 34
      docs/Learning-Environment-Design-Heuristic-Brains.md
  20. 47
      docs/Learning-Environment-Design-Player-Brains.md
  21. 189
      docs/Learning-Environment-Design.md
  22. 343
      docs/Learning-Environment-Examples.md
  23. 123
      docs/Learning-Environment-Executable.md
  24. 25
      docs/Limitations.md
  25. 26
      docs/ML-Agents-Overview.md
  26. 75
      docs/Migrating.md
  27. 10
      docs/Python-API.md
  28. 9
      docs/Training-Curriculum-Learning.md
  29. 76
      docs/Training-Imitation-Learning.md
  30. 167
      docs/Training-ML-Agents.md
  31. 218
      docs/Training-PPO.md
  32. 104
      docs/Training-on-Amazon-Web-Service.md
  33. 112
      docs/Training-on-Microsoft-Azure-Custom-Instance.md
  34. 102
      docs/Training-on-Microsoft-Azure.md
  35. 8
      docs/Using-Docker.md
  36. 171
      docs/Using-TensorFlow-Sharp-in-Unity.md
  37. 71
      docs/Using-Tensorboard.md

10
docs/API-Reference.md


[Doxygen](http://www.stack.nl/~dimitri/doxygen/) for auto-generating HTML
documentation.
To generate the API reference, [download
Doxygen](http://www.stack.nl/~dimitri/doxygen/download.html) and run the
following command within the `docs/` directory:
To generate the API reference,
[download Doxygen](http://www.stack.nl/~dimitri/doxygen/download.html)
and run the following command within the `docs/` directory:
doxygen dox-ml-agents.conf
```sh
doxygen dox-ml-agents.conf
```
`dox-ml-agents.conf` is a Doxygen configuration file for the ML-Agents toolkit
that includes the classes that have been properly formatted. The generated HTML

10
docs/Background-Jupyter.md


[Jupyter](https://jupyter.org) is a fantastic tool for writing code with
embedded visualizations. We provide one such notebook,
`notebooks/getting-started.ipynb`, for testing the Python control
interface to a Unity build. This notebook is introduced in the [Getting Started
with the 3D Balance Ball Environment](Getting-Started-with-Balance-Ball.md)
`notebooks/getting-started.ipynb`, for testing the Python control interface to a
Unity build. This notebook is introduced in the
[Getting Started with the 3D Balance Ball Environment](Getting-Started-with-Balance-Ball.md)
tutorial, but can be used for testing the connection to any Unity build.
For a walkthrough of how to use Jupyter, see

jupyter notebook
```sh
jupyter notebook
```
Then navigate to `localhost:8888` to access your notebooks.

301
docs/Background-Machine-Learning.md


# Background: Machine Learning
Given that a number of users of the ML-Agents toolkit might not have a formal machine
learning background, this page provides an overview to facilitate the
understanding of the ML-Agents toolkit. However, We will not attempt to provide a thorough
treatment of machine learning as there are fantastic resources online.
Given that a number of users of the ML-Agents toolkit might not have a formal
machine learning background, this page provides an overview to facilitate the
understanding of the ML-Agents toolkit. However, We will not attempt to provide
a thorough treatment of machine learning as there are fantastic resources
online.
Machine learning, a branch of artificial intelligence, focuses on learning
Machine learning, a branch of artificial intelligence, focuses on learning
include: unsupervised learning, supervised learning and reinforcement learning.
Each class of algorithm learns from a different type of data. The following
paragraphs provide an overview for each of these classes of machine learning,
as well as introductory examples.
include: unsupervised learning, supervised learning and reinforcement learning.
Each class of algorithm learns from a different type of data. The following
paragraphs provide an overview for each of these classes of machine learning, as
well as introductory examples.
The goal of
[unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning) is to group or cluster similar items in a
data set. For example, consider the players of a game. We may want to group
the players depending on how engaged they are with the game. This would enable
us to target different groups (e.g. for highly-engaged players we might
invite them to be beta testers for new features, while for unengaged players
we might email them helpful tutorials). Say that we wish to split our players
into two groups. We would first define basic attributes of the players, such
as the number of hours played, total money spent on in-app purchases and
number of levels completed. We can then feed this data set (three attributes
for every player) to an unsupervised learning algorithm where we specify the
number of groups to be two. The algorithm would then split the data set of
players into two groups where the players within each group would be similar
to each other. Given the attributes we used to describe each player, in this
case, the output would be a split of all the players into two groups, where
one group would semantically represent the engaged players and the second
group would semantically represent the unengaged players.
The goal of [unsupervised
learning](https://en.wikipedia.org/wiki/Unsupervised_learning) is to group or
cluster similar items in a data set. For example, consider the players of a
game. We may want to group the players depending on how engaged they are with
the game. This would enable us to target different groups (e.g. for
highly-engaged players we might invite them to be beta testers for new features,
while for unengaged players we might email them helpful tutorials). Say that we
wish to split our players into two groups. We would first define basic
attributes of the players, such as the number of hours played, total money spent
on in-app purchases and number of levels completed. We can then feed this data
set (three attributes for every player) to an unsupervised learning algorithm
where we specify the number of groups to be two. The algorithm would then split
the data set of players into two groups where the players within each group
would be similar to each other. Given the attributes we used to describe each
player, in this case, the output would be a split of all the players into two
groups, where one group would semantically represent the engaged players and the
second group would semantically represent the unengaged players.
defined the appropriate attributes and relied on the algorithm to uncover
the two groups on its own. This type of data set is typically called an
unlabeled data set as it is lacking these direct labels. Consequently,
unsupervised learning can be helpful in situations where these labels can be
expensive or hard to produce. In the next paragraph, we overview supervised
learning algorithms which accept input labels in addition to attributes.
defined the appropriate attributes and relied on the algorithm to uncover the
two groups on its own. This type of data set is typically called an unlabeled
data set as it is lacking these direct labels. Consequently, unsupervised
learning can be helpful in situations where these labels can be expensive or
hard to produce. In the next paragraph, we overview supervised learning
algorithms which accept input labels in addition to attributes.
In [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning),
we do not want to just group similar items but directly
learn a mapping from each item to the group (or class) that it belongs to.
Returning to our earlier example of
clustering players, let's say we now wish to predict which of our players are
about to churn (that is stop playing the game for the next 30 days). We
can look into our historical records and create a data set that
contains attributes of our players in addition to a label indicating whether
they have churned or not. Note that the player attributes we use for this
churn prediction task may be different from the ones we used for our earlier
clustering task. We can then feed this data set (attributes **and** label for
each player) into a supervised learning algorithm which would learn a mapping
from the player attributes to a label indicating whether that player
will churn or not. The intuition is that the supervised learning algorithm
will learn which values of these attributes typically correspond to players
who have churned and not churned (for example, it may learn that players
who spend very little and play for very short periods will most likely churn).
Now given this learned model, we can provide it the attributes of a
new player (one that recently started playing the game) and it would output
a _predicted_ label for that player. This prediction is the algorithms
expectation of whether the player will churn or not.
We can now use these predictions to target the players
who are expected to churn and entice them to continue playing the game.
In [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning), we
do not want to just group similar items but directly learn a mapping from each
item to the group (or class) that it belongs to. Returning to our earlier
example of clustering players, let's say we now wish to predict which of our
players are about to churn (that is stop playing the game for the next 30 days).
We can look into our historical records and create a data set that contains
attributes of our players in addition to a label indicating whether they have
churned or not. Note that the player attributes we use for this churn prediction
task may be different from the ones we used for our earlier clustering task. We
can then feed this data set (attributes **and** label for each player) into a
supervised learning algorithm which would learn a mapping from the player
attributes to a label indicating whether that player will churn or not. The
intuition is that the supervised learning algorithm will learn which values of
these attributes typically correspond to players who have churned and not
churned (for example, it may learn that players who spend very little and play
for very short periods will most likely churn). Now given this learned model, we
can provide it the attributes of a new player (one that recently started playing
the game) and it would output a _predicted_ label for that player. This
prediction is the algorithms expectation of whether the player will churn or
not. We can now use these predictions to target the players who are expected to
churn and entice them to continue playing the game.
player. Model selection, on the other hand, pertains to selecting the
algorithm (and its parameters) that perform the task well. Both of these
tasks are active areas of machine learning research and, in practice, require
several iterations to achieve good performance.
player. Model selection, on the other hand, pertains to selecting the algorithm
(and its parameters) that perform the task well. Both of these tasks are active
areas of machine learning research and, in practice, require several iterations
to achieve good performance.
We now switch to reinforcement learning, the third class of
machine learning algorithms, and arguably the one most relevant for the ML-Agents toolkit.
We now switch to reinforcement learning, the third class of machine learning
algorithms, and arguably the one most relevant for the ML-Agents toolkit.
can be viewed as a form of learning for sequential
decision making that is commonly associated with controlling robots (but is,
in fact, much more general). Consider an autonomous firefighting robot that is
tasked with navigating into an area, finding the fire and neutralizing it. At
any given moment, the robot perceives the environment through its sensors (e.g.
camera, heat, touch), processes this information and produces an action (e.g.
move to the left, rotate the water hose, turn on the water). In other words,
it is continuously making decisions about how to interact in this environment
given its view of the world (i.e. sensors input) and objective (i.e.
neutralizing the fire). Teaching a robot to be a successful firefighting
machine is precisely what reinforcement learning is designed to do.
can be viewed as a form of learning for sequential decision making that is
commonly associated with controlling robots (but is, in fact, much more
general). Consider an autonomous firefighting robot that is tasked with
navigating into an area, finding the fire and neutralizing it. At any given
moment, the robot perceives the environment through its sensors (e.g. camera,
heat, touch), processes this information and produces an action (e.g. move to
the left, rotate the water hose, turn on the water). In other words, it is
continuously making decisions about how to interact in this environment given
its view of the world (i.e. sensors input) and objective (i.e. neutralizing the
fire). Teaching a robot to be a successful firefighting machine is precisely
what reinforcement learning is designed to do.
More specifically, the goal of reinforcement learning is to learn a **policy**,
which is essentially a mapping from **observations** to **actions**. An
observation is what the robot can measure from its **environment** (in this
More specifically, the goal of reinforcement learning is to learn a **policy**,
which is essentially a mapping from **observations** to **actions**. An
observation is what the robot can measure from its **environment** (in this
to the configuration of the robot (e.g. position of its base, position of
its water hose and whether the hose is on or off).
to the configuration of the robot (e.g. position of its base, position of its
water hose and whether the hose is on or off).
The last remaining piece
of the reinforcement learning task is the **reward signal**. When training a
robot to be a mean firefighting machine, we provide it with rewards (positive
and negative) indicating how well it is doing on completing the task.
Note that the robot does not _know_ how to put out fires before it is trained.
It learns the objective because it receives a large positive reward when it puts
out the fire and a small negative reward for every passing second. The fact that
rewards are sparse (i.e. may not be provided at every step, but only when a
robot arrives at a success or failure situation), is a defining characteristic of
reinforcement learning and precisely why learning good policies can be difficult
(and/or time-consuming) for complex environments.
The last remaining piece of the reinforcement learning task is the **reward
signal**. When training a robot to be a mean firefighting machine, we provide it
with rewards (positive and negative) indicating how well it is doing on
completing the task. Note that the robot does not _know_ how to put out fires
before it is trained. It learns the objective because it receives a large
positive reward when it puts out the fire and a small negative reward for every
passing second. The fact that rewards are sparse (i.e. may not be provided at
every step, but only when a robot arrives at a success or failure situation), is
a defining characteristic of reinforcement learning and precisely why learning
good policies can be difficult (and/or time-consuming) for complex environments.
<p align="center">
<img src="images/rl_cycle.png" alt="The reinforcement learning cycle."/>

usually requires many trials and iterative
policy updates. More specifically, the robot is placed in several
fire situations and over time learns an optimal policy which allows it
to put our fires more effectively. Obviously, we cannot expect to train a
robot repeatedly in the real world, particularly when fires are involved. This
is precisely why the use of
usually requires many trials and iterative policy updates. More specifically,
the robot is placed in several fire situations and over time learns an optimal
policy which allows it to put our fires more effectively. Obviously, we cannot
expect to train a robot repeatedly in the real world, particularly when fires
are involved. This is precisely why the use of
serves as the perfect training grounds for learning such behaviors.
While our discussion of reinforcement learning has centered around robots,
there are strong parallels between robots and characters in a game. In fact,
in many ways, one can view a non-playable character (NPC) as a virtual
robot, with its own observations about the environment, its own set of actions
and a specific objective. Thus it is natural to explore how we can
train behaviors within Unity using reinforcement learning. This is precisely
what the ML-Agents toolkit offers. The video linked below includes a reinforcement
learning demo showcasing training character behaviors using the ML-Agents toolkit.
serves as the perfect training grounds for learning such behaviors. While our
discussion of reinforcement learning has centered around robots, there are
strong parallels between robots and characters in a game. In fact, in many ways,
one can view a non-playable character (NPC) as a virtual robot, with its own
observations about the environment, its own set of actions and a specific
objective. Thus it is natural to explore how we can train behaviors within Unity
using reinforcement learning. This is precisely what the ML-Agents toolkit
offers. The video linked below includes a reinforcement learning demo showcasing
training character behaviors using the ML-Agents toolkit.
<a href="http://www.youtube.com/watch?feature=player_embedded&v=fiQsmdwEGT8" target="_blank">
<img src="http://img.youtube.com/vi/fiQsmdwEGT8/0.jpg" alt="RL Demo" width="400" border="10" />
</a>
<a href="http://www.youtube.com/watch?feature=player_embedded&v=fiQsmdwEGT8" target="_blank">
<img src="http://img.youtube.com/vi/fiQsmdwEGT8/0.jpg" alt="RL Demo" width="400" border="10" />
</a>
also involves two tasks: attribute selection and model selection.
Attribute selection is defining the set of observations for the robot
that best help it complete its objective, while model selection is defining
the form of the policy (mapping from observations to actions) and its
parameters. In practice, training behaviors is an iterative process that may
require changing the attribute and model choices.
also involves two tasks: attribute selection and model selection. Attribute
selection is defining the set of observations for the robot that best help it
complete its objective, while model selection is defining the form of the policy
(mapping from observations to actions) and its parameters. In practice, training
behaviors is an iterative process that may require changing the attribute and
model choices.
One common aspect of all three branches of machine learning is that they
all involve a **training phase** and an **inference phase**. While the
details of the training and inference phases are different for each of the
three, at a high-level, the training phase involves building a model
using the provided data, while the inference phase involves applying this
model to new, previously unseen, data. More specifically:
* For our unsupervised learning
example, the training phase learns the optimal two clusters based
on the data describing existing players, while the inference phase assigns a
new player to one of these two clusters.
* For our supervised learning example, the
training phase learns the mapping from player attributes to player label
(whether they churned or not), and the inference phase predicts whether
a new player will churn or not based on that learned mapping.
* For our reinforcement learning example, the training phase learns the
optimal policy through guided trials, and in the inference phase, the agent
observes and tales actions in the wild using its learned policy.
One common aspect of all three branches of machine learning is that they all
involve a **training phase** and an **inference phase**. While the details of
the training and inference phases are different for each of the three, at a
high-level, the training phase involves building a model using the provided
data, while the inference phase involves applying this model to new, previously
unseen, data. More specifically:
To briefly summarize: all three classes of algorithms involve training
and inference phases in addition to attribute and model selections. What
ultimately separates them is the type of data available to learn from. In
unsupervised learning our data set was a collection of attributes, in
supervised learning our data set was a collection of attribute-label pairs,
and, lastly, in reinforcement learning our data set was a collection of
* For our unsupervised learning example, the training phase learns the optimal
two clusters based on the data describing existing players, while the
inference phase assigns a new player to one of these two clusters.
* For our supervised learning example, the training phase learns the mapping
from player attributes to player label (whether they churned or not), and the
inference phase predicts whether a new player will churn or not based on that
learned mapping.
* For our reinforcement learning example, the training phase learns the optimal
policy through guided trials, and in the inference phase, the agent observes
and tales actions in the wild using its learned policy.
To briefly summarize: all three classes of algorithms involve training and
inference phases in addition to attribute and model selections. What ultimately
separates them is the type of data available to learn from. In unsupervised
learning our data set was a collection of attributes, in supervised learning our
data set was a collection of attribute-label pairs, and, lastly, in
reinforcement learning our data set was a collection of
[Deep learning](https://en.wikipedia.org/wiki/Deep_learning) is a family of
algorithms that can be used to address any of the problems introduced
above. More specifically, they can be used to solve both attribute and
model selection tasks. Deep learning has gained popularity in recent
years due to its outstanding performance on several challenging machine learning
tasks. One example is [AlphaGo](https://en.wikipedia.org/wiki/AlphaGo),
a [computer Go](https://en.wikipedia.org/wiki/Computer_Go) program, that
leverages deep learning, that was able to beat Lee Sedol (a Go world champion).
[Deep learning](https://en.wikipedia.org/wiki/Deep_learning) is a family of
algorithms that can be used to address any of the problems introduced above.
More specifically, they can be used to solve both attribute and model selection
tasks. Deep learning has gained popularity in recent years due to its
outstanding performance on several challenging machine learning tasks. One
example is [AlphaGo](https://en.wikipedia.org/wiki/AlphaGo), a [computer
Go](https://en.wikipedia.org/wiki/Computer_Go) program, that leverages deep
learning, that was able to beat Lee Sedol (a Go world champion).
complex functions from large amounts of training data. This makes them a
natural choice for reinforcement learning tasks when a large amount of data
can be generated, say through the use of a simulator or engine such as Unity.
By generating hundreds of thousands of simulations of
the environment within Unity, we can learn policies for very complex environments
(a complex environment is one where the number of observations an agent perceives
and the number of actions they can take are large).
Many of the algorithms we provide in ML-Agents use some form of deep learning,
built on top of the open-source library, [TensorFlow](Background-TensorFlow.md).
complex functions from large amounts of training data. This makes them a natural
choice for reinforcement learning tasks when a large amount of data can be
generated, say through the use of a simulator or engine such as Unity. By
generating hundreds of thousands of simulations of the environment within Unity,
we can learn policies for very complex environments (a complex environment is
one where the number of observations an agent perceives and the number of
actions they can take are large). Many of the algorithms we provide in ML-Agents
use some form of deep learning, built on top of the open-source library,
[TensorFlow](Background-TensorFlow.md).

74
docs/Background-TensorFlow.md


# Background: TensorFlow
As discussed in our
[machine learning background page](Background-Machine-Learning.md), many of the
algorithms we provide in the ML-Agents toolkit leverage some form of deep learning.
More specifically, our implementations are built on top of the open-source
library [TensorFlow](https://www.tensorflow.org/). This means that the models
produced by the ML-Agents toolkit are (currently) in a format only understood by
As discussed in our
[machine learning background page](Background-Machine-Learning.md),
many of the algorithms we provide in the
ML-Agents toolkit leverage some form of deep learning. More specifically, our
implementations are built on top of the open-source library
[TensorFlow](https://www.tensorflow.org/). This means that the models produced
by the ML-Agents toolkit are (currently) in a format only understood by
TensorFlow. In this page we provide a brief overview of TensorFlow, in addition
to TensorFlow-related tools that we leverage within the ML-Agents toolkit.

performing computations using data flow graphs, the underlying representation
of deep learning models. It facilitates training and inference on CPUs and
GPUs in a desktop, server, or mobile device. Within the ML-Agents toolkit, when you
train the behavior of an Agent, the output is a TensorFlow model (.bytes)
file that you can then embed within an Internal Brain. Unless you implement
a new algorithm, the use of TensorFlow is mostly abstracted away and behind
the scenes.
performing computations using data flow graphs, the underlying representation of
deep learning models. It facilitates training and inference on CPUs and GPUs in
a desktop, server, or mobile device. Within the ML-Agents toolkit, when you
train the behavior of an Agent, the output is a TensorFlow model (.bytes) file
that you can then embed within an Internal Brain. Unless you implement a new
algorithm, the use of TensorFlow is mostly abstracted away and behind the
scenes.
One component of training models with TensorFlow is setting the
values of certain model attributes (called _hyperparameters_). Finding the
right values of these hyperparameters can require a few iterations.
Consequently, we leverage a visualization tool within TensorFlow called
[TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard).
It allows the visualization of certain agent attributes (e.g. reward)
throughout training which can be helpful in both building
intuitions for the different hyperparameters and setting the optimal values for
your Unity environment. We provide more details on setting the hyperparameters
in later parts of the documentation, but, in the meantime, if you are
unfamiliar with TensorBoard we recommend this
One component of training models with TensorFlow is setting the values of
certain model attributes (called _hyperparameters_). Finding the right values of
these hyperparameters can require a few iterations. Consequently, we leverage a
visualization tool within TensorFlow called
[TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard).
It allows the visualization of certain agent attributes (e.g. reward) throughout
training which can be helpful in both building intuitions for the different
hyperparameters and setting the optimal values for your Unity environment. We
provide more details on setting the hyperparameters in later parts of the
documentation, but, in the meantime, if you are unfamiliar with TensorBoard we
recommend this
One of the drawbacks of TensorFlow is that it does not provide a native
C# API. This means that the Internal Brain is not natively supported since
Unity scripts are written in C#. Consequently,
to enable the Internal Brain, we leverage a third-party
library [TensorFlowSharp](https://github.com/migueldeicaza/TensorFlowSharp)
which provides .NET bindings to TensorFlow. Thus, when a Unity environment
that contains an Internal Brain is built, inference is performed via
TensorFlowSharp. We provide an additional in-depth overview of how to
leverage [TensorFlowSharp within Unity](Using-TensorFlow-Sharp-in-Unity.md)
which will become more relevant once you install and start training
behaviors within the ML-Agents toolkit. Given the reliance on TensorFlowSharp, the
Internal Brain is currently marked as experimental.
One of the drawbacks of TensorFlow is that it does not provide a native C# API.
This means that the Internal Brain is not natively supported since Unity scripts
are written in C#. Consequently, to enable the Internal Brain, we leverage a
third-party library
[TensorFlowSharp](https://github.com/migueldeicaza/TensorFlowSharp) which
provides .NET bindings to TensorFlow. Thus, when a Unity environment that
contains an Internal Brain is built, inference is performed via TensorFlowSharp.
We provide an additional in-depth overview of how to leverage
[TensorFlowSharp within Unity](Using-TensorFlow-Sharp-in-Unity.md)
which will become more
relevant once you install and start training behaviors within the ML-Agents
toolkit. Given the reliance on TensorFlowSharp, the Internal Brain is currently
marked as experimental.

12
docs/Background-Unity.md


# Background: Unity
If you are not familiar with the [Unity Engine](https://unity3d.com/unity),
we highly recommend the
[Unity Manual](https://docs.unity3d.com/Manual/index.html) and
[Tutorials page](https://unity3d.com/learn/tutorials). The
If you are not familiar with the [Unity Engine](https://unity3d.com/unity), we
highly recommend the [Unity Manual](https://docs.unity3d.com/Manual/index.html)
and [Tutorials page](https://unity3d.com/learn/tutorials). The
with the ML-Agents toolkit:
with the ML-Agents toolkit:
* [Editor](https://docs.unity3d.com/Manual/UsingTheEditor.html)
* [Interface](https://docs.unity3d.com/Manual/LearningtheInterface.html)
* [Scene](https://docs.unity3d.com/Manual/CreatingScenes.html)

* [Scripting](https://docs.unity3d.com/Manual/ScriptingSection.html)
* [Physics](https://docs.unity3d.com/Manual/PhysicsSection.html)
* [Ordering of event functions](https://docs.unity3d.com/Manual/ExecutionOrder.html)
(e.g. FixedUpdate, Update)
(e.g. FixedUpdate, Update)

11
docs/Basic-Guide.md


**Plugins** > **Computer**.
**Note**: If you don't see anything under **Assets**, drag the
`MLAgentsSDK/Assets/ML-Agents` folder under **Assets** within
Project window.
`MLAgentsSDK/Assets/ML-Agents` folder under **Assets** within Project window.
![Imported TensorFlowsharp](images/imported-tensorflowsharp.png)

if you want to [use an executable](Learning-Environment-Executable.md) or to
`None` if you want to interact with the current scene in the Unity Editor.
More information and documentation is provided in the
More information and documentation is provided in the
[Python API](Python-API.md) page.
## Training the Brain with Reinforcement Learning

training runs
- And the `--train` tells `mlagents-learn` to run a training session (rather
than inference)
5. When the message _"Start training by pressing the Play button in the Unity
4. When the message _"Start training by pressing the Play button in the Unity
Editor"_ is displayed on the screen, you can press the :arrow_forward: button
in Unity to start training in the Editor.

- For a "Hello World" introduction to creating your own learning environment,
check out the [Making a New Learning
Environment](Learning-Environment-Create-New.md) page.
- For a series of Youtube video tutorials, checkout the [Machine Learning Agents
PlayList](https://www.youtube.com/playlist?list=PLX2vGYjWbI0R08eWQkO7nQkGiicHAX7IX)
- For a series of Youtube video tutorials, checkout the
[Machine Learning Agents PlayList](https://www.youtube.com/playlist?list=PLX2vGYjWbI0R08eWQkO7nQkGiicHAX7IX)
page.

12
docs/FAQ.md


If you haven't switched your scripting runtime version from .NET 3.5 to .NET 4.6
or .NET 4.x, you will see such error message:
```
```console
error CS1061: Type `System.Text.StringBuilder' does not contain a definition for `Clear' and no extension method `Clear' of type `System.Text.StringBuilder' could be found. Are you missing an assembly reference?
```

ENABLE_TENSORFLOW flag for your scripting define symbols, you will see the
following error message:
```
```console
You need to install and enable the TensorFlowSharp plugin in order to use the internal brain.
```

If you have a graph placeholder set in the internal Brain inspector that is not
present in the TensorFlow graph, you will see some error like this:
```
```console
UnityAgentsException: One of the Tensorflow placeholder could not be found. In brain <some_brain_name>, there are no FloatingPoint placeholder named <some_placeholder_name>.
```

Similarly, if you have a graph scope set in the internal Brain inspector that is
not correctly set, you will see some error like this:
```
```console
UnityAgentsException: The node <Wrong_Graph_Scope>/action could not be found. Please make sure the graphScope <Wrong_Graph_Scope>/ is correct
```

If you receive such a permission error on macOS, run:
```shell
```sh
```shell
```sh
chmod -R 755 *.x86_64
```

55
docs/Feature-Memory.md


# Memory-enhanced Agents using Recurrent Neural Networks
## What are memories for?
Have you ever entered a room to get something and immediately forgot
what you were looking for? Don't let that happen to
your agents.
## What are memories for
It is now possible to give memories to your agents. When training, the
agents will be able to store a vector of floats to be used next time
they need to make a decision.
Have you ever entered a room to get something and immediately forgot what you
were looking for? Don't let that happen to your agents.
It is now possible to give memories to your agents. When training, the agents
will be able to store a vector of floats to be used next time they need to make
a decision.
Deciding what the agents should remember in order to solve a task is not
easy to do by hand, but our training algorithms can learn to keep
track of what is important to remember with [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).
Deciding what the agents should remember in order to solve a task is not easy to
do by hand, but our training algorithms can learn to keep track of what is
important to remember with
[LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).
When configuring the trainer parameters in the `config/trainer_config.yaml`
When configuring the trainer parameters in the `config/trainer_config.yaml`
file, add the following parameters to the Brain you want to use.
```json

```
* `use_recurrent` is a flag that notifies the trainer that you want
to use a Recurrent Neural Network.
* `sequence_length` defines how long the sequences of experiences
must be while training. In order to use a LSTM, training requires
a sequence of experiences instead of single experiences.
* `memory_size` corresponds to the size of the memory the agent
must keep. Note that if this number is too small, the agent will not
be able to remember a lot of things. If this number is too large,
the neural network will take longer to train.
* `use_recurrent` is a flag that notifies the trainer that you want to use a
Recurrent Neural Network.
* `sequence_length` defines how long the sequences of experiences must be while
training. In order to use a LSTM, training requires a sequence of experiences
instead of single experiences.
* `memory_size` corresponds to the size of the memory the agent must keep. Note
that if this number is too small, the agent will not be able to remember a lot
of things. If this number is too large, the neural network will take longer to
train.
* LSTM does not work well with continuous vector action space.
Please use discrete vector action space for better results.
* Since the memories must be sent back and forth between Python
and Unity, using too large `memory_size` will slow down training.
* Adding a recurrent layer increases the complexity of the neural
network, it is recommended to decrease `num_layers` when using recurrent.
* LSTM does not work well with continuous vector action space. Please use
discrete vector action space for better results.
* Since the memories must be sent back and forth between Python and Unity, using
too large `memory_size` will slow down training.
* Adding a recurrent layer increases the complexity of the neural network, it is
recommended to decrease `num_layers` when using recurrent.
* It is required that `memory_size` be divisible by 4.

42
docs/Feature-Monitor.md


![Monitor](images/monitor.png)
The monitor allows visualizing information related to the agents or training process within a Unity scene.
The monitor allows visualizing information related to the agents or training
process within a Unity scene.
You can track many different things both related and unrelated to the agents themselves. By default, the Monitor is only active in the *inference* phase, so not during training. To change this behaviour, you can activate or deactivate it by calling `SetActive(boolean)`. For example to also show the monitor during training, you can call it in the `InitializeAcademy()` method of your `Academy`:
You can track many different things both related and unrelated to the agents
themselves. By default, the Monitor is only active in the *inference* phase, so
not during training. To change this behaviour, you can activate or deactivate it
by calling `SetActive(boolean)`. For example to also show the monitor during
training, you can call it in the `InitializeAcademy()` method of your `Academy`:
public class YourAcademy : Academy {
public class YourAcademy : Academy {
public override void InitializeAcademy()
{
Monitor.SetActive(true);

To add values to monitor, call the `Log` function anywhere in your code :
To add values to monitor, call the `Log` function anywhere in your code:
* *`key`* is the name of the information you want to display.
* *`value`* is the information you want to display. *`value`* can have different types :
* *`string`* - The Monitor will display the string next to the key. It can be useful for displaying error messages.
* *`float`* - The Monitor will display a slider. Note that the values must be between -1 and 1. If the value is positive, the slider will be green, if the value is negative, the slider will be red.
* *`float[]`* - The Monitor Log call can take an additional argument called `displayType` that can be either `INDEPENDENT` (default) or `PROPORTIONAL` :
* *`INDEPENDENT`* is used to display multiple independent floats as a histogram. The histogram will be a sequence of vertical sliders.
* *`PROPORTION`* is used to see the proportions between numbers. For each float in values, a rectangle of width of value divided by the sum of all values will be show. It is best for visualizing values that sum to 1.
* *`target`* is the transform to which you want to attach information. If the transform is `null` the information will be attached to the global monitor.
* **NB:** When adding a target transform that is not the global monitor, make sure you have your main camera object tagged as `MainCamera` via the inspector. This is needed to properly display the text onto the screen.
* `key` is the name of the information you want to display.
* `value` is the information you want to display. *`value`* can have different
types:
* `string` - The Monitor will display the string next to the key. It can be
useful for displaying error messages.
* `float` - The Monitor will display a slider. Note that the values must be
between -1 and 1. If the value is positive, the slider will be green, if the
value is negative, the slider will be red.
* `float[]` - The Monitor Log call can take an additional argument called
`displayType` that can be either `INDEPENDENT` (default) or `PROPORTIONAL`:
* `INDEPENDENT` is used to display multiple independent floats as a
histogram. The histogram will be a sequence of vertical sliders.
* `PROPORTION` is used to see the proportions between numbers. For each
float in values, a rectangle of width of value divided by the sum of all
values will be show. It is best for visualizing values that sum to 1.
* `target` is the transform to which you want to attach information. If the
transform is `null` the information will be attached to the global monitor.
* **NB:** When adding a target transform that is not the global monitor, make
sure you have your main camera object tagged as `MainCamera` via the
inspector. This is needed to properly display the text onto the screen.

16
docs/Getting-Started-with-Balance-Ball.md


based on the rewards received when it tries different values). For example, an
element might represent a force or torque applied to a `RigidBody` in the agent.
The **Discrete** action vector space defines its actions as tables. An action
given to the agent is an array of indeces into tables.
given to the agent is an array of indeces into tables.
The 3D Balance Ball example is programmed to use both types of vector action
space. You can try training with both settings to observe whether there is a

## Training the Brain with Reinforcement Learning
Now that we have an environment, we can perform the training.
Now that we have an environment, we can perform the training.
### Training with PPO

To summarize, go to your command line, enter the `ml-agents` directory and type:
```shell
```sh
mlagents-learn config/trainer_config.yaml --run-id=<run-identifier> --train
```

### Observing Training Progress
Once you start training using `mlagents-learn` in the way described in the previous
section, the `ml-agents` directory will contain a `summaries` directory. In
order to observe the training process in more detail, you can use TensorBoard.
From the command line run:
Once you start training using `mlagents-learn` in the way described in the
previous section, the `ml-agents` directory will contain a `summaries`
directory. In order to observe the training process in more detail, you can use
TensorBoard. From the command line run:
```shell
```sh
tensorboard --logdir=summaries
```

248
docs/Installation-Windows.md


# Installing ML-Agents Toolkit for Windows
The ML-Agents toolkit supports Windows 10. While it might be possible to run the ML-Agents toolkit using other versions of Windows, it has not been tested on other versions. Furthermore, the ML-Agents toolkit has not been tested on a Windows VM such as Bootcamp or Parallels.
The ML-Agents toolkit supports Windows 10. While it might be possible to run the
ML-Agents toolkit using other versions of Windows, it has not been tested on
other versions. Furthermore, the ML-Agents toolkit has not been tested on a
Windows VM such as Bootcamp or Parallels.
To use the ML-Agents toolkit, you install Python and the required Python packages as outlined below. This guide also covers how set up GPU-based training (for advanced users). GPU-based training is not required for the v0.4 release of the ML-Agents toolkit. However, training on a GPU might be required by future versions and features.
To use the ML-Agents toolkit, you install Python and the required Python
packages as outlined below. This guide also covers how set up GPU-based training
(for advanced users). GPU-based training is not required for the v0.4 release of
the ML-Agents toolkit. However, training on a GPU might be required by future
versions and features.
[Download](https://www.anaconda.com/download/#windows) and install Anaconda for Windows. By using Anaconda, you can manage separate environments for different distributions of Python. Python 3.5 or 3.6 is required as we no longer support Python 2. In this guide, we are using Python version 3.6 and Anaconda version 5.1 ([64-bit](https://repo.continuum.io/archive/Anaconda3-5.1.0-Windows-x86_64.exe) or [32-bit](https://repo.continuum.io/archive/Anaconda3-5.1.0-Windows-x86.exe) direct links).
[Download](https://www.anaconda.com/download/#windows) and install Anaconda for
Windows. By using Anaconda, you can manage separate environments for different
distributions of Python. Python 3.5 or 3.6 is required as we no longer support
Python 2. In this guide, we are using Python version 3.6 and Anaconda version
5.1
([64-bit](https://repo.continuum.io/archive/Anaconda3-5.1.0-Windows-x86_64.exe)
or [32-bit](https://repo.continuum.io/archive/Anaconda3-5.1.0-Windows-x86.exe)
direct links).
<img src="images/anaconda_install.PNG"
alt="Anaconda Install"
width="500" border="10" />
<img src="images/anaconda_install.PNG"
alt="Anaconda Install"
width="500" border="10" />
We recommend the default _advanced installation options_. However, select the options appropriate for your specific situation.
We recommend the default _advanced installation options_. However, select the
options appropriate for your specific situation.
<img src="images/anaconda_default.PNG"
alt="Anaconda Install"
width="500" border="10" />
<img src="images/anaconda_default.PNG" alt="Anaconda Install" width="500" border="10" />
After installation, you must open __Anaconda Navigator__ to finish the setup. From the Windows search bar, type _anaconda navigator_. You can close Anaconda Navigator after it opens.
After installation, you must open __Anaconda Navigator__ to finish the setup.
From the Windows search bar, type _anaconda navigator_. You can close Anaconda
Navigator after it opens.
If environment variables were not created, you will see error "conda is not recognized as internal or external command" when you type `conda` into the command line. To solve this you will need to set the environment variable correctly.
If environment variables were not created, you will see error "conda is not
recognized as internal or external command" when you type `conda` into the
command line. To solve this you will need to set the environment variable
correctly.
Type `environment variables` in the search bar (this can be reached by hitting the Windows key or the bottom left Windows button). You should see an option called __Edit the system environment variables__.
Type `environment variables` in the search bar (this can be reached by hitting
the Windows key or the bottom left Windows button). You should see an option
called __Edit the system environment variables__.
<img src="images/edit_env_var.png"
alt="edit env variables"
width="250" border="10" />
<img src="images/edit_env_var.png"
alt="edit env variables"
width="250" border="10" />
From here, click the __Environment Variables__ button.
Double click "Path" under __System variable__ to edit the "Path" variable, click __New__ to add the following new paths.
From here, click the __Environment Variables__ button. Double click "Path" under
__System variable__ to edit the "Path" variable, click __New__ to add the
following new paths.
```
```console
%UserProfile%\Anaconda3\Scripts
%UserProfile%\Anaconda3\Scripts\conda.exe
%UserProfile%\Anaconda3

You will create a new [Conda environment](https://conda.io/docs/) to be used with the ML-Agents toolkit. This means that all the packages that you install are localized to just this environment. It will not affect any other installation of Python or other environments. Whenever you want to run ML-Agents, you will need activate this Conda environment.
You will create a new [Conda environment](https://conda.io/docs/) to be used
with the ML-Agents toolkit. This means that all the packages that you install
are localized to just this environment. It will not affect any other
installation of Python or other environments. Whenever you want to run
ML-Agents, you will need activate this Conda environment.
To create a new Conda environment, open a new Anaconda Prompt (_Anaconda Prompt_ in the search bar) and type in the following command:
To create a new Conda environment, open a new Anaconda Prompt (_Anaconda Prompt_
in the search bar) and type in the following command:
```
```sh
You may be asked to install new packages. Type `y` and press enter _(make sure you are connected to the internet)_. You must install these required packages. The new Conda environment is called ml-agents and uses Python version 3.6.
You may be asked to install new packages. Type `y` and press enter _(make sure
you are connected to the internet)_. You must install these required packages.
The new Conda environment is called ml-agents and uses Python version 3.6.
<img src="images/conda_new.PNG"
alt="Anaconda Install"
width="500" border="10" />
<img src="images/conda_new.PNG" alt="Anaconda Install" width="500" border="10" />
To use this environment, you must activate it. _(To use this environment In the future, you can run the same command)_. In the same Anaconda Prompt, type in the following command:
To use this environment, you must activate it. _(To use this environment In the
future, you can run the same command)_. In the same Anaconda Prompt, type in the
following command:
```
```sh
Next, install `tensorflow`. Install this package using `pip` - which is a package management system used to install Python packages. Latest versions of Tensorflow won't work, so you will need to make sure that you install version 1.7.1. In the same Anaconda Prompt, type in the following command _(make sure you are connected to the internet)_:
Next, install `tensorflow`. Install this package using `pip` - which is a
package management system used to install Python packages. Latest versions of
Tensorflow won't work, so you will need to make sure that you install version
1.7.1. In the same Anaconda Prompt, type in the following command _(make sure
you are connected to the internet)_:
```
```sh
The ML-Agents toolkit depends on a number of Python packages. Use `pip` to install these Python dependencies.
The ML-Agents toolkit depends on a number of Python packages. Use `pip` to
install these Python dependencies.
If you haven't already, clone the ML-Agents Toolkit Github repository to your local computer. You can do this using Git ([download here](https://git-scm.com/download/win)) and running the following commands in an Anaconda Prompt _(if you open a new prompt, be sure to activate the ml-agents Conda environment by typing `activate ml-agents`)_:
If you haven't already, clone the ML-Agents Toolkit Github repository to your
local computer. You can do this using Git ([download
here](https://git-scm.com/download/win)) and running the following commands in
an Anaconda Prompt _(if you open a new prompt, be sure to activate the ml-agents
Conda environment by typing `activate ml-agents`)_:
```
```sh
If you don't want to use Git, you can always directly download all the files [here](https://github.com/Unity-Technologies/ml-agents/archive/master.zip).
If you don't want to use Git, you can always directly download all the files
[here](https://github.com/Unity-Technologies/ml-agents/archive/master.zip).
In our example, the files are located in `C:\Downloads`. After you have either cloned or downloaded the files, from the Anaconda Prompt, change to the python directory inside the ml-agents directory:
In our example, the files are located in `C:\Downloads`. After you have either
cloned or downloaded the files, from the Anaconda Prompt, change to the python
directory inside the ml-agents directory:
```
```console
Make sure you are connected to the internet and then type in the Anaconda Prompt:
Make sure you are connected to the internet and then type in the Anaconda
Prompt:
```
```sh
This will complete the installation of all the required Python packages to run the ML-Agents toolkit.
This will complete the installation of all the required Python packages to run
the ML-Agents toolkit.
GPU is not required for the ML-Agents toolkit and won't speed up the PPO algorithm a lot during training(but something in the future will benefit from GPU). This is a guide for advanced users who want to train using GPUs. Additionally, you will need to check if your GPU is CUDA compatible. Please check Nvidia's page [here](https://developer.nvidia.com/cuda-gpus).
GPU is not required for the ML-Agents toolkit and won't speed up the PPO
algorithm a lot during training(but something in the future will benefit from
GPU). This is a guide for advanced users who want to train using GPUs.
Additionally, you will need to check if your GPU is CUDA compatible. Please
check Nvidia's page [here](https://developer.nvidia.com/cuda-gpus).
[Download](https://developer.nvidia.com/cuda-toolkit-archive) and install the CUDA toolkit 9.0 from Nvidia's archive. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ (Step Visual Studio 2017) compiler and a runtime library and is needed to run the ML-Agents toolkit. In this guide, we are using version 9.0.176 (https://developer.nvidia.com/compute/cuda/9.0/Prod/network_installers/cuda_9.0.176_win10_network-exe)).
[Download](https://developer.nvidia.com/cuda-toolkit-archive) and install the
CUDA toolkit 9.0 from Nvidia's archive. The toolkit includes GPU-accelerated
libraries, debugging and optimization tools, a C/C++ (Step Visual Studio 2017)
compiler and a runtime library and is needed to run the ML-Agents toolkit. In
this guide, we are using version
[9.0.176](https://developer.nvidia.com/compute/cuda/9.0/Prod/network_installers/cuda_9.0.176_win10_network-exe)).
Before installing, please make sure you __close any running instances of Unity or Visual Studio__.
Before installing, please make sure you __close any running instances of Unity
or Visual Studio__.
Run the installer and select the Express option. Note the directory where you installed the CUDA toolkit. In this guide, we installed in the directory `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`
Run the installer and select the Express option. Note the directory where you
installed the CUDA toolkit. In this guide, we installed in the directory
`C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`
[Download](https://developer.nvidia.com/cudnn) and install the cuDNN library from Nvidia. cuDNN is a GPU-accelerated library of primitives for deep neural networks. Before you can download, you will need to sign up for free to the Nvidia Developer Program.
[Download](https://developer.nvidia.com/cudnn) and install the cuDNN library
from Nvidia. cuDNN is a GPU-accelerated library of primitives for deep neural
networks. Before you can download, you will need to sign up for free to the
Nvidia Developer Program.
<img src="images/cuDNN_membership_required.png"
alt="cuDNN membership required"
width="500" border="10" />
<img src="images/cuDNN_membership_required.png"
alt="cuDNN membership required"
width="500" border="10" />
Once you've signed up, go back to the cuDNN [downloads page](https://developer.nvidia.com/cudnn). You may or may not be asked to fill out a short survey. When you get to the list cuDNN releases, __make sure you are downloading the right version for the CUDA toolkit you installed in Step 1.__ In this guide, we are using version 7.0.5 for CUDA toolkit version 9.0 ([direct link](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.0_20171129/cudnn-9.0-windows10-x64-v7)).
Once you've signed up, go back to the cuDNN
[downloads page](https://developer.nvidia.com/cudnn). You may or may not be asked to fill
out a short survey. When you get to the list cuDNN releases, __make sure you are
downloading the right version for the CUDA toolkit you installed in Step 1.__
In this guide, we are using version 7.0.5 for CUDA toolkit version 9.0 ([direct
link](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.0_20171129/cudnn-9.0-windows10-x64-v7)).
After you have downloaded the cuDNN files, you will need to extract the files into the CUDA toolkit directory. In the cuDNN zip file, there are three folders called `bin`, `include`, and `lib`.
After you have downloaded the cuDNN files, you will need to extract the files
into the CUDA toolkit directory. In the cuDNN zip file, there are three folders
called `bin`, `include`, and `lib`.
<img src="images/cudnn_zip_files.PNG"
alt="cuDNN zip files"
width="500" border="10" />
<img src="images/cudnn_zip_files.PNG"
alt="cuDNN zip files"
width="500" border="10" />
Copy these three folders into the CUDA toolkit directory. The CUDA toolkit directory is located at `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`
Copy these three folders into the CUDA toolkit directory. The CUDA toolkit
directory is located at
`C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`
<img src="images/cuda_toolkit_directory.PNG"
alt="cuda toolkit directory"
width="500" border="10" />
<img src="images/cuda_toolkit_directory.PNG"
alt="cuda toolkit directory"
width="500" border="10" />
</p>
### Set Environment Variables

To set the environment variable, type `environment variables` in the search bar (this can be reached by hitting the Windows key or the bottom left Windows button). You should see an option called __Edit the system environment variables__.
To set the environment variable, type `environment variables` in the search bar
(this can be reached by hitting the Windows key or the bottom left Windows
button). You should see an option called __Edit the system environment
variables__.
<img src="images/edit_env_var.png"
alt="edit env variables"
width="250" border="10" />
<img src="images/edit_env_var.png"
alt="edit env variables"
width="250" border="10" />
From here, click the __Environment Variables__ button. Click __New__ to add a new system variable _(make sure you do this under __System variables__ and not User variables_.
From here, click the __Environment Variables__ button. Click __New__ to add a
new system variable _(make sure you do this under __System variables__ and not
User variables_.
<img src="images/new_system_variable.PNG"
alt="new system variable"
width="500" border="10" />
<img src="images/new_system_variable.PNG"
alt="new system variable"
width="500" border="10" />
For __Variable Name__, enter `CUDA_HOME`. For the variable value, put the directory location for the CUDA toolkit. In this guide, the directory location is `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`. Press __OK__ once.
For __Variable Name__, enter `CUDA_HOME`. For the variable value, put the
directory location for the CUDA toolkit. In this guide, the directory location
is `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`. Press __OK__ once.
<img src="images/system_variable_name_value.PNG"
alt="system variable names and values"
width="500" border="10" />
<img src="images/system_variable_name_value.PNG"
alt="system variable names and values"
width="500" border="10" />
To set the two path variables, inside the same __Environment Variables__ window and under the second box called __System Variables__, find a variable called `Path` and click __Edit__. You will add two directories to the list. For this guide, the two entries would look like:
To set the two path variables, inside the same __Environment Variables__ window
and under the second box called __System Variables__, find a variable called
`Path` and click __Edit__. You will add two directories to the list. For this
guide, the two entries would look like:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\extras\CUPTI\libx64
```console
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\extras\CUPTI\libx64
```
Make sure to replace the relevant directory location with the one you have installed. _Please note that case sensitivity matters_.
Make sure to replace the relevant directory location with the one you have
installed. _Please note that case sensitivity matters_.
<img src="images/path_variables.PNG"
alt="Path variables"
<img src="images/path_variables.PNG"
alt="Path variables"
Next, install `tensorflow-gpu` using `pip`. You'll need version 1.7.1. In an Anaconda Prompt with the Conda environment ml-agents activated, type in the following command to uninstall the tensorflow for cpu and install the tensorflow for gpu _(make sure you are connected to the internet)_:
Next, install `tensorflow-gpu` using `pip`. You'll need version 1.7.1. In an
Anaconda Prompt with the Conda environment ml-agents activated, type in the
following command to uninstall the tensorflow for cpu and install the tensorflow
for gpu _(make sure you are connected to the internet)_:
```
```sh
Lastly, you should test to see if everything installed properly and that TensorFlow can identify your GPU. In the same Anaconda Prompt, type in the following command:
Lastly, you should test to see if everything installed properly and that
TensorFlow can identify your GPU. In the same Anaconda Prompt, type in the
following command:
```python
import tensorflow as tf

You should see something similar to:
```
```console
We would like to thank [Jason Weimann](https://unity3d.college/2017/10/25/machine-learning-in-unity3d-setting-up-the-environment-tensorflow-for-agentml-on-windows-10/) and [Nitish S. Mutha](http://blog.nitishmutha.com/tensorflow/2017/01/22/TensorFlow-with-gpu-for-windows.html) for writing the original articles which were used to create this guide.
We would like to thank
[Jason Weimann](https://unity3d.college/2017/10/25/machine-learning-in-unity3d-setting-up-the-environment-tensorflow-for-agentml-on-windows-10/)
and
[Nitish S. Mutha](http://blog.nitishmutha.com/tensorflow/2017/01/22/TensorFlow-with-gpu-for-windows.html)
for writing the original articles which were used to create this guide.

28
docs/Installation.md


# Installation
To install and use ML-Agents, you need install Unity, clone this repository
and install Python with additional dependencies. Each of the subsections
below overviews each step, in addition to a Docker set-up.
To install and use ML-Agents, you need install Unity, clone this repository and
install Python with additional dependencies. Each of the subsections below
overviews each step, in addition to a Docker set-up.
like to use our Docker set-up (introduced later), make sure to select the
_Linux Build Support_ component when installing Unity.
like to use our Docker set-up (introduced later), make sure to select the _Linux
Build Support_ component when installing Unity.
<img src="images/unity_linux_build_support.png"
alt="Linux Build Support"
width="500" border="10" />
<img src="images/unity_linux_build_support.png"
alt="Linux Build Support"
width="500" border="10" />
</p>
## Clone the Ml-Agents Repository

git clone https://github.com/Unity-Technologies/ml-agents.git
The `MLAgentsSDK` directory in this repository contains the Unity Assets
to add to your projects. The `python` directory contains python packages
which provide trainers, a python API to interface with Unity, and a package
to interface with OpenAI Gym.
The `MLAgentsSDK` directory in this repository contains the Unity Assets to add
to your projects. The `python` directory contains python packages which provide
trainers, a python API to interface with Unity, and a package to interface with
OpenAI Gym.
In order to use ML-Agents toolkit, you need Python 3.6 along with
the dependencies listed in the [requirements file](../ml-agents/requirements.txt).
In order to use ML-Agents toolkit, you need Python 3.6 along with the
dependencies listed in the [requirements file](../ml-agents/requirements.txt).
Some of the primary dependencies include:
- [TensorFlow](Background-TensorFlow.md)

64
docs/Learning-Environment-Best-Practices.md


# Environment Design Best Practices
## General
* It is often helpful to start with the simplest version of the problem, to ensure the agent can learn it. From there increase
complexity over time. This can either be done manually, or via Curriculum Learning, where a set of lessons which progressively increase in difficulty are presented to the agent ([learn more here](Training-Curriculum-Learning.md)).
* When possible, it is often helpful to ensure that you can complete the task by using a Player Brain to control the agent.
* It is often helpful to make many copies of the agent, and attach the brain to be trained to all of these agents. In this way the brain can get more feedback information from all of these agents, which helps it train faster.
* It is often helpful to start with the simplest version of the problem, to
ensure the agent can learn it. From there increase complexity over time. This
can either be done manually, or via Curriculum Learning, where a set of
lessons which progressively increase in difficulty are presented to the agent
([learn more here](Training-Curriculum-Learning.md)).
* When possible, it is often helpful to ensure that you can complete the task by
using a Player Brain to control the agent.
* It is often helpful to make many copies of the agent, and attach the brain to
be trained to all of these agents. In this way the brain can get more feedback
information from all of these agents, which helps it train faster.
* The magnitude of any given reward should typically not be greater than 1.0 in order to ensure a more stable learning process.
* Positive rewards are often more helpful to shaping the desired behavior of an agent than negative rewards.
* For locomotion tasks, a small positive reward (+0.1) for forward velocity is typically used.
* If you want the agent to finish a task quickly, it is often helpful to provide a small penalty every step (-0.05) that the agent does not complete the task. In this case completion of the task should also coincide with the end of the episode.
* Overly-large negative rewards can cause undesirable behavior where an agent learns to avoid any behavior which might produce the negative reward, even if it is also behavior which can eventually lead to a positive reward.
* The magnitude of any given reward should typically not be greater than 1.0 in
order to ensure a more stable learning process.
* Positive rewards are often more helpful to shaping the desired behavior of an
agent than negative rewards.
* For locomotion tasks, a small positive reward (+0.1) for forward velocity is
typically used.
* If you want the agent to finish a task quickly, it is often helpful to provide
a small penalty every step (-0.05) that the agent does not complete the task.
In this case completion of the task should also coincide with the end of the
episode.
* Overly-large negative rewards can cause undesirable behavior where an agent
learns to avoid any behavior which might produce the negative reward, even if
it is also behavior which can eventually lead to a positive reward.
* Vector Observations should include all variables relevant to allowing the agent to take the optimally informed decision.
* In cases where Vector Observations need to be remembered or compared over time, increase the `Stacked Vectors` value to allow the agent to keep track of multiple observations into the past.
* Categorical variables such as type of object (Sword, Shield, Bow) should be encoded in one-hot fashion (i.e. `3` > `0, 0, 1`).
* Besides encoding non-numeric values, all inputs should be normalized to be in the range 0 to +1 (or -1 to 1). For example, the `x` position information of an agent where the maximum possible value is `maxValue` should be recorded as `AddVectorObs(transform.position.x / maxValue);` rather than `AddVectorObs(transform.position.x);`. See the equation below for one approach of normalization.
* Positional information of relevant GameObjects should be encoded in relative coordinates wherever possible. This is often relative to the agent position.
* Vector Observations should include all variables relevant to allowing the
agent to take the optimally informed decision.
* In cases where Vector Observations need to be remembered or compared over
time, increase the `Stacked Vectors` value to allow the agent to keep track of
multiple observations into the past.
* Categorical variables such as type of object (Sword, Shield, Bow) should be
encoded in one-hot fashion (i.e. `3` > `0, 0, 1`).
* Besides encoding non-numeric values, all inputs should be normalized to be in
the range 0 to +1 (or -1 to 1). For example, the `x` position information of
an agent where the maximum possible value is `maxValue` should be recorded as
`AddVectorObs(transform.position.x / maxValue);` rather than
`AddVectorObs(transform.position.x);`. See the equation below for one approach
of normalization.
* Positional information of relevant GameObjects should be encoded in relative
coordinates wherever possible. This is often relative to the agent position.
* When using continuous control, action values should be clipped to an appropriate range. The provided PPO model automatically clips these values between -1 and 1, but third party training systems may not do so.
* Be sure to set the Vector Action's Space Size to the number of used Vector Actions, and not greater, as doing the latter can interfere with the efficiency of the training process.
* When using continuous control, action values should be clipped to an
appropriate range. The provided PPO model automatically clips these values
between -1 and 1, but third party training systems may not do so.
* Be sure to set the Vector Action's Space Size to the number of used Vector
Actions, and not greater, as doing the latter can interfere with the
efficiency of the training process.

330
docs/Learning-Environment-Create-New.md


# Making a New Learning Environment
This tutorial walks through the process of creating a Unity Environment. A Unity Environment is an application built using the Unity Engine which can be used to train Reinforcement Learning agents.
This tutorial walks through the process of creating a Unity Environment. A Unity
Environment is an application built using the Unity Engine which can be used to
train Reinforcement Learning agents.
In this example, we will train a ball to roll to a randomly placed cube. The ball also learns to avoid falling off the platform.
In this example, we will train a ball to roll to a randomly placed cube. The
ball also learns to avoid falling off the platform.
Using the ML-Agents toolkit in a Unity project involves the following basic steps:
Using the ML-Agents toolkit in a Unity project involves the following basic
steps:
1. Create an environment for your agents to live in. An environment can range from a simple physical simulation containing a few objects to an entire game or ecosystem.
2. Implement an Academy subclass and add it to a GameObject in the Unity scene containing the environment. This GameObject will serve as the parent for any Brain objects in the scene. Your Academy class can implement a few optional methods to update the scene independently of any agents. For example, you can add, move, or delete agents and other entities in the environment.
1. Create an environment for your agents to live in. An environment can range
from a simple physical simulation containing a few objects to an entire game
or ecosystem.
2. Implement an Academy subclass and add it to a GameObject in the Unity scene
containing the environment. This GameObject will serve as the parent for any
Brain objects in the scene. Your Academy class can implement a few optional
methods to update the scene independently of any agents. For example, you can
add, move, or delete agents and other entities in the environment.
4. Implement your Agent subclasses. An Agent subclass defines the code an agent uses to observe its environment, to carry out assigned actions, and to calculate the rewards used for reinforcement training. You can also implement optional methods to reset the agent when it has finished or failed its task.
5. Add your Agent subclasses to appropriate GameObjects, typically, the object in the scene that represents the agent in the simulation. Each Agent object must be assigned a Brain object.
6. If training, set the Brain type to External and [run the training process](Training-ML-Agents.md).
4. Implement your Agent subclasses. An Agent subclass defines the code an agent
uses to observe its environment, to carry out assigned actions, and to
calculate the rewards used for reinforcement training. You can also implement
optional methods to reset the agent when it has finished or failed its task.
5. Add your Agent subclasses to appropriate GameObjects, typically, the object
in the scene that represents the agent in the simulation. Each Agent object
must be assigned a Brain object.
6. If training, set the Brain type to External and
[run the training process](Training-ML-Agents.md).
**Note:** If you are unfamiliar with Unity, refer to [Learning the interface](https://docs.unity3d.com/Manual/LearningtheInterface.html) in the Unity Manual if an Editor task isn't explained sufficiently in this tutorial.
**Note:** If you are unfamiliar with Unity, refer to
[Learning the interface](https://docs.unity3d.com/Manual/LearningtheInterface.html)
in the Unity Manual if an Editor task isn't explained sufficiently in this
tutorial.
The first task to accomplish is simply creating a new Unity project and importing the ML-Agents assets into it:
The first task to accomplish is simply creating a new Unity project and
importing the ML-Agents assets into it:
2. In a file system window, navigate to the folder containing your cloned ML-Agents repository.
3. Drag the `ML-Agents` folder from `MLAgentsSDK/Assets` to the Unity Editor Project window.
2. In a file system window, navigate to the folder containing your cloned
ML-Agents repository.
3. Drag the `ML-Agents` folder from `MLAgentsSDK/Assets` to the Unity Editor
Project window.
## Create the Environment:
## Create the Environment
Next, we will create a very simple scene to act as our ML-Agents environment. The "physical" components of the environment include a Plane to act as the floor for the agent to move around on, a Cube to act as the goal or target for the agent to seek, and a Sphere to represent the agent itself.
Next, we will create a very simple scene to act as our ML-Agents environment.
The "physical" components of the environment include a Plane to act as the floor
for the agent to move around on, a Cube to act as the goal or target for the
agent to seek, and a Sphere to represent the agent itself.
**Create the floor plane:**
### Create the floor plane
5. On the Plane's Mesh Renderer, expand the Materials property and change the default-material to *floor*.
5. On the Plane's Mesh Renderer, expand the Materials property and change the
default-material to *floor*.
(To set a new material, click the small circle icon next to the current material name. This opens the **Object Picker** dialog so that you can choose the a different material from the list of all materials currently in the project.)
(To set a new material, click the small circle icon next to the current material
name. This opens the **Object Picker** dialog so that you can choose the a
different material from the list of all materials currently in the project.)
**Add the Target Cube**
### Add the Target Cube
5. On the Cube's Mesh Renderer, expand the Materials property and change the default-material to *Block*.
5. On the Cube's Mesh Renderer, expand the Materials property and change the
default-material to *Block*.
**Add the Agent Sphere**
### Add the Agent Sphere
5. On the Sphere's Mesh Renderer, expand the Materials property and change the default-material to *checker 1*.
5. On the Sphere's Mesh Renderer, expand the Materials property and change the
default-material to *checker 1*.
7. Add the Physics/Rigidbody component to the Sphere. (Adding a Rigidbody )
7. Add the Physics/Rigidbody component to the Sphere. (Adding a Rigidbody)
Note that we will create an Agent subclass to add to this GameObject as a component later in the tutorial.
Note that we will create an Agent subclass to add to this GameObject as a
component later in the tutorial.
**Add Empty GameObjects to Hold the Academy and Brain**
### Add Empty GameObjects to Hold the Academy and Brain
1. Right click in Hierarchy window, select Create Empty.
2. Name the GameObject "Academy"

![The scene hierarchy](images/mlagents-NewTutHierarchy.png)
You can adjust the camera angles to give a better view of the scene at runtime. The next steps will be to create and add the ML-Agent components.
You can adjust the camera angles to give a better view of the scene at runtime.
The next steps will be to create and add the ML-Agent components.
The Academy object coordinates the ML-Agents in the scene and drives the decision-making portion of the simulation loop. Every ML-Agent scene needs one Academy instance. Since the base Academy class is abstract, you must make your own subclass even if you don't need to use any of the methods for a particular environment.
The Academy object coordinates the ML-Agents in the scene and drives the
decision-making portion of the simulation loop. Every ML-Agent scene needs one
Academy instance. Since the base Academy class is abstract, you must make your
own subclass even if you don't need to use any of the methods for a particular
environment.
First, add a New Script component to the Academy GameObject created earlier:
First, add a New Script component to the Academy GameObject created earlier:
1. Select the Academy GameObject to view it in the Inspector window.
2. Click **Add Component**.

Next, edit the new `RollerAcademy` script:
1. In the Unity Project window, double-click the `RollerAcademy` script to open it in your code editor. (By default new scripts are placed directly in the **Assets** folder.)
1. In the Unity Project window, double-click the `RollerAcademy` script to open
it in your code editor. (By default new scripts are placed directly in the
**Assets** folder.)
In such a basic scene, we don't need the Academy to initialize, reset, or otherwise control any objects in the environment so we have the simplest possible Academy implementation:
In such a basic scene, we don't need the Academy to initialize, reset, or
otherwise control any objects in the environment so we have the simplest
possible Academy implementation:
```csharp
using MLAgents;

The default settings for the Academy properties are also fine for this environment, so we don't need to change anything for the RollerAcademy component in the Inspector window.
The default settings for the Academy properties are also fine for this
environment, so we don't need to change anything for the RollerAcademy component
in the Inspector window.
The Brain object encapsulates the decision making process. An Agent sends its observations to its Brain and expects a decision in return. The Brain Type setting determines how the Brain makes decisions. Unlike the Academy and Agent classes, you don't make your own Brain subclasses.
The Brain object encapsulates the decision making process. An Agent sends its
observations to its Brain and expects a decision in return. The Brain Type
setting determines how the Brain makes decisions. Unlike the Academy and Agent
classes, you don't make your own Brain subclasses.
1. Select the Brain GameObject created earlier to show its properties in the Inspector window.
1. Select the Brain GameObject created earlier to show its properties in the
Inspector window.
We will come back to the Brain properties later, but leave the Brain Type as **Player** for now.
We will come back to the Brain properties later, but leave the Brain Type as
**Player** for now.
![The Brain default properties](images/mlagents-NewTutBrain.png)

Then, edit the new `RollerAgent` script:
1. In the Unity Project window, double-click the `RollerAgent` script to open it in your code editor.
1. In the Unity Project window, double-click the `RollerAgent` script to open it
in your code editor.
3. Delete the `Update()` method, but we will use the `Start()` function, so leave it alone for now.
3. Delete the `Update()` method, but we will use the `Start()` function, so
leave it alone for now.
So far, these are the basic steps that you would use to add ML-Agents to any Unity project. Next, we will add the logic that will let our agent learn to roll to the cube using reinforcement learning.
So far, these are the basic steps that you would use to add ML-Agents to any
Unity project. Next, we will add the logic that will let our agent learn to roll
to the cube using reinforcement learning.
In this simple scenario, we don't use the Academy object to control the environment. If we wanted to change the environment, for example change the size of the floor or add or remove agents or other objects before or during the simulation, we could implement the appropriate methods in the Academy. Instead, we will have the Agent do all the work of resetting itself and the target when it succeeds or falls trying.
In this simple scenario, we don't use the Academy object to control the
environment. If we wanted to change the environment, for example change the size
of the floor or add or remove agents or other objects before or during the
simulation, we could implement the appropriate methods in the Academy. Instead,
we will have the Agent do all the work of resetting itself and the target when
it succeeds or falls trying.
**Initialization and Resetting the Agent**
### Initialization and Resetting the Agent
When the agent reaches its target, it marks itself done and its agent reset function moves the target to a random location. In addition, if the agent rolls off the platform, the reset function puts it back onto the floor.
When the agent reaches its target, it marks itself done and its agent reset
function moves the target to a random location. In addition, if the agent rolls
off the platform, the reset function puts it back onto the floor.
To move the target GameObject, we need a reference to its Transform (which stores a GameObject's position, orientation and scale in the 3D world). To get this reference, add a public field of type `Transform` to the RollerAgent class. Public fields of a component in Unity get displayed in the Inspector window, allowing you to choose which GameObject to use as the target in the Unity Editor. To reset the agent's velocity (and later to apply force to move the agent) we need a reference to the Rigidbody component. A [Rigidbody](https://docs.unity3d.com/ScriptReference/Rigidbody.html) is Unity's primary element for physics simulation. (See [Physics](https://docs.unity3d.com/Manual/PhysicsSection.html) for full documentation of Unity physics.) Since the Rigidbody component is on the same GameObject as our Agent script, the best way to get this reference is using `GameObject.GetComponent<T>()`, which we can call in our script's `Start()` method.
To move the target GameObject, we need a reference to its Transform (which
stores a GameObject's position, orientation and scale in the 3D world). To get
this reference, add a public field of type `Transform` to the RollerAgent class.
Public fields of a component in Unity get displayed in the Inspector window,
allowing you to choose which GameObject to use as the target in the Unity
Editor. To reset the agent's velocity (and later to apply force to move the
agent) we need a reference to the Rigidbody component. A
[Rigidbody](https://docs.unity3d.com/ScriptReference/Rigidbody.html) is Unity's
primary element for physics simulation. (See
[Physics](https://docs.unity3d.com/Manual/PhysicsSection.html) for full
documentation of Unity physics.) Since the Rigidbody component is on the same
GameObject as our Agent script, the best way to get this reference is using
`GameObject.GetComponent<T>()`, which we can call in our script's `Start()`
method.
So far, our RollerAgent script looks like:
So far, our RollerAgent script looks like:
```csharp
using System.Collections.Generic;

public class RollerAgent : Agent
public class RollerAgent : Agent
{
Rigidbody rBody;
void Start () {

this.rBody.velocity = Vector3.zero;
}
else
{
{
// Move the target to a new spot
Target.position = new Vector3(Random.value * 8 - 4,
0.5f,

}
```
Next, let's implement the Agent.CollectObservations() function.
Next, let's implement the Agent.CollectObservations() function.
**Observing the Environment**
### Observing the Environment
The Agent sends the information we collect to the Brain, which uses it to make a decision. When you train the agent (or use a trained model), the data is fed into a neural network as a feature vector. For an agent to successfully learn a task, we need to provide the correct information. A good rule of thumb for deciding what information to collect is to consider what you would need to calculate an analytical solution to the problem.
The Agent sends the information we collect to the Brain, which uses it to make a
decision. When you train the agent (or use a trained model), the data is fed
into a neural network as a feature vector. For an agent to successfully learn a
task, we need to provide the correct information. A good rule of thumb for
deciding what information to collect is to consider what you would need to
calculate an analytical solution to the problem.
* Position of the target. In general, it is better to use the relative position of other objects rather than the absolute position for more generalizable training. Note that the agent only collects the x and z coordinates since the floor is aligned with the x-z plane and the y component of the target's position never changes.
* Position of the target. In general, it is better to use the relative position
of other objects rather than the absolute position for more generalizable
training. Note that the agent only collects the x and z coordinates since the
floor is aligned with the x-z plane and the y component of the target's
position never changes.
```csharp
// Calculate relative position

AddVectorObs(relativePosition.z / 5);
```
* Position of the agent itself within the confines of the floor. This data is collected as the agent's distance from each edge of the floor.
* Position of the agent itself within the confines of the floor. This data is
collected as the agent's distance from each edge of the floor.
```csharp
// Distance to edges of platform

AddVectorObs((this.transform.position.z - 5) / 5);
```
* The velocity of the agent. This helps the agent learn to control its speed so it doesn't overshoot the target and roll off the platform.
* The velocity of the agent. This helps the agent learn to control its speed so
it doesn't overshoot the target and roll off the platform.
```csharp
// Agent velocity

All the values are divided by 5 to normalize the inputs to the neural network to the range [-1,1]. (The number five is used because the platform is 10 units across.)
All the values are divided by 5 to normalize the inputs to the neural network to
the range [-1,1]. (The number five is used because the platform is 10 units
across.)
In total, the state observation contains 8 values and we need to use the continuous state space when we get around to setting the Brain properties:
In total, the state observation contains 8 values and we need to use the
continuous state space when we get around to setting the Brain properties:
```csharp
public override void CollectObservations()

// Agent velocity
AddVectorObs(rBody.velocity.x/5);
AddVectorObs(rBody.velocity.z/5);

The final part of the Agent code is the Agent.AgentAction() function, which receives the decision from the Brain.
The final part of the Agent code is the Agent.AgentAction() function, which
receives the decision from the Brain.
**Actions**
### Actions
The decision of the Brain comes in the form of an action array passed to the `AgentAction()` function. The number of elements in this array is determined by the `Vector Action Space Type` and `Vector Action Space Size` settings of the agent's Brain. The RollerAgent uses the continuous vector action space and needs two continuous control signals from the brain. Thus, we will set the Brain `Vector Action Size` to 2. The first element,`action[0]` determines the force applied along the x axis; `action[1]` determines the force applied along the z axis. (If we allowed the agent to move in three dimensions, then we would need to set `Vector Action Size` to 3. Each of these values returned by the network are between `-1` and `1.` Note the Brain really has no idea what the values in the action array mean. The training process just adjusts the action values in response to the observation input and then sees what kind of rewards it gets as a result.
The decision of the Brain comes in the form of an action array passed to the
`AgentAction()` function. The number of elements in this array is determined by
the `Vector Action Space Type` and `Vector Action Space Size` settings of the
agent's Brain. The RollerAgent uses the continuous vector action space and needs
two continuous control signals from the brain. Thus, we will set the Brain
`Vector Action Size` to 2. The first element,`action[0]` determines the force
applied along the x axis; `action[1]` determines the force applied along the z
axis. (If we allowed the agent to move in three dimensions, then we would need
to set `Vector Action Size` to 3. Each of these values returned by the network
are between `-1` and `1.` Note the Brain really has no idea what the values in
the action array mean. The training process just adjusts the action values in
response to the observation input and then sees what kind of rewards it gets as
a result.
The RollerAgent applies the values from the action[] array to its Rigidbody component, `rBody`, using the `Rigidbody.AddForce` function:
The RollerAgent applies the values from the action[] array to its Rigidbody
component, `rBody`, using the `Rigidbody.AddForce` function:
```csharp
Vector3 controlSignal = Vector3.zero;

```
**Rewards**
### Rewards
Reinforcement learning requires rewards. Assign rewards in the `AgentAction()` function. The learning algorithm uses the rewards assigned to the agent at each step in the simulation and learning process to determine whether it is giving the agent the optimal actions. You want to reward an agent for completing the assigned task (reaching the Target cube, in this case) and punish the agent if it irrevocably fails (falls off the platform). You can sometimes speed up training with sub-rewards that encourage behavior that helps the agent complete the task. For example, the RollerAgent reward system provides a small reward if the agent moves closer to the target in a step and a small negative reward at each step which encourages the agent to complete its task quickly.
Reinforcement learning requires rewards. Assign rewards in the `AgentAction()`
function. The learning algorithm uses the rewards assigned to the agent at each
step in the simulation and learning process to determine whether it is giving
the agent the optimal actions. You want to reward an agent for completing the
assigned task (reaching the Target cube, in this case) and punish the agent if
it irrevocably fails (falls off the platform). You can sometimes speed up
training with sub-rewards that encourage behavior that helps the agent complete
the task. For example, the RollerAgent reward system provides a small reward if
the agent moves closer to the target in a step and a small negative reward at
each step which encourages the agent to complete its task quickly.
The RollerAgent calculates the distance to detect when it reaches the target. When it does, the code increments the Agent.reward variable by 1.0 and marks the agent as finished by setting the agent to done.
The RollerAgent calculates the distance to detect when it reaches the target.
When it does, the code increments the Agent.reward variable by 1.0 and marks the
agent as finished by setting the agent to done.
```csharp
float distanceToTarget = Vector3.Distance(this.transform.position,

}
```
**Note:** When you mark an agent as done, it stops its activity until it is reset. You can have the agent reset immediately, by setting the Agent.ResetOnDone property to true in the inspector or you can wait for the Academy to reset the environment. This RollerBall environment relies on the `ResetOnDone` mechanism and doesn't set a `Max Steps` limit for the Academy (so it never resets the environment).
**Note:** When you mark an agent as done, it stops its activity until it is
reset. You can have the agent reset immediately, by setting the
Agent.ResetOnDone property to true in the inspector or you can wait for the
Academy to reset the environment. This RollerBall environment relies on the
`ResetOnDone` mechanism and doesn't set a `Max Steps` limit for the Academy (so
it never resets the environment).
It can also encourage an agent to finish a task more quickly to assign a negative reward at each step:
It can also encourage an agent to finish a task more quickly to assign a
negative reward at each step:
```csharp
// Time penalty

Finally, to punish the agent for falling off the platform, assign a large negative reward and, of course, set the agent to done so that it resets itself in the next step:
Finally, to punish the agent for falling off the platform, assign a large
negative reward and, of course, set the agent to done so that it resets itself
in the next step:
```csharp
// Fell off platform

}
```
**AgentAction()**
With the action and reward logic outlined above, the final version of the `AgentAction()` function looks like:
### AgentAction()
With the action and reward logic outlined above, the final version of the
`AgentAction()` function looks like:
```csharp
public float speed = 10;

{
// Rewards
float distanceToTarget = Vector3.Distance(this.transform.position,
float distanceToTarget = Vector3.Distance(this.transform.position,
// Reached target
if (distanceToTarget < 1.42f)
{

// Time penalty
AddReward(-0.05f);

}
```
Note the `speed` and `previousDistance` class variables defined before the function. Since `speed` is public, you can set the value from the Inspector window.
Note the `speed` and `previousDistance` class variables defined before the
function. Since `speed` is public, you can set the value from the Inspector
window.
Now, that all the GameObjects and ML-Agent components are in place, it is time to connect everything together in the Unity Editor. This involves assigning the Brain object to the Agent, changing some of the Agent Components properties, and setting the Brain properties so that they are compatible with our agent code.
Now, that all the GameObjects and ML-Agent components are in place, it is time
to connect everything together in the Unity Editor. This involves assigning the
Brain object to the Agent, changing some of the Agent Components properties, and
setting the Brain properties so that they are compatible with our agent code.
1. Expand the Academy GameObject in the Hierarchy window, so that the Brain object is visible.
2. Select the RollerAgent GameObject to show its properties in the Inspector window.
3. Drag the Brain object from the Hierarchy window to the RollerAgent Brain field.
1. Expand the Academy GameObject in the Hierarchy window, so that the Brain
object is visible.
2. Select the RollerAgent GameObject to show its properties in the Inspector
window.
3. Drag the Brain object from the Hierarchy window to the RollerAgent Brain
field.
Also, drag the Target GameObject from the Hierarchy window to the RollerAgent Target field.
Also, drag the Target GameObject from the Hierarchy window to the RollerAgent
Target field.
Finally, select the Brain GameObject so that you can see its properties in the Inspector window. Set the following properties:
Finally, select the Brain GameObject so that you can see its properties in the
Inspector window. Set the following properties:
* `Vector Observation Space Type` = **Continuous**
* `Vector Observation Space Size` = 8

## Testing the Environment
It is always a good idea to test your environment manually before embarking on an extended training run. The reason we have left the Brain set to the **Player** type is so that we can control the agent using direct keyboard control. But first, you need to define the keyboard to action mapping. Although the RollerAgent only has an `Action Size` of two, we will use one key to specify positive values and one to specify negative values for each action, for a total of four keys.
It is always a good idea to test your environment manually before embarking on
an extended training run. The reason we have left the Brain set to the
**Player** type is so that we can control the agent using direct keyboard
control. But first, you need to define the keyboard to action mapping. Although
the RollerAgent only has an `Action Size` of two, we will use one key to specify
positive values and one to specify negative values for each action, for a total
of four keys.
3. Expand the **Continuous Player Actions** dictionary (only visible when using the **Player* brain).
3. Expand the **Continuous Player Actions** dictionary (only visible when using
the **Player* brain).
4. Set **Size** to 4.
5. Set the following mappings:

| Element 2 | W | 1 | 1 |
| Element 3 | S | 1 | -1 |
The **Index** value corresponds to the index of the action array passed to `AgentAction()` function. **Value** is assigned to action[Index] when **Key** is pressed.
The **Index** value corresponds to the index of the action array passed to
`AgentAction()` function. **Value** is assigned to action[Index] when **Key** is
pressed.
Press **Play** to run the scene and use the WASD keys to move the agent around the platform. Make sure that there are no errors displayed in the Unity editor Console window and that the agent resets when it reaches its target or falls from the platform. Note that for more involved debugging, the ML-Agents SDK includes a convenient Monitor class that you can use to easily display agent status information in the Game window.
Press **Play** to run the scene and use the WASD keys to move the agent around
the platform. Make sure that there are no errors displayed in the Unity editor
Console window and that the agent resets when it reaches its target or falls
from the platform. Note that for more involved debugging, the ML-Agents SDK
includes a convenient Monitor class that you can use to easily display agent
status information in the Game window.
One additional test you can perform is to first ensure that your environment and
the Python API work as expected using the `notebooks/getting-started.ipynb`

Now you can train the Agent. To get ready for training, you must first to change the **Brain Type** from **Player** to **External**. From there, the process is the same as described in [Training ML-Agents](Training-ML-Agents.md).
Now you can train the Agent. To get ready for training, you must first to change
the **Brain Type** from **Player** to **External**. From there, the process is
the same as described in [Training ML-Agents](Training-ML-Agents.md).
This section briefly reviews how to organize your scene when using
Agents in your Unity environment.
This section briefly reviews how to organize your scene when using Agents in
your Unity environment.
There are three kinds of game objects you need to include in your scene in order to use Unity ML-Agents:
* Academy
* Brain
* Agents
There are three kinds of game objects you need to include in your scene in order
to use Unity ML-Agents:
* Academy
* Brain
* Agents
* There can only be one Academy game object in a scene.
* You can have multiple Brain game objects but they must be child of the Academy game object.
* There can only be one Academy game object in a scene.
* You can have multiple Brain game objects but they must be child of the Academy game object.
Here is an example of what your scene hierarchy should look like:

55
docs/Learning-Environment-Design-Academy.md


# Creating an Academy
An Academy orchestrates all the Agent and Brain objects in a Unity scene. Every scene containing agents must contain a single Academy. To use an Academy, you must create your own subclass. However, all the methods you can override are optional.
An Academy orchestrates all the Agent and Brain objects in a Unity scene. Every
scene containing agents must contain a single Academy. To use an Academy, you
must create your own subclass. However, all the methods you can override are
optional.
Use the Academy methods to:

See [Reinforcement Learning in Unity](Learning-Environment-Design.md) for a description of the timing of these method calls during a simulation.
See [Reinforcement Learning in Unity](Learning-Environment-Design.md) for a
description of the timing of these method calls during a simulation.
Initialization is performed once in an Academy object's lifecycle. Use the `InitializeAcademy()` method for any logic you would normally perform in the standard Unity `Start()` or `Awake()` methods.
Initialization is performed once in an Academy object's lifecycle. Use the
`InitializeAcademy()` method for any logic you would normally perform in the
standard Unity `Start()` or `Awake()` methods.
**Note:** Because the base Academy implements a `Awake()` function, you must not implement your own. Because of the way the Unity MonoBehaviour class is defined, implementing your own `Awake()` function hides the base class version and Unity will call yours instead. Likewise, do not implement a `FixedUpdate()` function in your Academy subclass.
**Note:** Because the base Academy implements a `Awake()` function, you must not
implement your own. Because of the way the Unity MonoBehaviour class is defined,
implementing your own `Awake()` function hides the base class version and Unity
will call yours instead. Likewise, do not implement a `FixedUpdate()` function
in your Academy subclass.
Implement an `AcademyReset()` function to alter the environment at the start of each episode. For example, you might want to reset an agent to its starting position or move a goal to a random position. An environment resets when the Academy `Max Steps` count is reached.
Implement an `AcademyReset()` function to alter the environment at the start of
each episode. For example, you might want to reset an agent to its starting
position or move a goal to a random position. An environment resets when the
Academy `Max Steps` count is reached.
When you reset an environment, consider the factors that should change so that training is generalizable to different conditions. For example, if you were training a maze-solving agent, you would probably want to change the maze itself for each training episode. Otherwise, the agent would probably on learn to solve one, particular maze, not mazes in general.
When you reset an environment, consider the factors that should change so that
training is generalizable to different conditions. For example, if you were
training a maze-solving agent, you would probably want to change the maze itself
for each training episode. Otherwise, the agent would probably on learn to solve
one, particular maze, not mazes in general.
The `AcademyStep()` function is called at every step in the simulation before any agents are updated. Use this function to update objects in the environment at every step or during the episode between environment resets. For example, if you want to add elements to the environment at random intervals, you can put the logic for creating them in the `AcademyStep()` function.
The `AcademyStep()` function is called at every step in the simulation before
any agents are updated. Use this function to update objects in the environment
at every step or during the episode between environment resets. For example, if
you want to add elements to the environment at random intervals, you can put the
logic for creating them in the `AcademyStep()` function.
* `Max Steps` - Total number of steps per-episode. `0` corresponds to episodes without a maximum number of steps. Once the step counter reaches maximum, the environment will reset.
* `Configuration` - The engine-level settings which correspond to rendering quality and engine speed.
* `Width` - Width of the environment window in pixels.
* `Height` - Width of the environment window in pixels.
* `Quality Level` - Rendering quality of environment. (Higher is better)
* `Time Scale` - Speed at which environment is run. (Higher is faster)
* `Target Frame Rate` - FPS engine attempts to maintain.
* `Reset Parameters` - List of custom parameters that can be changed in the environment on reset.
* `Max Steps` - Total number of steps per-episode. `0` corresponds to episodes
without a maximum number of steps. Once the step counter reaches maximum, the
environment will reset.
* `Configuration` - The engine-level settings which correspond to rendering
quality and engine speed.
* `Width` - Width of the environment window in pixels.
* `Height` - Width of the environment window in pixels.
* `Quality Level` - Rendering quality of environment. (Higher is better)
* `Time Scale` - Speed at which environment is run. (Higher is faster)
* `Target Frame Rate` - FPS engine attempts to maintain.
* `Reset Parameters` - List of custom parameters that can be changed in the
environment on reset.

429
docs/Learning-Environment-Design-Agents.md


# Agents
An agent is an actor that can observe its environment and decide on the best course of action using those observations. Create agents in Unity by extending the Agent class. The most important aspects of creating agents that can successfully learn are the observations the agent collects and, for reinforcement learning, the reward you assign to estimate the value of the agent's current state toward accomplishing its tasks.
An agent is an actor that can observe its environment and decide on the best
course of action using those observations. Create agents in Unity by extending
the Agent class. The most important aspects of creating agents that can
successfully learn are the observations the agent collects and, for
reinforcement learning, the reward you assign to estimate the value of the
agent's current state toward accomplishing its tasks.
An agent passes its observations to its brain. The brain, then, makes a decision and passes the chosen action back to the agent. Your agent code must execute the action, for example, move the agent in one direction or another. In order to [train an agent using reinforcement learning](Learning-Environment-Design.md), your agent must calculate a reward value at each action. The reward is used to discover the optimal decision-making policy. (A reward is not used by already trained agents or for imitation learning.)
An agent passes its observations to its brain. The brain, then, makes a decision
and passes the chosen action back to the agent. Your agent code must execute the
action, for example, move the agent in one direction or another. In order to
[train an agent using reinforcement learning](Learning-Environment-Design.md),
your agent must calculate a reward value at each action. The reward is used to
discover the optimal decision-making policy. (A reward is not used by already
trained agents or for imitation learning.)
The Brain class abstracts out the decision making logic from the agent itself so that you can use the same brain in multiple agents.
How a brain makes its decisions depends on the type of brain it is. An **External** brain simply passes the observations from its agents to an external process and then passes the decisions made externally back to the agents. An **Internal** brain uses the trained policy parameters to make decisions (and no longer adjusts the parameters in search of a better decision). The other types of brains do not directly involve training, but you might find them useful as part of a training project. See [Brains](Learning-Environment-Design-Brains.md).
The Brain class abstracts out the decision making logic from the agent itself so
that you can use the same brain in multiple agents. How a brain makes its
decisions depends on the type of brain it is. An **External** brain simply
passes the observations from its agents to an external process and then passes
the decisions made externally back to the agents. An **Internal** brain uses the
trained policy parameters to make decisions (and no longer adjusts the
parameters in search of a better decision). The other types of brains do not
directly involve training, but you might find them useful as part of a training
project. See [Brains](Learning-Environment-Design-Brains.md).
The observation-decision-action-reward cycle repeats after a configurable number of simulation steps (the frequency defaults to once-per-step). You can also set up an agent to request decisions on demand. Making decisions at regular step intervals is generally most appropriate for physics-based simulations. Making decisions on demand is generally appropriate for situations where agents only respond to specific events or take actions of variable duration. For example, an agent in a robotic simulator that must provide fine-control of joint torques should make its decisions every step of the simulation. On the other hand, an agent that only needs to make decisions when certain game or simulation events occur, should use on-demand decision making.
The observation-decision-action-reward cycle repeats after a configurable number
of simulation steps (the frequency defaults to once-per-step). You can also set
up an agent to request decisions on demand. Making decisions at regular step
intervals is generally most appropriate for physics-based simulations. Making
decisions on demand is generally appropriate for situations where agents only
respond to specific events or take actions of variable duration. For example, an
agent in a robotic simulator that must provide fine-control of joint torques
should make its decisions every step of the simulation. On the other hand, an
agent that only needs to make decisions when certain game or simulation events
occur, should use on-demand decision making.
To control the frequency of step-based decision making, set the **Decision Frequency** value for the Agent object in the Unity Inspector window. Agents using the same Brain instance can use a different frequency. During simulation steps in which no decision is requested, the agent receives the same action chosen by the previous decision.
To control the frequency of step-based decision making, set the **Decision
Frequency** value for the Agent object in the Unity Inspector window. Agents
using the same Brain instance can use a different frequency. During simulation
steps in which no decision is requested, the agent receives the same action
chosen by the previous decision.
On demand decision making allows agents to request decisions from their
brains only when needed instead of receiving decisions at a fixed
frequency. This is useful when the agents commit to an action for a
variable number of steps or when the agents cannot make decisions
at the same time. This typically the case for turn based games, games
where agents must react to events or games where agents can take
actions of variable duration.
On demand decision making allows agents to request decisions from their brains
only when needed instead of receiving decisions at a fixed frequency. This is
useful when the agents commit to an action for a variable number of steps or
when the agents cannot make decisions at the same time. This typically the case
for turn based games, games where agents must react to events or games where
agents can take actions of variable duration.
When you turn on **On Demand Decisions** for an agent, your agent code must call the `Agent.RequestDecision()` function. This function call starts one iteration of the observation-decision-action-reward cycle. The Brain invokes the agent's `CollectObservations()` method, makes a decision and returns it by calling the `AgentAction()` method. The Brain waits for the agent to request the next decision before starting another iteration.
When you turn on **On Demand Decisions** for an agent, your agent code must call
the `Agent.RequestDecision()` function. This function call starts one iteration
of the observation-decision-action-reward cycle. The Brain invokes the agent's
`CollectObservations()` method, makes a decision and returns it by calling the
`AgentAction()` method. The Brain waits for the agent to request the next
decision before starting another iteration.
To make decisions, an agent must observe its environment in order to infer the state of the world. A state observation can take the following forms:
To make decisions, an agent must observe its environment in order to infer the
state of the world. A state observation can take the following forms:
* **Vector Observation** — a feature vector consisting of an array of floating point numbers.
* **Vector Observation** — a feature vector consisting of an array of floating
point numbers.
When you use vector observations for an agent, implement the `Agent.CollectObservations()` method to create the feature vector. When you use **Visual Observations**, you only need to identify which Unity Camera objects will provide images and the base Agent class handles the rest. You do not need to implement the `CollectObservations()` method when your agent uses visual observations (unless it also uses vector observations).
When you use vector observations for an agent, implement the
`Agent.CollectObservations()` method to create the feature vector. When you use
**Visual Observations**, you only need to identify which Unity Camera objects
will provide images and the base Agent class handles the rest. You do not need
to implement the `CollectObservations()` method when your agent uses visual
observations (unless it also uses vector observations).
For agents using a continuous state space, you create a feature vector to represent the agent's observation at each step of the simulation. The Brain class calls the `CollectObservations()` method of each of its agents. Your implementation of this function must call `AddVectorObs` to add vector observations.
For agents using a continuous state space, you create a feature vector to
represent the agent's observation at each step of the simulation. The Brain
class calls the `CollectObservations()` method of each of its agents. Your
implementation of this function must call `AddVectorObs` to add vector
observations.
The observation must include all the information an agent needs to accomplish its task. Without sufficient and relevant information, an agent may learn poorly or may not learn at all. A reasonable approach for determining what information should be included is to consider what you would need to calculate an analytical solution to the problem.
The observation must include all the information an agent needs to accomplish
its task. Without sufficient and relevant information, an agent may learn poorly
or may not learn at all. A reasonable approach for determining what information
should be included is to consider what you would need to calculate an analytical
solution to the problem.
For examples of various state observation functions, you can look at the [example environments](Learning-Environment-Examples.md) included in the ML-Agents SDK. For instance, the 3DBall example uses the rotation of the platform, the relative position of the ball, and the velocity of the ball as its state observation. As an experiment, you can remove the velocity components from the observation and retrain the 3DBall agent. While it will learn to balance the ball reasonably well, the performance of the agent without using velocity is noticeably worse.
For examples of various state observation functions, you can look at the
[example environments](Learning-Environment-Examples.md) included in the
ML-Agents SDK. For instance, the 3DBall example uses the rotation of the
platform, the relative position of the ball, and the velocity of the ball as its
state observation. As an experiment, you can remove the velocity components from
the observation and retrain the 3DBall agent. While it will learn to balance the
ball reasonably well, the performance of the agent without using velocity is
noticeably worse.
```csharp
public GameObject ball;

}
```
The feature vector must always contain the same number of elements and observations must always be in the same position within the list. If the number of observed entities in an environment can vary you can pad the feature vector with zeros for any missing entities in a specific observation or you can limit an agent's observations to a fixed subset. For example, instead of observing every enemy agent in an environment, you could only observe the closest five.
The feature vector must always contain the same number of elements and
observations must always be in the same position within the list. If the number
of observed entities in an environment can vary you can pad the feature vector
with zeros for any missing entities in a specific observation or you can limit
an agent's observations to a fixed subset. For example, instead of observing
every enemy agent in an environment, you could only observe the closest five.
When you set up an Agent's brain in the Unity Editor, set the following properties to use a continuous vector observation:
When you set up an Agent's brain in the Unity Editor, set the following
properties to use a continuous vector observation:
**Space Size** — The state size must match the length of your feature vector.
**Brain Type** — Set to **External** during training; set to **Internal** to use the trained model.
* **Space Size** — The state size must match the length of your feature vector.
* **Brain Type** — Set to **External** during training; set to **Internal** to
use the trained model.
The observation feature vector is a list of floating point numbers, which means you must convert any other data types to a float or a list of floats.
The observation feature vector is a list of floating point numbers, which means
you must convert any other data types to a float or a list of floats.
Integers can be be added directly to the observation vector. You must explicitly convert Boolean values to a number:
Integers can be be added directly to the observation vector. You must explicitly
convert Boolean values to a number:
For entities like positions and rotations, you can add their components to the feature list individually. For example:
For entities like positions and rotations, you can add their components to the
feature list individually. For example:
```csharp
Vector3 speed = ball.transform.GetComponent<Rigidbody>().velocity;

```
Type enumerations should be encoded in the _one-hot_ style. That is, add an element to the feature vector for each element of enumeration, setting the element representing the observed member to one and set the rest to zero. For example, if your enumeration contains \[Sword, Shield, Bow\] and the agent observes that the current item is a Bow, you would add the elements: 0, 0, 1 to the feature vector. The following code example illustrates how to add
Type enumerations should be encoded in the _one-hot_ style. That is, add an
element to the feature vector for each element of enumeration, setting the
element representing the observed member to one and set the rest to zero. For
example, if your enumeration contains \[Sword, Shield, Bow\] and the agent
observes that the current item is a Bow, you would add the elements: 0, 0, 1 to
the feature vector. The following code example illustrates how to add.
```csharp
enum CarriedItems { Sword, Shield, Bow, LastItem }

for (int ci = 0; ci < (int)CarriedItems.LastItem; ci++)
{
AddVectorObs((int)currentItem == ci ? 1.0f : 0.0f);
AddVectorObs((int)currentItem == ci ? 1.0f : 0.0f);
}
}
```

For the best results when training, you should normalize the components of your feature vector to the range [-1, +1] or [0, 1]. When you normalize the values, the PPO neural network can often converge to a solution faster. Note that it isn't always necessary to normalize to these recommended ranges, but it is considered a best practice when using neural networks. The greater the variation in ranges between the components of your observation, the more likely that training will be affected.
For the best results when training, you should normalize the components of your
feature vector to the range [-1, +1] or [0, 1]. When you normalize the values,
the PPO neural network can often converge to a solution faster. Note that it
isn't always necessary to normalize to these recommended ranges, but it is
considered a best practice when using neural networks. The greater the variation
in ranges between the components of your observation, the more likely that
training will be affected.
To normalize a value to [0, 1], you can use the following formula:

Rotations and angles should also be normalized. For angles between 0 and 360 degrees, you can use the following formulas:
Rotations and angles should also be normalized. For angles between 0 and 360
degrees, you can use the following formulas:
```csharp
Quaternion rotation = transform.rotation;

For angles that can be outside the range [0,360], you can either reduce the angle, or, if the number of turns is significant, increase the maximum value used in your normalization formula.
For angles that can be outside the range [0,360], you can either reduce the
angle, or, if the number of turns is significant, increase the maximum value
used in your normalization formula.
Camera observations use rendered textures from one or more cameras in a scene. The brain vectorizes the textures into a 3D Tensor which can be fed into a convolutional neural network (CNN). For more information on CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). You can use camera observations along side vector observations.
Agents using camera images can capture state of arbitrary complexity and are useful when the state is difficult to describe numerically. However, they are also typically less efficient and slower to train, and sometimes don't succeed at all.
Camera observations use rendered textures from one or more cameras in a scene.
The brain vectorizes the textures into a 3D Tensor which can be fed into a
convolutional neural network (CNN). For more information on CNNs, see [this
guide](http://cs231n.github.io/convolutional-networks/). You can use camera
observations along side vector observations.
Agents using camera images can capture state of arbitrary complexity and are
useful when the state is difficult to describe numerically. However, they are
also typically less efficient and slower to train, and sometimes don't succeed
at all.
To add a visual observation to an agent, click on the `Add Camera` button in the
Agent inspector. Then drag the camera you want to add to the `Camera` field. You
can have more than one camera attached to an agent.
To add a visual observation to an agent, click on the `Add Camera` button in the Agent inspector. Then drag the camera you want to add to the `Camera` field. You can have more than one camera attached to an agent.
![Agent Camera](images/visual-observation.png)
![Agent Camera](images/visual-observation.png)
In addition, make sure that the Agent's Brain expects a visual observation. In
the Brain inspector, under **Brain Parameters** > **Visual Observations**,
specify the number of Cameras the agent is using for its visual observations.
For each visual observation, set the width and height of the image (in pixels)
and whether or not the observation is color or grayscale (when `Black And White`
is checked).
In addition, make sure that the Agent's Brain expects a visual observation. In the Brain inspector, under **Brain Parameters** > **Visual Observations**, specify the number of Cameras the agent is using for its visual observations. For each visual observation, set the width and height of the image (in pixels) and whether or not the observation is color or grayscale (when `Black And White` is checked).
An action is an instruction from the brain that the agent carries out. The action is passed to the agent as a parameter when the Academy invokes the agent's `AgentAction()` function. When you specify that the vector action space is **Continuous**, the action parameter passed to the agent is an array of control signals with length equal to the `Vector Action Space Size` property. When you specify a **Discrete** vector action space type, the action parameter is an array containing integers. Each integer is an index into a list or table of commands. In the **Discrete** vector action space type, the action parameter is an array of indices. The number of indices in the array is determined by the number of branches defined in the `Branches Size` property. Each branch corresponds to an action table, you can specify the size of each table by modifying the `Branches` property. Set the `Vector Action Space Size` and `Vector Action Space Type` properties on the Brain object assigned to the agent (using the Unity Editor Inspector window).
An action is an instruction from the brain that the agent carries out. The
action is passed to the agent as a parameter when the Academy invokes the
agent's `AgentAction()` function. When you specify that the vector action space
is **Continuous**, the action parameter passed to the agent is an array of
control signals with length equal to the `Vector Action Space Size` property.
When you specify a **Discrete** vector action space type, the action parameter
is an array containing integers. Each integer is an index into a list or table
of commands. In the **Discrete** vector action space type, the action parameter
is an array of indices. The number of indices in the array is determined by the
number of branches defined in the `Branches Size` property. Each branch
corresponds to an action table, you can specify the size of each table by
modifying the `Branches` property. Set the `Vector Action Space Size` and
`Vector Action Space Type` properties on the Brain object assigned to the agent
(using the Unity Editor Inspector window).
Neither the Brain nor the training algorithm know anything about what the action values themselves mean. The training algorithm simply tries different values for the action list and observes the affect on the accumulated rewards over time and many training episodes. Thus, the only place actions are defined for an agent is in the `AgentAction()` function. You simply specify the type of vector action space, and, for the continuous vector action space, the number of values, and then apply the received values appropriately (and consistently) in `ActionAct()`.
Neither the Brain nor the training algorithm know anything about what the action
values themselves mean. The training algorithm simply tries different values for
the action list and observes the affect on the accumulated rewards over time and
many training episodes. Thus, the only place actions are defined for an agent is
in the `AgentAction()` function. You simply specify the type of vector action
space, and, for the continuous vector action space, the number of values, and
then apply the received values appropriately (and consistently) in
`ActionAct()`.
For example, if you designed an agent to move in two dimensions, you could use either continuous or the discrete vector actions. In the continuous case, you would set the vector action size to two (one for each dimension), and the agent's brain would create an action with two floating point values. In the discrete case, you would use one Branch with a size of four (one for each direction), and the brain would create an action array containing a single element with a value ranging from zero to three. Alternatively, you could create two branches of size two (one for horizontal movement and one for vertical movement), and the brain would create an action array containing two elements with values ranging from zero to one.
For example, if you designed an agent to move in two dimensions, you could use
either continuous or the discrete vector actions. In the continuous case, you
would set the vector action size to two (one for each dimension), and the
agent's brain would create an action with two floating point values. In the
discrete case, you would use one Branch with a size of four (one for each
direction), and the brain would create an action array containing a single
element with a value ranging from zero to three. Alternatively, you could create
two branches of size two (one for horizontal movement and one for vertical
movement), and the brain would create an action array containing two elements
with values ranging from zero to one.
Note that when you are programming actions for an agent, it is often helpful to test your action logic using a **Player** brain, which lets you map keyboard commands to actions. See [Brains](Learning-Environment-Design-Brains.md).
Note that when you are programming actions for an agent, it is often helpful to
test your action logic using a **Player** brain, which lets you map keyboard
commands to actions. See [Brains](Learning-Environment-Design-Brains.md).
The [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) and [Area](Learning-Environment-Examples.md#push-block) example environments are set up to use either the continuous or the discrete vector action spaces.
The [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) and
[Area](Learning-Environment-Examples.md#push-block) example environments are set
up to use either the continuous or the discrete vector action spaces.
When an agent uses a brain set to the **Continuous** vector action space, the action parameter passed to the agent's `AgentAction()` function is an array with length equal to the Brain object's `Vector Action Space Size` property value. The individual values in the array have whatever meanings that you ascribe to them. If you assign an element in the array as the speed of an agent, for example, the training process learns to control the speed of the agent though this parameter.
When an agent uses a brain set to the **Continuous** vector action space, the
action parameter passed to the agent's `AgentAction()` function is an array with
length equal to the Brain object's `Vector Action Space Size` property value.
The individual values in the array have whatever meanings that you ascribe to
them. If you assign an element in the array as the speed of an agent, for
example, the training process learns to control the speed of the agent though
this parameter.
The [Reacher example](Learning-Environment-Examples.md#reacher) defines a continuous action space with four control values.
The [Reacher example](Learning-Environment-Examples.md#reacher) defines a
continuous action space with four control values.
![](images/reacher.png)
![reacher](images/reacher.png)
These control values are applied as torques to the bodies making up the arm :
These control values are applied as torques to the bodies making up the arm:
```csharp
public override void AgentAction(float[] act)

}
```
By default the output from our provided PPO algorithm pre-clamps the values of `vectorAction` into the [-1, 1] range. It is a best practice to manually clip these as well, if you plan to use a 3rd party algorithm with your environment. As shown above, you can scale the control values as needed after clamping them.
By default the output from our provided PPO algorithm pre-clamps the values of
`vectorAction` into the [-1, 1] range. It is a best practice to manually clip
these as well, if you plan to use a 3rd party algorithm with your environment.
As shown above, you can scale the control values as needed after clamping them.
When an agent uses a brain set to the **Discrete** vector action space, the action parameter passed to the agent's `AgentAction()` function is an array containing indices. With the discrete vector action space, `Branches` is an array of integers, each value corresponds to the number of possibilities for each branch.
When an agent uses a brain set to the **Discrete** vector action space, the
action parameter passed to the agent's `AgentAction()` function is an array
containing indices. With the discrete vector action space, `Branches` is an
array of integers, each value corresponds to the number of possibilities for
each branch.
For example, if we wanted an agent that can move in an plane and jump, we could define two branches (one for motion and one for jumping) because we want our agent be able to move __and__ jump concurently.
We define the first branch to have 5 possible actions (don't move, go left, go right, go backward, go forward) and the second one to have 2 possible actions (don't jump, jump). The AgentAction method would look something like :
For example, if we wanted an agent that can move in an plane and jump, we could
define two branches (one for motion and one for jumping) because we want our
agent be able to move __and__ jump concurently. We define the first branch to
have 5 possible actions (don't move, go left, go right, go backward, go forward)
and the second one to have 2 possible actions (don't jump, jump). The
AgentAction method would look something like:
int movement = Mathf.FloorToInt(act[0]);
int movement = Mathf.FloorToInt(act[0]);
int jump = Mathf.FloorToInt(act[1]);
int jump = Mathf.FloorToInt(act[1]);
// Look up the index in the movement action list:
if (movement == 1) { directionX = -1; }

directionX * 40f, directionY * 300f, directionZ * 40f));
```
Note that the above code example is a simplified extract from the AreaAgent class, which provides alternate implementations for both the discrete and the continuous action spaces.
Note that the above code example is a simplified extract from the AreaAgent
class, which provides alternate implementations for both the discrete and the
continuous action spaces.
When using Discrete Actions, it is possible to specify that some actions are impossible for the next decision. Then the agent is controlled by an External or Internal Brain, the agent will be unable to perform the specified action. Note that when the agent is controlled by a Player or Heuristic Brain, the agent will still be able to decide to perform the masked action. In order to mask an action, call the method `SetActionMask` within the `CollectObservation` method :
When using Discrete Actions, it is possible to specify that some actions are
impossible for the next decision. Then the agent is controlled by an External or
Internal Brain, the agent will be unable to perform the specified action. Note
that when the agent is controlled by a Player or Heuristic Brain, the agent will
still be able to decide to perform the masked action. In order to mask an
action, call the method `SetActionMask` within the `CollectObservation` method :
Where :
* `branch` is the index (starting at 0) of the branch on which you want to mask the action
* `actionIndices` is a list of `int` or a single `int` corresponding to the index of theaction that the agent cannot perform.
Where:
For example, if you have an agent with 2 branches and on the first branch (branch 0) there are 4 possible actions : _"do nothing"_, _"jump"_, _"shoot"_ and _"change weapon"_. Then with the code bellow, the agent will either _"do nothing"_ or _"change weapon"_ for his next decision (since action index 1 and 2 are masked)
* `branch` is the index (starting at 0) of the branch on which you want to mask
the action
* `actionIndices` is a list of `int` or a single `int` corresponding to the
index of theaction that the agent cannot perform.
For example, if you have an agent with 2 branches and on the first branch
(branch 0) there are 4 possible actions : _"do nothing"_, _"jump"_, _"shoot"_
and _"change weapon"_. Then with the code bellow, the agent will either _"do
nothing"_ or _"change weapon"_ for his next decision (since action index 1 and 2
are masked)
Notes:
Notes:
* You can call `SetActionMask` multiple times if you want to put masks on multiple branches.
* You cannot mask all the actions of a branch.
* You cannot mask actions in continuous control.
* You can call `SetActionMask` multiple times if you want to put masks on
multiple branches.
* You cannot mask all the actions of a branch.
* You cannot mask actions in continuous control.
In reinforcement learning, the reward is a signal that the agent has done something right. The PPO reinforcement learning algorithm works by optimizing the choices an agent makes such that the agent earns the highest cumulative reward over time. The better your reward mechanism, the better your agent will learn.
In reinforcement learning, the reward is a signal that the agent has done
something right. The PPO reinforcement learning algorithm works by optimizing
the choices an agent makes such that the agent earns the highest cumulative
reward over time. The better your reward mechanism, the better your agent will
learn.
**Note:** Rewards are not used during inference by a brain using an already trained policy and is also not used during imitation learning.
Perhaps the best advice is to start simple and only add complexity as needed. In general, you should reward results rather than actions you think will lead to the desired results. To help develop your rewards, you can use the Monitor class to display the cumulative reward received by an agent. You can even use a Player brain to control the agent while watching how it accumulates rewards.
**Note:** Rewards are not used during inference by a brain using an already
trained policy and is also not used during imitation learning.
Allocate rewards to an agent by calling the `AddReward()` method in the `AgentAction()` function. The reward assigned in any step should be in the range [-1,1]. Values outside this range can lead to unstable training. The `reward` value is reset to zero at every step.
Perhaps the best advice is to start simple and only add complexity as needed. In
general, you should reward results rather than actions you think will lead to
the desired results. To help develop your rewards, you can use the Monitor class
to display the cumulative reward received by an agent. You can even use a Player
brain to control the agent while watching how it accumulates rewards.
**Examples**
Allocate rewards to an agent by calling the `AddReward()` method in the
`AgentAction()` function. The reward assigned in any step should be in the range
[-1,1]. Values outside this range can lead to unstable training. The `reward`
value is reset to zero at every step.
You can examine the `AgentAction()` functions defined in the [example environments](Learning-Environment-Examples.md) to see how those projects allocate rewards.
### Examples
The `GridAgent` class in the [GridWorld example](Learning-Environment-Examples.md#gridworld) uses a very simple reward system:
You can examine the `AgentAction()` functions defined in the [example
environments](Learning-Environment-Examples.md) to see how those projects
allocate rewards.
The `GridAgent` class in the [GridWorld
example](Learning-Environment-Examples.md#gridworld) uses a very simple reward
system:
Collider[] hitObjects = Physics.OverlapBox(trueAgent.transform.position,
Collider[] hitObjects = Physics.OverlapBox(trueAgent.transform.position,
new Vector3(0.3f, 0.3f, 0.3f));
if (hitObjects.Where(col => col.gameObject.tag == "goal").ToArray().Length == 1)
{

}
```
The agent receives a positive reward when it reaches the goal and a negative reward when it falls into the pit. Otherwise, it gets no rewards. This is an example of a _sparse_ reward system. The agent must explore a lot to find the infrequent reward.
The agent receives a positive reward when it reaches the goal and a negative
reward when it falls into the pit. Otherwise, it gets no rewards. This is an
example of a _sparse_ reward system. The agent must explore a lot to find the
infrequent reward.
In contrast, the `AreaAgent` in the [Area example](Learning-Environment-Examples.md#push-block) gets a small negative reward every step. In order to get the maximum reward, the agent must finish its task of reaching the goal square as quickly as possible:
In contrast, the `AreaAgent` in the [Area
example](Learning-Environment-Examples.md#push-block) gets a small negative
reward every step. In order to get the maximum reward, the agent must finish its
task of reaching the goal square as quickly as possible:
if (gameObject.transform.position.y < 0.0f ||
Mathf.Abs(gameObject.transform.position.x - area.transform.position.x) > 8f ||
if (gameObject.transform.position.y < 0.0f ||
Mathf.Abs(gameObject.transform.position.x - area.transform.position.x) > 8f ||
Mathf.Abs(gameObject.transform.position.z + 5 - area.transform.position.z) > 8)
{
Done();

The agent also gets a larger negative penalty if it falls off the playing surface.
The agent also gets a larger negative penalty if it falls off the playing
surface.
The `Ball3DAgent` in the [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) takes a similar approach, but allocates a small positive reward as long as the agent balances the ball. The agent can maximize its rewards by keeping the ball on the platform:
The `Ball3DAgent` in the
[3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) takes a
similar approach, but allocates a small positive reward as long as the agent
balances the ball. The agent can maximize its rewards by keeping the ball on the
platform:
```csharp
if (IsDone() == false)

}
```
The `Ball3DAgent` also assigns a negative penalty when the ball falls off the platform.
The `Ball3DAgent` also assigns a negative penalty when the ball falls off the
platform.
* `Brain` - The brain to register this agent to. Can be dragged into the inspector using the Editor.
* `Visual Observations` - A list of `Cameras` which will be used to generate observations.
* `Max Step` - The per-agent maximum number of steps. Once this number is reached, the agent will be reset if `Reset On Done` is checked.
* `Reset On Done` - Whether the agent's `AgentReset()` function should be called when the agent reaches its `Max Step` count or is marked as done in code.
* `On Demand Decision` - Whether the agent requests decisions at a fixed step interval or explicitly requests decisions by calling `RequestDecision()`.
* If not checked, the Agent will request a new
decision every `Decision Frequency` steps and
perform an action every step. In the example above,
`CollectObservations()` will be called every 5 steps and
`AgentAction()` will be called at every step. This means that the
Agent will reuse the decision the Brain has given it.
* If checked, the Agent controls when to receive
decisions, and take actions. To do so, the Agent may leverage one or two methods:
* `RequestDecision()` Signals that the Agent is requesting a decision.
This causes the Agent to collect its observations and ask the Brain for a
decision at the next step of the simulation. Note that when an Agent
requests a decision, it also request an action.
This is to ensure that all decisions lead to an action during training.
* `RequestAction()` Signals that the Agent is requesting an action. The
action provided to the Agent in this case is the same action that was
provided the last time it requested a decision.
* `Decision Frequency` - The number of steps between decision requests. Not used if `On Demand Decision`, is true.
* `Brain` - The brain to register this agent to. Can be dragged into the
inspector using the Editor.
* `Visual Observations` - A list of `Cameras` which will be used to generate
observations.
* `Max Step` - The per-agent maximum number of steps. Once this number is
reached, the agent will be reset if `Reset On Done` is checked.
* `Reset On Done` - Whether the agent's `AgentReset()` function should be called
when the agent reaches its `Max Step` count or is marked as done in code.
* `On Demand Decision` - Whether the agent requests decisions at a fixed step
interval or explicitly requests decisions by calling `RequestDecision()`.
* If not checked, the Agent will request a new decision every `Decision
Frequency` steps and perform an action every step. In the example above,
`CollectObservations()` will be called every 5 steps and `AgentAction()`
will be called at every step. This means that the Agent will reuse the
decision the Brain has given it.
* If checked, the Agent controls when to receive decisions, and take actions.
To do so, the Agent may leverage one or two methods:
* `RequestDecision()` Signals that the Agent is requesting a decision. This
causes the Agent to collect its observations and ask the Brain for a
decision at the next step of the simulation. Note that when an Agent
requests a decision, it also request an action. This is to ensure that
all decisions lead to an action during training.
* `RequestAction()` Signals that the Agent is requesting an action. The
action provided to the Agent in this case is the same action that was
provided the last time it requested a decision.
* `Decision Frequency` - The number of steps between decision requests. Not used if `On Demand Decision`, is true.
We created a helpful `Monitor` class that enables visualizing variables within
a Unity environment. While this was built for monitoring an Agent's value
function throughout the training process, we imagine it can be more broadly
useful. You can learn more [here](Feature-Monitor.md).
We created a helpful `Monitor` class that enables visualizing variables within a
Unity environment. While this was built for monitoring an Agent's value function
throughout the training process, we imagine it can be more broadly useful. You
can learn more [here](Feature-Monitor.md).
To add an Agent to an environment at runtime, use the Unity `GameObject.Instantiate()` function. It is typically easiest to instantiate an agent from a [Prefab](https://docs.unity3d.com/Manual/Prefabs.html) (otherwise, you have to instantiate every GameObject and Component that make up your agent individually). In addition, you must assign a Brain instance to the new Agent and initialize it by calling its `AgentReset()` method. For example, the following function creates a new agent given a Prefab, Brain instance, location, and orientation:
To add an Agent to an environment at runtime, use the Unity
`GameObject.Instantiate()` function. It is typically easiest to instantiate an
agent from a [Prefab](https://docs.unity3d.com/Manual/Prefabs.html) (otherwise,
you have to instantiate every GameObject and Component that make up your agent
individually). In addition, you must assign a Brain instance to the new Agent
and initialize it by calling its `AgentReset()` method. For example, the
following function creates a new agent given a Prefab, Brain instance, location,
and orientation:
```csharp
private void CreateAgent(GameObject agentPrefab, Brain brain, Vector3 position, Quaternion orientation)

## Destroying an Agent
Before destroying an Agent GameObject, you must mark it as done (and wait for the next step in the simulation) so that the Brain knows that this agent is no longer active. Thus, the best place to destroy an agent is in the `Agent.AgentOnDone()` function:
Before destroying an Agent GameObject, you must mark it as done (and wait for
the next step in the simulation) so that the Brain knows that this agent is no
longer active. Thus, the best place to destroy an agent is in the
`Agent.AgentOnDone()` function:
```csharp
public override void AgentOnDone()

```
Note that in order for `AgentOnDone()` to be called, the agent's `ResetOnDone` property must be false. You can set `ResetOnDone` on the agent's Inspector or in code.
Note that in order for `AgentOnDone()` to be called, the agent's `ResetOnDone`
property must be false. You can set `ResetOnDone` on the agent's Inspector or in
code.

106
docs/Learning-Environment-Design-Brains.md


# Brains
The Brain encapsulates the decision making process. Brain objects must be children of the Academy in the Unity scene hierarchy. Every Agent must be assigned a Brain, but you can use the same Brain with more than one Agent. You can also create several Brains, attach each of the Brain to one or more than one Agent.
The Brain encapsulates the decision making process. Brain objects must be
children of the Academy in the Unity scene hierarchy. Every Agent must be
assigned a Brain, but you can use the same Brain with more than one Agent. You
can also create several Brains, attach each of the Brain to one or more than one
Agent.
Use the Brain class directly, rather than a subclass. Brain behavior is determined by the **Brain Type**. The ML-Agents toolkit defines four Brain Types:
Use the Brain class directly, rather than a subclass. Brain behavior is
determined by the **Brain Type**. The ML-Agents toolkit defines four Brain
Types:
* [External](Learning-Environment-Design-External-Internal-Brains.md) — The **External** and **Internal** types typically work together; set **External** when training your agents. You can also use the **External** brain to communicate with a Python script via the Python `UnityEnvironment` class included in the Python portion of the ML-Agents SDK.
* [Internal](Learning-Environment-Design-External-Internal-Brains.md) – Set **Internal** to make use of a trained model.
* [Heuristic](Learning-Environment-Design-Heuristic-Brains.md) – Set **Heuristic** to hand-code the agent's logic by extending the Decision class.
* [Player](Learning-Environment-Design-Player-Brains.md) – Set **Player** to map keyboard keys to agent actions, which can be useful to test your agent code.
* [External](Learning-Environment-Design-External-Internal-Brains.md) — The
**External** and **Internal** types typically work together; set **External**
when training your agents. You can also use the **External** brain to
communicate with a Python script via the Python `UnityEnvironment` class
included in the Python portion of the ML-Agents SDK.
* [Internal](Learning-Environment-Design-External-Internal-Brains.md) – Set
**Internal** to make use of a trained model.
* [Heuristic](Learning-Environment-Design-Heuristic-Brains.md) – Set
**Heuristic** to hand-code the agent's logic by extending the Decision class.
* [Player](Learning-Environment-Design-Player-Brains.md) – Set **Player** to map
keyboard keys to agent actions, which can be useful to test your agent code.
During training, set your agent's brain type to **External**. To use the trained model, import the model file into the Unity project and change the brain type to **Internal**.
During training, set your agent's brain type to **External**. To use the trained
model, import the model file into the Unity project and change the brain type to
**Internal**.
The Brain class has several important properties that you can set using the Inspector window. These properties must be appropriate for the agents using the brain. For example, the `Vector Observation Space Size` property must match the length of the feature vector created by an agent exactly. See [Agents](Learning-Environment-Design-Agents.md) for information about creating agents and setting up a Brain instance correctly.
The Brain class has several important properties that you can set using the
Inspector window. These properties must be appropriate for the agents using the
brain. For example, the `Vector Observation Space Size` property must match the
length of the feature vector created by an agent exactly. See
[Agents](Learning-Environment-Design-Agents.md) for information about creating
agents and setting up a Brain instance correctly.
The Brain Inspector window in the Unity Editor displays the properties assigned to a Brain component:
The Brain Inspector window in the Unity Editor displays the properties assigned
to a Brain component:
* `Brain Parameters` - Define vector observations, visual observation, and vector actions for the Brain.
* `Vector Observation`
* `Space Size` - Length of vector observation for brain.
* `Stacked Vectors` - The number of previous vector observations that will be stacked and used collectively for decision making. This results in the effective size of the vector observation being passed to the brain being: _Space Size_ x _Stacked Vectors_.
* `Visual Observations` - Describes height, width, and whether to grayscale visual observations for the Brain.
* `Vector Action`
* `Space Type` - Corresponds to whether action vector contains a single integer (Discrete) or a series of real-valued floats (Continuous).
* `Space Size` (Continuous) - Length of action vector for brain.
* `Branches` (Discrete) - An array of integers, defines multiple concurent discrete actions. The values in the `Branches` array correspond to the number of possible discrete values for each action branch.
* `Action Descriptions` - A list of strings used to name the available actions for the Brain.
* `Brain Parameters` - Define vector observations, visual observation, and
vector actions for the Brain.
* `Vector Observation`
* `Space Size` - Length of vector observation for brain.
* `Stacked Vectors` - The number of previous vector observations that will
be stacked and used collectively for decision making. This results in the
effective size of the vector observation being passed to the brain being:
_Space Size_ x _Stacked Vectors_.
* `Visual Observations` - Describes height, width, and whether to grayscale
visual observations for the Brain.
* `Vector Action`
* `Space Type` - Corresponds to whether action vector contains a single
integer (Discrete) or a series of real-valued floats (Continuous).
* `Space Size` (Continuous) - Length of action vector for brain.
* `Branches` (Discrete) - An array of integers, defines multiple concurent
discrete actions. The values in the `Branches` array correspond to the
number of possible discrete values for each action branch.
* `Action Descriptions` - A list of strings used to name the available
actions for the Brain.
* `External` - Actions are decided by an external process, such as the PPO training process.
* `Internal` - Actions are decided using internal TensorFlowSharp model.
* `Player` - Actions are decided using keyboard input mappings.
* `Heuristic` - Actions are decided using a custom `Decision` script, which must be attached to the Brain game object.
* `External` - Actions are decided by an external process, such as the PPO
training process.
* `Internal` - Actions are decided using internal TensorFlowSharp model.
* `Player` - Actions are decided using keyboard input mappings.
* `Heuristic` - Actions are decided using a custom `Decision` script, which
must be attached to the Brain game object.
The Player, Heuristic and Internal brains have been updated to support broadcast. The broadcast feature allows you to collect data from your agents using a Python program without controlling them.
The Player, Heuristic and Internal brains have been updated to support
broadcast. The broadcast feature allows you to collect data from your agents
using a Python program without controlling them.
### How to use: Unity

### How to use: Python
### How to use: Python
When you launch your Unity Environment from a Python program, you can see what the agents connected to non-external brains are doing. When calling `step` or `reset` on your environment, you retrieve a dictionary mapping brain names to `BrainInfo` objects. The dictionary contains a `BrainInfo` object for each non-external brain set to broadcast as well as for any external brains.
Just like with an external brain, the `BrainInfo` object contains the fields for `visual_observations`, `vector_observations`, `text_observations`, `memories`,`rewards`, `local_done`, `max_reached`, `agents` and `previous_actions`. Note that `previous_actions` corresponds to the actions that were taken by the agents at the previous step, not the current one.
Note that when you do a `step` on the environment, you cannot provide actions for non-external brains. If there are no external brains in the scene, simply call `step()` with no arguments.
When you launch your Unity Environment from a Python program, you can see what
the agents connected to non-external brains are doing. When calling `step` or
`reset` on your environment, you retrieve a dictionary mapping brain names to
`BrainInfo` objects. The dictionary contains a `BrainInfo` object for each
non-external brain set to broadcast as well as for any external brains.
You can use the broadcast feature to collect data generated by Player, Heuristics or Internal brains game sessions. You can then use this data to train an agent in a supervised context.
Just like with an external brain, the `BrainInfo` object contains the fields for
`visual_observations`, `vector_observations`, `text_observations`,
`memories`,`rewards`, `local_done`, `max_reached`, `agents` and
`previous_actions`. Note that `previous_actions` corresponds to the actions that
were taken by the agents at the previous step, not the current one.
Note that when you do a `step` on the environment, you cannot provide actions
for non-external brains. If there are no external brains in the scene, simply
call `step()` with no arguments.
You can use the broadcast feature to collect data generated by Player,
Heuristics or Internal brains game sessions. You can then use this data to train
an agent in a supervised context.

118
docs/Learning-Environment-Design-External-Internal-Brains.md


# External and Internal Brains
The **External** and **Internal** types of Brains work in different phases of training. When training your agents, set their brain types to **External**; when using the trained models, set their brain types to **Internal**.
The **External** and **Internal** types of Brains work in different phases of
training. When training your agents, set their brain types to **External**; when
using the trained models, set their brain types to **Internal**.
When [running an ML-Agents training algorithm](Training-ML-Agents.md), at least one Brain object in a scene must be set to **External**. This allows the training process to collect the observations of agents using that brain and give the agents their actions.
When [running an ML-Agents training algorithm](Training-ML-Agents.md), at least
one Brain object in a scene must be set to **External**. This allows the
training process to collect the observations of agents using that brain and give
the agents their actions.
In addition to using an External brain for training using the ML-Agents learning algorithms, you can use an External brain to control agents in a Unity environment using an external Python program. See [Python API](Python-API.md) for more information.
In addition to using an External brain for training using the ML-Agents learning
algorithms, you can use an External brain to control agents in a Unity
environment using an external Python program. See [Python API](Python-API.md)
for more information.
Unlike the other types, the External Brain has no properties to set in the Unity Inspector window.
Unlike the other types, the External Brain has no properties to set in the Unity
Inspector window.
The Internal Brain type uses a [TensorFlow model](https://www.tensorflow.org/get_started/get_started_for_beginners#models_and_training) to make decisions. The Proximal Policy Optimization (PPO) and Behavioral Cloning algorithms included with the ML-Agents SDK produce trained TensorFlow models that you can use with the Internal Brain type.
The Internal Brain type uses a
[TensorFlow model](https://www.tensorflow.org/get_started/get_started_for_beginners#models_and_training)
to make decisions. The Proximal Policy Optimization (PPO) and Behavioral Cloning
algorithms included with the ML-Agents SDK produce trained TensorFlow models
that you can use with the Internal Brain type.
A __model__ is a mathematical relationship mapping an agent's observations to its actions. TensorFlow is a software library for performing numerical computation through data flow graphs. A TensorFlow model, then, defines the mathematical relationship between your agent's observations and its actions using a TensorFlow data flow graph.
A __model__ is a mathematical relationship mapping an agent's observations to
its actions. TensorFlow is a software library for performing numerical
computation through data flow graphs. A TensorFlow model, then, defines the
mathematical relationship between your agent's observations and its actions
using a TensorFlow data flow graph.
The training algorithms included in the ML-Agents SDK produce TensorFlow graph models as the end result of the training process. See [Training ML-Agents](Training-ML-Agents.md) for instructions on how to train a model.
The training algorithms included in the ML-Agents SDK produce TensorFlow graph
models as the end result of the training process. See
[Training ML-Agents](Training-ML-Agents.md) for instructions on how to train a
model.
1. Select the Brain GameObject in the **Hierarchy** window of the Unity Editor. (The Brain GameObject must be a child of the Academy GameObject and must have a Brain component.)
1. Select the Brain GameObject in the **Hierarchy** window of the Unity Editor.
(The Brain GameObject must be a child of the Academy GameObject and must have
a Brain component.)
**Note:** In order to see the **Internal** Brain Type option, you must
[enable TensorFlowSharp](Using-TensorFlow-Sharp-in-Unity.md).
3. Import the `environment_run-id.bytes` file produced by the PPO training
program. (Where `environment_run-id` is the name of the model file, which is
constructed from the name of your Unity environment executable and the run-id
value you assigned when running the training process.)
**Note:** In order to see the **Internal** Brain Type option, you must [enable TensorFlowSharp](Using-TensorFlow-Sharp-in-Unity.md).
3. Import the `environment_run-id.bytes` file produced by the PPO training program. (Where `environment_run-id` is the name of the model file, which is constructed from the name of your Unity environment executable and the run-id value you assigned when running the training process.)
You can [import assets into Unity](https://docs.unity3d.com/Manual/ImportingAssets.html) in various ways. The easiest way is to simply drag the file into the **Project** window and drop it into an appropriate folder.
4. Once the `environment.bytes` file is imported, drag it from the **Project** window to the **Graph Model** field of the Brain component.
You can
[import assets into Unity](https://docs.unity3d.com/Manual/ImportingAssets.html)
in various ways. The easiest way is to simply drag the file into the
**Project** window and drop it into an appropriate folder.
4. Once the `environment.bytes` file is imported, drag it from the **Project**
window to the **Graph Model** field of the Brain component.
If you are using a model produced by the ML-Agents `mlagents-learn` command, use the default values for the other Internal Brain parameters.
If you are using a model produced by the ML-Agents `mlagents-learn` command, use
the default values for the other Internal Brain parameters.
The default values of the TensorFlow graph parameters work with the model produced by the PPO and BC training code in the ML-Agents SDK. To use a default ML-Agents model, the only parameter that you need to set is the `Graph Model`, which must be set to the .bytes file containing the trained model itself.
The default values of the TensorFlow graph parameters work with the model
produced by the PPO and BC training code in the ML-Agents SDK. To use a default
ML-Agents model, the only parameter that you need to set is the `Graph Model`,
which must be set to the .bytes file containing the trained model itself.
* `Graph Model` : This must be the `bytes` file corresponding to the pre-trained
TensorFlow graph. (You must first drag this file into your Resources folder
and then from the Resources folder into the inspector)
* `Graph Model` : This must be the `bytes` file corresponding to the pre-trained TensorFlow graph. (You must first drag this file into your Resources folder and then from the Resources folder into the inspector)
Only change the following Internal Brain properties if you have created your own
TensorFlow model and are not using an ML-Agents model:
Only change the following Internal Brain properties if you have created your own TensorFlow model and are not using an ML-Agents model:
* `Graph Scope` : If you set a scope while training your TensorFlow model, all your placeholder name will have a prefix. You must specify that prefix here. Note that if more than one Brain were set to external during training, you must give a `Graph Scope` to the internal Brain corresponding to the name of the Brain GameObject.
* `Batch Size Node Name` : If the batch size is one of the inputs of your graph, you must specify the name if the placeholder here. The brain will make the batch size equal to the number of agents connected to the brain automatically.
* `State Node Name` : If your graph uses the state as an input, you must specify the name of the placeholder here.
* `Recurrent Input Node Name` : If your graph uses a recurrent input / memory as input and outputs new recurrent input / memory, you must specify the name if the input placeholder here.
* `Recurrent Output Node Name` : If your graph uses a recurrent input / memory as input and outputs new recurrent input / memory, you must specify the name if the output placeholder here.
* `Observation Placeholder Name` : If your graph uses observations as input, you must specify it here. Note that the number of observations is equal to the length of `Camera Resolutions` in the brain parameters.
* `Action Node Name` : Specify the name of the placeholder corresponding to the actions of the brain in your graph. If the action space type is continuous, the output must be a one dimensional tensor of float of length `Action Space Size`, if the action space type is discrete, the output must be a one dimensional tensor of int of the same length as the `Branches` array.
* `Graph Placeholder` : If your graph takes additional inputs that are fixed (example: noise level) you can specify them here. Note that in your graph, these must correspond to one dimensional tensors of int or float of size 1.
* `Name` : Corresponds to the name of the placeholder.
* `Value Type` : Either Integer or Floating Point.
* `Min Value` and `Max Value` : Specify the range of the value here. The value will be sampled from the uniform distribution ranging from `Min Value` to `Max Value` inclusive.
* `Graph Scope` : If you set a scope while training your TensorFlow model, all
your placeholder name will have a prefix. You must specify that prefix here.
Note that if more than one Brain were set to external during training, you
must give a `Graph Scope` to the internal Brain corresponding to the name of
the Brain GameObject.
* `Batch Size Node Name` : If the batch size is one of the inputs of your
graph, you must specify the name if the placeholder here. The brain will make
the batch size equal to the number of agents connected to the brain
automatically.
* `State Node Name` : If your graph uses the state as an input, you must specify
the name of the placeholder here.
* `Recurrent Input Node Name` : If your graph uses a recurrent input / memory as
input and outputs new recurrent input / memory, you must specify the name if
the input placeholder here.
* `Recurrent Output Node Name` : If your graph uses a recurrent input / memory
as input and outputs new recurrent input / memory, you must specify the name
if the output placeholder here.
* `Observation Placeholder Name` : If your graph uses observations as input, you
must specify it here. Note that the number of observations is equal to the
length of `Camera Resolutions` in the brain parameters.
* `Action Node Name` : Specify the name of the placeholder corresponding to the
actions of the brain in your graph. If the action space type is continuous,
the output must be a one dimensional tensor of float of length `Action Space
Size`, if the action space type is discrete, the output must be a one
dimensional tensor of int of the same length as the `Branches` array.
* `Graph Placeholder` : If your graph takes additional inputs that are fixed
(example: noise level) you can specify them here. Note that in your graph,
these must correspond to one dimensional tensors of int or float of size 1.
* `Name` : Corresponds to the name of the placeholder.
* `Value Type` : Either Integer or Floating Point.
* `Min Value` and `Max Value` : Specify the range of the value here. The value
will be sampled from the uniform distribution ranging from `Min Value` to
`Max Value` inclusive.

34
docs/Learning-Environment-Design-Heuristic-Brains.md


# Heuristic Brain
The **Heuristic** brain type allows you to hand code an agent's decision making process. A Heuristic brain requires an implementation of the Decision interface to which it delegates the decision making process.
The **Heuristic** brain type allows you to hand code an agent's decision making
process. A Heuristic brain requires an implementation of the Decision interface
to which it delegates the decision making process.
When you set the **Brain Type** property of a Brain to **Heuristic**, you must add a component implementing the Decision interface to the same GameObject as the Brain.
When you set the **Brain Type** property of a Brain to **Heuristic**, you must
add a component implementing the Decision interface to the same GameObject as
the Brain.
When creating your Decision class, extend MonoBehaviour (so you can use the class as a Unity component) and extend the Decision interface.
When creating your Decision class, extend MonoBehaviour (so you can use the
class as a Unity component) and extend the Decision interface.
public class HeuristicLogic : MonoBehaviour, Decision
public class HeuristicLogic : MonoBehaviour, Decision
The Decision interface defines two methods, `Decide()` and `MakeMemory()`.
The Decision interface defines two methods, `Decide()` and `MakeMemory()`.
The `Decide()` method receives an agents current state, consisting of the agent's observations, reward, memory and other aspects of the agent's state, and must return an array containing the action that the agent should take. The format of the returned action array depends on the **Vector Action Space Type**. When using a **Continuous** action space, the action array is just a float array with a length equal to the **Vector Action Space Size** setting. When using a **Discrete** action space, the action array is an integer array with the same size as the `Branches` array. In the discrete action space, the values of the **Branches** array define the number of discrete values that your `Decide()` function can return for each branch, which don't need to be consecutive integers.
The `Decide()` method receives an agents current state, consisting of the
agent's observations, reward, memory and other aspects of the agent's state, and
must return an array containing the action that the agent should take. The
format of the returned action array depends on the **Vector Action Space Type**.
When using a **Continuous** action space, the action array is just a float array
with a length equal to the **Vector Action Space Size** setting. When using a
**Discrete** action space, the action array is an integer array with the same
size as the `Branches` array. In the discrete action space, the values of the
**Branches** array define the number of discrete values that your `Decide()`
function can return for each branch, which don't need to be consecutive
integers.
The `MakeMemory()` function allows you to pass data forward to the next iteration of an agent's decision making process. The array you return from `MakeMemory()` is passed to the `Decide()` function in the next iteration. You can use the memory to allow the agent's decision process to take past actions and observations into account when making the current decision. If your heuristic logic does not require memory, just return an empty array.
The `MakeMemory()` function allows you to pass data forward to the next
iteration of an agent's decision making process. The array you return from
`MakeMemory()` is passed to the `Decide()` function in the next iteration. You
can use the memory to allow the agent's decision process to take past actions
and observations into account when making the current decision. If your
heuristic logic does not require memory, just return an empty array.

47
docs/Learning-Environment-Design-Player-Brains.md


# Player Brain
The **Player** brain type allows you to control an agent using keyboard commands. You can use Player brains to control a "teacher" agent that trains other agents during [imitation learning](Training-Imitation-Learning.md). You can also use Player brains to test your agents and environment before changing their brain types to **External** and running the training process.
The **Player** brain type allows you to control an agent using keyboard
commands. You can use Player brains to control a "teacher" agent that trains
other agents during [imitation learning](Training-Imitation-Learning.md). You
can also use Player brains to test your agents and environment before changing
their brain types to **External** and running the training process.
The **Player** brain properties allow you to assign one or more keyboard keys to each action and a unique value to send when a key is pressed.
The **Player** brain properties allow you to assign one or more keyboard keys to
each action and a unique value to send when a key is pressed.
Note the differences between the discrete and continuous action spaces. When a brain uses the discrete action space, you can send one integer value as the action per step. In contrast, when a brain uses the continuous action space you can send any number of floating point values (up to the **Vector Action Space Size** setting).
Note the differences between the discrete and continuous action spaces. When a
brain uses the discrete action space, you can send one integer value as the
action per step. In contrast, when a brain uses the continuous action space you
can send any number of floating point values (up to the **Vector Action Space
Size** setting).
|**Continuous Player Actions**|| The mapping for the continuous vector action space. Shown when the action space is **Continuous**|.
|| **Size** | The number of key commands defined. You can assign more than one command to the same action index in order to send different values for that action. (If you press both keys at the same time, deterministic results are not guaranteed.)|
|**Continuous Player Actions**|| The mapping for the continuous vector action
space. Shown when the action space is **Continuous**|.
|| **Size** | The number of key commands defined. You can assign more than one
command to the same action index in order to send different values for that
action. (If you press both keys at the same time, deterministic results are not guaranteed.)|
|| **Index** | The element of the agent's action vector to set when this key is pressed. The index value cannot exceed the size of the Action Space (minus 1, since it is an array index).|
|| **Value** | The value to send to the agent as its action for the specified index when the mapped key is pressed. All other members of the action vector are set to 0. |
|**Discrete Player Actions**|| The mapping for the discrete vector action space. Shown when the action space is **Discrete**.|
|| **Index** | The element of the agent's action vector to set when this key is
pressed. The index value cannot exceed the size of the Action Space (minus 1,
since it is an array index).|
|| **Value** | The value to send to the agent as its action for the specified
index when the mapped key is pressed. All other members of the action vector
are set to 0. |
|**Discrete Player Actions**|| The mapping for the discrete vector action space.
Shown when the action space is **Discrete**.|
|| **Branch Index** |The element of the agent's action vector to set when this key is pressed. The index value cannot exceed the size of the Action Space (minus 1, since it is an array index).|
|| **Value** | The value to send to the agent as its action when the mapped key is pressed. Cannot exceed the max value for the associated branch (minus 1, since it is an array index).|
|| **Branch Index** |The element of the agent's action vector to set when this
key is pressed. The index value cannot exceed the size of the Action Space
(minus 1, since it is an array index).|
|| **Value** | The value to send to the agent as its action when the mapped key
is pressed. Cannot exceed the max value for the associated branch (minus 1,
since it is an array index).|
For more information about the Unity input system, see [Input](https://docs.unity3d.com/ScriptReference/Input.html).
For more information about the Unity input system, see
[Input](https://docs.unity3d.com/ScriptReference/Input.html).

189
docs/Learning-Environment-Design.md


# Reinforcement Learning in Unity
Reinforcement learning is an artificial intelligence technique that trains _agents_ to perform tasks by rewarding desirable behavior. During reinforcement learning, an agent explores its environment, observes the state of things, and, based on those observations, takes an action. If the action leads to a better state, the agent receives a positive reward. If it leads to a less desirable state, then the agent receives no reward or a negative reward (punishment). As the agent learns during training, it optimizes its decision making so that it receives the maximum reward over time.
Reinforcement learning is an artificial intelligence technique that trains
_agents_ to perform tasks by rewarding desirable behavior. During reinforcement
learning, an agent explores its environment, observes the state of things, and,
based on those observations, takes an action. If the action leads to a better
state, the agent receives a positive reward. If it leads to a less desirable
state, then the agent receives no reward or a negative reward (punishment). As
the agent learns during training, it optimizes its decision making so that it
receives the maximum reward over time.
The ML-Agents toolkit uses a reinforcement learning technique called [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/). PPO uses a neural network to approximate the ideal function that maps an agent's observations to the best action an agent can take in a given state. The ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate Python process (communicating with the running Unity application over a socket).
The ML-Agents toolkit uses a reinforcement learning technique called
[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/).
PPO uses a neural network to approximate the ideal function that maps an agent's
observations to the best action an agent can take in a given state. The
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
Python process (communicating with the running Unity application over a socket).
**Note:** if you aren't studying machine and reinforcement learning as a subject and just want to train agents to accomplish tasks, you can treat PPO training as a _black box_. There are a few training-related parameters to adjust inside Unity as well as on the Python training side, but you do not need in-depth knowledge of the algorithm itself to successfully create and train agents. Step-by-step procedures for running the training process are provided in the [Training section](Training-ML-Agents.md).
**Note:** if you aren't studying machine and reinforcement learning as a subject
and just want to train agents to accomplish tasks, you can treat PPO training as
a _black box_. There are a few training-related parameters to adjust inside
Unity as well as on the Python training side, but you do not need in-depth
knowledge of the algorithm itself to successfully create and train agents.
Step-by-step procedures for running the training process are provided in the
[Training section](Training-ML-Agents.md).
Training and simulation proceed in steps orchestrated by the ML-Agents Academy class. The Academy works with Agent and Brain objects in the scene to step through the simulation. When either the Academy has reached its maximum number of steps or all agents in the scene are _done_, one training episode is finished.
Training and simulation proceed in steps orchestrated by the ML-Agents Academy
class. The Academy works with Agent and Brain objects in the scene to step
through the simulation. When either the Academy has reached its maximum number
of steps or all agents in the scene are _done_, one training episode is
finished.
During training, the external Python training process communicates with the Academy to run a series of episodes while it collects data and optimizes its neural network model. The type of Brain assigned to an agent determines whether it participates in training or not. The **External** brain communicates with the external process to train the TensorFlow model. When training is completed successfully, you can add the trained model file to your Unity project for use with an **Internal** brain.
During training, the external Python training process communicates with the
Academy to run a series of episodes while it collects data and optimizes its
neural network model. The type of Brain assigned to an agent determines whether
it participates in training or not. The **External** brain communicates with the
external process to train the TensorFlow model. When training is completed
successfully, you can add the trained model file to your Unity project for use
with an **Internal** brain.
The ML-Agents Academy class orchestrates the agent simulation loop as follows:

4. Uses each agent's Brain class to decide on the agent's next action.
4. Uses each agent's Brain class to decide on the agent's next action.
6. Calls the `AgentAction()` function for each agent in the scene, passing in the action chosen by the agent's brain. (This function is not called if the agent is done.)
7. Calls the agent's `AgentOnDone()` function if the agent has reached its `Max Step` count or has otherwise marked itself as `done`. Optionally, you can set an agent to restart if it finishes before the end of an episode. In this case, the Academy calls the `AgentReset()` function.
8. When the Academy reaches its own `Max Step` count, it starts the next episode again by calling your Academy subclass's `AcademyReset()` function.
6. Calls the `AgentAction()` function for each agent in the scene, passing in
the action chosen by the agent's brain. (This function is not called if the
agent is done.)
7. Calls the agent's `AgentOnDone()` function if the agent has reached its `Max
Step` count or has otherwise marked itself as `done`. Optionally, you can set
an agent to restart if it finishes before the end of an episode. In this
case, the Academy calls the `AgentReset()` function.
8. When the Academy reaches its own `Max Step` count, it starts the next episode
again by calling your Academy subclass's `AcademyReset()` function.
To create a training environment, extend the Academy and Agent classes to implement the above methods. The `Agent.CollectObservations()` and `Agent.AgentAction()` functions are required; the other methods are optional — whether you need to implement them or not depends on your specific scenario.
To create a training environment, extend the Academy and Agent classes to
implement the above methods. The `Agent.CollectObservations()` and
`Agent.AgentAction()` functions are required; the other methods are optional —
whether you need to implement them or not depends on your specific scenario.
**Note:** The API used by the Python PPO training process to communicate with and control the Academy during training can be used for other purposes as well. For example, you could use the API to use Unity as the simulation engine for your own machine learning algorithms. See [Python API](Python-API.md) for more information.
**Note:** The API used by the Python PPO training process to communicate with
and control the Academy during training can be used for other purposes as well.
For example, you could use the API to use Unity as the simulation engine for
your own machine learning algorithms. See [Python API](Python-API.md) for more
information.
To train and use the ML-Agents toolkit in a Unity scene, the scene must contain a single Academy subclass along with as many Brain objects and Agent subclasses as you need. Any Brain instances in the scene must be attached to GameObjects that are children of the Academy in the Unity Scene Hierarchy. Agent instances should be attached to the GameObject representing that agent.
To train and use the ML-Agents toolkit in a Unity scene, the scene must contain
a single Academy subclass along with as many Brain objects and Agent subclasses
as you need. Any Brain instances in the scene must be attached to GameObjects
that are children of the Academy in the Unity Scene Hierarchy. Agent instances
should be attached to the GameObject representing that agent.
You must assign a brain to every agent, but you can share brains between multiple agents. Each agent will make its own observations and act independently, but will use the same decision-making logic and, for **Internal** brains, the same trained TensorFlow model.
You must assign a brain to every agent, but you can share brains between
multiple agents. Each agent will make its own observations and act
independently, but will use the same decision-making logic and, for **Internal**
brains, the same trained TensorFlow model.
The Academy object orchestrates agents and their decision making processes. Only place a single Academy object in a scene.
The Academy object orchestrates agents and their decision making processes. Only
place a single Academy object in a scene.
You must create a subclass of the Academy class (since the base class is abstract). When you create your Academy subclass, you can implement the following methods (all are optional):
You must create a subclass of the Academy class (since the base class is
abstract). When you create your Academy subclass, you can implement the
following methods (all are optional):
* `AcademyReset()` — Prepare the environment and agents for the next training episode. Use this function to place and initialize entities in the scene as necessary.
* `AcademyStep()` — Prepare the environment for the next simulation step. The base Academy class calls this function before calling any `AgentAction()` methods for the current step. You can use this function to update other objects in the scene before the agents take their actions. Note that the agents have already collected their observations and chosen an action before the Academy invokes this method.
* `AcademyReset()` — Prepare the environment and agents for the next training
episode. Use this function to place and initialize entities in the scene as
necessary.
* `AcademyStep()` — Prepare the environment for the next simulation step. The
base Academy class calls this function before calling any `AgentAction()`
methods for the current step. You can use this function to update other
objects in the scene before the agents take their actions. Note that the
agents have already collected their observations and chosen an action before
the Academy invokes this method.
The base Academy classes also defines several important properties that you can set in the Unity Editor Inspector. For training, the most important of these properties is `Max Steps`, which determines how long each training episode lasts. Once the Academy's step counter reaches this value, it calls the `AcademyReset()` function to start the next episode.
The base Academy classes also defines several important properties that you can
set in the Unity Editor Inspector. For training, the most important of these
properties is `Max Steps`, which determines how long each training episode
lasts. Once the Academy's step counter reaches this value, it calls the
`AcademyReset()` function to start the next episode.
See [Academy](Learning-Environment-Design-Academy.md) for a complete list of the Academy properties and their uses.
See [Academy](Learning-Environment-Design-Academy.md) for a complete list of
the Academy properties and their uses.
The Brain encapsulates the decision making process. Brain objects must be children of the Academy in the Unity scene hierarchy. Every Agent must be assigned a Brain, but you can use the same Brain with more than one Agent.
Use the Brain class directly, rather than a subclass. Brain behavior is determined by the brain type. During training, set your agent's brain type to **External**. To use the trained model, import the model file into the Unity project and change the brain type to **Internal**. See [Brains](Learning-Environment-Design-Brains.md) for details on using the different types of brains. You can extend the CoreBrain class to create different brain types if the four built-in types don't do what you need.
The Brain encapsulates the decision making process. Brain objects must be
children of the Academy in the Unity scene hierarchy. Every Agent must be
assigned a Brain, but you can use the same Brain with more than one Agent.
The Brain class has several important properties that you can set using the Inspector window. These properties must be appropriate for the agents using the brain. For example, the `Vector Observation Space Size` property must match the length of the feature vector created by an agent exactly. See [Agents](Learning-Environment-Design-Agents.md) for information about creating agents and setting up a Brain instance correctly.
Use the Brain class directly, rather than a subclass. Brain behavior is
determined by the brain type. During training, set your agent's brain type to
**External**. To use the trained model, import the model file into the Unity
project and change the brain type to **Internal**. See
[Brains](Learning-Environment-Design-Brains.md) for details on using the
different types of brains. You can extend the CoreBrain class to create
different brain types if the four built-in types don't do what you need.
The Brain class has several important properties that you can set using the
Inspector window. These properties must be appropriate for the agents using the
brain. For example, the `Vector Observation Space Size` property must match the
length of the feature vector created by an agent exactly. See
[Agents](Learning-Environment-Design-Agents.md) for information about creating
agents and setting up a Brain instance correctly.
See [Brains](Learning-Environment-Design-Brains.md) for a complete list of the Brain properties.
See [Brains](Learning-Environment-Design-Brains.md) for a complete list of the
Brain properties.
The Agent class represents an actor in the scene that collects observations and carries out actions. The Agent class is typically attached to the GameObject in the scene that otherwise represents the actor — for example, to a player object in a football game or a car object in a vehicle simulation. Every Agent must be assigned a Brain.
The Agent class represents an actor in the scene that collects observations and
carries out actions. The Agent class is typically attached to the GameObject in
the scene that otherwise represents the actor — for example, to a player object
in a football game or a car object in a vehicle simulation. Every Agent must be
assigned a Brain.
To create an agent, extend the Agent class and implement the essential `CollectObservations()` and `AgentAction()` methods:
To create an agent, extend the Agent class and implement the essential
`CollectObservations()` and `AgentAction()` methods:
* `AgentAction()` — Carries out the action chosen by the agent's brain and assigns a reward to the current state.
* `AgentAction()` — Carries out the action chosen by the agent's brain and
assigns a reward to the current state.
Your implementations of these functions determine how the properties of the Brain assigned to this agent must be set.
You must also determine how an Agent finishes its task or times out. You can manually set an agent to done in your `AgentAction()` function when the agent has finished (or irrevocably failed) its task. You can also set the agent's `Max Steps` property to a positive value and the agent will consider itself done after it has taken that many steps. When the Academy reaches its own `Max Steps` count, it starts the next episode. If you set an agent's `ResetOnDone` property to true, then the agent can attempt its task several times in one episode. (Use the `Agent.AgentReset()` function to prepare the agent to start again.)
Your implementations of these functions determine how the properties of the
Brain assigned to this agent must be set.
See [Agents](Learning-Environment-Design-Agents.md) for detailed information about programing your own agents.
You must also determine how an Agent finishes its task or times out. You can
manually set an agent to done in your `AgentAction()` function when the agent
has finished (or irrevocably failed) its task. You can also set the agent's `Max
Steps` property to a positive value and the agent will consider itself done
after it has taken that many steps. When the Academy reaches its own `Max Steps`
count, it starts the next episode. If you set an agent's `ResetOnDone` property
to true, then the agent can attempt its task several times in one episode. (Use
the `Agent.AgentReset()` function to prepare the agent to start again.)
See [Agents](Learning-Environment-Design-Agents.md) for detailed information
about programing your own agents.
An _environment_ in the ML-Agents toolkit can be any scene built in Unity. The Unity scene provides the environment in which agents observe, act, and learn. How you set up the Unity scene to serve as a learning environment really depends on your goal. You may be trying to solve a specific reinforcement learning problem of limited scope, in which case you can use the same scene for both training and for testing trained agents. Or, you may be training agents to operate in a complex game or simulation. In this case, it might be more efficient and practical to create a purpose-built training scene.
An _environment_ in the ML-Agents toolkit can be any scene built in Unity. The
Unity scene provides the environment in which agents observe, act, and learn.
How you set up the Unity scene to serve as a learning environment really depends
on your goal. You may be trying to solve a specific reinforcement learning
problem of limited scope, in which case you can use the same scene for both
training and for testing trained agents. Or, you may be training agents to
operate in a complex game or simulation. In this case, it might be more
efficient and practical to create a purpose-built training scene.
Both training and testing (or normal game) scenes must contain an Academy object to control the agent decision making process. The Academy defines several properties that can be set differently for a training scene versus a regular scene. The Academy's **Configuration** properties control rendering and time scale. You can set the **Training Configuration** to minimize the time Unity spends rendering graphics in order to speed up training. You may need to adjust the other functional, Academy settings as well. For example, `Max Steps` should be as short as possible for training — just long enough for the agent to accomplish its task, with some extra time for "wandering" while it learns. In regular scenes, you often do not want the Academy to reset the scene at all; if so, `Max Steps` should be set to zero.
Both training and testing (or normal game) scenes must contain an Academy object
to control the agent decision making process. The Academy defines several
properties that can be set differently for a training scene versus a regular
scene. The Academy's **Configuration** properties control rendering and time
scale. You can set the **Training Configuration** to minimize the time Unity
spends rendering graphics in order to speed up training. You may need to adjust
the other functional, Academy settings as well. For example, `Max Steps` should
be as short as possible for training — just long enough for the agent to
accomplish its task, with some extra time for "wandering" while it learns. In
regular scenes, you often do not want the Academy to reset the scene at all; if
so, `Max Steps` should be set to zero.
When you create a training environment in Unity, you must set up the scene so that it can be controlled by the external training process. Considerations include:
When you create a training environment in Unity, you must set up the scene so
that it can be controlled by the external training process. Considerations
include:
* The training scene must start automatically when your Unity application is launched by the training process.
* The training scene must start automatically when your Unity application is
launched by the training process.
* The Academy must reset the scene to a valid starting point for each episode of training.
* A training episode must have a definite end — either using `Max Steps` or by each agent setting itself to `done`.
* The Academy must reset the scene to a valid starting point for each episode of
training.
* A training episode must have a definite end — either using `Max Steps` or by
each agent setting itself to `done`.

343
docs/Learning-Environment-Examples.md


# Example Learning Environments
The Unity ML-Agents toolkit contains an expanding set of example environments which
demonstrate various features of the platform. Environments are located in
`MLAgentsSDK/Assets/ML-Agents/Examples` and summarized below.
Additionally, our
[first ML Challenge](https://connect.unity.com/challenges/ml-agents-1)
contains environments created by the community.
The Unity ML-Agents toolkit contains an expanding set of example environments
which demonstrate various features of the platform. Environments are located in
`MLAgentsSDK/Assets/ML-Agents/Examples` and summarized below. Additionally, our
[first ML Challenge](https://connect.unity.com/challenges/ml-agents-1) contains
environments created by the community.
This page only overviews the example environments we provide. To learn more
on how to design and build your own environments see our
[Making a New Learning Environment](Learning-Environment-Create-New.md)
page.
This page only overviews the example environments we provide. To learn more on
how to design and build your own environments see our [Making a New Learning
Environment](Learning-Environment-Create-New.md) page.
Note: Environment scenes marked as _optional_ do not have accompanying pre-trained model files, and are designed to serve as challenges for researchers.
Note: Environment scenes marked as _optional_ do not have accompanying
pre-trained model files, and are designed to serve as challenges for
researchers.
If you would like to contribute environments, please see our
[contribution guidelines](../CONTRIBUTING.md) page.
If you would like to contribute environments, please see our
[contribution guidelines](../CONTRIBUTING.md) page.
* Set-up: A linear movement task where the agent must move left or right to rewarding states.
* Set-up: A linear movement task where the agent must move left or right to
rewarding states.
* Agent Reward Function:
* +0.1 for arriving at suboptimal state.
* +1.0 for arriving at optimal state.
* Agent Reward Function:
* +0.1 for arriving at suboptimal state.
* +1.0 for arriving at optimal state.
* Vector Observation space: One variable corresponding to current state.
* Vector Action space: (Discrete) Two possible actions (Move left, move right).
* Visual Observations: None.
* Vector Observation space: One variable corresponding to current state.
* Vector Action space: (Discrete) Two possible actions (Move left, move
right).
* Visual Observations: None.
* Reset Parameters: None
* Benchmark Mean Reward: 0.94

* Set-up: A balance-ball task, where the agent controls the platform.
* Goal: The agent must balance the platform in order to keep the ball on it for as long as possible.
* Agents: The environment contains 12 agents of the same kind, all linked to a single brain.
* Agent Reward Function:
* +0.1 for every step the ball remains on the platform.
* -1.0 if the ball falls from the platform.
* Set-up: A balance-ball task, where the agent controls the platform.
* Goal: The agent must balance the platform in order to keep the ball on it for
as long as possible.
* Agents: The environment contains 12 agents of the same kind, all linked to a
single brain.
* Agent Reward Function:
* +0.1 for every step the ball remains on the platform.
* -1.0 if the ball falls from the platform.
* Vector Observation space: 8 variables corresponding to rotation of platform, and position, rotation, and velocity of ball.
* Vector Observation space (Hard Version): 5 variables corresponding to rotation of platform and position and rotation of ball.
* Vector Action space: (Continuous) Size of 2, with one value corresponding to X-rotation, and the other to Z-rotation.
* Visual Observations: None.
* Vector Observation space: 8 variables corresponding to rotation of platform,
and position, rotation, and velocity of ball.
* Vector Observation space (Hard Version): 5 variables corresponding to
rotation of platform and position and rotation of ball.
* Vector Action space: (Continuous) Size of 2, with one value corresponding to
X-rotation, and the other to Z-rotation.
* Visual Observations: None.
* Reset Parameters: None
* Benchmark Mean Reward: 100

* Set-up: A version of the classic grid-world task. Scene contains agent, goal, and obstacles.
* Goal: The agent must navigate the grid to the goal while avoiding the obstacles.
* Set-up: A version of the classic grid-world task. Scene contains agent, goal,
and obstacles.
* Goal: The agent must navigate the grid to the goal while avoiding the
obstacles.
* Agent Reward Function:
* -0.01 for every step.
* +1.0 if the agent navigates to the goal position of the grid (episode ends).
* -1.0 if the agent navigates to an obstacle (episode ends).
* Agent Reward Function:
* -0.01 for every step.
* +1.0 if the agent navigates to the goal position of the grid (episode ends).
* -1.0 if the agent navigates to an obstacle (episode ends).
* Vector Observation space: None
* Vector Action space: (Discrete) Size of 4, corresponding to movement in cardinal directions.
* Visual Observations: One corresponding to top-down view of GridWorld.
* Reset Parameters: Three, corresponding to grid size, number of obstacles, and number of goals.
* Vector Observation space: None
* Vector Action space: (Discrete) Size of 4, corresponding to movement in
cardinal directions.
* Visual Observations: One corresponding to top-down view of GridWorld.
* Reset Parameters: Three, corresponding to grid size, number of obstacles, and
number of goals.
* Benchmark Mean Reward: 0.8
## [Tennis](https://youtu.be/RDaIh7JX6RI)

* Set-up: Two-player game where agents control rackets to bounce ball over a net.
* Goal: The agents must bounce ball between one another while not dropping or sending ball out of bounds.
* Agents: The environment contains two agent linked to a single brain named TennisBrain. After training you can attach another brain named MyBrain to one of the agent to play against your trained model.
* Agent Reward Function (independent):
* +0.1 To agent when hitting ball over net.
* -0.1 To agent who let ball hit their ground, or hit ball out of bounds.
* Set-up: Two-player game where agents control rackets to bounce ball over a
net.
* Goal: The agents must bounce ball between one another while not dropping or
sending ball out of bounds.
* Agents: The environment contains two agent linked to a single brain named
TennisBrain. After training you can attach another brain named MyBrain to one
of the agent to play against your trained model.
* Agent Reward Function (independent):
* +0.1 To agent when hitting ball over net.
* -0.1 To agent who let ball hit their ground, or hit ball out of bounds.
* Vector Observation space: 8 variables corresponding to position and velocity of ball and racket.
* Vector Action space: (Continuous) Size of 2, corresponding to movement toward net or away from net, and jumping.
* Visual Observations: None.
* Vector Observation space: 8 variables corresponding to position and velocity
of ball and racket.
* Vector Action space: (Continuous) Size of 2, corresponding to movement
toward net or away from net, and jumping.
* Visual Observations: None.
* Reset Parameters: One, corresponding to size of ball.
* Benchmark Mean Reward: 2.5
* Optional Imitation Learning scene: `TennisIL`.

* Set-up: A platforming environment where the agent can push a block around.
* Goal: The agent must push the block to the goal.
* Agents: The environment contains one agent linked to a single brain.
* Agent Reward Function:
* -0.0025 for every step.
* +1.0 if the block touches the goal.
* Agent Reward Function:
* -0.0025 for every step.
* +1.0 if the block touches the goal.
* Vector Observation space: (Continuous) 70 variables corresponding to 14 ray-casts each detecting one of three possible objects (wall, goal, or block).
* Vector Action space: (Continuous) Size of 2, corresponding to movement in X and Z directions.
* Visual Observations (Optional): One first-person camera. Use `VisualPushBlock` scene.
* Vector Observation space: (Continuous) 70 variables corresponding to 14
ray-casts each detecting one of three possible objects (wall, goal, or
block).
* Vector Action space: (Continuous) Size of 2, corresponding to movement in X
and Z directions.
* Visual Observations (Optional): One first-person camera. Use
`VisualPushBlock` scene.
* Reset Parameters: None.
* Benchmark Mean Reward: 4.5
* Optional Imitation Learning scene: `PushBlockIL`.

* Set-up: A platforming environment where the agent can jump over a wall.
* Goal: The agent must use the block to scale the wall and reach the goal.
* Agents: The environment contains one agent linked to two different brains. The brain the agent is linked to changes depending on the height of the wall.
* Agent Reward Function:
* -0.0005 for every step.
* +1.0 if the agent touches the goal.
* -1.0 if the agent falls off the platform.
* Agents: The environment contains one agent linked to two different brains. The
brain the agent is linked to changes depending on the height of the wall.
* Agent Reward Function:
* -0.0005 for every step.
* +1.0 if the agent touches the goal.
* -1.0 if the agent falls off the platform.
* Vector Observation space: Size of 74, corresponding to 14 raycasts each detecting 4 possible objects. plus the global position of the agent and whether or not the agent is grounded.
* Vector Action space: (Discrete) 4 Branches :
* Forward Motion (3 possible actions : Forward, Backwards, No Action)
* Rotation (3 possible acions : Rotate Left, Rotate Right, No Action)
* Side Motion (3 possible acions : Left, Right, No Action)
* Jump (2 possible actions: Jump, No Action)
* Visual Observations: None.
* Vector Observation space: Size of 74, corresponding to 14 raycasts each
detecting 4 possible objects. plus the global position of the agent and
whether or not the agent is grounded.
* Vector Action space: (Discrete) 4 Branches:
* Forward Motion (3 possible actions: Forward, Backwards, No Action)
* Rotation (3 possible acions: Rotate Left, Rotate Right, No Action)
* Side Motion (3 possible acions: Left, Right, No Action)
* Jump (2 possible actions: Jump, No Action)
* Visual Observations: None.
* Reset Parameters: 4, corresponding to the height of the possible walls.
* Benchmark Mean Reward (Big & Small Wall Brain): 0.8

* Set-up: Double-jointed arm which can move to target locations.
* Goal: The agents must move it's hand to the goal location, and keep it there.
* Agents: The environment contains 10 agent linked to a single brain.
* Agent Reward Function (independent):
* +0.1 Each step agent's hand is in goal location.
* Agent Reward Function (independent):
* +0.1 Each step agent's hand is in goal location.
* Vector Observation space: 26 variables corresponding to position, rotation, velocity, and angular velocities of the two arm Rigidbodies.
* Vector Action space: (Continuous) Size of 4, corresponding to torque applicable to two joints.
* Visual Observations: None.
* Vector Observation space: 26 variables corresponding to position, rotation,
velocity, and angular velocities of the two arm Rigidbodies.
* Vector Action space: (Continuous) Size of 4, corresponding to torque
applicable to two joints.
* Visual Observations: None.
* Reset Parameters: Two, corresponding to goal size, and goal movement speed.
* Benchmark Mean Reward: 30

* Set-up: A creature with 4 arms and 4 forearms.
* Goal: The agents must move its body toward the goal direction without falling.
* `CrawlerStaticTarget` - Goal direction is always forward.
* `CrawlerDynamicTarget`- Goal direction is randomized.
* `CrawlerStaticTarget` - Goal direction is always forward.
* `CrawlerDynamicTarget`- Goal direction is randomized.
* Agent Reward Function (independent):
* +0.03 times body velocity in the goal direction.
* +0.01 times body direction alignment with goal direction.
* Agent Reward Function (independent):
* +0.03 times body velocity in the goal direction.
* +0.01 times body direction alignment with goal direction.
* Vector Observation space: 117 variables corresponding to position, rotation, velocity, and angular velocities of each limb plus the acceleration and angular acceleration of the body.
* Vector Action space: (Continuous) Size of 20, corresponding to target rotations for joints.
* Visual Observations: None.
* Vector Observation space: 117 variables corresponding to position, rotation,
velocity, and angular velocities of each limb plus the acceleration and
angular acceleration of the body.
* Vector Action space: (Continuous) Size of 20, corresponding to target
rotations for joints.
* Visual Observations: None.
* Reset Parameters: None
* Benchmark Mean Reward: 2000

* Set-up: A multi-agent environment where agents compete to collect bananas.
* Goal: The agents must learn to move to as many yellow bananas as possible while avoiding blue bananas.
* Set-up: A multi-agent environment where agents compete to collect bananas.
* Goal: The agents must learn to move to as many yellow bananas as possible
while avoiding blue bananas.
* Agent Reward Function (independent):
* +1 for interaction with yellow banana
* -1 for interaction with blue banana.
* Agent Reward Function (independent):
* +1 for interaction with yellow banana
* -1 for interaction with blue banana.
* Vector Observation space: 53 corresponding to velocity of agent (2), whether agent is frozen and/or shot its laser (2), plus ray-based perception of objects around agent's forward direction (49; 7 raycast angles with 7 measurements for each).
* Vector Action space: (Discrete) 4 Branches :
* Forward Motion (3 possible actions : Forward, Backwards, No Action)
* Side Motion (3 possible acions : Left, Right, No Action)
* Rotation (3 possible acions : Rotate Left, Rotate Right, No Action)
* Laser (2 possible actions: Laser, No Action)
* Visual Observations (Optional): First-person camera per-agent. Use `VisualBanana` scene.
* Vector Observation space: 53 corresponding to velocity of agent (2), whether
agent is frozen and/or shot its laser (2), plus ray-based perception of
objects around agent's forward direction (49; 7 raycast angles with 7
measurements for each).
* Vector Action space: (Discrete) 4 Branches:
* Forward Motion (3 possible actions: Forward, Backwards, No Action)
* Side Motion (3 possible acions: Left, Right, No Action)
* Rotation (3 possible acions: Rotate Left, Rotate Right, No Action)
* Laser (2 possible actions: Laser, No Action)
* Visual Observations (Optional): First-person camera per-agent. Use
`VisualBanana` scene.
* Optional Imitation Learning scene: `BananaIL`.
* Optional Imitation Learning scene: `BananaIL`.
* Set-up: Environment where the agent needs to find information in a room, remember it, and use it to move to the correct goal.
* Goal: Move to the goal which corresponds to the color of the block in the room.
* Set-up: Environment where the agent needs to find information in a room,
remember it, and use it to move to the correct goal.
* Goal: Move to the goal which corresponds to the color of the block in the
room.
* +1 For moving to correct goal.
* -0.1 For moving to incorrect goal.
* -0.0003 Existential penalty.
* +1 For moving to correct goal.
* -0.1 For moving to incorrect goal.
* -0.0003 Existential penalty.
* Vector Observation space: 30 corresponding to local ray-casts detecting objects, goals, and walls.
* Vector Action space: (Discrete) 1 Branch, 4 actions corresponding to agent rotation and forward/backward movement.
* Visual Observations (Optional): First-person view for the agent. Use `VisualHallway` scene.
* Vector Observation space: 30 corresponding to local ray-casts detecting
objects, goals, and walls.
* Vector Action space: (Discrete) 1 Branch, 4 actions corresponding to agent
rotation and forward/backward movement.
* Visual Observations (Optional): First-person view for the agent. Use
`VisualHallway` scene.
* Reset Parameters: None.
* Benchmark Mean Reward: 0.7
* Optional Imitation Learning scene: `HallwayIL`.

![Bouncer](images/bouncer.png)
* Set-up: Environment where the agent needs on-demand decision making. The agent must decide how perform its next bounce only when it touches the ground.
* Set-up: Environment where the agent needs on-demand decision making. The agent
must decide how perform its next bounce only when it touches the ground.
* +1 For catching the banana.
* -1 For bouncing out of bounds.
* -0.05 Times the action squared. Energy expenditure penalty.
* +1 For catching the banana.
* -1 For bouncing out of bounds.
* -0.05 Times the action squared. Energy expenditure penalty.
* Vector Observation space: 6 corresponding to local position of agent and banana.
* Vector Action space: (Continuous) 3 corresponding to agent force applied for the jump.
* Visual Observations: None.
* Vector Observation space: 6 corresponding to local position of agent and
banana.
* Vector Action space: (Continuous) 3 corresponding to agent force applied for
the jump.
* Visual Observations: None.
* Reset Parameters: None.
* Benchmark Mean Reward: 2.5

* Set-up: Environment where four agents compete in a 2 vs 2 toy soccer game.
* Set-up: Environment where four agents compete in a 2 vs 2 toy soccer game.
* Striker: Get the ball into the opponent's goal.
* Goalie: Prevent the ball from entering its own goal.
* Agents: The environment contains four agents, with two linked to one brain (strikers) and two linked to another (goalies).
* Striker: Get the ball into the opponent's goal.
* Goalie: Prevent the ball from entering its own goal.
* Agents: The environment contains four agents, with two linked to one brain
(strikers) and two linked to another (goalies).
* Striker:
* +1 When ball enters opponent's goal.
* -0.1 When ball enters own team's goal.
* -0.001 Existential penalty.
* Goalie:
* -1 When ball enters team's goal.
* +0.1 When ball enters opponents goal.
* +0.001 Existential bonus.
* Striker:
* +1 When ball enters opponent's goal.
* -0.1 When ball enters own team's goal.
* -0.001 Existential penalty.
* Goalie:
* -1 When ball enters team's goal.
* +0.1 When ball enters opponents goal.
* +0.001 Existential bonus.
* Vector Observation space: 112 corresponding to local 14 ray casts, each detecting 7 possible object types, along with the object's distance. Perception is in 180 degree view from front of agent.
* Vector Action space: (Discrete) One Branch
* Striker: 6 actions corresponding to forward, backward, sideways movement, as well as rotation.
* Goalie: 4 actions corresponding to forward, backward, sideways movement.
* Visual Observations: None.
* Vector Observation space: 112 corresponding to local 14 ray casts, each
detecting 7 possible object types, along with the object's distance.
Perception is in 180 degree view from front of agent.
* Vector Action space: (Discrete) One Branch
* Striker: 6 actions corresponding to forward, backward, sideways movement,
as well as rotation.
* Goalie: 4 actions corresponding to forward, backward, sideways movement.
* Visual Observations: None.
* Benchmark Mean Reward (Striker & Goalie Brain): 0 (the means will be inverse of each other and criss crosses during training)
* Benchmark Mean Reward (Striker & Goalie Brain): 0 (the means will be inverse
of each other and criss crosses during training)
* Set-up: Physics-based Humanoids agents with 26 degrees of freedom. These DOFs correspond to articulation of the following body-parts: hips, chest, spine, head, thighs, shins, feets, arms, forearms and hands.
* Goal: The agents must move its body toward the goal direction as quickly as possible without falling.
* Agents: The environment contains 11 independent agent linked to a single brain.
* Agent Reward Function (independent):
* +0.03 times body velocity in the goal direction.
* +0.01 times head y position.
* +0.01 times body direction alignment with goal direction.
* -0.01 times head velocity difference from body velocity.
* Set-up: Physics-based Humanoids agents with 26 degrees of freedom. These DOFs
correspond to articulation of the following body-parts: hips, chest, spine,
head, thighs, shins, feets, arms, forearms and hands.
* Goal: The agents must move its body toward the goal direction as quickly as
possible without falling.
* Agents: The environment contains 11 independent agent linked to a single
brain.
* Agent Reward Function (independent):
* +0.03 times body velocity in the goal direction.
* +0.01 times head y position.
* +0.01 times body direction alignment with goal direction.
* -0.01 times head velocity difference from body velocity.
* Vector Observation space: 215 variables corresponding to position, rotation, velocity, and angular velocities of each limb, along with goal direction.
* Vector Action space: (Continuous) Size of 39, corresponding to target rotations applicable to the joints.
* Visual Observations: None.
* Vector Observation space: 215 variables corresponding to position, rotation,
velocity, and angular velocities of each limb, along with goal direction.
* Vector Action space: (Continuous) Size of 39, corresponding to target
rotations applicable to the joints.
* Visual Observations: None.
* Reset Parameters: None.
* Benchmark Mean Reward: 1000

* Set-up: Environment where the agent needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top.
* Set-up: Environment where the agent needs to press a button to spawn a
pyramid, then navigate to the pyramid, knock it over, and move to the gold
brick at the top.
* +2 For moving to golden brick (minus 0.001 per step).
* +2 For moving to golden brick (minus 0.001 per step).
* Vector Observation space: 148 corresponding to local ray-casts detecting switch, bricks, golden brick, and walls, plus variable indicating switch state.
* Vector Action space: (Discrete) 4 corresponding to agent rotation and forward/backward movement.
* Visual Observations (Optional): First-person camera per-agent. Use `VisualPyramids` scene.
* Vector Observation space: 148 corresponding to local ray-casts detecting
switch, bricks, golden brick, and walls, plus variable indicating switch
state.
* Vector Action space: (Discrete) 4 corresponding to agent rotation and
forward/backward movement.
* Visual Observations (Optional): First-person camera per-agent. Us
`VisualPyramids` scene.
* Reset Parameters: None.
* Optional Imitation Learning scene: `PyramidsIL`.
* Benchmark Mean Reward: 1.75

123
docs/Learning-Environment-Executable.md


# Using an Environment Executable
This section will help you create and use built environments rather than the Editor to interact with an environment. Using an executable has some advantages over using the Editor :
This section will help you create and use built environments rather than the
Editor to interact with an environment. Using an executable has some advantages
over using the Editor:
* You can exchange executable with other people without having to share your entire repository.
* You can put your executable on a remote machine for faster training.
* You can use `Headless` mode for faster training.
* You can keep using the Unity Editor for other tasks while the agents are training.
* You can exchange executable with other people without having to share your
entire repository.
* You can put your executable on a remote machine for faster training.
* You can use `Headless` mode for faster training.
* You can keep using the Unity Editor for other tasks while the agents are
training.
## Building the 3DBall environment

1. Launch Unity.
2. On the Projects dialog, choose the **Open** option at the top of the window.
3. Using the file dialog that opens, locate the `MLAgentsSDK` folder
within the ML-Agents project and click **Open**.
4. In the **Project** window, navigate to the folder
`Assets/ML-Agents/Examples/3DBall/`.
5. Double-click the `3DBall` file to load the scene containing the Balance
Ball environment.
3. Using the file dialog that opens, locate the `MLAgentsSDK` folder within the
ML-Agents project and click **Open**.
4. In the **Project** window, navigate to the folder
`Assets/ML-Agents/Examples/3DBall/`.
5. Double-click the `3DBall` file to load the scene containing the Balance Ball
environment.
Make sure the Brains in the scene have the right type. For example, if you want to be able to control your agents from Python, you will need to set the corresponding brain to **External**.
Make sure the Brains in the scene have the right type. For example, if you want
to be able to control your agents from Python, you will need to set the
corresponding brain to **External**.
1. In the **Scene** window, click the triangle icon next to the Ball3DAcademy
object.
1. In the **Scene** window, click the triangle icon next to the Ball3DAcademy
object.
Next, we want the set up scene to play correctly when the training process
Next, we want the set up scene to play correctly when the training process
* The environment application runs in the background
* No dialogs require interaction
* The correct scene loads automatically
* The environment application runs in the background.
* No dialogs require interaction.
* The correct scene loads automatically.
- Ensure that **Run in Background** is Checked.
- Ensure that **Display Resolution Dialog** is set to Disabled.
* Ensure that **Run in Background** is Checked.
* Ensure that **Display Resolution Dialog** is set to Disabled.
- (optional) Select “Development Build” to
[log debug messages](https://docs.unity3d.com/Manual/LogFiles.html).
5. If any scenes are shown in the **Scenes in Build** list, make sure that
the 3DBall Scene is the only one checked. (If the list is empty, than only the
current scene is included in the build).
* (optional) Select “Development Build” to [log debug
messages](https://docs.unity3d.com/Manual/LogFiles.html).
5. If any scenes are shown in the **Scenes in Build** list, make sure that the
3DBall Scene is the only one checked. (If the list is empty, than only the
current scene is included in the build).
- In the File dialog, navigate to your ML-Agents directory.
- Assign a file name and click **Save**.
- (For Windows)With Unity 2018.1, it will ask you to select a folder instead of a file name. Create a subfolder within the ML-Agents folder and select that folder to build. In the following steps you will refer to this subfolder's name as `env_name`.
* In the File dialog, navigate to your ML-Agents directory.
* Assign a file name and click **Save**.
* (For Windows)With Unity 2018.1, it will ask you to select a folder instead
of a file name. Create a subfolder within the ML-Agents folder and select
that folder to build. In the following steps you will refer to this
subfolder's name as `env_name`.
Now that we have a Unity executable containing the simulation environment, we
Now that we have a Unity executable containing the simulation environment, we
If you want to use the [Python API](Python-API.md) to interact with your executable, you can pass the name of the executable with the argument 'file_name' of the `UnityEnvironment`. For instance :
If you want to use the [Python API](Python-API.md) to interact with your
executable, you can pass the name of the executable with the argument
'file_name' of the `UnityEnvironment`. For instance:
```python
from mlagents.envs import UnityEnvironment

## Training the Environment
1. Open a command or terminal window.
2. Nagivate to the folder where you installed ML-Agents.
3. Change to the python directory.
4. Run `mlagents-learn <trainer-config-file> --env=<env_name> --run-id=<run-identifier> --train`
Where:
- `<trainer-config-file>` is the filepath of the trainer configuration yaml.
- `<env_name>` is the name and path to the executable you exported from Unity (without extension)
- `<run-identifier>` is a string used to separate the results of different training runs
- And the `--train` tells `mlagents-learn` to run a training session (rather than inference)
1. Open a command or terminal window.
2. Nagivate to the folder where you installed ML-Agents.
3. Change to the python directory.
4. Run
`mlagents-learn <trainer-config-file> --env=<env_name> --run-id=<run-identifier> --train`
Where:
* `<trainer-config-file>` is the filepath of the trainer configuration yaml.
* `<env_name>` is the name and path to the executable you exported from Unity
(without extension)
* `<run-identifier>` is a string used to separate the results of different
training runs
* And the `--train` tells `mlagents-learn` to run a training session (rather
than inference)
For example, if you are training with a 3DBall executable you exported to the ml-agents/python directory, run:
For example, if you are training with a 3DBall executable you exported to the
ml-agents/python directory, run:
```shell
```sh
**Note**: If you're using Anaconda, don't forget to activate the ml-agents environment first.
**Note**: If you're using Anaconda, don't forget to activate the ml-agents
environment first.
If `mlagents-learn` runs correctly and starts training, you should see something like this:
If `mlagents-learn` runs correctly and starts training, you should see something
like this:
You can press Ctrl+C to stop the training, and your trained model will be at `models/<run-identifier>/<env_name>_<run-identifier>.bytes`, which corresponds to your model's latest checkpoint. You can now embed this trained model into your internal brain by following the steps below:
You can press Ctrl+C to stop the training, and your trained model will be at
`models/<run-identifier>/<env_name>_<run-identifier>.bytes`, which corresponds
to your model's latest checkpoint. You can now embed this trained model into
your internal brain by following the steps below:
1. Move your model file into
`MLAgentsSDK/Assets/ML-Agents/Examples/3DBall/TFModels/`.
1. Move your model file into
`MLAgentsSDK/Assets/ML-Agents/Examples/3DBall/TFModels/`.
5. Drag the `<env_name>_<run-identifier>.bytes` file from the Project window of the Editor
to the **Graph Model** placeholder in the **Ball3DBrain** inspector window.
5. Drag the `<env_name>_<run-identifier>.bytes` file from the Project window of
the Editor to the **Graph Model** placeholder in the **Ball3DBrain**
inspector window.
6. Press the Play button at the top of the editor.

25
docs/Limitations.md


# Limitations
# Limitations
If you enable Headless mode, you will not be able to collect visual
observations from your agents.
If you enable Headless mode, you will not be able to collect visual observations
from your agents.
Currently the speed of the game physics can only be increased to 100x
real-time. The Academy also moves in time with FixedUpdate() rather than
Update(), so game behavior implemented in Update() may be out of sync with the Agent decision making. See [Execution Order of Event Functions](https://docs.unity3d.com/Manual/ExecutionOrder.html) for more information.
Currently the speed of the game physics can only be increased to 100x real-time.
The Academy also moves in time with FixedUpdate() rather than Update(), so game
behavior implemented in Update() may be out of sync with the Agent decision
making. See
[Execution Order of Event Functions](https://docs.unity3d.com/Manual/ExecutionOrder.html)
for more information.
As of version 0.3, we no longer support Python 2.
As of version 0.3, we no longer support Python 2.
Currently the Ml-Agents toolkit uses TensorFlow 1.7.1 due to the version of the TensorFlowSharp plugin we are using.
Currently the Ml-Agents toolkit uses TensorFlow 1.7.1 due to the version of the
TensorFlowSharp plugin we are using.

26
docs/ML-Agents-Overview.md


Python API. It lives within the Learning Environment.
<p align="center">
<img src="images/learning_environment_basic.png"
alt="Simplified ML-Agents Scene Block Diagram"
width="700" border="10" />
<img src="images/learning_environment_basic.png"
alt="Simplified ML-Agents Scene Block Diagram"
width="700" border="10" />
</p>
_Simplified block diagram of ML-Agents._

medics (medics and drivers have different actions).
<p align="center">
<img src="images/learning_environment_example.png"
alt="Example ML-Agents Scene Block Diagram"
border="10" />
<img src="images/learning_environment_example.png"
alt="Example ML-Agents Scene Block Diagram"
border="10" />
</p>
_Example block diagram of ML-Agents toolkit for our sample game._

enables additional training modes.
<p align="center">
<img src="images/learning_environment.png"
alt="ML-Agents Scene Block Diagram"
border="10" />
<img src="images/learning_environment.png"
alt="ML-Agents Scene Block Diagram"
border="10" />
</p>
_An example of how a scene containing multiple Agents and Brains might be

future.
<p align="center">
<img src="images/math.png"
alt="Example Math Curriculum"
width="700"
border="10" />
<img src="images/math.png"
alt="Example Math Curriculum"
width="700"
border="10" />
</p>
_Example of a mathematics curriculum. Lessons progress from simpler topics to

75
docs/Migrating.md


# Migrating from ML-Agents toolkit v0.3 to v0.4
# Migrating
## Migrating from ML-Agents toolkit v0.3 to v0.4
## Unity API
* `using MLAgents;` needs to be added in all of the C# scripts that use ML-Agents.
### Unity API
* `using MLAgents;` needs to be added in all of the C# scripts that use
ML-Agents.
### Python API
## Python API
* We've changed some of the python packages dependencies in requirement.txt file. Make sure to run `pip install .` within your `ml-agents/python` folder to update your python packages.
* We've changed some of the python packages dependencies in requirement.txt
file. Make sure to run `pip install .` within your `ml-agents/python` folder
to update your python packages.
# Migrating from ML-Agents toolkit v0.2 to v0.3
## Migrating from ML-Agents toolkit v0.2 to v0.3
There are a large number of new features and improvements in the ML-Agents toolkit v0.3 which change both the training process and Unity API in ways which will cause incompatibilities with environments made using older versions. This page is designed to highlight those changes for users familiar with v0.1 or v0.2 in order to ensure a smooth transition.
There are a large number of new features and improvements in the ML-Agents
toolkit v0.3 which change both the training process and Unity API in ways which
will cause incompatibilities with environments made using older versions. This
page is designed to highlight those changes for users familiar with v0.1 or v0.2
in order to ensure a smooth transition.
### Important
* The ML-Agents toolkit is no longer compatible with Python 2.
### Python Training
* The training script `ppo.py` and `PPO.ipynb` Python notebook have been
replaced with a single `learn.py` script as the launching point for training
with ML-Agents. For more information on using `learn.py`, see
[here](Training-ML-Agents.md).
* Hyperparameters for training brains are now stored in the
`trainer_config.yaml` file. For more information on using this file, see
[here](Training-ML-Agents.md).
## Important
* The ML-Agents toolkit is no longer compatible with Python 2.
### Unity API
## Python Training
* The training script `ppo.py` and `PPO.ipynb` Python notebook have been replaced with a single `learn.py` script as the launching point for training with ML-Agents. For more information on using `learn.py`, see [here]().
* Hyperparameters for training brains are now stored in the `trainer_config.yaml` file. For more information on using this file, see [here]().
* Modifications to an Agent's rewards must now be done using either
`AddReward()` or `SetReward()`.
* Setting an Agent to done now requires the use of the `Done()` method.
* `CollectStates()` has been replaced by `CollectObservations()`, which now no
longer returns a list of floats.
* To collect observations, call `AddVectorObs()` within `CollectObservations()`.
Note that you can call `AddVectorObs()` with floats, integers, lists and
arrays of floats, Vector3 and Quaternions.
* `AgentStep()` has been replaced by `AgentAction()`.
* `WaitTime()` has been removed.
* The `Frame Skip` field of the Academy is replaced by the Agent's `Decision
Frequency` field, enabling agent to make decisions at different frequencies.
* The names of the inputs in the Internal Brain have been changed. You must
replace `state` with `vector_observation` and `observation` with
`visual_observation`. In addition, you must remove the `epsilon` placeholder.
## Unity API
* Modifications to an Agent's rewards must now be done using either `AddReward()` or `SetReward()`.
* Setting an Agent to done now requires the use of the `Done()` method.
* `CollectStates()` has been replaced by `CollectObservations()`, which now no longer returns a list of floats.
* To collect observations, call `AddVectorObs()` within `CollectObservations()`. Note that you can call `AddVectorObs()` with floats, integers, lists and arrays of floats, Vector3 and Quaternions.
* `AgentStep()` has been replaced by `AgentAction()`.
* `WaitTime()` has been removed.
* The `Frame Skip` field of the Academy is replaced by the Agent's `Decision Frequency` field, enabling agent to make decisions at different frequencies.
* The names of the inputs in the Internal Brain have been changed. You must replace `state` with `vector_observation` and `observation` with `visual_observation`. In addition, you must remove the `epsilon` placeholder.
### Semantics
## Semantics
In order to more closely align with the terminology used in the Reinforcement Learning field, and to be more descriptive, we have changed the names of some of the concepts used in ML-Agents. The changes are highlighted in the table below.
In order to more closely align with the terminology used in the Reinforcement
Learning field, and to be more descriptive, we have changed the names of some of
the concepts used in ML-Agents. The changes are highlighted in the table below.
| Old - v0.2 and earlier | New - v0.3 and later |
| --- | --- |

10
docs/Python-API.md


The ML-Agents toolkit provides a Python API for controlling the agent simulation
loop of a environment or game built with Unity. This API is used by the ML-Agent
training algorithms (run with `mlagents-learn`), but you can also write your Python
programs using this API.
training algorithms (run with `mlagents-learn`), but you can also write your
Python programs using this API.
The key objects in the Python API include:

A BrainInfo object contains the following fields:
- **`visual_observations`** : A list of 4 dimensional numpy arrays. Matrix n of
the list corresponds to the n<sup>th</sup> observation of the brain.
the list corresponds to the n<sup>th</sup> observation of the brain.
- **`vector_observations`** : A two dimensional numpy array of dimension `(batch
size, vector observation size)`.
- **`text_observations`** : A list of string corresponding to the agents text

- **`rewards`** : A list as long as the number of agents using the brain
containing the rewards they each obtained at the previous step.
containing the rewards they each obtained at the previous step.
containing `done` flags (whether or not the agent is done).
containing `done` flags (whether or not the agent is done).
- **`max_reached`** : A list as long as the number of agents using the brain
containing true if the agents reached their max steps.
- **`agents`** : A list of the unique ids of the agents using the brain.

9
docs/Training-Curriculum-Learning.md


```
* `measure` - What to measure learning progress, and advancement in lessons by.
* `reward` - Uses a measure received reward.
* `progress` - Uses ratio of steps/max_steps.
* `reward` - Uses a measure received reward.
* `progress` - Uses ratio of steps/max_steps.
* `thresholds` (float array) - Points in value of `measure` where lesson should
be increased.
* `min_lesson_length` (int) - How many times the progress measure should be

* If `true`, weighting will be 0.75 (new) 0.25 (old).
* If `true`, weighting will be 0.75 (new) 0.25 (old).
Once our curriculum is defined, we have to use the reset parameters we defined
and modify the environment from the agent's `AgentReset()` function. See

folder and PPO will train using Curriculum Learning. For example, to train
agents in the Wall Jump environment with curriculum learning, we can run
```shell
```sh
mlagents-learn config/trainer_config.yaml --curriculum=curricula/wall-jump/ --run-id=wall-jump-curriculum --train
```

76
docs/Training-Imitation-Learning.md


# Imitation Learning
It is often more intuitive to simply demonstrate the behavior we want an agent to perform, rather than attempting to have it learn via trial-and-error methods. Consider our [running example](ML-Agents-Overview.md#running-example-training-npc-behaviors) of training a medic NPC : instead of indirectly training a medic with the help of a reward function, we can give the medic real world examples of observations from the game and actions from a game controller to guide the medic's behavior. More specifically, in this mode, the Brain type during training is set to Player and all the actions performed with the controller (in addition to the agent observations) will be recorded and sent to the Python API. The imitation learning algorithm will then use these pairs of observations and actions from the human player to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFYs).
It is often more intuitive to simply demonstrate the behavior we want an agent
to perform, rather than attempting to have it learn via trial-and-error methods.
Consider our
[running example](ML-Agents-Overview.md#running-example-training-npc-behaviors)
of training a medic NPC : instead of indirectly training a medic with the help
of a reward function, we can give the medic real world examples of observations
from the game and actions from a game controller to guide the medic's behavior.
More specifically, in this mode, the Brain type during training is set to Player
and all the actions performed with the controller (in addition to the agent
observations) will be recorded and sent to the Python API. The imitation
learning algorithm will then use these pairs of observations and actions from
the human player to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFYs).
There are a variety of possible imitation learning algorithms which can be used, the simplest one of them is Behavioral Cloning. It works by collecting training data from a teacher, and then simply uses it to directly learn a policy, in the same way the supervised learning for image classification or other traditional Machine Learning tasks work.
There are a variety of possible imitation learning algorithms which can be used,
the simplest one of them is Behavioral Cloning. It works by collecting training
data from a teacher, and then simply uses it to directly learn a policy, in the
same way the supervised learning for image classification or other traditional
Machine Learning tasks work.
1. In order to use imitation learning in a scene, the first thing you will need is to create two Brains, one which will be the "Teacher," and the other which will be the "Student." We will assume that the names of the brain `GameObject`s are "Teacher" and "Student" respectively.
2. Set the "Teacher" brain to Player mode, and properly configure the inputs to map to the corresponding actions. **Ensure that "Broadcast" is checked within the Brain inspector window.**
1. In order to use imitation learning in a scene, the first thing you will need
is to create two Brains, one which will be the "Teacher," and the other which
will be the "Student." We will assume that the names of the brain
`GameObject`s are "Teacher" and "Student" respectively.
2. Set the "Teacher" brain to Player mode, and properly configure the inputs to
map to the corresponding actions. **Ensure that "Broadcast" is checked within
the Brain inspector window.**
4. Link the brains to the desired agents (one agent as the teacher and at least one agent as a student).
5. In `config/trainer_config.yaml`, add an entry for the "Student" brain. Set the `trainer` parameter of this entry to `imitation`, and the `brain_to_imitate` parameter to the name of the teacher brain: "Teacher". Additionally, set `batches_per_epoch`, which controls how much training to do each moment. Increase the `max_steps` option if you'd like to keep training the agents for a longer period of time.
6. Launch the training process with `mlagents-learn config/trainer_config.yaml --train --slow`, and press the :arrow_forward: button in Unity when the message _"Start training by pressing the Play button in the Unity Editor"_ is displayed on the screen
7. From the Unity window, control the agent with the Teacher brain by providing "teacher demonstrations" of the behavior you would like to see.
8. Watch as the agent(s) with the student brain attached begin to behave similarly to the demonstrations.
9. Once the Student agents are exhibiting the desired behavior, end the training process with `CTL+C` from the command line.
10. Move the resulting `*.bytes` file into the `TFModels` subdirectory of the Assets folder (or a subdirectory within Assets of your choosing) , and use with `Internal` brain.
4. Link the brains to the desired agents (one agent as the teacher and at least
one agent as a student).
5. In `config/trainer_config.yaml`, add an entry for the "Student" brain. Set
the `trainer` parameter of this entry to `imitation`, and the
`brain_to_imitate` parameter to the name of the teacher brain: "Teacher".
Additionally, set `batches_per_epoch`, which controls how much training to do
each moment. Increase the `max_steps` option if you'd like to keep training
the agents for a longer period of time.
6. Launch the training process with `mlagents-learn config/trainer_config.yaml
--train --slow`, and press the :arrow_forward: button in Unity when the
message _"Start training by pressing the Play button in the Unity Editor"_ is
displayed on the screen
7. From the Unity window, control the agent with the Teacher brain by providing
"teacher demonstrations" of the behavior you would like to see.
8. Watch as the agent(s) with the student brain attached begin to behave
similarly to the demonstrations.
9. Once the Student agents are exhibiting the desired behavior, end the training
process with `CTL+C` from the command line.
10. Move the resulting `*.bytes` file into the `TFModels` subdirectory of the
Assets folder (or a subdirectory within Assets of your choosing) , and use
with `Internal` brain.
We provide a convenience utility, `BC Teacher Helper` component that you can add to the Teacher Agent.
We provide a convenience utility, `BC Teacher Helper` component that you can add
to the Teacher Agent.
<img src="images/bc_teacher_helper.png"
alt="BC Teacher Helper"
width="375" border="10" />
<img src="images/bc_teacher_helper.png"
alt="BC Teacher Helper"
width="375" border="10" />
1. To start and stop recording experiences. This is useful in case you'd like to interact with the game _but not have the agents learn from these interactions_. The default command to toggle this is to press `R` on the keyboard.
1. To start and stop recording experiences. This is useful in case you'd like to
interact with the game _but not have the agents learn from these
interactions_. The default command to toggle this is to press `R` on the
keyboard.
2. Reset the training buffer. This enables you to instruct the agents to forget their buffer of recent experiences. This is useful if you'd like to get them to quickly learn a new behavior. The default command to reset the buffer is to press `C` on the keyboard.
2. Reset the training buffer. This enables you to instruct the agents to forget
their buffer of recent experiences. This is useful if you'd like to get them
to quickly learn a new behavior. The default command to reset the buffer is
to press `C` on the keyboard.

167
docs/Training-ML-Agents.md


# Training ML-Agents
The ML-Agents toolkit conducts training using an external Python training process. During training, this external process communicates with the Academy object in the Unity scene to generate a block of agent experiences. These experiences become the training set for a neural network used to optimize the agent's policy (which is essentially a mathematical function mapping observations to actions). In reinforcement learning, the neural network optimizes the policy by maximizing the expected rewards. In imitation learning, the neural network optimizes the policy to achieve the smallest difference between the actions chosen by the agent trainee and the actions chosen by the expert in the same situation.
The ML-Agents toolkit conducts training using an external Python training
process. During training, this external process communicates with the Academy
object in the Unity scene to generate a block of agent experiences. These
experiences become the training set for a neural network used to optimize the
agent's policy (which is essentially a mathematical function mapping
observations to actions). In reinforcement learning, the neural network
optimizes the policy by maximizing the expected rewards. In imitation learning,
the neural network optimizes the policy to achieve the smallest difference
between the actions chosen by the agent trainee and the actions chosen by the
expert in the same situation.
The output of the training process is a model file containing the optimized policy. This model file is a TensorFlow data graph containing the mathematical operations and the optimized weights selected during the training process. You can use the generated model file with the Internal Brain type in your Unity project to decide the best course of action for an agent.
The output of the training process is a model file containing the optimized
policy. This model file is a TensorFlow data graph containing the mathematical
operations and the optimized weights selected during the training process. You
can use the generated model file with the Internal Brain type in your Unity
project to decide the best course of action for an agent.
Use the command `mlagents-learn` to train your agents. This command is installed with the `mlagents` package
and its implementation can be found at `ml-agents/learn.py`. The [configuration file](#training-config-file), `config/trainer_config.yaml` specifies the hyperparameters used during training. You can edit this file with a text editor to add a specific configuration for each brain.
Use the command `mlagents-learn` to train your agents. This command is installed
with the `mlagents` package and its implementation can be found at
`ml-agents/learn.py`. The [configuration file](#training-config-file),
`config/trainer_config.yaml` specifies the hyperparameters used during training.
You can edit this file with a text editor to add a specific configuration for
each brain.
For a broader overview of reinforcement learning, imitation learning and the ML-Agents training process, see [ML-Agents Toolkit Overview](ML-Agents-Overview.md).
For a broader overview of reinforcement learning, imitation learning and the
ML-Agents training process, see [ML-Agents Toolkit
Overview](ML-Agents-Overview.md).
Use the `mlagents-learn` command to train agents. `mlagents-learn` supports training with [reinforcement learning](Background-Machine-Learning.md#reinforcement-learning), [curriculum learning](Training-Curriculum-Learning.md), and [behavioral cloning imitation learning](Training-Imitation-Learning.md).
Use the `mlagents-learn` command to train agents. `mlagents-learn` supports
training with
[reinforcement learning](Background-Machine-Learning.md#reinforcement-learning),
[curriculum learning](Training-Curriculum-Learning.md),
and [behavioral cloning imitation learning](Training-Imitation-Learning.md).
Run `mlagents-learn` from the command line to launch the training process. Use the command line patterns and the `config/trainer_config.yaml` file to control training options.
Run `mlagents-learn` from the command line to launch the training process. Use
the command line patterns and the `config/trainer_config.yaml` file to control
training options.
```shell
```sh
* `<trainer-config-file>` is the filepath of the trainer configuration yaml.
* `<env_name>`__(Optional)__ is the name (including path) of your Unity executable containing the agents to be trained. If `<env_name>` is not passed, the training will happen in the Editor. Press the :arrow_forward: button in Unity when the message _"Start training by pressing the Play button in the Unity Editor"_ is displayed on the screen.
* `<run-identifier>` is an optional identifier you can use to identify the results of individual training runs.
* `<trainer-config-file>` is the filepath of the trainer configuration yaml.
* `<env_name>`__(Optional)__ is the name (including path) of your Unity
executable containing the agents to be trained. If `<env_name>` is not passed,
the training will happen in the Editor. Press the :arrow_forward: button in
Unity when the message _"Start training by pressing the Play button in the
Unity Editor"_ is displayed on the screen.
* `<run-identifier>` is an optional identifier you can use to identify the
results of individual training runs.
For example, suppose you have a project in Unity named "CatsOnBicycles" which contains agents ready to train. To perform the training:
For example, suppose you have a project in Unity named "CatsOnBicycles" which
contains agents ready to train. To perform the training:
1. [Build the project](Learning-Environment-Executable.md), making sure that you only include the training scene.
1. [Build the project](Learning-Environment-Executable.md), making sure that you
only include the training scene.
4. Run the following to launch the training process using the path to the Unity environment you built in step 1:
4. Run the following to launch the training process using the path to the Unity
environment you built in step 1:
mlagents-learn config/trainer_config.yaml --env=../../projects/Cats/CatsOnBicycles.app --run-id=cob_1 --train
```sh
mlagents-learn config/trainer_config.yaml --env=../../projects/Cats/CatsOnBicycles.app --run-id=cob_1 --train
```
During a training session, the training program prints out and saves updates at regular intervals (specified by the `summary_freq` option). The saved statistics are grouped by the `run-id` value so you should assign a unique id to each training run if you plan to view the statistics. You can view these statistics using TensorBoard during or after training by running the following command (from the ML-Agents python directory):
During a training session, the training program prints out and saves updates at
regular intervals (specified by the `summary_freq` option). The saved statistics
are grouped by the `run-id` value so you should assign a unique id to each
training run if you plan to view the statistics. You can view these statistics
using TensorBoard during or after training by running the following command
(from the ML-Agents python directory):
tensorboard --logdir=summaries
```sh
tensorboard --logdir=summaries
```
While this example used the default training hyperparameters, you can edit the [training_config.yaml file](#training-config-file) with a text editor to set different values.
While this example used the default training hyperparameters, you can edit the
[training_config.yaml file](#training-config-file) with a text editor to set
different values.
In addition to passing the path of the Unity executable containing your training environment, you can set the following command line options when invoking `mlagents-learn`:
In addition to passing the path of the Unity executable containing your training
environment, you can set the following command line options when invoking
`mlagents-learn`:
* `--curriculum=<file>` – Specify a curriculum JSON file for defining the lessons for curriculum training. See [Curriculum Training](Training-Curriculum-Learning.md) for more information.
* `--keep-checkpoints=<n>` – Specify the maximum number of model checkpoints to keep. Checkpoints are saved after the number of steps specified by the `save-freq` option. Once the maximum number of checkpoints has been reached, the oldest checkpoint is deleted when saving a new checkpoint. Defaults to 5.
* `--lesson=<n>` – Specify which lesson to start with when performing curriculum training. Defaults to 0.
* `--load` – If set, the training code loads an already trained model to initialize the neural network before training. The learning code looks for the model in `models/<run-id>/` (which is also where it saves models at the end of training). When not set (the default), the neural network weights are randomly initialized and an existing model is not loaded.
* `--num-runs=<n>` - Sets the number of concurrent training sessions to perform. Default is set to 1. Set to higher values when benchmarking performance and multiple training sessions is desired. Training sessions are independent, and do not improve learning performance.
* `--run-id=<path>` – Specifies an identifier for each training run. This identifier is used to name the subdirectories in which the trained model and summary statistics are saved as well as the saved model itself. The default id is "ppo". If you use TensorBoard to view the training statistics, always set a unique run-id for each training run. (The statistics for all runs with the same id are combined as if they were produced by a the same session.)
* `--save-freq=<n>` Specifies how often (in steps) to save the model during training. Defaults to 50000.
* `--seed=<n>` – Specifies a number to use as a seed for the random number generator used by the training code.
* `--slow` – Specify this option to run the Unity environment at normal, game speed. The `--slow` mode uses the **Time Scale** and **Target Frame Rate** specified in the Academy's **Inference Configuration**. By default, training runs using the speeds specified in your Academy's **Training Configuration**. See [Academy Properties](Learning-Environment-Design-Academy.md#academy-properties).
* `--train` – Specifies whether to train model or only run in inference mode. When training, **always** use the `--train` option.
* `--worker-id=<n>` – When you are running more than one training environment at the same time, assign each a unique worker-id number. The worker-id is added to the communication port opened between the current instance of `mlagents-learn` and the ExternalCommunicator object in the Unity environment. Defaults to 0.
* `--docker-target-name=<dt>` – The Docker Volume on which to store curriculum, executable and model files. See [Using Docker](Using-Docker.md).
* `--no-graphics` - Specify this option to run the Unity executable in `-batchmode` and doesn't initialize the graphics driver. Use this only if your training doesn't involve visual observations (reading from Pixels). See [here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more details.
* `--curriculum=<file>` – Specify a curriculum JSON file for defining the
lessons for curriculum training. See [Curriculum
Training](Training-Curriculum-Learning.md) for more information.
* `--keep-checkpoints=<n>` – Specify the maximum number of model checkpoints to
keep. Checkpoints are saved after the number of steps specified by the
`save-freq` option. Once the maximum number of checkpoints has been reached,
the oldest checkpoint is deleted when saving a new checkpoint. Defaults to 5.
* `--lesson=<n>` – Specify which lesson to start with when performing curriculum
training. Defaults to 0.
* `--load` – If set, the training code loads an already trained model to
initialize the neural network before training. The learning code looks for the
model in `models/<run-id>/` (which is also where it saves models at the end of
training). When not set (the default), the neural network weights are randomly
initialized and an existing model is not loaded.
* `--num-runs=<n>` - Sets the number of concurrent training sessions to perform.
Default is set to 1. Set to higher values when benchmarking performance and
multiple training sessions is desired. Training sessions are independent, and
do not improve learning performance.
* `--run-id=<path>` – Specifies an identifier for each training run. This
identifier is used to name the subdirectories in which the trained model and
summary statistics are saved as well as the saved model itself. The default id
is "ppo". If you use TensorBoard to view the training statistics, always set a
unique run-id for each training run. (The statistics for all runs with the
same id are combined as if they were produced by a the same session.)
* `--save-freq=<n>` Specifies how often (in steps) to save the model during
training. Defaults to 50000.
* `--seed=<n>` – Specifies a number to use as a seed for the random number
generator used by the training code.
* `--slow` – Specify this option to run the Unity environment at normal, game
speed. The `--slow` mode uses the **Time Scale** and **Target Frame Rate**
specified in the Academy's **Inference Configuration**. By default, training
runs using the speeds specified in your Academy's **Training Configuration**.
See
[Academy Properties](Learning-Environment-Design-Academy.md#academy-properties).
* `--train` – Specifies whether to train model or only run in inference mode.
When training, **always** use the `--train` option.
* `--worker-id=<n>` – When you are running more than one training environment at
the same time, assign each a unique worker-id number. The worker-id is added
to the communication port opened between the current instance of
`mlagents-learn` and the ExternalCommunicator object in the Unity environment.
Defaults to 0.
* `--docker-target-name=<dt>` – The Docker Volume on which to store curriculum,
executable and model files. See [Using Docker](Using-Docker.md).
* `--no-graphics` - Specify this option to run the Unity executable in
`-batchmode` and doesn't initialize the graphics driver. Use this only if your
training doesn't involve visual observations (reading from Pixels). See
[here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more
details.
The training config file, `config/trainer_config.yaml` specifies the training method, the hyperparameters, and a few additional values to use during training. The file is divided into sections. The **default** section defines the default values for all the available settings. You can also add new sections to override these defaults to train specific Brains. Name each of these override sections after the GameObject containing the Brain component that should use these settings. (This GameObject will be a child of the Academy in your scene.) Sections for the example environments are included in the provided config file.
The training config file, `config/trainer_config.yaml` specifies the training
method, the hyperparameters, and a few additional values to use during training.
The file is divided into sections. The **default** section defines the default
values for all the available settings. You can also add new sections to override
these defaults to train specific Brains. Name each of these override sections
after the GameObject containing the Brain component that should use these
settings. (This GameObject will be a child of the Academy in your scene.)
Sections for the example environments are included in the provided config file.
| :-- | :-- | :-- |
| :-- | :-- | :-- |
| batch_size | The number of experiences in each iteration of gradient descent.| PPO, BC |
| batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model.| BC |
| beta | The strength of entropy regularization.| PPO, BC |

| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).| PPO, BC |
|| PPO = Proximal Policy Optimization, BC = Behavioral Cloning (Imitation)) ||
For specific advice on setting hyperparameters based on the type of training you are conducting, see:
For specific advice on setting hyperparameters based on the type of training you
are conducting, see:
* [Training with PPO](Training-PPO.md)
* [Using Recurrent Neural Networks](Feature-Memory.md)

You can also compare the [example environments](Learning-Environment-Examples.md) to the corresponding sections of the `config/trainer_config.yaml` file for each example to see how the hyperparameters and other configuration variables have been changed from the defaults.
You can also compare the
[example environments](Learning-Environment-Examples.md)
to the corresponding sections of the `config/trainer_config.yaml` file for each
example to see how the hyperparameters and other configuration variables have
been changed from the defaults.

218
docs/Training-PPO.md


# Training with Proximal Policy Optimization
ML-Agents uses a reinforcement learning technique called [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/). PPO uses a neural network to approximate the ideal function that maps an agent's observations to the best action an agent can take in a given state. The ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate Python process (communicating with the running Unity application over a socket).
ML-Agents uses a reinforcement learning technique called
[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/).
PPO uses a neural network to approximate the ideal function that maps an agent's
observations to the best action an agent can take in a given state. The
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
Python process (communicating with the running Unity application over a socket).
See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the training program, `learn.py`.
See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the
training program, `learn.py`.
If you are using the recurrent neural network (RNN) to utilize memory, see [Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training details.
If you are using the recurrent neural network (RNN) to utilize memory, see
[Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training
details.
If you are using curriculum training to pace the difficulty of the learning task presented to an agent, see [Training with Curriculum Learning](Training-Curriculum-Learning.md).
If you are using curriculum training to pace the difficulty of the learning task
presented to an agent, see [Training with Curriculum
Learning](Training-Curriculum-Learning.md).
For information about imitation learning, which uses a different training algorithm, see [Training with Imitation Learning](Training-Imitation-Learning.md).
For information about imitation learning, which uses a different training
algorithm, see
[Training with Imitation Learning](Training-Imitation-Learning.md).
Successfully training a Reinforcement Learning model often involves tuning the training hyperparameters. This guide contains some best practices for tuning the training process when the default parameters don't seem to be giving the level of performance you would like.
Successfully training a Reinforcement Learning model often involves tuning the
training hyperparameters. This guide contains some best practices for tuning the
training process when the default parameters don't seem to be giving the level
of performance you would like.
#### Gamma
### Gamma
`gamma` corresponds to the discount factor for future rewards. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large. In cases when rewards are more immediate, it can be smaller.
`gamma` corresponds to the discount factor for future rewards. This can be
thought of as how far into the future the agent should care about possible
rewards. In situations when the agent should be acting in the present in order
to prepare for rewards in the distant future, this value should be large. In
cases when rewards are more immediate, it can be smaller.
#### Lambda
### Lambda
`lambd` corresponds to the `lambda` parameter used when calculating the Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process.
`lambd` corresponds to the `lambda` parameter used when calculating the
Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This
can be thought of as how much the agent relies on its current value estimate
when calculating an updated value estimate. Low values correspond to relying
more on the current value estimate (which can be high bias), and high values
correspond to relying more on the actual rewards received in the environment
(which can be high variance). The parameter provides a trade-off between the
two, and the right value can lead to a more stable training process.
#### Buffer Size
### Buffer Size
`buffer_size` corresponds to how many experiences (agent observations, actions and rewards obtained) should be collected before we do any
learning or updating of the model. **This should be a multiple of `batch_size`**. Typically larger `buffer_size` correspond to more stable training updates.
`buffer_size` corresponds to how many experiences (agent observations, actions
and rewards obtained) should be collected before we do any learning or updating
of the model. **This should be a multiple of `batch_size`**. Typically a larger
`buffer_size` corresponds to more stable training updates.
#### Batch Size
### Batch Size
`batch_size` is the number of experiences used for one iteration of a gradient descent update. **This should always be a fraction of the
`buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value
should be smaller (in order of 10s).
`batch_size` is the number of experiences used for one iteration of a gradient
descent update. **This should always be a fraction of the `buffer_size`**. If
you are using a continuous action space, this value should be large (in the
order of 1000s). If you are using a discrete action space, this value should be
smaller (in order of 10s).
### Number of Epochs
#### Number of Epochs
`num_epoch` is the number of passes through the experience buffer during gradient descent. The larger the `batch_size`, the
larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning.
`num_epoch` is the number of passes through the experience buffer during
gradient descent. The larger the `batch_size`, the larger it is acceptable to
make this. Decreasing this will ensure more stable updates, at the cost of
slower learning.
### Learning Rate
#### Learning Rate
`learning_rate` corresponds to the strength of each gradient descent update step. This should typically be decreased if
training is unstable, and the reward does not consistently increase.
`learning_rate` corresponds to the strength of each gradient descent update
step. This should typically be decreased if training is unstable, and the reward
does not consistently increase.
#### Time Horizon
### Time Horizon
`time_horizon` corresponds to how many steps of experience to collect per-agent before adding it to the experience buffer.
When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state.
As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon).
In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal.
This number should be large enough to capture all the important behavior within a sequence of an agent's actions.
`time_horizon` corresponds to how many steps of experience to collect per-agent
before adding it to the experience buffer. When this limit is reached before the
end of an episode, a value estimate is used to predict the overall expected
reward from the agent's current state. As such, this parameter trades off
between a less biased, but higher variance estimate (long time horizon) and more
biased, but less varied estimate (short time horizon). In cases where there are
frequent rewards within an episode, or episodes are prohibitively large, a
smaller number can be more ideal. This number should be large enough to capture
all the important behavior within a sequence of an agent's actions.
#### Max Steps
### Max Steps
`max_steps` corresponds to how many steps of the simulation (multiplied by frame-skip) are run during the training process. This value should be increased for more complex problems.
`max_steps` corresponds to how many steps of the simulation (multiplied by
frame-skip) are run during the training process. This value should be increased
for more complex problems.
#### Beta
### Beta
`beta` corresponds to the strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.
`beta` corresponds to the strength of the entropy regularization, which makes
the policy "more random." This ensures that agents properly explore the action
space during training. Increasing this will ensure more random actions are
taken. This should be adjusted such that the entropy (measurable from
TensorBoard) slowly decreases alongside increases in reward. If entropy drops
too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.
#### Epsilon
### Epsilon
`epsilon` corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process.
`epsilon` corresponds to the acceptable threshold of divergence between the old
and new policies during gradient descent updating. Setting this value small will
result in more stable updates, but will also slow the training process.
#### Normalize
### Normalize
`normalize` corresponds to whether normalization is applied to the vector observation inputs. This normalization is based on the running average and variance of the vector observation.
Normalization can be helpful in cases with complex continuous control problems, but may be harmful with simpler discrete control problems.
`normalize` corresponds to whether normalization is applied to the vector
observation inputs. This normalization is based on the running average and
variance of the vector observation. Normalization can be helpful in cases with
complex continuous control problems, but may be harmful with simpler discrete
control problems.
#### Number of Layers
### Number of Layers
`num_layers` corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation. For simple problems,
fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems.
`num_layers` corresponds to how many hidden layers are present after the
observation input, or after the CNN encoding of the visual observation. For
simple problems, fewer layers are likely to train faster and more efficiently.
More layers may be necessary for more complex control problems.
#### Hidden Units
### Hidden Units
`hidden_units` correspond to how many units are in each fully connected layer of the neural network. For simple problems
where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where
the action is a very complex interaction between the observation variables, this should be larger.
`hidden_units` correspond to how many units are in each fully connected layer of
the neural network. For simple problems where the correct action is a
straightforward combination of the observation inputs, this should be small. For
problems where the action is a very complex interaction between the observation
variables, this should be larger.
### (Optional) Recurrent Neural Network Hyperparameters
## (Optional) Recurrent Neural Network Hyperparameters
#### Sequence Length
### Sequence Length
`sequence_length` corresponds to the length of the sequences of experience passed through the network during training. This should be long enough to capture whatever information your agent might need to remember over time. For example, if your agent needs to remember the velocity of objects, then this can be a small value. If your agent needs to remember a piece of information given only once at the beginning of an episode, then this should be a larger value.
`sequence_length` corresponds to the length of the sequences of experience
passed through the network during training. This should be long enough to
capture whatever information your agent might need to remember over time. For
example, if your agent needs to remember the velocity of objects, then this can
be a small value. If your agent needs to remember a piece of information given
only once at the beginning of an episode, then this should be a larger value.
#### Memory Size
### Memory Size
`memory_size` corresponds to the size of the array of floating point numbers used to store the hidden state of the recurrent neural network. This value must be a multiple of 4, and should scale with the amount of information you expect the agent will need to remember in order to successfully complete the task.
`memory_size` corresponds to the size of the array of floating point numbers
used to store the hidden state of the recurrent neural network. This value must
be a multiple of 4, and should scale with the amount of information you expect
the agent will need to remember in order to successfully complete the task.
### (Optional) Intrinsic Curiosity Module Hyperparameters
## (Optional) Intrinsic Curiosity Module Hyperparameters
#### Curioisty Encoding Size
### Curioisty Encoding Size
`curiosity_enc_size` corresponds to the size of the hidden layer used to encode the observations within the intrinsic curiosity module. This value should be small enough to encourage the curiosity module to compress the original observation, but also not too small to prevent it from learning the dynamics of the environment.
`curiosity_enc_size` corresponds to the size of the hidden layer used to encode
the observations within the intrinsic curiosity module. This value should be
small enough to encourage the curiosity module to compress the original
observation, but also not too small to prevent it from learning the dynamics of
the environment.
#### Curiosity Strength
### Curiosity Strength
`curiosity_strength` corresponds to the magnitude of the intrinsic reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrnisic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal.
`curiosity_strength` corresponds to the magnitude of the intrinsic reward
generated by the intrinsic curiosity module. This should be scaled in order to
ensure it is large enough to not be overwhelmed by extrnisic reward signals in
the environment. Likewise it should not be too large to overwhelm the extrinsic
reward signal.
To view training statistics, use TensorBoard. For information on launching and using TensorBoard, see [here](./Getting-Started-with-Balance-Ball.md#observing-training-progress).
To view training statistics, use TensorBoard. For information on launching and
using TensorBoard, see
[here](./Getting-Started-with-Balance-Ball.md#observing-training-progress).
#### Cumulative Reward
### Cumulative Reward
The general trend in reward should consistently increase over time. Small ups and downs are to be expected. Depending on the complexity of the task, a significant increase in reward may not present itself until millions of steps into the training process.
The general trend in reward should consistently increase over time. Small ups
and downs are to be expected. Depending on the complexity of the task, a
significant increase in reward may not present itself until millions of steps
into the training process.
#### Entropy
### Entropy
This corresponds to how random the decisions of a brain are. This should consistently decrease during training. If it decreases too soon or not at all, `beta` should be adjusted (when using discrete action space).
This corresponds to how random the decisions of a brain are. This should
consistently decrease during training. If it decreases too soon or not at all,
`beta` should be adjusted (when using discrete action space).
#### Learning Rate
### Learning Rate
#### Policy Loss
### Policy Loss
These values will oscillate during training. Generally they should be less than 1.0.
These values will oscillate during training. Generally they should be less than
1.0.
#### Value Estimate
### Value Estimate
These values should increase as the cumulative reward increases. They correspond to how much future reward the agent predicts itself receiving at any given point.
These values should increase as the cumulative reward increases. They correspond
to how much future reward the agent predicts itself receiving at any given
point.
#### Value Loss
### Value Loss
These values will increase as the reward increases, and then should decrease once reward becomes stable.
These values will increase as the reward increases, and then should decrease
once reward becomes stable.

104
docs/Training-on-Amazon-Web-Service.md


# Training on Amazon Web Service
This page contains instructions for setting up an EC2 instance on Amazon Web Service for training ML-Agents environments.
This page contains instructions for setting up an EC2 instance on Amazon Web
Service for training ML-Agents environments.
We've prepared an preconfigured AMI for you with the ID: `ami-18642967` in the `us-east-1` region. It was created as a modification of [Deep Learning AMI (Ubuntu)](https://aws.amazon.com/marketplace/pp/B077GCH38C). If you want to do training without the headless mode, you need to enable X Server on it. After launching your EC2 instance using the ami and ssh into it, run the following commands to enable it:
We've prepared an preconfigured AMI for you with the ID: `ami-18642967` in the
`us-east-1` region. It was created as a modification of [Deep Learning AMI
(Ubuntu)](https://aws.amazon.com/marketplace/pp/B077GCH38C). If you want to do
training without the headless mode, you need to enable X Server on it. After
launching your EC2 instance using the ami and ssh into it, run the following
commands to enable it:
```
```console
sudo /usr/bin/X :0 &
$ sudo /usr/bin/X :0 &
nvidia-smi
$ nvidia-smi
/*
* Thu Jun 14 20:27:26 2018
* +-----------------------------------------------------------------------------+

*/
//Make the ubuntu use X Server for display
export DISPLAY=:0
$ export DISPLAY=:0
You could also choose to configure your own instance. To begin with, you will need an EC2 instance which contains the latest Nvidia drivers, CUDA9, and cuDNN. In this tutorial we used the [Deep Learning AMI (Ubuntu)](https://aws.amazon.com/marketplace/pp/B077GCH38C) listed under AWS Marketplace with a p2.xlarge instance.
You could also choose to configure your own instance. To begin with, you will
need an EC2 instance which contains the latest Nvidia drivers, CUDA9, and cuDNN.
In this tutorial we used the
[Deep Learning AMI (Ubuntu)](https://aws.amazon.com/marketplace/pp/B077GCH38C)
listed under AWS Marketplace with a p2.xlarge instance.
### Installing the ML-Agents toolkit on the instance

```
```sh
```
```sh
git clone https://github.com/Unity-Technologies/ml-agents.git
cd ml-agents/python
pip3 install .

X Server setup is only necessary if you want to do training that requires visual observation input. _Instructions here are adapted from this [Medium post](https://medium.com/towards-data-science/how-to-run-unity-on-amazon-cloud-or-without-monitor-3c10ce022639) on running general Unity applications in the cloud._
X Server setup is only necessary if you want to do training that requires visual
observation input. _Instructions here are adapted from this
[Medium post](https://medium.com/towards-data-science/how-to-run-unity-on-amazon-cloud-or-without-monitor-3c10ce022639)
on running general Unity applications in the cloud._
Current limitations of the Unity Engine require that a screen be available to render to when using visual observations. In order to make this possible when training on a remote server, a virtual screen is required. We can do this by installing Xorg and creating a virtual screen. Once installed and created, we can display the Unity environment in the virtual environment, and train as we would on a local machine. Ensure that `headless` mode is disabled when building linux executables which use visual observations.
Current limitations of the Unity Engine require that a screen be available to
render to when using visual observations. In order to make this possible when
training on a remote server, a virtual screen is required. We can do this by
installing Xorg and creating a virtual screen. Once installed and created, we
can display the Unity environment in the virtual environment, and train as we
would on a local machine. Ensure that `headless` mode is disabled when building
linux executables which use visual observations.
```
```console
sudo apt-get update
sudo apt-get install -y xserver-xorg mesa-utils
sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024
$ sudo apt-get update
$ sudo apt-get install -y xserver-xorg mesa-utils
$ sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024
nvidia-xconfig --query-gpu-info
$ nvidia-xconfig --query-gpu-info
sudo sed -i 's/ BoardName "Tesla K80"/ BoardName "Tesla K80"\n BusID "0:30:0"/g' /etc/X11/xorg.conf
$ sudo sed -i 's/ BoardName "Tesla K80"/ BoardName "Tesla K80"\n BusID "0:30:0"/g' /etc/X11/xorg.conf
sudo vim /etc/X11/xorg.conf //And remove two lines that contain Section "Files" and EndSection
$ sudo vim /etc/X11/xorg.conf //And remove two lines that contain Section "Files" and EndSection
```
```console
wget http://download.nvidia.com/XFree86/Linux-x86_64/390.67/NVIDIA-Linux-x86_64-390.67.run
sudo /bin/bash ./NVIDIA-Linux-x86_64-390.67.run --accept-license --no-questions --ui=none
$ wget http://download.nvidia.com/XFree86/Linux-x86_64/390.67/NVIDIA-Linux-x86_64-390.67.run
$ sudo /bin/bash ./NVIDIA-Linux-x86_64-390.67.run --accept-license --no-questions --ui=none
sudo echo 'blacklist nouveau' | sudo tee -a /etc/modprobe.d/blacklist.conf
sudo echo 'options nouveau modeset=0' | sudo tee -a /etc/modprobe.d/blacklist.conf
sudo echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
sudo update-initramfs -u
$ sudo echo 'blacklist nouveau' | sudo tee -a /etc/modprobe.d/blacklist.conf
$ sudo echo 'options nouveau modeset=0' | sudo tee -a /etc/modprobe.d/blacklist.conf
$ sudo echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
$ sudo update-initramfs -u
2. Restart the EC2 instance:
3. Restart the EC2 instance:
```
```console
3. Make sure there are no Xorg processes running:
4. Make sure there are no Xorg processes running:
```
```console
sudo killall Xorg
$ sudo killall Xorg
nvidia-smi
$ nvidia-smi
/*
* Thu Jun 14 20:21:11 2018
* +-----------------------------------------------------------------------------+

*/
```
4. Start X Server and make the ubuntu use X Server for display:
5. Start X Server and make the ubuntu use X Server for display:
```
```console
sudo /usr/bin/X :0 &
$ sudo /usr/bin/X :0 &
nvidia-smi
$ nvidia-smi
export DISPLAY=:0
$ export DISPLAY=:0
5. Ensure the Xorg is correctly configured:
6. Ensure the Xorg is correctly configured:
```
//For more information on glxgears, see ftp://www.x.org/pub/X11R6.8.1/doc/glxgears.1.html.
glxgears
```console
//For more information on glxgears, see ftp://www.x.org/pub/X11R6.8.1/doc/glxgears.1.html.
$ glxgears
//If Xorg is configured correctly, you should see the following message
/*
* Running synchronized to the vertical refresh. The framerate should be

## Training on EC2 instance
1. In the Unity Editor, load a project containing an ML-Agents environment (you can use one of the example environments if you have not created your own).
1. In the Unity Editor, load a project containing an ML-Agents environment (you
can use one of the example environments if you have not created your own).
2. Open the Build Settings window (menu: File > Build Settings).
3. Select Linux as the Target Platform, and x86_64 as the target architecture.
4. Check Headless Mode (If you haven't setup the X Server).

You should receive a message confirming that the environment was loaded successfully.
8. Train the executable
```
```console
//cd into your ml-agents/python folder
chmod +x <your_env>.x86_64
python learn.py <your_env> --train

112
docs/Training-on-Microsoft-Azure-Custom-Instance.md


This page contains instructions for setting up a custom Virtual Machine on Microsoft Azure so you can running ML-Agents training in the cloud.
1. Start by [deploying an Azure VM](https://docs.microsoft.com/azure/virtual-machines/linux/quick-create-portal) with Ubuntu Linux (tests were done with 16.04 LTS). To use GPU support, use a N-Series VM.
2. SSH into your VM.
3. Start with the following commands to install the Nvidia driver:
1. Start by
[deploying an Azure VM](https://docs.microsoft.com/azure/virtual-machines/linux/quick-create-portal)
with Ubuntu Linux (tests were done with 16.04 LTS). To use GPU support, use
a N-Series VM.
2. SSH into your VM.
3. Start with the following commands to install the Nvidia driver:
```
wget http://us.download.nvidia.com/tesla/375.66/nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
```sh
wget http://us.download.nvidia.com/tesla/375.66/nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
sudo dpkg -i nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
sudo dpkg -i nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
sudo apt-get update
sudo apt-get update
sudo apt-get install cuda-drivers
sudo apt-get install cuda-drivers
sudo reboot
```
sudo reboot
```
4. After a minute you should be able to reconnect to your VM and install the CUDA toolkit:
4. After a minute you should be able to reconnect to your VM and install the
CUDA toolkit:
```
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
```sh
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get update
sudo apt-get install cuda-8-0
```
sudo apt-get install cuda-8-0
```
5. You'll next need to download cuDNN from the Nvidia developer site. This requires a registered account.
5. You'll next need to download cuDNN from the Nvidia developer site. This
requires a registered account.
6. Navigate to [http://developer.nvidia.com](http://developer.nvidia.com) and create an account and verify it.
6. Navigate to [http://developer.nvidia.com](http://developer.nvidia.com) and
create an account and verify it.
7. Download (to your own computer) cuDNN from [this url](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v6/prod/8.0_20170307/Ubuntu16_04_x64/libcudnn6_6.0.20-1+cuda8.0_amd64-deb).
7. Download (to your own computer) cuDNN from [this url](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v6/prod/8.0_20170307/Ubuntu16_04_x64/libcudnn6_6.0.20-1+cuda8.0_amd64-deb).
8. Copy the deb package to your VM: ```scp libcudnn6_6.0.21-1+cuda8.0_amd64.deb <VMUserName>@<VMIPAddress>:libcudnn6_6.0.21-1+cuda8.0_amd64.deb ```
8. Copy the deb package to your VM:
9. SSH back to your VM and execute the following:
```sh
scp libcudnn6_6.0.21-1+cuda8.0_amd64.deb <VMUserName>@<VMIPAddress>:libcudnn6_6.0.21-1+cuda8.0_amd64.deb
```
```
sudo dpkg -i libcudnn6_6.0.21-1+cuda8.0_amd64.deb
9. SSH back to your VM and execute the following:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
. ~/.profile
```console
sudo dpkg -i libcudnn6_6.0.21-1+cuda8.0_amd64.deb
sudo reboot
```
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
. ~/.profile
10. After a minute, you should be able to SSH back into your VM. After doing so, run the following:
sudo reboot
```
```
sudo apt install python-pip
sudo apt install python3-pip
```
10. After a minute, you should be able to SSH back into your VM. After doing
so, run the following:
11. At this point, you need to install TensorFlow. The version you install should be tied to if you are using GPU to train:
```sh
sudo apt install python-pip
sudo apt install python3-pip
```
```
pip3 install tensorflow-gpu==1.4.0 keras==2.0.6
```
Or CPU to train:
```
pip3 install tensorflow==1.4.0 keras==2.0.6
```
11. At this point, you need to install TensorFlow. The version you install
should be tied to if you are using GPU to train:
12. You'll then need to install additional dependencies:
```
pip3 install pillow
pip3 install numpy
pip3 install docopt
```
```sh
pip3 install tensorflow-gpu==1.4.0 keras==2.0.6
```
Or CPU to train:
13. You can now return to the [main Azure instruction page](Training-on-Microsoft-Azure.md).
```sh
pip3 install tensorflow==1.4.0 keras==2.0.6
```
12. You'll then need to install additional dependencies:
```sh
pip3 install pillow
pip3 install numpy
pip3 install docopt
```
13. You can now return to the
[main Azure instruction page](Training-on-Microsoft-Azure.md).

102
docs/Training-on-Microsoft-Azure.md


# Training on Microsoft Azure (works with ML-Agents toolkit v0.3)
This page contains instructions for setting up training on Microsoft Azure through either [Azure Container Instances](https://azure.microsoft.com/services/container-instances/) or Virtual Machines. Non "headless" training has not yet been tested to verify support.
This page contains instructions for setting up training on Microsoft Azure
through either
[Azure Container Instances](https://azure.microsoft.com/services/container-instances/)
or Virtual Machines. Non "headless" training has not yet been tested to verify
support.
A pre-configured virtual machine image is available in the Azure Marketplace and is nearly compltely ready for training. You can start by deploying the [Data Science Virtual Machine for Linux (Ubuntu)](https://azuremarketplace.microsoft.com/marketplace/apps/microsoft-ads.linux-data-science-vm-ubuntu) into your Azure subscription. Once your VM is deployed, SSH into it and run the following command to complete dependency installation:
```
A pre-configured virtual machine image is available in the Azure Marketplace and
is nearly compltely ready for training. You can start by deploying the
[Data Science Virtual Machine for Linux (Ubuntu)](https://azuremarketplace.microsoft.com/marketplace/apps/microsoft-ads.linux-data-science-vm-ubuntu)
into your Azure subscription. Once your VM is deployed, SSH into it and run the
following command to complete dependency installation:
```sh
Note that, if you choose to deploy the image to an [N-Series GPU optimized VM](https://docs.microsoft.com/azure/virtual-machines/linux/sizes-gpu), training will, by default, run on the GPU. If you choose any other type of VM, training will run on the CPU.
Note that, if you choose to deploy the image to an
[N-Series GPU optimized VM](https://docs.microsoft.com/azure/virtual-machines/linux/sizes-gpu),
training will, by default, run on the GPU. If you choose any other type of VM,
training will run on the CPU.
Setting up your own instance requires a number of package installations. Please view the documentation for doing so [here](Training-on-Microsoft-Azure-Custom-Instance.md).
Setting up your own instance requires a number of package installations. Please
view the documentation for doing so
[here](Training-on-Microsoft-Azure-Custom-Instance.md).
2. [Move](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/copy-files-to-linux-vm-using-scp) the `ml-agents` sub-folder of this ml-agents repo to the remote Azure instance, and set it as the working directory.
1. [Move](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/copy-files-to-linux-vm-using-scp)
the `ml-agents` sub-folder of this ml-agents repo to the remote Azure
instance, and set it as the working directory.
2. Install the required packages with `pip3 install .`.
## Testing

1. In the Unity Editor, load a project containing an ML-Agents environment (you can use one of the example environments if you have not created your own).
1. In the Unity Editor, load a project containing an ML-Agents environment (you
can use one of the example environments if you have not created your own).
2. Open the Build Settings window (menu: File > Build Settings).
3. Select Linux as the Target Platform, and x86_64 as the target architecture.
4. Check Headless Mode.

env = UnityEnvironment(<your_env>)
```
You should receive a message confirming that the environment was loaded successfully.
You should receive a message confirming that the environment was loaded
successfully.
1. [Move](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/copy-files-to-linux-vm-using-scp) your built Unity application to your Virtual Machine.
2. Set the `ml-agents` sub-folder of the ml-agents repo to your working directory.
3. Run the following command:
1. [Move](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/copy-files-to-linux-vm-using-scp)
your built Unity application to your Virtual Machine.
2. Set the `ml-agents` sub-folder of the ml-agents repo to your working
directory.
3. Run the following command:
```
```sh
Where `<your_app>` is the path to your app (i.e. `~/unity-volume/3DBallHeadless`) and `<run_id>` is an identifer you would like to identify your training run with.
Where `<your_app>` is the path to your app (i.e.
`~/unity-volume/3DBallHeadless`) and `<run_id>` is an identifer you would like
to identify your training run with.
If you've selected to run on a N-Series VM with GPU support, you can verify that the GPU is being used by running `nvidia-smi` from the command line.
If you've selected to run on a N-Series VM with GPU support, you can verify that
the GPU is being used by running `nvidia-smi` from the command line.
Once you have started training, you can [use Tensorboard to observe the training](Using-Tensorboard.md).
Once you have started training, you can [use Tensorboard to observe the
training](Using-Tensorboard.md).
1. Start by [opening the appropriate port for web traffic to connect to your VM](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/nsg-quickstart-portal).
1. Start by [opening the appropriate port for web traffic to connect to your VM](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/nsg-quickstart-portal).
* Note that you don't need to generate a new `Network Security Group` but instead, go to the **Networking** tab under **Settings** for your VM.
* As an example, you could use the following settings to open the Port with the following Inbound Rule settings:
* Source: Any
* Source Port Ranges: *
* Destination: Any
* Destination Port Ranges: 6006
* Protocol: Any
* Action: Allow
* Priority: <Leave as default>
2. Unless you started the training as a background process, connect to your VM from another terminal instance.
3. Set the `python` folder in ml-agents to your current working directory.
4. Run the following command from your `tensorboard --logdir=summaries --host 0.0.0.0`
5. You should now be able to open a browser and navigate to `<Your_VM_IP_Address>:6060` to view the TensorBoard report.
* Note that you don't need to generate a new `Network Security Group` but
instead, go to the **Networking** tab under **Settings** for your VM.
* As an example, you could use the following settings to open the Port with
the following Inbound Rule settings:
* Source: Any
* Source Port Ranges: *
* Destination: Any
* Destination Port Ranges: 6006
* Protocol: Any
* Action: Allow
* Priority: (Leave as default)
2. Unless you started the training as a background process, connect to your VM
from another terminal instance.
3. Set the `python` folder in ml-agents to your current working directory.
4. Run the following command from your `tensorboard --logdir=summaries --host
0.0.0.0`
5. You should now be able to open a browser and navigate to
`<Your_VM_IP_Address>:6060` to view the TensorBoard report.
[Azure Container Instances](https://azure.microsoft.com/services/container-instances/) allow you to spin up a container, on demand, that will run your training and then be shut down. This ensures you aren't leaving a billable VM running when it isn't needed. You can read more about [The ML-Agents toolkit support for Docker containers here](Using-Docker.md). Using ACI enables you to offload training of your models without needing to install Python and Tensorflow on your own computer. You can find [instructions, including a pre-deployed image in DockerHub for you to use, available here](https://github.com/druttka/unity-ml-on-azure).
[Azure Container Instances](https://azure.microsoft.com/services/container-instances/)
allow you to spin up a container, on demand, that will run your training and
then be shut down. This ensures you aren't leaving a billable VM running when
it isn't needed. You can read more about
[The ML-Agents toolkit support for Docker containers here](Using-Docker.md).
Using ACI enables you to offload training of your models without needing to
install Python and Tensorflow on your own computer. You can find instructions,
including a pre-deployed image in DockerHub for you to use, available
[here](https://github.com/druttka/unity-ml-on-azure).

8
docs/Using-Docker.md


Docker container by calling the following command at the top-level of the
repository:
```shell
```sh
docker build -t <image-name> .
```

Run the Docker container by calling the following command at the top-level of
the repository:
```shell
```sh
docker run --name <container-name> \
--mount type=bind,source="$(pwd)"/unity-volume,target=/unity-volume \
-p 5005:5005 \

To train with a `3DBall` environment executable, the command would be:
```shell
```sh
docker run --name 3DBallContainer.first.trial \
--mount type=bind,source="$(pwd)"/unity-volume,target=/unity-volume \
-p 5005:5005 \

container while saving state by either using `Ctrl+C` or `⌘+C` (Mac) or by using
the following command:
```shell
```sh
docker kill --signal=SIGINT <container-name>
```

171
docs/Using-TensorFlow-Sharp-in-Unity.md


# Using TensorFlowSharp in Unity (Experimental)
The ML-Agents toolkit allows you to use pre-trained [TensorFlow graphs](https://www.tensorflow.org/programmers_guide/graphs) inside your Unity games. This support is possible thanks to [the TensorFlowSharp project](https://github.com/migueldeicaza/TensorFlowSharp). The primary purpose for this support is to use the TensorFlow models produced by the ML-Agents toolkit's own training programs, but a side benefit is that you can use any TensorFlow model.
The ML-Agents toolkit allows you to use pre-trained
[TensorFlow graphs](https://www.tensorflow.org/programmers_guide/graphs)
inside your Unity
games. This support is possible thanks to the
[TensorFlowSharp project](https://github.com/migueldeicaza/TensorFlowSharp).
The primary purpose for this support is to use the TensorFlow models produced by
the ML-Agents toolkit's own training programs, but a side benefit is that you
can use any TensorFlow model.
_Notice: This feature is still experimental. While it is possible to embed trained models into Unity games, Unity Technologies does not officially support this use-case for production games at this time. As such, no guarantees are provided regarding the quality of experience. If you encounter issues regarding battery life, or general performance (especially on mobile), please let us know._
_Notice: This feature is still experimental. While it is possible to embed
trained models into Unity games, Unity Technologies does not officially support
this use-case for production games at this time. As such, no guarantees are
provided regarding the quality of experience. If you encounter issues regarding
battery life, or general performance (especially on mobile), please let us
know._
## Supported devices :
## Supported devices
* Linux 64 bits
* Mac OS X 64 bits
* Windows 64 bits
* iOS (Requires additional steps)
* Android
* Linux 64 bits
* Mac OS X 64 bits
* Windows 64 bits
* iOS (Requires additional steps)
* Android
## Requirements

# Using TensorFlowSharp with ML-Agents
## Using TensorFlowSharp with ML-Agents
Go to `Edit` -> `Player Settings` and add `ENABLE_TENSORFLOW` to the `Scripting Define Symbols` for each type of device you want to use (**`PC, Mac and Linux Standalone`**, **`iOS`** or **`Android`**).
Go to `Edit` -> `Player Settings` and add `ENABLE_TENSORFLOW` to the `Scripting
Define Symbols` for each type of device you want to use (**`PC, Mac and Linux
Standalone`**, **`iOS`** or **`Android`**).
Set the Brain you used for training to `Internal`. Drag `your_name_graph.bytes` into Unity and then drag it into The `Graph Model` field in the Brain.
Set the Brain you used for training to `Internal`. Drag `your_name_graph.bytes`
into Unity and then drag it into The `Graph Model` field in the Brain.
The TensorFlow data graphs produced by the ML-Agents training programs work without any additional settings.
The TensorFlow data graphs produced by the ML-Agents training programs work
without any additional settings.
In order to use a TensorFlow data graph in Unity, make sure the nodes of your graph have appropriate names. You can assign names to nodes in TensorFlow :
In order to use a TensorFlow data graph in Unity, make sure the nodes of your
graph have appropriate names. You can assign names to nodes in TensorFlow :
```python
variable= tf.identity(variable, name="variable_name")

* Name the batch size input placeholder `batch_size`
* Name the input vector observation placeholder `state`
* Name the output node `action`
* Name the recurrent vector (memory) input placeholder `recurrent_in` (if any)
* Name the recurrent vector (memory) output node `recurrent_out` (if any)
* Name the observations placeholders input placeholders `visual_observation_i` where `i` is the index of the observation (starting at 0)
You can have additional placeholders for float or integers but they must be placed in placeholders of dimension 1 and size 1. (Be sure to name them.)
* Name the batch size input placeholder `batch_size`
* Name the input vector observation placeholder `state`
* Name the output node `action`
* Name the recurrent vector (memory) input placeholder `recurrent_in` (if any)
* Name the recurrent vector (memory) output node `recurrent_out` (if any)
* Name the observations placeholders input placeholders `visual_observation_i`
where `i` is the index of the observation (starting at 0)
It is important that the inputs and outputs of the graph are exactly the ones you receive and return when training your model with an `External` brain. This means you cannot have any operations such as reshaping outside of the graph.
The object you get by calling `step` or `reset` has fields `vector_observations`, `visual_observations` and `memories` which must correspond to the placeholders of your graph. Similarly, the arguments `action` and `memory` you pass to `step` must correspond to the output nodes of your graph.
You can have additional placeholders for float or integers but they must be
placed in placeholders of dimension 1 and size 1. (Be sure to name them.)
It is important that the inputs and outputs of the graph are exactly the ones
you receive and return when training your model with an `External` brain. This
means you cannot have any operations such as reshaping outside of the graph. The
object you get by calling `step` or `reset` has fields `vector_observations`,
`visual_observations` and `memories` which must correspond to the placeholders
of your graph. Similarly, the arguments `action` and `memory` you pass to `step`
must correspond to the output nodes of your graph.
While training your Agent using the Python API, you can save your graph at any point of the training. Note that the argument `output_node_names` must be the name of the tensor your graph outputs (separated by a coma if using multiple outputs). In this case, it will be either `action` or `action,recurrent_out` if you have recurrent outputs.
While training your Agent using the Python API, you can save your graph at any
point of the training. Note that the argument `output_node_names` must be the
name of the tensor your graph outputs (separated by a coma if using multiple
outputs). In this case, it will be either `action` or `action,recurrent_out` if
you have recurrent outputs.
```python
from tensorflow.python.tools import freeze_graph

restore_op_name = "save/restore_all", filename_tensor_name = "save/Const:0")
```
Your model will be saved with the name `your_name_graph.bytes` and will contain both the graph and associated weights. Note that you must save your graph as a .bytes file so Unity can load it.
Your model will be saved with the name `your_name_graph.bytes` and will contain
both the graph and associated weights. Note that you must save your graph as a
.bytes file so Unity can load it.
In the Unity Editor, you must specify the names of the nodes used by your graph in the **Internal** brain Inspector window. If you used a scope when defining your graph, specify it in the `Graph Scope` field.
In the Unity Editor, you must specify the names of the nodes used by your graph
in the **Internal** brain Inspector window. If you used a scope when defining
your graph, specify it in the `Graph Scope` field.
See [Internal Brain](Learning-Environment-Design-External-Internal-Brains.md#internal-brain) for more information about using Internal Brains.
See
[Internal Brain](Learning-Environment-Design-External-Internal-Brains.md#internal-brain)
for more information about using Internal Brains.
If you followed these instructions well, the agents in your environment that use this brain will use your fully trained network to make decisions.
If you followed these instructions well, the agents in your environment that use
this brain will use your fully trained network to make decisions.
# iOS additional instructions for building
## iOS additional instructions for building
* Once you build the project for iOS in the editor, open the .xcodeproj file within the project folder using Xcode.
* Set up your ios account following the [iOS Account setup page](https://docs.unity3d.com/Manual/iphone-accountsetup.html).
* Once you build the project for iOS in the editor, open the .xcodeproj file
within the project folder using Xcode.
* Set up your ios account following the
[iOS Account setup page](https://docs.unity3d.com/Manual/iphone-accountsetup.html).
* Drag the library `libtensorflow-core.a` from the **Project Navigator** on the left under `Libraries/ML-Agents/Plugins/iOS` into the flag list, after `-force_load`.
* Drag the library `libtensorflow-core.a` from the **Project Navigator** on
the left under `Libraries/ML-Agents/Plugins/iOS` into the flag list, after
`-force_load`.
# Using TensorFlowSharp without ML-Agents
## Using TensorFlowSharp without ML-Agents
Beyond controlling an in-game agent, you can also use TensorFlowSharp for more general computation. The following instructions describe how to generally embed TensorFlow models without using the ML-Agents framework.
Beyond controlling an in-game agent, you can also use TensorFlowSharp for more
general computation. The following instructions describe how to generally embed
TensorFlow models without using the ML-Agents framework.
You must have a TensorFlow graph, such as `your_name_graph.bytes`, made using TensorFlow's `freeze_graph.py`. The process to create such graph is explained in the [Using your own trained graphs](#using-your-own-trained-graphs) section.
You must have a TensorFlow graph, such as `your_name_graph.bytes`, made using
TensorFlow's `freeze_graph.py`. The process to create such graph is explained in
the [Using your own trained graphs](#using-your-own-trained-graphs) section.
## Inside of Unity

2. At the top off your C# script, add the line:
```csharp
using TensorFlow;
```
```csharp
using TensorFlow;
```
3. If you will be building for android, you must add this block at the start of your code :
3. If you will be building for android, you must add this block at the start of
your code :
```csharp
#if UNITY_ANDROID
TensorFlowSharp.Android.NativeBinding.Init();
#endif
```
```csharp
#if UNITY_ANDROID
TensorFlowSharp.Android.NativeBinding.Init();
#endif
```
```csharp
TextAsset graphModel = Resources.Load (your_name_graph) as TextAsset;
```
```csharp
TextAsset graphModel = Resources.Load (your_name_graph) as TextAsset;
```
```csharp
graph = new TFGraph ();
graph.Import (graphModel.bytes);
session = new TFSession (graph);
```
```csharp graph = new TFGraph ();
graph.Import (graphModel.bytes);
session = new TFSession (graph);
```
6. Assign the input tensors for the graph. For example, the following code assigns a one dimensional input tensor of size 2:
6. Assign the input tensors for the graph. For example, the following code
assigns a one dimensional input tensor of size 2:
```csharp
var runner = session.GetRunner ();
runner.AddInput (graph ["input_placeholder_name"] [0], new float[]{ placeholder_value1, placeholder_value2 });
```
```csharp
var runner = session.GetRunner ();
runner.AddInput (graph ["input_placeholder_name"] [0], new float[]{ placeholder_value1, placeholder_value2 });
```
You must provide all required inputs to the graph. Supply one input per TensorFlow placeholder.
You must provide all required inputs to the graph. Supply one input per
TensorFlow placeholder.
```csharp
runner.Fetch (graph["output_placeholder_name"][0]);
float[,] recurrent_tensor = runner.Run () [0].GetValue () as float[,];
```
```csharp
runner.Fetch (graph["output_placeholder_name"][0]);
float[,] recurrent_tensor = runner.Run () [0].GetValue () as float[,];
```
Note that this example assumes the output array is a two-dimensional tensor of floats. Cast to a long array if your outputs are integers.
Note that this example assumes the output array is a two-dimensional tensor of
floats. Cast to a long array if your outputs are integers.

71
docs/Using-Tensorboard.md


# Using TensorBoard to Observe Training
The ML-Agents toolkit saves statistics during learning session that you can view with a TensorFlow utility named, [TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard).
The ML-Agents toolkit saves statistics during learning session that you can view
with a TensorFlow utility named,
[TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard).
The `mlagents-learn` command saves training statistics to a folder named `summaries`, organized by the `run-id` value you assign to a training session.
The `mlagents-learn` command saves training statistics to a folder named
`summaries`, organized by the `run-id` value you assign to a training session.
In order to observe the training process, either during training or afterward,
In order to observe the training process, either during training or afterward,
start TensorBoard:
1. Open a terminal or console window:

4. Open a browser window and navigate to [localhost:6006](http://localhost:6006).
**Note:** If you don't assign a `run-id` identifier, `mlagents-learn` uses the default string, "ppo". All the statistics will be saved to the same sub-folder and displayed as one session in TensorBoard. After a few runs, the displays can become difficult to interpret in this situation. You can delete the folders under the `summaries` directory to clear out old statistics.
**Note:** If you don't assign a `run-id` identifier, `mlagents-learn` uses the
default string, "ppo". All the statistics will be saved to the same sub-folder
and displayed as one session in TensorBoard. After a few runs, the displays can
become difficult to interpret in this situation. You can delete the folders
under the `summaries` directory to clear out old statistics.
On the left side of the TensorBoard window, you can select which of the training runs you want to display. You can select multiple run-ids to compare statistics. The TensorBoard window also provides options for how to display and smooth graphs.
When you run the training program, `mlagents-learn`, you can use the `--save-freq` option to specify how frequently to save the statistics.
On the left side of the TensorBoard window, you can select which of the training
runs you want to display. You can select multiple run-ids to compare statistics.
The TensorBoard window also provides options for how to display and smooth
graphs.
When you run the training program, `mlagents-learn`, you can use the
`--save-freq` option to specify how frequently to save the statistics.
## The ML-Agents toolkit training statistics

* Lesson - Plots the progress from lesson to lesson. Only interesting when performing
[curriculum training](Training-Curriculum-Learning.md).
* Lesson - Plots the progress from lesson to lesson. Only interesting when
performing [curriculum training](Training-Curriculum-Learning.md).
* Cumulative Reward - The mean cumulative episode reward over all agents.
Should increase during a successful training session.
* Cumulative Reward - The mean cumulative episode reward over all agents. Should
increase during a successful training session.
* Entropy - How random the decisions of the model are. Should slowly decrease
during a successful training process. If it decreases too quickly, the `beta`
hyperparameter should be increased.
* Entropy - How random the decisions of the model are. Should slowly decrease
during a successful training process. If it decreases too quickly, the `beta`
hyperparameter should be increased.
* Episode Length - The mean length of each episode in the environment for all
agents.
* Episode Length - The mean length of each episode in the environment for all
agents.
* Learning Rate - How large a step the training algorithm takes as it searches
for the optimal policy. Should decrease over time.
* Learning Rate - How large a step the training algorithm takes as it searches
for the optimal policy. Should decrease over time.
much the policy (process for deciding actions) is changing. The magnitude of
this should decrease during a successful training session.
much the policy (process for deciding actions) is changing. The magnitude of
this should decrease during a successful training session.
* Value Estimate - The mean value estimate for all states visited by the agent.
Should increase during a successful training session.
* Value Estimate - The mean value estimate for all states visited by the agent.
Should increase during a successful training session.
well the model is able to predict the value of each state. This should increase
while the agent is learning, and then decrease once the reward stabilizes.
well the model is able to predict the value of each state. This should
increase while the agent is learning, and then decrease once the reward
stabilizes.
* _(Curiosity-Specific)_ Intrinsic Reward - This corresponds to the mean cumulative intrinsic reward generated per-episode.
* _(Curiosity-Specific)_ Intrinsic Reward - This corresponds to the mean
cumulative intrinsic reward generated per-episode.
* _(Curiosity-Specific)_ Forward Loss - The mean magnitude of the inverse model loss function. Corresponds to how well the model is able to predict the new observation encoding.
* _(Curiosity-Specific)_ Forward Loss - The mean magnitude of the inverse model
loss function. Corresponds to how well the model is able to predict the new
observation encoding.
* _(Curiosity-Specific)_ Inverse Loss - The mean magnitude of the forward model loss function. Corresponds to how well the model is able to predict the action taken between two observations.
* _(Curiosity-Specific)_ Inverse Loss - The mean magnitude of the forward model
loss function. Corresponds to how well the model is able to predict the action
taken between two observations.
正在加载...
取消
保存