Cleaning up documentation.

6 年前 · 40f4eb3e
--- a/docs/API-Reference.md
+++ b/docs/API-Reference.md
 [Doxygen](http://www.stack.nl/~dimitri/doxygen/) for auto-generating HTML
 documentation.

-To generate the API reference, [download
-Doxygen](http://www.stack.nl/~dimitri/doxygen/download.html) and run the
-following command within the `docs/` directory:
+To generate the API reference,
+[download Doxygen](http://www.stack.nl/~dimitri/doxygen/download.html)
+and run the following command within the `docs/` directory:
-    doxygen dox-ml-agents.conf
+```sh
+doxygen dox-ml-agents.conf
+```

 `dox-ml-agents.conf` is a Doxygen configuration file for the ML-Agents toolkit
 that includes the classes that have been properly formatted. The generated HTML
--- a/docs/Background-Jupyter.md
+++ b/docs/Background-Jupyter.md

 [Jupyter](https://jupyter.org) is a fantastic tool for writing code with
 embedded visualizations. We provide one such notebook,
-`notebooks/getting-started.ipynb`, for testing the Python control
-interface to a Unity build. This notebook is introduced in the [Getting Started
-with the 3D Balance Ball Environment](Getting-Started-with-Balance-Ball.md)
+`notebooks/getting-started.ipynb`, for testing the Python control interface to a
+Unity build. This notebook is introduced in the
+[Getting Started with the 3D Balance Ball Environment](Getting-Started-with-Balance-Ball.md)
 tutorial, but can be used for testing the connection to any Unity build.

 For a walkthrough of how to use Jupyter, see

-    jupyter notebook
+```sh
+jupyter notebook
+```

 Then navigate to `localhost:8888` to access your notebooks.
--- a/docs/Background-Machine-Learning.md
+++ b/docs/Background-Machine-Learning.md
 # Background: Machine Learning

-Given that a number of users of the ML-Agents toolkit might not have a formal machine 
-learning background, this page provides an overview to facilitate the 
-understanding of the ML-Agents toolkit. However, We will not attempt to provide a thorough 
-treatment of machine learning as there are fantastic resources online.
+Given that a number of users of the ML-Agents toolkit might not have a formal
+machine learning background, this page provides an overview to facilitate the
+understanding of the ML-Agents toolkit. However, We will not attempt to provide
+a thorough treatment of machine learning as there are fantastic resources
+online.
-Machine learning, a branch of artificial intelligence, focuses on learning 
+Machine learning, a branch of artificial intelligence, focuses on learning
-include: unsupervised learning, supervised learning and reinforcement learning. 
-Each class of algorithm learns from a different type of data. The following 
-paragraphs provide an overview for each of these classes of machine learning, 
-as well as introductory examples. 
+include: unsupervised learning, supervised learning and reinforcement learning.
+Each class of algorithm learns from a different type of data. The following
+paragraphs provide an overview for each of these classes of machine learning, as
+well as introductory examples.
-The goal of 
-[unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning) is to group or cluster similar items in a 
-data set. For example, consider the players of a game. We may want to group 
-the players depending on how engaged they are with the game. This would enable
-us to target different groups (e.g. for highly-engaged players we might
-invite them to be beta testers for new features, while for unengaged players
-we might email them helpful tutorials). Say that we wish to split our players 
-into two groups. We would first define basic attributes of the players, such 
-as the number of hours played, total money spent on in-app purchases and
-number of levels completed. We can then feed this data set (three attributes 
-for every player) to an unsupervised learning algorithm where we specify the 
-number of groups to be two. The algorithm would then split the data set of
-players into two groups where the players within each group would be similar
-to each other. Given the attributes we used to describe each player, in this
-case, the output would be a split of all the players into two groups, where 
-one group would semantically represent the engaged players and the second
-group would semantically represent the unengaged players.
+The goal of [unsupervised
+learning](https://en.wikipedia.org/wiki/Unsupervised_learning) is to group or
+cluster similar items in a data set. For example, consider the players of a
+game. We may want to group the players depending on how engaged they are with
+the game. This would enable us to target different groups (e.g. for
+highly-engaged players we might invite them to be beta testers for new features,
+while for unengaged players we might email them helpful tutorials). Say that we
+wish to split our players into two groups. We would first define basic
+attributes of the players, such as the number of hours played, total money spent
+on in-app purchases and number of levels completed. We can then feed this data
+set (three attributes for every player) to an unsupervised learning algorithm
+where we specify the number of groups to be two. The algorithm would then split
+the data set of players into two groups where the players within each group
+would be similar to each other. Given the attributes we used to describe each
+player, in this case, the output would be a split of all the players into two
+groups, where one group would semantically represent the engaged players and the
+second group would semantically represent the unengaged players.
-defined the appropriate attributes and relied on the algorithm to uncover
-the two groups on its own. This type of data set is typically called an 
-unlabeled data set as it is lacking these direct labels. Consequently, 
-unsupervised learning can be helpful in situations where these labels can be
-expensive or hard to produce. In the next paragraph, we overview supervised 
-learning algorithms which accept input labels in addition to attributes.
+defined the appropriate attributes and relied on the algorithm to uncover the
+two groups on its own. This type of data set is typically called an unlabeled
+data set as it is lacking these direct labels. Consequently, unsupervised
+learning can be helpful in situations where these labels can be expensive or
+hard to produce. In the next paragraph, we overview supervised learning
+algorithms which accept input labels in addition to attributes.
-In [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning),
-we do not want to just group similar items but directly
-learn a mapping from each item to the group (or class) that it belongs to.
-Returning to our earlier example of
-clustering players, let's say we now wish to predict which of our players are
-about to churn (that is stop playing the game for the next 30 days). We 
-can look into our historical records and create a data set that
-contains attributes of our players in addition to a label indicating whether
-they have churned or not. Note that the player attributes we use for this
-churn prediction task may be different from the ones we used for our earlier
-clustering task. We can then feed this data set (attributes **and** label for
-each player) into a supervised learning algorithm which would learn a mapping 
-from the player attributes to a label indicating whether that player 
-will churn or not. The intuition is that the supervised learning algorithm
-will learn which values of these attributes typically correspond to players
-who have churned and not churned (for example, it may learn that players
-who spend very little and play for very short periods will most likely churn).
-Now given this learned model, we can provide it the attributes of a
-new player (one that recently started playing the game) and it would output
-a _predicted_ label for that player. This prediction is the algorithms
-expectation of whether the player will churn or not.
-We can now use these predictions to target the players
-who are expected to churn and entice them to continue playing the game.
+In [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning), we
+do not want to just group similar items but directly learn a mapping from each
+item to the group (or class) that it belongs to. Returning to our earlier
+example of clustering players, let's say we now wish to predict which of our
+players are about to churn (that is stop playing the game for the next 30 days).
+We can look into our historical records and create a data set that contains
+attributes of our players in addition to a label indicating whether they have
+churned or not. Note that the player attributes we use for this churn prediction
+task may be different from the ones we used for our earlier clustering task. We
+can then feed this data set (attributes **and** label for each player) into a
+supervised learning algorithm which would learn a mapping from the player
+attributes to a label indicating whether that player will churn or not. The
+intuition is that the supervised learning algorithm will learn which values of
+these attributes typically correspond to players who have churned and not
+churned (for example, it may learn that players who spend very little and play
+for very short periods will most likely churn). Now given this learned model, we
+can provide it the attributes of a new player (one that recently started playing
+the game) and it would output a _predicted_ label for that player. This
+prediction is the algorithms expectation of whether the player will churn or
+not. We can now use these predictions to target the players who are expected to
+churn and entice them to continue playing the game.
-player. Model selection, on the other hand, pertains to selecting the
-algorithm (and its parameters) that perform the task well. Both of these
-tasks are active areas of machine learning research and, in practice, require
-several iterations to achieve good performance.
+player. Model selection, on the other hand, pertains to selecting the algorithm
+(and its parameters) that perform the task well. Both of these tasks are active
+areas of machine learning research and, in practice, require several iterations
+to achieve good performance.
-We now switch to reinforcement learning, the third class of
-machine learning algorithms, and arguably the one most relevant for the ML-Agents toolkit.
+We now switch to reinforcement learning, the third class of machine learning
+algorithms, and arguably the one most relevant for the ML-Agents toolkit.
-can be viewed as a form of learning for sequential
-decision making that is commonly associated with controlling robots (but is,
-in fact, much more general). Consider an autonomous firefighting robot that is
-tasked with navigating into an area, finding the fire and neutralizing it. At
-any given moment, the robot perceives the environment through its sensors (e.g.
-camera, heat, touch), processes this information and produces an action (e.g.
-move to the left, rotate the water hose, turn on the water). In other words,
-it is continuously making decisions about how to interact in this environment
-given its view of the world (i.e. sensors input) and objective (i.e.
-neutralizing the fire). Teaching a robot to be a successful firefighting
-machine is precisely what reinforcement learning is designed to do. 
+can be viewed as a form of learning for sequential decision making that is
+commonly associated with controlling robots (but is, in fact, much more
+general). Consider an autonomous firefighting robot that is tasked with
+navigating into an area, finding the fire and neutralizing it. At any given
+moment, the robot perceives the environment through its sensors (e.g. camera,
+heat, touch), processes this information and produces an action (e.g. move to
+the left, rotate the water hose, turn on the water). In other words, it is
+continuously making decisions about how to interact in this environment given
+its view of the world (i.e. sensors input) and objective (i.e. neutralizing the
+fire). Teaching a robot to be a successful firefighting machine is precisely
+what reinforcement learning is designed to do.
-More specifically, the goal of reinforcement learning is to learn a **policy**, 
-which is essentially a mapping from **observations** to **actions**. An 
-observation is what the robot can measure from its **environment** (in this 
+More specifically, the goal of reinforcement learning is to learn a **policy**,
+which is essentially a mapping from **observations** to **actions**. An
+observation is what the robot can measure from its **environment** (in this
-to the configuration of the robot (e.g. position of its base, position of
-its water hose and whether the hose is on or off). 
+to the configuration of the robot (e.g. position of its base, position of its
+water hose and whether the hose is on or off).
-The last remaining piece
-of the reinforcement learning task is the **reward signal**. When training a
-robot to be a mean firefighting machine, we provide it with rewards (positive 
-and negative) indicating how well it is doing on completing the task.
-Note that the robot does not _know_ how to put out fires before it is trained. 
-It learns the objective because it receives a large positive reward when it puts 
-out the fire and a small negative reward for every passing second. The fact that 
-rewards are sparse (i.e. may not be provided at every step, but only when a 
-robot arrives at a success or failure situation), is a defining characteristic of 
-reinforcement learning and precisely why learning good policies can be difficult 
-(and/or time-consuming) for complex environments. 
+The last remaining piece of the reinforcement learning task is the **reward
+signal**. When training a robot to be a mean firefighting machine, we provide it
+with rewards (positive and negative) indicating how well it is doing on
+completing the task. Note that the robot does not _know_ how to put out fires
+before it is trained. It learns the objective because it receives a large
+positive reward when it puts out the fire and a small negative reward for every
+passing second. The fact that rewards are sparse (i.e. may not be provided at
+every step, but only when a robot arrives at a success or failure situation), is
+a defining characteristic of reinforcement learning and precisely why learning
+good policies can be difficult (and/or time-consuming) for complex environments.

 <p align="center">
  <img src="images/rl_cycle.png" alt="The reinforcement learning cycle."/>
-usually requires many trials and iterative
-policy updates. More specifically, the robot is placed in several
-fire situations and over time learns an optimal policy which allows it
-to put our fires more effectively. Obviously, we cannot expect to train a
-robot repeatedly in the real world, particularly when fires are involved. This
-is precisely why the use of 
+usually requires many trials and iterative policy updates. More specifically,
+the robot is placed in several fire situations and over time learns an optimal
+policy which allows it to put our fires more effectively. Obviously, we cannot
+expect to train a robot repeatedly in the real world, particularly when fires
+are involved. This is precisely why the use of
-serves as the perfect training grounds for learning such behaviors.
-While our discussion of reinforcement learning has centered around robots,
-there are strong parallels between robots and characters in a game. In fact,
-in many ways, one can view a non-playable character (NPC) as a virtual
-robot, with its own observations about the environment, its own set of actions
-and a specific objective. Thus it is natural to explore how we can
-train behaviors within Unity using reinforcement learning. This is precisely
-what the ML-Agents toolkit offers. The video linked below includes a reinforcement
-learning demo showcasing training character behaviors using the ML-Agents toolkit.
+serves as the perfect training grounds for learning such behaviors. While our
+discussion of reinforcement learning has centered around robots, there are
+strong parallels between robots and characters in a game. In fact, in many ways,
+one can view a non-playable character (NPC) as a virtual robot, with its own
+observations about the environment, its own set of actions and a specific
+objective. Thus it is natural to explore how we can train behaviors within Unity
+using reinforcement learning. This is precisely what the ML-Agents toolkit
+offers. The video linked below includes a reinforcement learning demo showcasing
+training character behaviors using the ML-Agents toolkit.
-    <a href="http://www.youtube.com/watch?feature=player_embedded&v=fiQsmdwEGT8" target="_blank">
-        <img src="http://img.youtube.com/vi/fiQsmdwEGT8/0.jpg" alt="RL Demo" width="400" border="10" />
-    </a>
+  <a href="http://www.youtube.com/watch?feature=player_embedded&v=fiQsmdwEGT8" target="_blank">
+    <img src="http://img.youtube.com/vi/fiQsmdwEGT8/0.jpg" alt="RL Demo" width="400" border="10" />
+  </a>
-also involves two tasks: attribute selection and model selection.
-Attribute selection is defining the set of observations for the robot
-that best help it complete its objective, while model selection is defining
-the form of the policy (mapping from observations to actions) and its
-parameters. In practice, training behaviors is an iterative process that may
-require changing the attribute and model choices.
+also involves two tasks: attribute selection and model selection. Attribute
+selection is defining the set of observations for the robot that best help it
+complete its objective, while model selection is defining the form of the policy
+(mapping from observations to actions) and its parameters. In practice, training
+behaviors is an iterative process that may require changing the attribute and
+model choices.
-One common aspect of all three branches of machine learning is that they
-all involve a **training phase** and an **inference phase**. While the
-details of the training and inference phases are different for each of the
-three, at a high-level, the training phase involves building a model
-using the provided data, while the inference phase involves applying this
-model to new, previously unseen, data. More specifically:
-* For our unsupervised learning
-example, the training phase learns the optimal two clusters based 
-on the data describing existing players, while the inference phase assigns a 
-new player to one of these two clusters. 
-* For our supervised learning example, the 
-training phase learns the mapping from player attributes to player label
-(whether they churned or not), and the inference phase predicts whether 
-a new player will churn or not based on that learned mapping. 
-* For our reinforcement learning example, the training phase learns the
-optimal policy through guided trials, and in the inference phase, the agent
-observes and tales actions in the wild using its learned policy.
+One common aspect of all three branches of machine learning is that they all
+involve a **training phase** and an **inference phase**. While the details of
+the training and inference phases are different for each of the three, at a
+high-level, the training phase involves building a model using the provided
+data, while the inference phase involves applying this model to new, previously
+unseen, data. More specifically:
-To briefly summarize: all three classes of algorithms involve training
-and inference phases in addition to attribute and model selections. What
-ultimately separates them is the type of data available to learn from. In
-unsupervised learning our data set was a collection of attributes, in
-supervised learning our data set was a collection of attribute-label pairs, 
-and, lastly, in reinforcement learning our data set was a collection of 
+* For our unsupervised learning example, the training phase learns the optimal
+  two clusters based on the data describing existing players, while the
+  inference phase assigns a new player to one of these two clusters.
+* For our supervised learning example, the training phase learns the mapping
+  from player attributes to player label (whether they churned or not), and the
+  inference phase predicts whether a new player will churn or not based on that
+  learned mapping.
+* For our reinforcement learning example, the training phase learns the optimal
+  policy through guided trials, and in the inference phase, the agent observes
+  and tales actions in the wild using its learned policy.
+
+To briefly summarize: all three classes of algorithms involve training and
+inference phases in addition to attribute and model selections. What ultimately
+separates them is the type of data available to learn from. In unsupervised
+learning our data set was a collection of attributes, in supervised learning our
+data set was a collection of attribute-label pairs, and, lastly, in
+reinforcement learning our data set was a collection of
-[Deep learning](https://en.wikipedia.org/wiki/Deep_learning) is a family of 
-algorithms that can be used to address any of the problems introduced 
-above. More specifically, they can be used to solve both attribute and 
-model selection tasks. Deep learning has gained popularity in recent 
-years due to its outstanding performance on several challenging machine learning 
-tasks. One example is [AlphaGo](https://en.wikipedia.org/wiki/AlphaGo), 
-a  [computer Go](https://en.wikipedia.org/wiki/Computer_Go) program, that 
-leverages deep learning, that was able to beat Lee Sedol (a Go world champion).
+[Deep learning](https://en.wikipedia.org/wiki/Deep_learning) is a family of
+algorithms that can be used to address any of the problems introduced above.
+More specifically, they can be used to solve both attribute and model selection
+tasks. Deep learning has gained popularity in recent years due to its
+outstanding performance on several challenging machine learning tasks. One
+example is [AlphaGo](https://en.wikipedia.org/wiki/AlphaGo), a  [computer
+Go](https://en.wikipedia.org/wiki/Computer_Go) program, that leverages deep
+learning, that was able to beat Lee Sedol (a Go world champion).
-complex functions from large amounts of training data. This makes them a
-natural choice for reinforcement learning tasks when a large amount of data
-can be generated, say through the use of a simulator or engine such as Unity.
-By generating hundreds of thousands of simulations of
-the environment within Unity, we can learn policies for very complex environments
-(a complex environment is one where the number of observations an agent perceives
-and the number of actions they can take are large).
-Many of the algorithms we provide in ML-Agents use some form of deep learning,
-built on top of the open-source library, [TensorFlow](Background-TensorFlow.md).
+complex functions from large amounts of training data. This makes them a natural
+choice for reinforcement learning tasks when a large amount of data can be
+generated, say through the use of a simulator or engine such as Unity. By
+generating hundreds of thousands of simulations of the environment within Unity,
+we can learn policies for very complex environments (a complex environment is
+one where the number of observations an agent perceives and the number of
+actions they can take are large). Many of the algorithms we provide in ML-Agents
+use some form of deep learning, built on top of the open-source library,
+[TensorFlow](Background-TensorFlow.md).
--- a/docs/Background-TensorFlow.md
+++ b/docs/Background-TensorFlow.md
 # Background: TensorFlow

-As discussed in our 
-[machine learning background page](Background-Machine-Learning.md), many of the
-algorithms we provide in the ML-Agents toolkit leverage some form of deep learning.
-More specifically, our implementations are built on top of the open-source 
-library [TensorFlow](https://www.tensorflow.org/). This means that the models
-produced by the ML-Agents toolkit are (currently) in a format only understood by
+As discussed in our
+[machine learning background page](Background-Machine-Learning.md),
+many of the algorithms we provide in the
+ML-Agents toolkit leverage some form of deep learning. More specifically, our
+implementations are built on top of the open-source library
+[TensorFlow](https://www.tensorflow.org/). This means that the models produced
+by the ML-Agents toolkit are (currently) in a format only understood by
 TensorFlow. In this page we provide a brief overview of TensorFlow, in addition
 to TensorFlow-related tools that we leverage within the ML-Agents toolkit.

-performing computations using data flow graphs, the underlying representation
-of deep learning models. It facilitates training and inference on CPUs and
-GPUs in a desktop, server, or mobile device. Within the ML-Agents toolkit, when you
-train the behavior of an Agent, the output is a TensorFlow model (.bytes)
-file that you can then embed within an Internal Brain. Unless you implement 
-a new algorithm, the use of TensorFlow is mostly abstracted away and behind 
-the scenes. 
+performing computations using data flow graphs, the underlying representation of
+deep learning models. It facilitates training and inference on CPUs and GPUs in
+a desktop, server, or mobile device. Within the ML-Agents toolkit, when you
+train the behavior of an Agent, the output is a TensorFlow model (.bytes) file
+that you can then embed within an Internal Brain. Unless you implement a new
+algorithm, the use of TensorFlow is mostly abstracted away and behind the
+scenes.
-One component of training models with TensorFlow is setting the
-values of certain model attributes (called _hyperparameters_). Finding the
-right values of these hyperparameters can require a few iterations.
-Consequently, we leverage a visualization tool within TensorFlow called
-[TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard). 
-It allows the visualization of certain agent attributes (e.g. reward)
-throughout training which can be helpful in both building
-intuitions for the different hyperparameters and setting the optimal values for 
-your Unity environment. We provide more details on setting the hyperparameters
-in later parts of the documentation, but, in the meantime, if you are 
-unfamiliar with TensorBoard we recommend this 
+One component of training models with TensorFlow is setting the values of
+certain model attributes (called _hyperparameters_). Finding the right values of
+these hyperparameters can require a few iterations. Consequently, we leverage a
+visualization tool within TensorFlow called
+[TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard).
+It allows the visualization of certain agent attributes (e.g. reward) throughout
+training which can be helpful in both building intuitions for the different
+hyperparameters and setting the optimal values for your Unity environment. We
+provide more details on setting the hyperparameters in later parts of the
+documentation, but, in the meantime, if you are unfamiliar with TensorBoard we
+recommend this
-One of the drawbacks of TensorFlow is that it does not provide a native
-C# API. This means that the Internal Brain is not natively supported since
-Unity scripts are written in C#. Consequently,
-to enable the Internal Brain, we leverage a third-party 
-library [TensorFlowSharp](https://github.com/migueldeicaza/TensorFlowSharp) 
-which provides .NET bindings to TensorFlow. Thus, when a Unity environment
-that contains an Internal Brain is built, inference is performed via
-TensorFlowSharp. We provide an additional in-depth overview of how to
-leverage [TensorFlowSharp within Unity](Using-TensorFlow-Sharp-in-Unity.md)
-which will become more relevant once you install and start training
-behaviors within the ML-Agents toolkit. Given the reliance on TensorFlowSharp, the
-Internal Brain is currently marked as experimental.
+One of the drawbacks of TensorFlow is that it does not provide a native C# API.
+This means that the Internal Brain is not natively supported since Unity scripts
+are written in C#. Consequently, to enable the Internal Brain, we leverage a
+third-party library
+[TensorFlowSharp](https://github.com/migueldeicaza/TensorFlowSharp) which
+provides .NET bindings to TensorFlow. Thus, when a Unity environment that
+contains an Internal Brain is built, inference is performed via TensorFlowSharp.
+We provide an additional in-depth overview of how to leverage
+[TensorFlowSharp within Unity](Using-TensorFlow-Sharp-in-Unity.md)
+which will become more
+relevant once you install and start training behaviors within the ML-Agents
+toolkit. Given the reliance on TensorFlowSharp, the Internal Brain is currently
+marked as experimental.
--- a/docs/Background-Unity.md
+++ b/docs/Background-Unity.md
 # Background: Unity

-If you are not familiar with the [Unity Engine](https://unity3d.com/unity),
-we highly recommend the 
-[Unity Manual](https://docs.unity3d.com/Manual/index.html) and
-[Tutorials page](https://unity3d.com/learn/tutorials). The 
+If you are not familiar with the [Unity Engine](https://unity3d.com/unity), we
+highly recommend the [Unity Manual](https://docs.unity3d.com/Manual/index.html)
+and [Tutorials page](https://unity3d.com/learn/tutorials). The
-with the ML-Agents toolkit: 
+with the ML-Agents toolkit:
+
 * [Editor](https://docs.unity3d.com/Manual/UsingTheEditor.html)
 * [Interface](https://docs.unity3d.com/Manual/LearningtheInterface.html)
 * [Scene](https://docs.unity3d.com/Manual/CreatingScenes.html)
 * [Scripting](https://docs.unity3d.com/Manual/ScriptingSection.html)
 * [Physics](https://docs.unity3d.com/Manual/PhysicsSection.html)
 * [Ordering of event functions](https://docs.unity3d.com/Manual/ExecutionOrder.html)
-(e.g. FixedUpdate, Update)
+  (e.g. FixedUpdate, Update)
--- a/docs/Basic-Guide.md
+++ b/docs/Basic-Guide.md
 **Plugins** > **Computer**.

 **Note**: If you don't see anything under **Assets**, drag the
-`MLAgentsSDK/Assets/ML-Agents` folder under **Assets** within
-Project window.
+`MLAgentsSDK/Assets/ML-Agents` folder under **Assets** within Project window.

 ![Imported TensorFlowsharp](images/imported-tensorflowsharp.png)

 if you want to [use an executable](Learning-Environment-Executable.md) or to
 `None` if you want to interact with the current scene in the Unity Editor.

-More information and documentation is provided in the 
+More information and documentation is provided in the
 [Python API](Python-API.md) page.

 ## Training the Brain with Reinforcement Learning
      training runs
    - And the `--train` tells `mlagents-learn` to run a training session (rather
      than inference)
-5. When the message _"Start training by pressing the Play button in the Unity
+4. When the message _"Start training by pressing the Play button in the Unity
   Editor"_ is displayed on the screen, you can press the :arrow_forward: button
   in Unity to start training in the Editor.

 - For a "Hello World" introduction to creating your own learning environment,
  check out the [Making a New Learning
  Environment](Learning-Environment-Create-New.md) page.
- For a series of Youtube video tutorials, checkout the [Machine Learning Agents
-  PlayList](https://www.youtube.com/playlist?list=PLX2vGYjWbI0R08eWQkO7nQkGiicHAX7IX)
+- For a series of Youtube video tutorials, checkout the
+  [Machine Learning Agents PlayList](https://www.youtube.com/playlist?list=PLX2vGYjWbI0R08eWQkO7nQkGiicHAX7IX)
  page.
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
 If you haven't switched your scripting runtime version from .NET 3.5 to .NET 4.6
 or .NET 4.x, you will see such error message:

-```
+```console
 error CS1061: Type `System.Text.StringBuilder' does not contain a definition for `Clear' and no extension method `Clear' of type `System.Text.StringBuilder' could be found. Are you missing an assembly reference?
 ```

 ENABLE_TENSORFLOW flag for your scripting define symbols, you will see the
 following error message:

-```
+```console
 You need to install and enable the TensorFlowSharp plugin in order to use the internal brain.
 ```

 If you have a graph placeholder set in the internal Brain inspector that is not
 present in the TensorFlow graph, you will see some error like this:

-```
+```console
 UnityAgentsException: One of the Tensorflow placeholder could not be found. In brain <some_brain_name>, there are no FloatingPoint placeholder named <some_placeholder_name>.
 ```

 Similarly, if you have a graph scope set in the internal Brain inspector that is
 not correctly set, you will see some error like this:

-```
+```console
 UnityAgentsException: The node <Wrong_Graph_Scope>/action could not be found. Please make sure the graphScope <Wrong_Graph_Scope>/ is correct
 ```


 If you receive such a permission error on macOS, run:

-```shell
+```sh
-```shell
+```sh
 chmod -R 755 *.x86_64
 ```

--- a/docs/Feature-Memory.md
+++ b/docs/Feature-Memory.md
 # Memory-enhanced Agents using Recurrent Neural Networks

-## What are memories for?
-Have you ever entered a room to get something and immediately forgot
-what you were looking for? Don't let that happen to 
-your agents.  
+## What are memories for
-It is now possible to give memories to your agents. When training, the 
-agents will be able to store a vector of floats to be used next time 
-they need to make a decision. 
+Have you ever entered a room to get something and immediately forgot what you
+were looking for? Don't let that happen to your agents.  
+
+It is now possible to give memories to your agents. When training, the agents
+will be able to store a vector of floats to be used next time they need to make
+a decision.
-Deciding what the agents should remember in order to solve a task is not 
-easy to do by hand, but our training algorithms can learn to keep 
-track of what is important to remember with [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory). 
+Deciding what the agents should remember in order to solve a task is not easy to
+do by hand, but our training algorithms can learn to keep track of what is
+important to remember with
+[LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).
-When configuring the trainer parameters in the `config/trainer_config.yaml` 
+
+When configuring the trainer parameters in the `config/trainer_config.yaml`
 file, add the following parameters to the Brain you want to use.

 ```json
 ```

-* `use_recurrent` is a flag that notifies the  trainer that you want 
-to use a Recurrent Neural Network.
-* `sequence_length` defines how long the sequences of experiences 
-must be while training. In order to use a LSTM, training requires 
-a sequence of experiences instead of single experiences.
-* `memory_size` corresponds to the size of the memory the agent 
-must keep. Note that if this number is too small, the agent will not 
-be able to remember a lot of things. If this number is too large, 
-the neural network will take longer to train. 
+* `use_recurrent` is a flag that notifies the  trainer that you want to use a
+  Recurrent Neural Network.
+* `sequence_length` defines how long the sequences of experiences must be while
+  training. In order to use a LSTM, training requires a sequence of experiences
+  instead of single experiences.
+* `memory_size` corresponds to the size of the memory the agent must keep. Note
+  that if this number is too small, the agent will not be able to remember a lot
+  of things. If this number is too large, the neural network will take longer to
+  train.
-* LSTM does not work well with continuous vector action space. 
-Please use discrete vector action space for better results.
-* Since the memories must be sent back and forth between Python 
-and Unity, using too large `memory_size` will slow down training.
-* Adding a recurrent layer increases the complexity of the neural 
-network, it is recommended to decrease `num_layers` when using recurrent.
+
+* LSTM does not work well with continuous vector action space. Please use
+  discrete vector action space for better results.
+* Since the memories must be sent back and forth between Python and Unity, using
+  too large `memory_size` will slow down training.
+* Adding a recurrent layer increases the complexity of the neural network, it is
+  recommended to decrease `num_layers` when using recurrent.
 * It is required that `memory_size` be divisible by 4.
--- a/docs/Feature-Monitor.md
+++ b/docs/Feature-Monitor.md

 ![Monitor](images/monitor.png)

-The monitor allows visualizing information related to the agents or training process within a Unity scene. 
+The monitor allows visualizing information related to the agents or training
+process within a Unity scene.
-You can track many different things both related and unrelated to the agents themselves. By default, the Monitor is only active in the *inference* phase, so not during training. To change this behaviour, you can activate or deactivate it by calling `SetActive(boolean)`. For example to also show the monitor during training, you can call it in the `InitializeAcademy()` method of your `Academy`:
+You can track many different things both related and unrelated to the agents
+themselves. By default, the Monitor is only active in the *inference* phase, so
+not during training. To change this behaviour, you can activate or deactivate it
+by calling `SetActive(boolean)`. For example to also show the monitor during
+training, you can call it in the `InitializeAcademy()` method of your `Academy`:
-public class YourAcademy : Academy { 
+public class YourAcademy : Academy {
    public override void InitializeAcademy()
    {
        Monitor.SetActive(true);

-To add values to monitor, call the `Log` function anywhere in your code :
+To add values to monitor, call the `Log` function anywhere in your code:
- * *`key`* is the name of the information you want to display.
- * *`value`* is the information you want to display. *`value`* can have different types : 
-   * *`string`* - The Monitor will display the string next to the key. It can be useful for displaying error messages.
-   * *`float`* - The Monitor will display a slider. Note that the values must be between -1 and 1. If the value is positive, the slider will be green, if the value is negative, the slider will be red.
-   * *`float[]`* - The Monitor Log call can take an additional argument called `displayType` that can be either `INDEPENDENT` (default) or `PROPORTIONAL` :
-   		* *`INDEPENDENT`* is used to display multiple independent floats as a histogram. The histogram will be a sequence of vertical sliders.
-   		* *`PROPORTION`* is used to see the proportions between numbers. For each float in values, a rectangle of width of value divided by the sum of all values will be show. It is best for visualizing values that sum to 1.
- * *`target`* is the transform to which you want to attach information. If the transform is `null` the information will be attached to the global monitor. 
-   * **NB:** When adding a target transform that is not the global monitor, make sure you have your main camera object tagged as `MainCamera` via the inspector. This is needed to properly display the text onto the screen.
+* `key` is the name of the information you want to display.
+* `value` is the information you want to display. *`value`* can have different
+  types:
+  * `string` - The Monitor will display the string next to the key. It can be
+    useful for displaying error messages.
+  * `float` - The Monitor will display a slider. Note that the values must be
+    between -1 and 1. If the value is positive, the slider will be green, if the
+    value is negative, the slider will be red.
+  * `float[]` - The Monitor Log call can take an additional argument called
+    `displayType` that can be either `INDEPENDENT` (default) or `PROPORTIONAL`:
+    * `INDEPENDENT` is used to display multiple independent floats as a
+      histogram. The histogram will be a sequence of vertical sliders.
+    * `PROPORTION` is used to see the proportions between numbers. For each
+      float in values, a rectangle of width of value divided by the sum of all
+      values will be show. It is best for visualizing values that sum to 1.
+* `target` is the transform to which you want to attach information. If the
+  transform is `null` the information will be attached to the global monitor.
+  * **NB:** When adding a target transform that is not the global monitor, make
+    sure you have your main camera object tagged as `MainCamera` via the
+    inspector. This is needed to properly display the text onto the screen.
--- a/docs/Getting-Started-with-Balance-Ball.md
+++ b/docs/Getting-Started-with-Balance-Ball.md
 based on the rewards received when it tries different values). For example, an
 element might represent a force or torque applied to a `RigidBody` in the agent.
 The **Discrete** action vector space defines its actions as tables. An action
-given to the agent is an array of indeces into tables. 
+given to the agent is an array of indeces into tables.

 The 3D Balance Ball example is programmed to use both types of vector action
 space. You can try training with both settings to observe whether there is a

 ## Training the Brain with Reinforcement Learning

-Now that we have an environment, we can perform the training. 
+Now that we have an environment, we can perform the training.

 ### Training with PPO


 To summarize, go to your command line, enter the `ml-agents` directory and type:

-```shell
+```sh
 mlagents-learn config/trainer_config.yaml --run-id=<run-identifier> --train
 ```


 ### Observing Training Progress

-Once you start training using `mlagents-learn` in the way described in the previous
-section, the `ml-agents` directory will contain a `summaries` directory. In
-order to observe the training process in more detail, you can use TensorBoard.
-From the command line run:
+Once you start training using `mlagents-learn` in the way described in the
+previous section, the `ml-agents` directory will contain a `summaries`
+directory. In order to observe the training process in more detail, you can use
+TensorBoard. From the command line run:
-```shell
+```sh
 tensorboard --logdir=summaries
 ```

--- a/docs/Installation-Windows.md
+++ b/docs/Installation-Windows.md
 # Installing ML-Agents Toolkit for Windows

-The ML-Agents toolkit supports Windows 10. While it might be possible to run the ML-Agents toolkit using other versions of Windows, it has not been tested on other versions. Furthermore, the ML-Agents toolkit has not been tested on a Windows VM such as Bootcamp or Parallels.
+The ML-Agents toolkit supports Windows 10. While it might be possible to run the
+ML-Agents toolkit using other versions of Windows, it has not been tested on
+other versions. Furthermore, the ML-Agents toolkit has not been tested on a
+Windows VM such as Bootcamp or Parallels.
-To use the ML-Agents toolkit, you install Python and the required Python packages as outlined below. This guide also covers how set up GPU-based training (for advanced users). GPU-based training is not required for the v0.4 release of the ML-Agents toolkit. However, training on a GPU might be required by future versions and features.
+To use the ML-Agents toolkit, you install Python and the required Python
+packages as outlined below. This guide also covers how set up GPU-based training
+(for advanced users). GPU-based training is not required for the v0.4 release of
+the ML-Agents toolkit. However, training on a GPU might be required by future
+versions and features.
-[Download](https://www.anaconda.com/download/#windows) and install Anaconda for Windows. By using Anaconda, you can manage separate environments for different distributions of Python. Python 3.5 or 3.6 is required as we no longer support Python 2. In this guide, we are using Python version 3.6 and Anaconda version 5.1 ([64-bit](https://repo.continuum.io/archive/Anaconda3-5.1.0-Windows-x86_64.exe) or [32-bit](https://repo.continuum.io/archive/Anaconda3-5.1.0-Windows-x86.exe) direct links).
+[Download](https://www.anaconda.com/download/#windows) and install Anaconda for
+Windows. By using Anaconda, you can manage separate environments for different
+distributions of Python. Python 3.5 or 3.6 is required as we no longer support
+Python 2. In this guide, we are using Python version 3.6 and Anaconda version
+5.1
+([64-bit](https://repo.continuum.io/archive/Anaconda3-5.1.0-Windows-x86_64.exe)
+or [32-bit](https://repo.continuum.io/archive/Anaconda3-5.1.0-Windows-x86.exe)
+direct links).
-    <img src="images/anaconda_install.PNG" 
-        alt="Anaconda Install" 
-        width="500" border="10" />
+  <img src="images/anaconda_install.PNG"
+       alt="Anaconda Install"
+       width="500" border="10" />
-We recommend the default _advanced installation options_. However, select the options appropriate for your specific situation.
+We recommend the default _advanced installation options_. However, select the
+options appropriate for your specific situation.
-    <img src="images/anaconda_default.PNG" 
-        alt="Anaconda Install" 
-        width="500" border="10" />
+  <img src="images/anaconda_default.PNG" alt="Anaconda Install" width="500" border="10" />
-After installation, you must open __Anaconda Navigator__ to finish the setup. From the Windows search bar, type _anaconda navigator_. You can close Anaconda Navigator after it opens.
+After installation, you must open __Anaconda Navigator__ to finish the setup.
+From the Windows search bar, type _anaconda navigator_. You can close Anaconda
+Navigator after it opens.
-If environment variables were not created, you will see error "conda is not recognized as internal or external command" when you type `conda` into the command line. To solve this you will need to set the environment variable correctly. 
+If environment variables were not created, you will see error "conda is not
+recognized as internal or external command" when you type `conda` into the
+command line. To solve this you will need to set the environment variable
+correctly.
-Type `environment variables` in the search bar (this can be reached by hitting the Windows key or the bottom left Windows button). You should see an option called __Edit the system environment variables__. 
+Type `environment variables` in the search bar (this can be reached by hitting
+the Windows key or the bottom left Windows button). You should see an option
+called __Edit the system environment variables__.
-    <img src="images/edit_env_var.png" 
-        alt="edit env variables" 
-        width="250" border="10" />
+  <img src="images/edit_env_var.png"
+       alt="edit env variables"
+       width="250" border="10" />
-From here, click the __Environment Variables__ button. 
-Double click "Path" under __System variable__ to edit the "Path" variable, click __New__ to add the following new paths. 
+From here, click the __Environment Variables__ button. Double click "Path" under
+__System variable__ to edit the "Path" variable, click __New__ to add the
+following new paths.
-```
+```console
 %UserProfile%\Anaconda3\Scripts
 %UserProfile%\Anaconda3\Scripts\conda.exe
 %UserProfile%\Anaconda3
-
-You will create a new [Conda environment](https://conda.io/docs/) to be used with the ML-Agents toolkit. This means that all the packages that you install are localized to just this environment. It will not affect any other installation of Python or other environments. Whenever you want to run ML-Agents, you will need activate this Conda environment.
+You will create a new [Conda environment](https://conda.io/docs/) to be used
+with the ML-Agents toolkit. This means that all the packages that you install
+are localized to just this environment. It will not affect any other
+installation of Python or other environments. Whenever you want to run
+ML-Agents, you will need activate this Conda environment.
-To create a new Conda environment, open a new Anaconda Prompt (_Anaconda Prompt_ in the search bar) and type in the following command:
+To create a new Conda environment, open a new Anaconda Prompt (_Anaconda Prompt_
+in the search bar) and type in the following command:
-```
+```sh
-You may be asked to install new packages. Type `y` and press enter _(make sure you are connected to the internet)_. You must install these required packages. The new Conda environment is called ml-agents and uses Python version 3.6.
+You may be asked to install new packages. Type `y` and press enter _(make sure
+you are connected to the internet)_. You must install these required packages.
+The new Conda environment is called ml-agents and uses Python version 3.6.
-    <img src="images/conda_new.PNG" 
-        alt="Anaconda Install" 
-        width="500" border="10" />
+  <img src="images/conda_new.PNG" alt="Anaconda Install" width="500" border="10" />
-To use this environment, you must activate it. _(To use this environment In the future, you can run the same command)_. In the same Anaconda Prompt, type in the following command:
+To use this environment, you must activate it. _(To use this environment In the
+future, you can run the same command)_. In the same Anaconda Prompt, type in the
+following command:
-```
+```sh
-Next, install `tensorflow`. Install this package using `pip` - which is a package management system used to install Python packages. Latest versions of Tensorflow won't work, so you will need to make sure that you install version 1.7.1. In the same Anaconda Prompt, type in the following command _(make sure you are connected to the internet)_:
+Next, install `tensorflow`. Install this package using `pip` - which is a
+package management system used to install Python packages. Latest versions of
+Tensorflow won't work, so you will need to make sure that you install version
+1.7.1. In the same Anaconda Prompt, type in the following command _(make sure
+you are connected to the internet)_:
-```
+```sh
-The ML-Agents toolkit depends on a number of Python packages. Use `pip` to install these Python dependencies.
+The ML-Agents toolkit depends on a number of Python packages. Use `pip` to
+install these Python dependencies.
-If you haven't already, clone the ML-Agents Toolkit Github repository to your local computer. You can do this using Git ([download here](https://git-scm.com/download/win)) and running the following commands in an Anaconda Prompt _(if you open a new prompt, be sure to activate the ml-agents Conda environment by typing `activate ml-agents`)_:
+If you haven't already, clone the ML-Agents Toolkit Github repository to your
+local computer. You can do this using Git ([download
+here](https://git-scm.com/download/win)) and running the following commands in
+an Anaconda Prompt _(if you open a new prompt, be sure to activate the ml-agents
+Conda environment by typing `activate ml-agents`)_:
-```
+```sh
-If you don't want to use Git, you can always directly download all the files [here](https://github.com/Unity-Technologies/ml-agents/archive/master.zip).
+If you don't want to use Git, you can always directly download all the files
+[here](https://github.com/Unity-Technologies/ml-agents/archive/master.zip).
-In our example, the files are located in `C:\Downloads`. After you have either cloned or downloaded the files, from the Anaconda Prompt, change to the python directory inside the ml-agents directory:
+In our example, the files are located in `C:\Downloads`. After you have either
+cloned or downloaded the files, from the Anaconda Prompt, change to the python
+directory inside the ml-agents directory:
-```
+```console
-Make sure you are connected to the internet and then type in the Anaconda Prompt:
+Make sure you are connected to the internet and then type in the Anaconda
+Prompt:
-```
+```sh
-This will complete the installation of all the required Python packages to run the ML-Agents toolkit.
+This will complete the installation of all the required Python packages to run
+the ML-Agents toolkit.
-GPU is not required for the ML-Agents toolkit and won't speed up the PPO algorithm a lot during training(but something in the future will benefit from GPU). This is a guide for advanced users who want to train using GPUs. Additionally, you will need to check if your GPU is CUDA compatible. Please check Nvidia's page [here](https://developer.nvidia.com/cuda-gpus).
+GPU is not required for the ML-Agents toolkit and won't speed up the PPO
+algorithm a lot during training(but something in the future will benefit from
+GPU). This is a guide for advanced users who want to train using GPUs.
+Additionally, you will need to check if your GPU is CUDA compatible. Please
+check Nvidia's page [here](https://developer.nvidia.com/cuda-gpus).
-[Download](https://developer.nvidia.com/cuda-toolkit-archive) and install the CUDA toolkit 9.0 from Nvidia's archive. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ (Step Visual Studio 2017) compiler and a runtime library and is needed to run the ML-Agents toolkit. In this guide, we are using version 9.0.176 (https://developer.nvidia.com/compute/cuda/9.0/Prod/network_installers/cuda_9.0.176_win10_network-exe)).
+[Download](https://developer.nvidia.com/cuda-toolkit-archive) and install the
+CUDA toolkit 9.0 from Nvidia's archive. The toolkit includes GPU-accelerated
+libraries, debugging and optimization tools, a C/C++ (Step Visual Studio 2017)
+compiler and a runtime library and is needed to run the ML-Agents toolkit. In
+this guide, we are using version
+[9.0.176](https://developer.nvidia.com/compute/cuda/9.0/Prod/network_installers/cuda_9.0.176_win10_network-exe)).
-Before installing, please make sure you __close any running instances of Unity or Visual Studio__.
+Before installing, please make sure you __close any running instances of Unity
+or Visual Studio__.
-Run the installer and select the Express option. Note the directory where you installed the CUDA toolkit. In this guide, we installed in the directory `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`
+Run the installer and select the Express option. Note the directory where you
+installed the CUDA toolkit. In this guide, we installed in the directory
+`C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`
-[Download](https://developer.nvidia.com/cudnn) and install the cuDNN library from Nvidia. cuDNN is a GPU-accelerated library of primitives for deep neural networks. Before you can download, you will need to sign up for free to the Nvidia Developer Program.
+[Download](https://developer.nvidia.com/cudnn) and install the cuDNN library
+from Nvidia. cuDNN is a GPU-accelerated library of primitives for deep neural
+networks. Before you can download, you will need to sign up for free to the
+Nvidia Developer Program.
-    <img src="images/cuDNN_membership_required.png" 
-        alt="cuDNN membership required" 
-        width="500" border="10" />
+  <img src="images/cuDNN_membership_required.png"
+       alt="cuDNN membership required"
+       width="500" border="10" />
-Once you've signed up, go back to the cuDNN [downloads page](https://developer.nvidia.com/cudnn). You may or may not be asked to fill out a short survey. When you get to the list cuDNN releases, __make sure you are downloading the right version for the CUDA toolkit you installed in Step 1.__  In this guide, we are using version 7.0.5 for CUDA toolkit version 9.0 ([direct link](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.0_20171129/cudnn-9.0-windows10-x64-v7)).
+Once you've signed up, go back to the cuDNN
+[downloads page](https://developer.nvidia.com/cudnn). You may or may not be asked to fill
+out a short survey. When you get to the list cuDNN releases, __make sure you are
+downloading the right version for the CUDA toolkit you installed in Step 1.__
+In this guide, we are using version 7.0.5 for CUDA toolkit version 9.0 ([direct
+link](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.0_20171129/cudnn-9.0-windows10-x64-v7)).
-After you have downloaded the cuDNN files, you will need to extract the files into the CUDA toolkit directory. In the cuDNN zip file, there are three folders called `bin`, `include`, and `lib`. 
+After you have downloaded the cuDNN files, you will need to extract the files
+into the CUDA toolkit directory. In the cuDNN zip file, there are three folders
+called `bin`, `include`, and `lib`.
-    <img src="images/cudnn_zip_files.PNG" 
-        alt="cuDNN zip files" 
-        width="500" border="10" />
+  <img src="images/cudnn_zip_files.PNG"
+       alt="cuDNN zip files"
+       width="500" border="10" />
-Copy these three folders into the CUDA toolkit directory. The CUDA toolkit directory is located at `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`
+Copy these three folders into the CUDA toolkit directory. The CUDA toolkit
+directory is located at
+`C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`
-    <img src="images/cuda_toolkit_directory.PNG" 
-        alt="cuda toolkit directory" 
-        width="500" border="10" />
+  <img src="images/cuda_toolkit_directory.PNG"
+       alt="cuda toolkit directory"
+       width="500" border="10" />
 </p>

 ### Set Environment Variables
-To set the environment variable, type `environment variables` in the search bar (this can be reached by hitting the Windows key or the bottom left Windows button). You should see an option called __Edit the system environment variables__. 
+To set the environment variable, type `environment variables` in the search bar
+(this can be reached by hitting the Windows key or the bottom left Windows
+button). You should see an option called __Edit the system environment
+variables__.
-    <img src="images/edit_env_var.png" 
-        alt="edit env variables" 
-        width="250" border="10" />
+  <img src="images/edit_env_var.png"
+       alt="edit env variables"
+       width="250" border="10" />
-From here, click the __Environment Variables__ button. Click __New__ to add a new system variable _(make sure you do this under __System variables__ and not User variables_. 
+From here, click the __Environment Variables__ button. Click __New__ to add a
+new system variable _(make sure you do this under __System variables__ and not
+User variables_.
-    <img src="images/new_system_variable.PNG" 
-        alt="new system variable" 
-        width="500" border="10" />
+  <img src="images/new_system_variable.PNG"
+       alt="new system variable"
+       width="500" border="10" />
-For __Variable Name__, enter `CUDA_HOME`. For the variable value, put the directory location for the CUDA toolkit. In this guide, the directory location is `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`. Press __OK__ once.
+For __Variable Name__, enter `CUDA_HOME`. For the variable value, put the
+directory location for the CUDA toolkit. In this guide, the directory location
+is `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`. Press __OK__ once.
-    <img src="images/system_variable_name_value.PNG" 
-        alt="system variable names and values" 
-        width="500" border="10" />
+  <img src="images/system_variable_name_value.PNG"
+       alt="system variable names and values"
+       width="500" border="10" />
-To set the two path variables, inside the same __Environment Variables__ window and under the second box called __System Variables__, find a variable called `Path` and click __Edit__. You will add two directories to the list. For this guide, the two entries would look like:
+To set the two path variables, inside the same __Environment Variables__ window
+and under the second box called __System Variables__, find a variable called
+`Path` and click __Edit__. You will add two directories to the list. For this
+guide, the two entries would look like:
-    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64
-    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\extras\CUPTI\libx64
+```console
+C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64
+C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\extras\CUPTI\libx64
+```
-Make sure to replace the relevant directory location with the one you have installed. _Please note that case sensitivity matters_.
+Make sure to replace the relevant directory location with the one you have
+installed. _Please note that case sensitivity matters_.
-    <img src="images/path_variables.PNG" 
-        alt="Path variables" 
+    <img src="images/path_variables.PNG"
+        alt="Path variables"
-Next, install `tensorflow-gpu` using `pip`. You'll need version 1.7.1. In an Anaconda Prompt with the Conda environment ml-agents activated, type in the following command to uninstall the tensorflow for cpu and install the tensorflow for gpu _(make sure you are connected to the internet)_:
+
+Next, install `tensorflow-gpu` using `pip`. You'll need version 1.7.1. In an
+Anaconda Prompt with the Conda environment ml-agents activated, type in the
+following command to uninstall the tensorflow for cpu and install the tensorflow
+for gpu _(make sure you are connected to the internet)_:
-```
+```sh
-Lastly, you should test to see if everything installed properly and that TensorFlow can identify your GPU. In the same Anaconda Prompt, type in the following command:
+Lastly, you should test to see if everything installed properly and that
+TensorFlow can identify your GPU. In the same Anaconda Prompt, type in the
+following command:

 ```python
 import tensorflow as tf

 You should see something similar to:

-```
+```console
-We would like to thank [Jason Weimann](https://unity3d.college/2017/10/25/machine-learning-in-unity3d-setting-up-the-environment-tensorflow-for-agentml-on-windows-10/) and [Nitish S. Mutha](http://blog.nitishmutha.com/tensorflow/2017/01/22/TensorFlow-with-gpu-for-windows.html) for writing the original articles which were used to create this guide.
+We would like to thank
+[Jason Weimann](https://unity3d.college/2017/10/25/machine-learning-in-unity3d-setting-up-the-environment-tensorflow-for-agentml-on-windows-10/)
+and
+[Nitish S. Mutha](http://blog.nitishmutha.com/tensorflow/2017/01/22/TensorFlow-with-gpu-for-windows.html)
+for writing the original articles which were used to create this guide.
--- a/docs/Installation.md
+++ b/docs/Installation.md
 # Installation

-To install and use ML-Agents, you need install Unity, clone this repository
-and install Python with additional dependencies. Each of the subsections
-below overviews each step, in addition to a Docker set-up.
+To install and use ML-Agents, you need install Unity, clone this repository and
+install Python with additional dependencies. Each of the subsections below
+overviews each step, in addition to a Docker set-up.
-like to use our Docker set-up (introduced later), make sure to select the 
-_Linux Build Support_ component when installing Unity.
+like to use our Docker set-up (introduced later), make sure to select the _Linux
+Build Support_ component when installing Unity.
-    <img src="images/unity_linux_build_support.png"
-        alt="Linux Build Support"
-        width="500" border="10" />
+  <img src="images/unity_linux_build_support.png"
+       alt="Linux Build Support"
+       width="500" border="10" />
 </p>

 ## Clone the Ml-Agents Repository
    git clone https://github.com/Unity-Technologies/ml-agents.git

-The `MLAgentsSDK` directory in this repository contains the Unity Assets
-to add to your projects. The `python` directory contains python packages
-which provide trainers, a python API to interface with Unity, and a package
-to interface with OpenAI Gym.
+The `MLAgentsSDK` directory in this repository contains the Unity Assets to add
+to your projects. The `python` directory contains python packages which provide
+trainers, a python API to interface with Unity, and a package to interface with
+OpenAI Gym.
-In order to use ML-Agents toolkit, you need Python 3.6 along with
-the dependencies listed in the [requirements file](../ml-agents/requirements.txt).
+In order to use ML-Agents toolkit, you need Python 3.6 along with the
+dependencies listed in the [requirements file](../ml-agents/requirements.txt).
 Some of the primary dependencies include:

 - [TensorFlow](Background-TensorFlow.md)
--- a/docs/Learning-Environment-Best-Practices.md
+++ b/docs/Learning-Environment-Best-Practices.md
 # Environment Design Best Practices

 ## General
-* It is often helpful to start with the simplest version of the problem, to ensure the agent can learn it. From there increase
-complexity over time. This can either be done manually, or via Curriculum Learning, where a set of lessons which progressively increase in difficulty are presented to the agent ([learn more here](Training-Curriculum-Learning.md)).
-* When possible, it is often helpful to ensure that you can complete the task by using a Player Brain to control the agent.
-* It is often helpful to make many copies of the agent, and attach the brain to be trained to all of these agents. In this way the brain can get more feedback information from all of these agents, which helps it train faster. 
+
+* It is often helpful to start with the simplest version of the problem, to
+  ensure the agent can learn it. From there increase complexity over time. This
+  can either be done manually, or via Curriculum Learning, where a set of
+  lessons which progressively increase in difficulty are presented to the agent
+  ([learn more here](Training-Curriculum-Learning.md)).
+* When possible, it is often helpful to ensure that you can complete the task by
+  using a Player Brain to control the agent.
+* It is often helpful to make many copies of the agent, and attach the brain to
+  be trained to all of these agents. In this way the brain can get more feedback
+  information from all of these agents, which helps it train faster.
-* The magnitude of any given reward should typically not be greater than 1.0 in order to ensure a more stable learning process.
-* Positive rewards are often more helpful to shaping the desired behavior of an agent than negative rewards.
-* For locomotion tasks, a small positive reward (+0.1) for forward velocity is typically used. 
-* If you want the agent to finish a task quickly, it is often helpful to provide a small penalty every step (-0.05) that the agent does not complete the task. In this case completion of the task should also coincide with the end of the episode.
-* Overly-large negative rewards can cause undesirable behavior where an agent learns to avoid any behavior which might produce the negative reward, even if it is also behavior which can eventually lead to a positive reward.
+
+* The magnitude of any given reward should typically not be greater than 1.0 in
+  order to ensure a more stable learning process.
+* Positive rewards are often more helpful to shaping the desired behavior of an
+  agent than negative rewards.
+* For locomotion tasks, a small positive reward (+0.1) for forward velocity is
+  typically used.
+* If you want the agent to finish a task quickly, it is often helpful to provide
+  a small penalty every step (-0.05) that the agent does not complete the task.
+  In this case completion of the task should also coincide with the end of the
+  episode.
+* Overly-large negative rewards can cause undesirable behavior where an agent
+  learns to avoid any behavior which might produce the negative reward, even if
+  it is also behavior which can eventually lead to a positive reward.
-* Vector Observations should include all variables relevant to allowing the agent to take the optimally informed decision.
-* In cases where Vector Observations need to be remembered or compared over time, increase the `Stacked Vectors` value to allow the agent to keep track of multiple observations into the past. 
-* Categorical variables such as type of object (Sword, Shield, Bow) should be encoded in one-hot fashion (i.e. `3` > `0, 0, 1`).
-* Besides encoding non-numeric values, all inputs should be normalized to be in the range 0 to +1 (or -1 to 1). For example, the `x` position information of an agent where the maximum possible value is `maxValue` should be recorded as `AddVectorObs(transform.position.x / maxValue);` rather than `AddVectorObs(transform.position.x);`. See the equation below for one approach of normalization. 
-* Positional information of relevant GameObjects should be encoded in relative coordinates wherever possible. This is often relative to the agent position.
+
+* Vector Observations should include all variables relevant to allowing the
+  agent to take the optimally informed decision.
+* In cases where Vector Observations need to be remembered or compared over
+  time, increase the `Stacked Vectors` value to allow the agent to keep track of
+  multiple observations into the past.
+* Categorical variables such as type of object (Sword, Shield, Bow) should be
+  encoded in one-hot fashion (i.e. `3` > `0, 0, 1`).
+* Besides encoding non-numeric values, all inputs should be normalized to be in
+  the range 0 to +1 (or -1 to 1). For example, the `x` position information of
+  an agent where the maximum possible value is `maxValue` should be recorded as
+  `AddVectorObs(transform.position.x / maxValue);` rather than
+  `AddVectorObs(transform.position.x);`. See the equation below for one approach
+  of normalization.
+* Positional information of relevant GameObjects should be encoded in relative
+  coordinates wherever possible. This is often relative to the agent position.
-* When using continuous control, action values should be clipped to an appropriate range. The provided PPO model automatically clips these values between -1 and 1, but third party training systems may not do so.
-* Be sure to set the Vector Action's Space Size to the number of used Vector Actions, and not greater, as doing the latter can interfere with the efficiency of the training process.
+
+* When using continuous control, action values should be clipped to an
+  appropriate range. The provided PPO model automatically clips these values
+  between -1 and 1, but third party training systems may not do so.
+* Be sure to set the Vector Action's Space Size to the number of used Vector
+  Actions, and not greater, as doing the latter can interfere with the
+  efficiency of the training process.
--- a/docs/Learning-Environment-Create-New.md
+++ b/docs/Learning-Environment-Create-New.md
 # Making a New Learning Environment

-This tutorial walks through the process of creating a Unity Environment. A Unity Environment is an application built using the Unity Engine which can be used to train Reinforcement Learning agents.
+This tutorial walks through the process of creating a Unity Environment. A Unity
+Environment is an application built using the Unity Engine which can be used to
+train Reinforcement Learning agents.
-In this example, we will train a ball to roll to a randomly placed cube. The ball also learns to avoid falling off the platform.
+In this example, we will train a ball to roll to a randomly placed cube. The
+ball also learns to avoid falling off the platform.
-Using the ML-Agents toolkit in a Unity project involves the following basic steps:
+Using the ML-Agents toolkit in a Unity project involves the following basic
+steps:
-1. Create an environment for your agents to live in. An environment can range from a simple physical simulation containing a few objects to an entire game or ecosystem.
-2. Implement an Academy subclass and add it to a GameObject in the Unity scene containing the environment. This GameObject will serve as the parent for any Brain objects in the scene. Your Academy class can implement a few optional methods to update the scene independently of any agents. For example, you can add, move, or delete agents and other entities in the environment.
+1. Create an environment for your agents to live in. An environment can range
+   from a simple physical simulation containing a few objects to an entire game
+   or ecosystem.
+2. Implement an Academy subclass and add it to a GameObject in the Unity scene
+   containing the environment. This GameObject will serve as the parent for any
+   Brain objects in the scene. Your Academy class can implement a few optional
+   methods to update the scene independently of any agents. For example, you can
+   add, move, or delete agents and other entities in the environment.
-4. Implement your Agent subclasses. An Agent subclass defines the code an agent uses to observe its environment, to carry out assigned actions, and to calculate the rewards used for reinforcement training. You can also implement optional methods to reset the agent when it has finished or failed its task.
-5. Add your Agent subclasses to appropriate GameObjects, typically, the object in the scene that represents the agent in the simulation. Each Agent object must be assigned a Brain object.
-6. If training, set the Brain type to External and [run the training process](Training-ML-Agents.md).  
+4. Implement your Agent subclasses. An Agent subclass defines the code an agent
+   uses to observe its environment, to carry out assigned actions, and to
+   calculate the rewards used for reinforcement training. You can also implement
+   optional methods to reset the agent when it has finished or failed its task.
+5. Add your Agent subclasses to appropriate GameObjects, typically, the object
+   in the scene that represents the agent in the simulation. Each Agent object
+   must be assigned a Brain object.
+6. If training, set the Brain type to External and
+   [run the training process](Training-ML-Agents.md).  
-**Note:** If you are unfamiliar with Unity, refer to [Learning the interface](https://docs.unity3d.com/Manual/LearningtheInterface.html) in the Unity Manual if an Editor task isn't explained sufficiently in this tutorial.
+**Note:** If you are unfamiliar with Unity, refer to
+[Learning the interface](https://docs.unity3d.com/Manual/LearningtheInterface.html)
+in the Unity Manual if an Editor task isn't explained sufficiently in this
+tutorial.
-The first task to accomplish is simply creating a new Unity project and importing the ML-Agents assets into it:
+The first task to accomplish is simply creating a new Unity project and
+importing the ML-Agents assets into it:
-
-2. In a file system window, navigate to the folder containing your cloned ML-Agents repository. 
-
-3. Drag the `ML-Agents` folder from `MLAgentsSDK/Assets` to the Unity Editor Project window.
+2. In a file system window, navigate to the folder containing your cloned
+   ML-Agents repository.
+3. Drag the `ML-Agents` folder from `MLAgentsSDK/Assets` to the Unity Editor
+   Project window.
-## Create the Environment:
+## Create the Environment
-Next, we will create a very simple scene to act as our ML-Agents environment. The "physical" components of the environment include a Plane to act as the floor for the agent to move around on, a Cube to act as the goal or target for the agent to seek, and a Sphere to represent the agent itself. 
+Next, we will create a very simple scene to act as our ML-Agents environment.
+The "physical" components of the environment include a Plane to act as the floor
+for the agent to move around on, a Cube to act as the goal or target for the
+agent to seek, and a Sphere to represent the agent itself.
-**Create the floor plane:**
+### Create the floor plane
-5. On the Plane's Mesh Renderer, expand the Materials property and change the default-material to *floor*.
+5. On the Plane's Mesh Renderer, expand the Materials property and change the
+   default-material to *floor*.
-(To set a new material, click the small circle icon next to the current material name. This opens the **Object Picker** dialog so that you can choose the a different material from the list of all materials currently in the project.)
+(To set a new material, click the small circle icon next to the current material
+name. This opens the **Object Picker** dialog so that you can choose the a
+different material from the list of all materials currently in the project.)
-**Add the Target Cube**
+### Add the Target Cube
-5. On the Cube's Mesh Renderer, expand the Materials property and change the default-material to *Block*.
+5. On the Cube's Mesh Renderer, expand the Materials property and change the
+   default-material to *Block*.
-**Add the Agent Sphere**
+### Add the Agent Sphere
-5. On the Sphere's Mesh Renderer, expand the Materials property and change the default-material to *checker 1*.
+5. On the Sphere's Mesh Renderer, expand the Materials property and change the
+   default-material to *checker 1*.
-7. Add the Physics/Rigidbody component to the Sphere. (Adding a Rigidbody ) 
+7. Add the Physics/Rigidbody component to the Sphere. (Adding a Rigidbody)
-Note that we will create an Agent subclass to add to this GameObject as a component later in the tutorial.
+Note that we will create an Agent subclass to add to this GameObject as a
+component later in the tutorial.
-**Add Empty GameObjects to Hold the Academy and Brain**
+### Add Empty GameObjects to Hold the Academy and Brain

 1. Right click in Hierarchy window, select Create Empty.
 2. Name the GameObject "Academy"
 ![The scene hierarchy](images/mlagents-NewTutHierarchy.png)

-You can adjust the camera angles to give a better view of the scene at runtime. The next steps will be to create and add the ML-Agent components.
+You can adjust the camera angles to give a better view of the scene at runtime.
+The next steps will be to create and add the ML-Agent components.
-The Academy object coordinates the ML-Agents in the scene and drives the decision-making portion of the simulation loop. Every ML-Agent scene needs one Academy instance. Since the base Academy class is abstract, you must make your own subclass even if you don't need to use any of the methods for a particular environment.
+The Academy object coordinates the ML-Agents in the scene and drives the
+decision-making portion of the simulation loop. Every ML-Agent scene needs one
+Academy instance. Since the base Academy class is abstract, you must make your
+own subclass even if you don't need to use any of the methods for a particular
+environment.
-First, add a New Script component to the Academy GameObject created earlier: 
+First, add a New Script component to the Academy GameObject created earlier:

 1. Select the Academy GameObject to view it in the Inspector window.
 2. Click **Add Component**.

 Next, edit the new `RollerAcademy` script:

-1. In the Unity Project window, double-click the `RollerAcademy` script to open it in your code editor. (By default new scripts are placed directly in the **Assets** folder.)
+1. In the Unity Project window, double-click the `RollerAcademy` script to open
+   it in your code editor. (By default new scripts are placed directly in the
+   **Assets** folder.)
-In such a basic scene, we don't need the Academy to initialize, reset, or otherwise control any objects in the environment so we have the simplest possible Academy implementation:
+In such a basic scene, we don't need the Academy to initialize, reset, or
+otherwise control any objects in the environment so we have the simplest
+possible Academy implementation:

 ```csharp
 using MLAgents;

-The default settings for the Academy properties are also fine for this environment, so we don't need to change anything for the RollerAcademy component in the Inspector window.
+The default settings for the Academy properties are also fine for this
+environment, so we don't need to change anything for the RollerAcademy component
+in the Inspector window.
-The Brain object encapsulates the decision making process. An Agent sends its observations to its Brain and expects a decision in return. The Brain Type setting determines how the Brain makes decisions. Unlike the Academy and Agent classes, you don't make your own Brain subclasses. 
+The Brain object encapsulates the decision making process. An Agent sends its
+observations to its Brain and expects a decision in return. The Brain Type
+setting determines how the Brain makes decisions. Unlike the Academy and Agent
+classes, you don't make your own Brain subclasses.
-1. Select the Brain GameObject created earlier to show its properties in the Inspector window.
+1. Select the Brain GameObject created earlier to show its properties in the
+   Inspector window.
-We will come back to the Brain properties later, but leave the Brain Type as **Player** for now.
+We will come back to the Brain properties later, but leave the Brain Type as
+**Player** for now.

 ![The Brain default properties](images/mlagents-NewTutBrain.png)


 Then, edit the new `RollerAgent` script:

-1. In the Unity Project window, double-click the `RollerAgent` script to open it in your code editor. 
+1. In the Unity Project window, double-click the `RollerAgent` script to open it
+   in your code editor.
-3. Delete the `Update()` method, but we will use the `Start()` function, so leave it alone for now.
+3. Delete the `Update()` method, but we will use the `Start()` function, so
+   leave it alone for now.
-So far, these are the basic steps that you would use to add ML-Agents to any Unity project. Next, we will add the logic that will let our agent learn to roll to the cube using reinforcement learning.
+So far, these are the basic steps that you would use to add ML-Agents to any
+Unity project. Next, we will add the logic that will let our agent learn to roll
+to the cube using reinforcement learning.
-In this simple scenario, we don't use the Academy object to control the environment. If we wanted to change the environment, for example change the size of the floor or add or remove agents or other objects before or during the simulation, we could implement the appropriate methods in the Academy. Instead, we will have the Agent do all the work of resetting itself and the target when it succeeds or falls trying. 
+In this simple scenario, we don't use the Academy object to control the
+environment. If we wanted to change the environment, for example change the size
+of the floor or add or remove agents or other objects before or during the
+simulation, we could implement the appropriate methods in the Academy. Instead,
+we will have the Agent do all the work of resetting itself and the target when
+it succeeds or falls trying.
-**Initialization and Resetting the Agent**
+### Initialization and Resetting the Agent
-When the agent reaches its target, it marks itself done and its agent reset function moves the target to a random location. In addition, if the agent rolls off the platform, the reset function puts it back onto the floor.
+When the agent reaches its target, it marks itself done and its agent reset
+function moves the target to a random location. In addition, if the agent rolls
+off the platform, the reset function puts it back onto the floor.
-To move the target GameObject, we need a reference to its Transform (which stores a GameObject's position, orientation and scale in the 3D world). To get this reference, add a public field of type `Transform` to the RollerAgent class.  Public fields of a component in Unity get displayed in the Inspector window, allowing you to choose which GameObject to use as the target in the Unity Editor. To reset the agent's velocity (and later to apply force to move the agent) we need a reference to the Rigidbody component. A [Rigidbody](https://docs.unity3d.com/ScriptReference/Rigidbody.html) is Unity's primary element for physics simulation. (See [Physics](https://docs.unity3d.com/Manual/PhysicsSection.html) for full documentation of Unity physics.) Since the Rigidbody component is on the same GameObject as our Agent script, the best way to get this reference is using `GameObject.GetComponent<T>()`, which we can call in our script's `Start()` method.
+To move the target GameObject, we need a reference to its Transform (which
+stores a GameObject's position, orientation and scale in the 3D world). To get
+this reference, add a public field of type `Transform` to the RollerAgent class.
+Public fields of a component in Unity get displayed in the Inspector window,
+allowing you to choose which GameObject to use as the target in the Unity
+Editor. To reset the agent's velocity (and later to apply force to move the
+agent) we need a reference to the Rigidbody component. A
+[Rigidbody](https://docs.unity3d.com/ScriptReference/Rigidbody.html) is Unity's
+primary element for physics simulation. (See
+[Physics](https://docs.unity3d.com/Manual/PhysicsSection.html) for full
+documentation of Unity physics.) Since the Rigidbody component is on the same
+GameObject as our Agent script, the best way to get this reference is using
+`GameObject.GetComponent<T>()`, which we can call in our script's `Start()`
+method.
-So far, our RollerAgent script looks like: 
+So far, our RollerAgent script looks like:

 ```csharp
 using System.Collections.Generic;
-public class RollerAgent : Agent 
+public class RollerAgent : Agent
 {
    Rigidbody rBody;
    void Start () {
            this.rBody.velocity = Vector3.zero;
        }
        else
-        { 
+        {
            // Move the target to a new spot
            Target.position = new Vector3(Random.value * 8 - 4,
                                          0.5f,
 }
 ```

-Next, let's implement the Agent.CollectObservations() function. 
+Next, let's implement the Agent.CollectObservations() function.
-**Observing the Environment**
+### Observing the Environment
-The Agent sends the information we collect to the Brain, which uses it to make a decision. When you train the agent (or use a trained model), the data is fed into a neural network as a feature vector. For an agent to successfully learn a task, we need to provide the correct information. A good rule of thumb for deciding what information to collect is to consider what you would need to calculate an analytical solution to the problem. 
+The Agent sends the information we collect to the Brain, which uses it to make a
+decision. When you train the agent (or use a trained model), the data is fed
+into a neural network as a feature vector. For an agent to successfully learn a
+task, we need to provide the correct information. A good rule of thumb for
+deciding what information to collect is to consider what you would need to
+calculate an analytical solution to the problem.
-* Position of the target. In general, it is better to use the relative position of other objects rather than the absolute position for more generalizable training. Note that the agent only collects the x and z coordinates since the floor is aligned with the x-z plane and the y component of the target's position never changes.
+* Position of the target. In general, it is better to use the relative position
+  of other objects rather than the absolute position for more generalizable
+  training. Note that the agent only collects the x and z coordinates since the
+  floor is aligned with the x-z plane and the y component of the target's
+  position never changes.

 ```csharp
 // Calculate relative position
 AddVectorObs(relativePosition.z / 5);
 ```

-* Position of the agent itself within the confines of the floor. This data is collected as the agent's distance from each edge of the floor.
+* Position of the agent itself within the confines of the floor. This data is
+  collected as the agent's distance from each edge of the floor.

 ```csharp
 // Distance to edges of platform
 AddVectorObs((this.transform.position.z - 5) / 5);
 ```

-* The velocity of the agent. This helps the agent learn to control its speed so it doesn't overshoot the target and roll off the platform.
+* The velocity of the agent. This helps the agent learn to control its speed so
+  it doesn't overshoot the target and roll off the platform.

 ```csharp
 // Agent velocity

-All the values are divided by 5 to normalize the inputs to the neural network to the range [-1,1]. (The number five is used because the platform is 10 units across.)
+All the values are divided by 5 to normalize the inputs to the neural network to
+the range [-1,1]. (The number five is used because the platform is 10 units
+across.)
-In total, the state observation contains 8 values and we need to use the continuous state space when we get around to setting the Brain properties:
+In total, the state observation contains 8 values and we need to use the
+continuous state space when we get around to setting the Brain properties:

 ```csharp
 public override void CollectObservations()
-    
+
-    
+
-    
+
    // Agent velocity
    AddVectorObs(rBody.velocity.x/5);
    AddVectorObs(rBody.velocity.z/5);
-The final part of the Agent code is the Agent.AgentAction() function, which receives the decision from the Brain.
+The final part of the Agent code is the Agent.AgentAction() function, which
+receives the decision from the Brain.
-**Actions**
+### Actions
-The decision of the Brain comes in the form of an action array passed to the `AgentAction()` function. The number of elements in this array is determined by the `Vector Action Space Type` and `Vector Action Space Size` settings of the agent's Brain. The RollerAgent uses the continuous vector action space and needs two continuous control signals from the brain. Thus, we will set the Brain `Vector Action Size` to 2. The first element,`action[0]` determines the force applied along the x axis; `action[1]` determines the force applied along the z axis. (If we allowed the agent to move in three dimensions, then we would need to set `Vector Action Size` to 3. Each of these values returned by the network are between `-1` and `1.` Note the Brain really has no idea what the values in the action array mean. The training process just adjusts the action values in response to the observation input and then sees what kind of rewards it gets as a result. 
+The decision of the Brain comes in the form of an action array passed to the
+`AgentAction()` function. The number of elements in this array is determined by
+the `Vector Action Space Type` and `Vector Action Space Size` settings of the
+agent's Brain. The RollerAgent uses the continuous vector action space and needs
+two continuous control signals from the brain. Thus, we will set the Brain
+`Vector Action Size` to 2. The first element,`action[0]` determines the force
+applied along the x axis; `action[1]` determines the force applied along the z
+axis. (If we allowed the agent to move in three dimensions, then we would need
+to set `Vector Action Size` to 3. Each of these values returned by the network
+are between `-1` and `1.` Note the Brain really has no idea what the values in
+the action array mean. The training process just adjusts the action values in
+response to the observation input and then sees what kind of rewards it gets as
+a result.
-The RollerAgent applies the values from the action[] array to its Rigidbody component, `rBody`, using the `Rigidbody.AddForce` function:
+The RollerAgent applies the values from the action[] array to its Rigidbody
+component, `rBody`, using the `Rigidbody.AddForce` function:

 ```csharp
 Vector3 controlSignal = Vector3.zero;
 ```

-**Rewards**
+### Rewards
-Reinforcement learning requires rewards. Assign rewards in the `AgentAction()` function. The learning algorithm uses the rewards assigned to the agent at each step in the simulation and learning process to determine whether it is giving the agent the optimal actions. You want to reward an agent for completing the assigned task (reaching the Target cube, in this case) and punish the agent if it irrevocably fails (falls off the platform). You can sometimes speed up training with sub-rewards that encourage behavior that helps the agent complete the task. For example, the RollerAgent reward system provides a small reward if the agent moves closer to the target in a step and a small negative reward at each step which encourages the agent to complete its task quickly. 
+Reinforcement learning requires rewards. Assign rewards in the `AgentAction()`
+function. The learning algorithm uses the rewards assigned to the agent at each
+step in the simulation and learning process to determine whether it is giving
+the agent the optimal actions. You want to reward an agent for completing the
+assigned task (reaching the Target cube, in this case) and punish the agent if
+it irrevocably fails (falls off the platform). You can sometimes speed up
+training with sub-rewards that encourage behavior that helps the agent complete
+the task. For example, the RollerAgent reward system provides a small reward if
+the agent moves closer to the target in a step and a small negative reward at
+each step which encourages the agent to complete its task quickly.
-The RollerAgent calculates the distance to detect when it reaches the target. When it does, the code increments the Agent.reward variable by 1.0 and marks the agent as finished by setting the agent to done. 
+The RollerAgent calculates the distance to detect when it reaches the target.
+When it does, the code increments the Agent.reward variable by 1.0 and marks the
+agent as finished by setting the agent to done.

 ```csharp
 float distanceToTarget = Vector3.Distance(this.transform.position,
 }
 ```

-**Note:** When you mark an agent as done, it stops its activity until it is reset. You can have the agent reset immediately, by setting the Agent.ResetOnDone property to true in the inspector or you can wait for the Academy to reset the environment. This RollerBall environment relies on the `ResetOnDone` mechanism and doesn't set a `Max Steps` limit for the Academy (so it never resets the environment).
+**Note:** When you mark an agent as done, it stops its activity until it is
+reset. You can have the agent reset immediately, by setting the
+Agent.ResetOnDone property to true in the inspector or you can wait for the
+Academy to reset the environment. This RollerBall environment relies on the
+`ResetOnDone` mechanism and doesn't set a `Max Steps` limit for the Academy (so
+it never resets the environment).
-It can also encourage an agent to finish a task more quickly to assign a negative reward at each step:
+It can also encourage an agent to finish a task more quickly to assign a
+negative reward at each step:

 ```csharp
 // Time penalty
-Finally, to punish the agent for falling off the platform, assign a large negative reward and, of course, set the agent to done so that it resets itself in the next step:
+Finally, to punish the agent for falling off the platform, assign a large
+negative reward and, of course, set the agent to done so that it resets itself
+in the next step:

 ```csharp
 // Fell off platform
 }
 ```

-**AgentAction()**
- 
-With the action and reward logic outlined above, the final version of the `AgentAction()` function looks like:
+### AgentAction()
+
+With the action and reward logic outlined above, the final version of the
+`AgentAction()` function looks like:

 ```csharp
 public float speed = 10;
 {
    // Rewards
-    float distanceToTarget = Vector3.Distance(this.transform.position, 
+    float distanceToTarget = Vector3.Distance(this.transform.position,
-    
+
    // Reached target
    if (distanceToTarget < 1.42f)
    {
-    
+
    // Time penalty
    AddReward(-0.05f);

 }
 ```

-Note the `speed` and `previousDistance` class variables defined before the function. Since `speed` is public, you can set the value from the Inspector window.
+Note the `speed` and `previousDistance` class variables defined before the
+function. Since `speed` is public, you can set the value from the Inspector
+window.
-Now, that all the GameObjects and ML-Agent components are in place, it is time to connect everything together in the Unity Editor. This involves assigning the Brain object to the Agent, changing some of the Agent Components properties, and setting the Brain properties so that they are compatible with our agent code. 
+Now, that all the GameObjects and ML-Agent components are in place, it is time
+to connect everything together in the Unity Editor. This involves assigning the
+Brain object to the Agent, changing some of the Agent Components properties, and
+setting the Brain properties so that they are compatible with our agent code.
-1. Expand the Academy GameObject in the Hierarchy window, so that the Brain object is visible.
-2. Select the RollerAgent GameObject to show its properties in the Inspector window.
-3. Drag the Brain object from the Hierarchy window to the RollerAgent Brain field.
+1. Expand the Academy GameObject in the Hierarchy window, so that the Brain
+   object is visible.
+2. Select the RollerAgent GameObject to show its properties in the Inspector
+   window.
+3. Drag the Brain object from the Hierarchy window to the RollerAgent Brain
+   field.
-Also, drag the Target GameObject from the Hierarchy window to the RollerAgent Target field.
+Also, drag the Target GameObject from the Hierarchy window to the RollerAgent
+Target field.
-Finally, select the Brain GameObject so that you can see its properties in the Inspector window. Set the following properties:
+Finally, select the Brain GameObject so that you can see its properties in the
+Inspector window. Set the following properties:

 * `Vector Observation Space Type` = **Continuous**
 * `Vector Observation Space Size` = 8

 ## Testing the Environment

-It is always a good idea to test your environment manually before embarking on an extended training run. The reason we have left the Brain set to the **Player** type is so that we can control the agent using direct keyboard control. But first, you need to define the keyboard to action mapping. Although the RollerAgent only has an `Action Size` of two, we will use one key to specify positive values and one to specify negative values for each action, for a total of four keys.
+It is always a good idea to test your environment manually before embarking on
+an extended training run. The reason we have left the Brain set to the
+**Player** type is so that we can control the agent using direct keyboard
+control. But first, you need to define the keyboard to action mapping. Although
+the RollerAgent only has an `Action Size` of two, we will use one key to specify
+positive values and one to specify negative values for each action, for a total
+of four keys.
-3. Expand the **Continuous Player Actions** dictionary (only visible when using the **Player* brain).
+3. Expand the **Continuous Player Actions** dictionary (only visible when using
+   the **Player* brain).
 4. Set **Size** to 4.
 5. Set the following mappings:

 | Element 2 | W  | 1        | 1        |
 | Element 3 | S   | 1        | -1       |

-The **Index** value corresponds to the index of the action array passed to `AgentAction()` function. **Value** is assigned to action[Index] when **Key** is pressed.
+The **Index** value corresponds to the index of the action array passed to
+`AgentAction()` function. **Value** is assigned to action[Index] when **Key** is
+pressed.
-Press **Play** to run the scene and use the WASD keys to move the agent around the platform. Make sure that there are no errors displayed in the Unity editor Console window and that the agent resets when it reaches its target or falls from the platform. Note that for more involved debugging, the ML-Agents SDK includes a convenient Monitor class that you can use to easily display agent status information in the Game window.
+Press **Play** to run the scene and use the WASD keys to move the agent around
+the platform. Make sure that there are no errors displayed in the Unity editor
+Console window and that the agent resets when it reaches its target or falls
+from the platform. Note that for more involved debugging, the ML-Agents SDK
+includes a convenient Monitor class that you can use to easily display agent
+status information in the Game window.

 One additional test you can perform is to first ensure that your environment and
 the Python API work as expected using the `notebooks/getting-started.ipynb`

-Now you can train the Agent. To get ready for training, you must first to change the **Brain Type** from **Player** to **External**. From there, the process is the same as described in [Training ML-Agents](Training-ML-Agents.md). 
+Now you can train the Agent. To get ready for training, you must first to change
+the **Brain Type** from **Player** to **External**. From there, the process is
+the same as described in [Training ML-Agents](Training-ML-Agents.md).
-This section briefly reviews how to organize your scene when using 
-Agents in your Unity environment.
+This section briefly reviews how to organize your scene when using Agents in
+your Unity environment.
-There are three kinds of game objects you need to include in your scene in order to use Unity ML-Agents:
- * Academy  
- * Brain  
- * Agents  
+There are three kinds of game objects you need to include in your scene in order
+to use Unity ML-Agents:
+
+* Academy  
+* Brain  
+* Agents  
- * There can only be one Academy game object in a scene.   
- * You can have multiple Brain game objects but they must be child of the Academy game object.  
+
+* There can only be one Academy game object in a scene.
+* You can have multiple Brain game objects but they must be child of the Academy game object.  

 Here is an example of what your scene hierarchy should look like:

--- a/docs/Learning-Environment-Design-Academy.md
+++ b/docs/Learning-Environment-Design-Academy.md
 # Creating an Academy

-An Academy orchestrates all the Agent and Brain objects in a Unity scene. Every scene containing agents must contain a single Academy. To use an Academy, you must create your own subclass. However, all the methods you can override are optional. 
+An Academy orchestrates all the Agent and Brain objects in a Unity scene. Every
+scene containing agents must contain a single Academy. To use an Academy, you
+must create your own subclass. However, all the methods you can override are
+optional.

 Use the Academy methods to:


-See [Reinforcement Learning in Unity](Learning-Environment-Design.md) for a description of the timing of these method calls during a simulation.
+See [Reinforcement Learning in Unity](Learning-Environment-Design.md) for a
+description of the timing of these method calls during a simulation.
-Initialization is performed once in an Academy object's lifecycle. Use the `InitializeAcademy()` method for any logic you would normally perform in the standard Unity `Start()` or `Awake()` methods. 
+Initialization is performed once in an Academy object's lifecycle. Use the
+`InitializeAcademy()` method for any logic you would normally perform in the
+standard Unity `Start()` or `Awake()` methods.
-**Note:** Because the base Academy implements a `Awake()` function, you must not implement your own. Because of the way the Unity MonoBehaviour class is defined, implementing your own `Awake()` function hides the base class version and Unity will call yours instead. Likewise, do not implement a `FixedUpdate()` function in your Academy subclass.
+**Note:** Because the base Academy implements a `Awake()` function, you must not
+implement your own. Because of the way the Unity MonoBehaviour class is defined,
+implementing your own `Awake()` function hides the base class version and Unity
+will call yours instead. Likewise, do not implement a `FixedUpdate()` function
+in your Academy subclass.
-Implement an `AcademyReset()` function to alter the environment at the start of each episode. For example, you might want to reset an agent to its starting position or move a goal to a random position. An environment resets when the Academy `Max Steps` count is reached. 
+Implement an `AcademyReset()` function to alter the environment at the start of
+each episode. For example, you might want to reset an agent to its starting
+position or move a goal to a random position. An environment resets when the
+Academy `Max Steps` count is reached.
-When you reset an environment, consider the factors that should change so that training is generalizable to different conditions. For example, if you were training a maze-solving agent, you would probably want to change the maze itself for each training episode. Otherwise, the agent would probably on learn to solve one, particular maze, not mazes in general.
+When you reset an environment, consider the factors that should change so that
+training is generalizable to different conditions. For example, if you were
+training a maze-solving agent, you would probably want to change the maze itself
+for each training episode. Otherwise, the agent would probably on learn to solve
+one, particular maze, not mazes in general.
-The `AcademyStep()` function is called at every step in the simulation before any agents are updated. Use this function to update objects in the environment at every step or during the episode between environment resets. For example, if you want to add elements to the environment at random intervals, you can put the logic for creating them in the `AcademyStep()` function.
+The `AcademyStep()` function is called at every step in the simulation before
+any agents are updated. Use this function to update objects in the environment
+at every step or during the episode between environment resets. For example, if
+you want to add elements to the environment at random intervals, you can put the
+logic for creating them in the `AcademyStep()` function.
-* `Max Steps` - Total number of steps per-episode. `0` corresponds to episodes without a maximum number of steps. Once the step counter reaches maximum, the environment will reset.
-* `Configuration` - The engine-level settings which correspond to rendering quality and engine speed.
-    * `Width` - Width of the environment window in pixels.
-    * `Height` - Width of the environment window in pixels.
-    * `Quality Level` - Rendering quality of environment. (Higher is better)
-    * `Time Scale` - Speed at which environment is run. (Higher is faster)
-    * `Target Frame Rate` - FPS engine attempts to maintain. 
-* `Reset Parameters` - List of custom parameters that can be changed in the environment on reset.
+* `Max Steps` - Total number of steps per-episode. `0` corresponds to episodes
+  without a maximum number of steps. Once the step counter reaches maximum, the
+  environment will reset.
+* `Configuration` - The engine-level settings which correspond to rendering
+  quality and engine speed.
+  * `Width` - Width of the environment window in pixels.
+  * `Height` - Width of the environment window in pixels.
+  * `Quality Level` - Rendering quality of environment. (Higher is better)
+  * `Time Scale` - Speed at which environment is run. (Higher is faster)
+  * `Target Frame Rate` - FPS engine attempts to maintain.
+* `Reset Parameters` - List of custom parameters that can be changed in the
+  environment on reset.
--- a/docs/Learning-Environment-Design-Agents.md
+++ b/docs/Learning-Environment-Design-Agents.md
 # Agents

-An agent is an actor that can observe its environment and decide on the best course of action using those observations. Create agents in Unity by extending the Agent class. The most important aspects of creating agents that can successfully learn are the observations the agent collects and, for reinforcement learning, the reward you assign to estimate the value of the agent's current state toward accomplishing its tasks.
+An agent is an actor that can observe its environment and decide on the best
+course of action using those observations. Create agents in Unity by extending
+the Agent class. The most important aspects of creating agents that can
+successfully learn are the observations the agent collects and, for
+reinforcement learning, the reward you assign to estimate the value of the
+agent's current state toward accomplishing its tasks.
-An agent passes its observations to its brain. The brain, then, makes a decision and passes the chosen action back to the agent. Your agent code must execute the action, for example, move the agent in one direction or another. In order to [train an agent using reinforcement learning](Learning-Environment-Design.md), your agent must calculate a reward value at each action. The reward is used to discover the optimal decision-making policy. (A reward is not used by already trained agents or for imitation learning.) 
+An agent passes its observations to its brain. The brain, then, makes a decision
+and passes the chosen action back to the agent. Your agent code must execute the
+action, for example, move the agent in one direction or another. In order to
+[train an agent using reinforcement learning](Learning-Environment-Design.md),
+your agent must calculate a reward value at each action. The reward is used to
+discover the optimal decision-making policy. (A reward is not used by already
+trained agents or for imitation learning.)
-The Brain class abstracts out the decision making logic from the agent itself so that you can use the same brain in multiple agents. 
-How a brain makes its decisions depends on the type of brain it is. An **External** brain simply passes the observations from its agents to an external process and then passes the decisions made externally back to the agents. An **Internal** brain uses the trained policy parameters to make decisions (and no longer adjusts the parameters in search of a better decision). The other types of brains do not directly involve training, but you might find them useful as part of a training project. See [Brains](Learning-Environment-Design-Brains.md).
+The Brain class abstracts out the decision making logic from the agent itself so
+that you can use the same brain in multiple agents. How a brain makes its
+decisions depends on the type of brain it is. An **External** brain simply
+passes the observations from its agents to an external process and then passes
+the decisions made externally back to the agents. An **Internal** brain uses the
+trained policy parameters to make decisions (and no longer adjusts the
+parameters in search of a better decision). The other types of brains do not
+directly involve training, but you might find them useful as part of a training
+project. See [Brains](Learning-Environment-Design-Brains.md).
-The observation-decision-action-reward cycle repeats after a configurable number of simulation steps (the frequency defaults to once-per-step). You can also set up an agent to request decisions on demand. Making decisions at regular step intervals is generally most appropriate for physics-based simulations. Making decisions on demand is generally appropriate for situations where agents only respond to specific events or take actions of variable duration. For example, an agent in a robotic simulator that must provide fine-control of joint torques should make its decisions every step of the simulation. On the other hand, an agent that only needs to make decisions when certain game or simulation events occur, should use on-demand decision making.  
+The observation-decision-action-reward cycle repeats after a configurable number
+of simulation steps (the frequency defaults to once-per-step). You can also set
+up an agent to request decisions on demand. Making decisions at regular step
+intervals is generally most appropriate for physics-based simulations. Making
+decisions on demand is generally appropriate for situations where agents only
+respond to specific events or take actions of variable duration. For example, an
+agent in a robotic simulator that must provide fine-control of joint torques
+should make its decisions every step of the simulation. On the other hand, an
+agent that only needs to make decisions when certain game or simulation events
+occur, should use on-demand decision making.  
-To control the frequency of step-based decision making, set the **Decision Frequency** value for the Agent object in the Unity Inspector window. Agents using the same Brain instance can use a different frequency. During simulation steps in which no decision is requested, the agent receives the same action chosen by the previous decision.
+To control the frequency of step-based decision making, set the **Decision
+Frequency** value for the Agent object in the Unity Inspector window. Agents
+using the same Brain instance can use a different frequency. During simulation
+steps in which no decision is requested, the agent receives the same action
+chosen by the previous decision.
-On demand decision making allows agents to request decisions from their 
-brains only when needed instead of receiving decisions at a fixed 
-frequency. This is useful when the agents commit to an action for a 
-variable number of steps or when the agents cannot make decisions 
-at the same time. This typically the case for turn based games, games 
-where agents must react to events or games where agents can take 
-actions of variable duration.
+On demand decision making allows agents to request decisions from their brains
+only when needed instead of receiving decisions at a fixed frequency. This is
+useful when the agents commit to an action for a variable number of steps or
+when the agents cannot make decisions at the same time. This typically the case
+for turn based games, games where agents must react to events or games where
+agents can take actions of variable duration.
-When you turn on **On Demand Decisions** for an agent, your agent code must call the `Agent.RequestDecision()` function. This function call starts one iteration of the observation-decision-action-reward cycle. The Brain invokes the agent's `CollectObservations()` method, makes a decision and returns it by calling the `AgentAction()` method. The Brain waits for the agent to request the next decision before starting another iteration.
+When you turn on **On Demand Decisions** for an agent, your agent code must call
+the `Agent.RequestDecision()` function. This function call starts one iteration
+of the observation-decision-action-reward cycle. The Brain invokes the agent's
+`CollectObservations()` method, makes a decision and returns it by calling the
+`AgentAction()` method. The Brain waits for the agent to request the next
+decision before starting another iteration.
-To make decisions, an agent must observe its environment in order to infer the state of the world. A state observation can take the following forms:
+To make decisions, an agent must observe its environment in order to infer the
+state of the world. A state observation can take the following forms:
-* **Vector Observation** — a feature vector consisting of an array of floating point numbers. 
+* **Vector Observation** — a feature vector consisting of an array of floating
+  point numbers.
-When you use vector observations for an agent, implement the `Agent.CollectObservations()` method to create the feature vector. When you use **Visual Observations**, you only need to identify which Unity Camera objects will provide images and the base Agent class handles the rest. You do not need to implement the `CollectObservations()` method when your agent uses visual observations (unless it also uses vector observations).
+When you use vector observations for an agent, implement the
+`Agent.CollectObservations()` method to create the feature vector. When you use
+**Visual Observations**, you only need to identify which Unity Camera objects
+will provide images and the base Agent class handles the rest. You do not need
+to implement the `CollectObservations()` method when your agent uses visual
+observations (unless it also uses vector observations).
-For agents using a continuous state space, you create a feature vector to represent the agent's observation at each step of the simulation. The Brain class calls the `CollectObservations()` method of each of its agents. Your implementation of this function must call `AddVectorObs` to add vector observations. 
+For agents using a continuous state space, you create a feature vector to
+represent the agent's observation at each step of the simulation. The Brain
+class calls the `CollectObservations()` method of each of its agents. Your
+implementation of this function must call `AddVectorObs` to add vector
+observations.
-The observation must include all the information an agent needs to accomplish its task. Without sufficient and relevant information, an agent may learn poorly or may not learn at all. A reasonable approach for determining what information should be included is to consider what you would need to calculate an analytical solution to the problem. 
+The observation must include all the information an agent needs to accomplish
+its task. Without sufficient and relevant information, an agent may learn poorly
+or may not learn at all. A reasonable approach for determining what information
+should be included is to consider what you would need to calculate an analytical
+solution to the problem.
-For examples of various state observation functions, you can look at the [example environments](Learning-Environment-Examples.md) included in the ML-Agents SDK.  For instance, the 3DBall example uses the rotation of the platform, the relative position of the ball, and the velocity of the ball as its state observation. As an experiment, you can remove the velocity components from the observation and retrain the 3DBall agent. While it will learn to balance the ball reasonably well, the performance of the agent without using velocity is noticeably worse.
+For examples of various state observation functions, you can look at the
+[example environments](Learning-Environment-Examples.md) included in the
+ML-Agents SDK.  For instance, the 3DBall example uses the rotation of the
+platform, the relative position of the ball, and the velocity of the ball as its
+state observation. As an experiment, you can remove the velocity components from
+the observation and retrain the 3DBall agent. While it will learn to balance the
+ball reasonably well, the performance of the agent without using velocity is
+noticeably worse.

 ```csharp
 public GameObject ball;
 }
 ```

-The feature vector must always contain the same number of elements and observations must always be in the same position within the list. If the number of observed entities in an environment can vary you can pad the feature vector with zeros for any missing entities in a specific observation or you can limit an agent's observations to a fixed subset. For example, instead of observing every enemy agent in an environment, you could only observe the closest five. 
+The feature vector must always contain the same number of elements and
+observations must always be in the same position within the list. If the number
+of observed entities in an environment can vary you can pad the feature vector
+with zeros for any missing entities in a specific observation or you can limit
+an agent's observations to a fixed subset. For example, instead of observing
+every enemy agent in an environment, you could only observe the closest five.
-When you set up an Agent's brain in the Unity Editor, set the following properties to use a continuous vector observation:
+When you set up an Agent's brain in the Unity Editor, set the following
+properties to use a continuous vector observation:
-**Space Size** — The state size must match the length of your feature vector.
-**Brain Type** — Set to **External** during training; set to **Internal** to use the trained model.
+* **Space Size** — The state size must match the length of your feature vector.
+* **Brain Type** — Set to **External** during training; set to **Internal** to
+  use the trained model.
-The observation feature vector is a list of floating point numbers, which means you must convert any other data types to a float or a list of floats. 
+The observation feature vector is a list of floating point numbers, which means
+you must convert any other data types to a float or a list of floats.
-Integers can be be added directly to the observation vector. You must explicitly convert Boolean values to a number:
+Integers can be be added directly to the observation vector. You must explicitly
+convert Boolean values to a number:
-For entities like positions and rotations, you can add their components to the feature list individually.  For example:
+For entities like positions and rotations, you can add their components to the
+feature list individually.  For example:

 ```csharp
 Vector3 speed = ball.transform.GetComponent<Rigidbody>().velocity;
 ```

-Type enumerations should be encoded in the _one-hot_ style. That is, add an element to the feature vector for each element of enumeration, setting the element representing the observed member to one and set the rest to zero. For example, if your enumeration contains \[Sword, Shield, Bow\] and the agent observes that the current item is a Bow, you would add the elements: 0, 0, 1 to the feature vector. The following code example illustrates how to add 
+Type enumerations should be encoded in the _one-hot_ style. That is, add an
+element to the feature vector for each element of enumeration, setting the
+element representing the observed member to one and set the rest to zero. For
+example, if your enumeration contains \[Sword, Shield, Bow\] and the agent
+observes that the current item is a Bow, you would add the elements: 0, 0, 1 to
+the feature vector. The following code example illustrates how to add.

 ```csharp
 enum CarriedItems { Sword, Shield, Bow, LastItem }
    for (int ci = 0; ci < (int)CarriedItems.LastItem; ci++)
    {
-        AddVectorObs((int)currentItem == ci ? 1.0f : 0.0f);            
+        AddVectorObs((int)currentItem == ci ? 1.0f : 0.0f);
    }
 }
 ```
-For the best results when training, you should normalize the components of your feature vector to the range [-1, +1] or [0, 1]. When you normalize the values, the PPO neural network can often converge to a solution faster. Note that it isn't always necessary to normalize to these recommended ranges, but it is considered a best practice when using neural networks. The greater the variation in ranges between the components of your observation, the more likely that training will be affected.
+For the best results when training, you should normalize the components of your
+feature vector to the range [-1, +1] or [0, 1]. When you normalize the values,
+the PPO neural network can often converge to a solution faster. Note that it
+isn't always necessary to normalize to these recommended ranges, but it is
+considered a best practice when using neural networks. The greater the variation
+in ranges between the components of your observation, the more likely that
+training will be affected.

 To normalize a value to [0, 1], you can use the following formula:


-Rotations and angles should also be normalized. For angles between 0 and 360 degrees, you can use the following formulas:
+Rotations and angles should also be normalized. For angles between 0 and 360
+degrees, you can use the following formulas:

 ```csharp
 Quaternion rotation = transform.rotation;

-For angles that can be outside the range [0,360], you can either reduce the angle, or, if the number of turns is significant, increase the maximum value used in your normalization formula.
- 
+For angles that can be outside the range [0,360], you can either reduce the
+angle, or, if the number of turns is significant, increase the maximum value
+used in your normalization formula.
+
-Camera observations use rendered textures from one or more cameras in a scene. The brain vectorizes the textures into a 3D Tensor which can be fed into a convolutional neural network (CNN). For more information on CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). You can use camera observations along side vector observations.
- 
-Agents using camera images can capture state of arbitrary complexity and are useful when the state is difficult to describe numerically. However, they are also typically less efficient and slower to train, and sometimes don't succeed at all.  
+Camera observations use rendered textures from one or more cameras in a scene.
+The brain vectorizes the textures into a 3D Tensor which can be fed into a
+convolutional neural network (CNN). For more information on CNNs, see [this
+guide](http://cs231n.github.io/convolutional-networks/). You can use camera
+observations along side vector observations.
+
+Agents using camera images can capture state of arbitrary complexity and are
+useful when the state is difficult to describe numerically. However, they are
+also typically less efficient and slower to train, and sometimes don't succeed
+at all.  
+
+To add a visual observation to an agent, click on the `Add Camera` button in the
+Agent inspector. Then drag the camera you want to add to the `Camera` field. You
+can have more than one camera attached to an agent.
-To add a visual observation to an agent, click on the `Add Camera` button in the Agent inspector. Then drag the camera you want to add to the `Camera` field. You can have more than one camera attached to an agent.
+![Agent Camera](images/visual-observation.png)
-![Agent Camera](images/visual-observation.png) 
+In addition, make sure that the Agent's Brain expects a visual observation. In
+the Brain inspector, under **Brain Parameters** > **Visual Observations**,
+specify the number of Cameras the agent is using for its visual observations.
+For each visual observation, set the width and height of the image (in pixels)
+and whether or not the observation is color or grayscale (when `Black And White`
+is checked).
-In addition, make sure that the Agent's Brain expects a visual observation. In the Brain inspector, under **Brain Parameters** > **Visual Observations**, specify the number of Cameras the agent is using for its visual observations. For each visual observation, set the width and height of the image (in pixels) and whether or not the observation is color or grayscale (when `Black And White` is checked).
- 
-An action is an instruction from the brain that the agent carries out. The action is passed to the agent as a parameter when the Academy invokes the agent's `AgentAction()` function. When you specify that the vector action space is **Continuous**, the action parameter passed to the agent is an array of control signals with length equal to the `Vector Action Space Size` property.  When you specify a **Discrete** vector action space type, the action parameter is an array containing integers. Each integer is an index into a list or table of commands. In the **Discrete** vector action space type, the action parameter is an array of indices. The number of indices in the array is determined by the number of branches defined in the `Branches Size` property. Each branch corresponds to an action table, you can specify the size of each table by modifying the `Branches` property. Set the `Vector Action Space Size` and `Vector Action Space Type` properties on the Brain object assigned to the agent (using the Unity Editor Inspector window). 
+An action is an instruction from the brain that the agent carries out. The
+action is passed to the agent as a parameter when the Academy invokes the
+agent's `AgentAction()` function. When you specify that the vector action space
+is **Continuous**, the action parameter passed to the agent is an array of
+control signals with length equal to the `Vector Action Space Size` property.
+When you specify a **Discrete** vector action space type, the action parameter
+is an array containing integers. Each integer is an index into a list or table
+of commands. In the **Discrete** vector action space type, the action parameter
+is an array of indices. The number of indices in the array is determined by the
+number of branches defined in the `Branches Size` property. Each branch
+corresponds to an action table, you can specify the size of each table by
+modifying the `Branches` property. Set the `Vector Action Space Size` and
+`Vector Action Space Type` properties on the Brain object assigned to the agent
+(using the Unity Editor Inspector window).
-Neither the Brain nor the training algorithm know anything about what the action values themselves mean. The training algorithm simply tries different values for the action list and observes the affect on the accumulated rewards over time and many training episodes. Thus, the only place actions are defined for an agent is in the `AgentAction()` function. You simply specify the type of vector action space, and, for the continuous vector action space, the number of values, and then apply the received values appropriately (and consistently) in `ActionAct()`.
+Neither the Brain nor the training algorithm know anything about what the action
+values themselves mean. The training algorithm simply tries different values for
+the action list and observes the affect on the accumulated rewards over time and
+many training episodes. Thus, the only place actions are defined for an agent is
+in the `AgentAction()` function. You simply specify the type of vector action
+space, and, for the continuous vector action space, the number of values, and
+then apply the received values appropriately (and consistently) in
+`ActionAct()`.
-For example, if you designed an agent to move in two dimensions, you could use either continuous or the discrete vector actions. In the continuous case, you would set the vector action size to two (one for each dimension), and the agent's brain would create an action with two floating point values. In the discrete case, you would use one Branch with a size of four (one for each direction), and the brain would create an action array containing a single element with a value ranging from zero to three. Alternatively, you could create two branches of size two (one for horizontal movement and one for vertical movement), and the brain would create an action array containing two elements with values ranging from zero to one.
+For example, if you designed an agent to move in two dimensions, you could use
+either continuous or the discrete vector actions. In the continuous case, you
+would set the vector action size to two (one for each dimension), and the
+agent's brain would create an action with two floating point values. In the
+discrete case, you would use one Branch with a size of four (one for each
+direction), and the brain would create an action array containing a single
+element with a value ranging from zero to three. Alternatively, you could create
+two branches of size two (one for horizontal movement and one for vertical
+movement), and the brain would create an action array containing two elements
+with values ranging from zero to one.
-Note that when you are programming actions for an agent, it is often helpful to test your action logic using a **Player** brain, which lets you map keyboard commands to actions. See [Brains](Learning-Environment-Design-Brains.md).
+Note that when you are programming actions for an agent, it is often helpful to
+test your action logic using a **Player** brain, which lets you map keyboard
+commands to actions. See [Brains](Learning-Environment-Design-Brains.md).
-The [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) and [Area](Learning-Environment-Examples.md#push-block) example environments are set up to use either the continuous or the discrete vector action spaces. 
+The [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) and
+[Area](Learning-Environment-Examples.md#push-block) example environments are set
+up to use either the continuous or the discrete vector action spaces.
-When an agent uses a brain set to the **Continuous** vector action space, the action parameter passed to the agent's `AgentAction()` function is an array with length equal to the Brain object's `Vector Action Space Size` property value.  The individual values in the array have whatever meanings that you ascribe to them. If you assign an element in the array as the speed of an agent, for example, the training process learns to control the speed of the agent though this parameter. 
+When an agent uses a brain set to the **Continuous** vector action space, the
+action parameter passed to the agent's `AgentAction()` function is an array with
+length equal to the Brain object's `Vector Action Space Size` property value.
+The individual values in the array have whatever meanings that you ascribe to
+them. If you assign an element in the array as the speed of an agent, for
+example, the training process learns to control the speed of the agent though
+this parameter.
-The [Reacher example](Learning-Environment-Examples.md#reacher) defines a continuous action space with four control values. 
+The [Reacher example](Learning-Environment-Examples.md#reacher) defines a
+continuous action space with four control values.
-![](images/reacher.png)
+![reacher](images/reacher.png)
-These control values are applied as torques to the bodies making up the arm :
+These control values are applied as torques to the bodies making up the arm:

 ```csharp
 public override void AgentAction(float[] act)
 }
 ```

-By default the output from our provided PPO algorithm pre-clamps the values of `vectorAction` into the [-1, 1] range. It is a best practice to manually clip these as well, if you plan to use a 3rd party algorithm with your environment. As shown above, you can scale the control values as needed after clamping them. 
- 
+By default the output from our provided PPO algorithm pre-clamps the values of
+`vectorAction` into the [-1, 1] range. It is a best practice to manually clip
+these as well, if you plan to use a 3rd party algorithm with your environment.
+As shown above, you can scale the control values as needed after clamping them.
+
-When an agent uses a brain set to the **Discrete** vector action space, the action parameter passed to the agent's `AgentAction()` function is an array containing indices. With the discrete vector action space, `Branches` is an array of integers, each value corresponds to the number of possibilities for each branch.
+When an agent uses a brain set to the **Discrete** vector action space, the
+action parameter passed to the agent's `AgentAction()` function is an array
+containing indices. With the discrete vector action space, `Branches` is an
+array of integers, each value corresponds to the number of possibilities for
+each branch.
-For example, if we wanted an agent that can move in an plane and jump, we could define two branches (one for motion and one for jumping) because we want our agent be able to move __and__ jump concurently.
-We define the first branch to have 5 possible actions (don't move, go left, go right, go backward, go forward) and the second one to have 2 possible actions (don't jump, jump). The AgentAction method would look something like :
+For example, if we wanted an agent that can move in an plane and jump, we could
+define two branches (one for motion and one for jumping) because we want our
+agent be able to move __and__ jump concurently. We define the first branch to
+have 5 possible actions (don't move, go left, go right, go backward, go forward)
+and the second one to have 2 possible actions (don't jump, jump). The
+AgentAction method would look something like:
-int movement = Mathf.FloorToInt(act[0]); 
+int movement = Mathf.FloorToInt(act[0]);
-int jump = Mathf.FloorToInt(act[1]); 
+int jump = Mathf.FloorToInt(act[1]);

 // Look up the index in the movement action list:
 if (movement == 1) { directionX = -1; }
        directionX * 40f, directionY * 300f, directionZ * 40f));
 ```

-Note that the above code example is a simplified extract from the AreaAgent class, which provides alternate implementations for both the discrete and the continuous action spaces.
+Note that the above code example is a simplified extract from the AreaAgent
+class, which provides alternate implementations for both the discrete and the
+continuous action spaces.
-When using Discrete Actions, it is possible to specify that some actions are impossible for the next decision. Then the agent is controlled by an External or Internal Brain, the agent will be unable to perform the specified action. Note that when the agent is controlled by a Player or Heuristic Brain, the agent will still be able to decide to perform the masked action. In order to mask an action, call the method `SetActionMask` within the `CollectObservation` method :
+
+When using Discrete Actions, it is possible to specify that some actions are
+impossible for the next decision. Then the agent is controlled by an External or
+Internal Brain, the agent will be unable to perform the specified action. Note
+that when the agent is controlled by a Player or Heuristic Brain, the agent will
+still be able to decide to perform the masked action. In order to mask an
+action, call the method `SetActionMask` within the `CollectObservation` method :
-Where : 
- * `branch` is the index (starting at 0) of the branch on which you want to mask the action
- * `actionIndices` is a list of `int` or a single `int` corresponding to the index of theaction that the agent cannot perform.
+Where:
-For example, if you have an agent with 2 branches and on the first branch (branch 0) there are 4 possible actions : _"do nothing"_, _"jump"_, _"shoot"_ and _"change weapon"_. Then with the code bellow, the agent will either _"do nothing"_ or _"change weapon"_ for his next decision (since action index 1 and 2 are masked)
+* `branch` is the index (starting at 0) of the branch on which you want to mask
+  the action
+* `actionIndices` is a list of `int` or a single `int` corresponding to the
+  index of theaction that the agent cannot perform.
+
+For example, if you have an agent with 2 branches and on the first branch
+(branch 0) there are 4 possible actions : _"do nothing"_, _"jump"_, _"shoot"_
+and _"change weapon"_. Then with the code bellow, the agent will either _"do
+nothing"_ or _"change weapon"_ for his next decision (since action index 1 and 2
+are masked)
-Notes: 
+Notes:
- * You can call `SetActionMask` multiple times if you want to put masks on multiple branches.
- * You cannot mask all the actions of a branch.
- * You cannot mask actions in continuous control.
-
+* You can call `SetActionMask` multiple times if you want to put masks on
+  multiple branches.
+* You cannot mask all the actions of a branch.
+* You cannot mask actions in continuous control.
-In reinforcement learning, the reward is a signal that the agent has done something right. The PPO reinforcement learning algorithm works by optimizing the choices an agent makes such that the agent earns the highest cumulative reward over time. The better your reward mechanism, the better your agent will learn.
+In reinforcement learning, the reward is a signal that the agent has done
+something right. The PPO reinforcement learning algorithm works by optimizing
+the choices an agent makes such that the agent earns the highest cumulative
+reward over time. The better your reward mechanism, the better your agent will
+learn.
-**Note:** Rewards are not used during inference by a brain using an already trained policy and is also not used during imitation learning.
- 
-Perhaps the best advice is to start simple and only add complexity as needed. In general, you should reward results rather than actions you think will lead to the desired results. To help develop your rewards, you can use the Monitor class to display the cumulative reward received by an agent. You can even use a Player brain to control the agent while watching how it accumulates rewards.
+**Note:** Rewards are not used during inference by a brain using an already
+trained policy and is also not used during imitation learning.
-Allocate rewards to an agent by calling the `AddReward()` method in the `AgentAction()` function. The reward assigned in any step should be in the range [-1,1].  Values outside this range can lead to unstable training. The `reward` value is reset to zero at every step. 
+Perhaps the best advice is to start simple and only add complexity as needed. In
+general, you should reward results rather than actions you think will lead to
+the desired results. To help develop your rewards, you can use the Monitor class
+to display the cumulative reward received by an agent. You can even use a Player
+brain to control the agent while watching how it accumulates rewards.
-**Examples**
+Allocate rewards to an agent by calling the `AddReward()` method in the
+`AgentAction()` function. The reward assigned in any step should be in the range
+[-1,1].  Values outside this range can lead to unstable training. The `reward`
+value is reset to zero at every step.
-You can examine the `AgentAction()` functions defined in the [example environments](Learning-Environment-Examples.md) to see how those projects allocate rewards.
+### Examples
-The `GridAgent` class in the [GridWorld example](Learning-Environment-Examples.md#gridworld) uses a very simple reward system:
+You can examine the `AgentAction()` functions defined in the [example
+environments](Learning-Environment-Examples.md) to see how those projects
+allocate rewards.
+
+The `GridAgent` class in the [GridWorld
+example](Learning-Environment-Examples.md#gridworld) uses a very simple reward
+system:
-Collider[] hitObjects = Physics.OverlapBox(trueAgent.transform.position, 
+Collider[] hitObjects = Physics.OverlapBox(trueAgent.transform.position,
                                           new Vector3(0.3f, 0.3f, 0.3f));
 if (hitObjects.Where(col => col.gameObject.tag == "goal").ToArray().Length == 1)
 {
 }
 ```

-The agent receives a positive reward when it reaches the goal and a negative reward when it falls into the pit. Otherwise, it gets no rewards. This is an example of a _sparse_ reward system. The agent must explore a lot to find the infrequent reward.
+The agent receives a positive reward when it reaches the goal and a negative
+reward when it falls into the pit. Otherwise, it gets no rewards. This is an
+example of a _sparse_ reward system. The agent must explore a lot to find the
+infrequent reward.
-In contrast, the `AreaAgent` in the [Area example](Learning-Environment-Examples.md#push-block) gets a small negative reward every step. In order to get the maximum reward, the agent must finish its task of reaching the goal square as quickly as possible:
+In contrast, the `AreaAgent` in the [Area
+example](Learning-Environment-Examples.md#push-block) gets a small negative
+reward every step. In order to get the maximum reward, the agent must finish its
+task of reaching the goal square as quickly as possible:
-if (gameObject.transform.position.y < 0.0f || 
-    Mathf.Abs(gameObject.transform.position.x - area.transform.position.x) > 8f || 
+if (gameObject.transform.position.y < 0.0f ||
+    Mathf.Abs(gameObject.transform.position.x - area.transform.position.x) > 8f ||
    Mathf.Abs(gameObject.transform.position.z + 5 - area.transform.position.z) > 8)
 {
    Done();

-The agent also gets a larger negative penalty if it falls off the playing surface.
+The agent also gets a larger negative penalty if it falls off the playing
+surface.
-The `Ball3DAgent` in the [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) takes a similar approach, but allocates a small positive reward as long as the agent balances the ball. The agent can maximize its rewards by keeping the ball on the platform:
+The `Ball3DAgent` in the
+[3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) takes a
+similar approach, but allocates a small positive reward as long as the agent
+balances the ball. The agent can maximize its rewards by keeping the ball on the
+platform:

 ```csharp
 if (IsDone() == false)
 }
 ```

-The `Ball3DAgent` also assigns a negative penalty when the ball falls off the platform.
+The `Ball3DAgent` also assigns a negative penalty when the ball falls off the
+platform.
-* `Brain` - The brain to register this agent to. Can be dragged into the inspector using the Editor.
-* `Visual Observations` - A list of `Cameras` which will be used to generate observations.
-* `Max Step` - The per-agent maximum number of steps. Once this number is reached, the agent will be reset if `Reset On Done` is checked.
-* `Reset On Done` - Whether the agent's `AgentReset()` function should be called when the agent reaches its `Max Step` count or is marked as done in code.
-* `On Demand Decision` - Whether the agent requests decisions at a fixed step interval or explicitly requests decisions by calling `RequestDecision()`.
-     * If not checked, the Agent will request a new 
-        decision every `Decision Frequency` steps and 
-        perform an action every step. In the example above, 
-        `CollectObservations()` will be called every 5 steps and 
-        `AgentAction()` will be called at every step. This means that the 
-        Agent will reuse the decision the Brain has given it. 
-     * If checked, the Agent controls when to receive
-        decisions, and take actions. To do so, the Agent may leverage one or two methods:
-        * `RequestDecision()` Signals that the Agent is requesting a decision.
-            This causes the Agent to collect its observations and ask the Brain for a 
-            decision at the next step of the simulation. Note that when an Agent 
-            requests a decision, it also request an action. 
-            This is to ensure that all decisions lead to an action during training.
-        * `RequestAction()` Signals that the Agent is requesting an action. The
-            action provided to the Agent in this case is the same action that was
-            provided the last time it requested a decision. 
-* `Decision Frequency` - The number of steps between decision requests. Not used if `On Demand Decision`, is true. 
+* `Brain` - The brain to register this agent to. Can be dragged into the
+  inspector using the Editor.
+* `Visual Observations` - A list of `Cameras` which will be used to generate
+  observations.
+* `Max Step` - The per-agent maximum number of steps. Once this number is
+  reached, the agent will be reset if `Reset On Done` is checked.
+* `Reset On Done` - Whether the agent's `AgentReset()` function should be called
+  when the agent reaches its `Max Step` count or is marked as done in code.
+* `On Demand Decision` - Whether the agent requests decisions at a fixed step
+  interval or explicitly requests decisions by calling `RequestDecision()`.
+  * If not checked, the Agent will request a new decision every `Decision
+     Frequency` steps and perform an action every step. In the example above,
+     `CollectObservations()` will be called every 5 steps and `AgentAction()`
+     will be called at every step. This means that the Agent will reuse the
+     decision the Brain has given it.
+  * If checked, the Agent controls when to receive decisions, and take actions.
+     To do so, the Agent may leverage one or two methods:
+    * `RequestDecision()` Signals that the Agent is requesting a decision. This
+        causes the Agent to collect its observations and ask the Brain for a
+        decision at the next step of the simulation. Note that when an Agent
+        requests a decision, it also request an action. This is to ensure that
+        all decisions lead to an action during training.
+    * `RequestAction()` Signals that the Agent is requesting an action. The
+        action provided to the Agent in this case is the same action that was
+        provided the last time it requested a decision.
+* `Decision Frequency` - The number of steps between decision requests. Not used if `On Demand Decision`, is true.
-We created a helpful `Monitor` class that enables visualizing variables within
-a Unity environment. While this was built for monitoring an Agent's value
-function throughout the training process, we imagine it can be more broadly
-useful. You can learn more [here](Feature-Monitor.md).
+We created a helpful `Monitor` class that enables visualizing variables within a
+Unity environment. While this was built for monitoring an Agent's value function
+throughout the training process, we imagine it can be more broadly useful. You
+can learn more [here](Feature-Monitor.md).
-To add an Agent to an environment at runtime, use the Unity `GameObject.Instantiate()` function. It is typically easiest to instantiate an agent from a [Prefab](https://docs.unity3d.com/Manual/Prefabs.html) (otherwise, you have to instantiate every GameObject and Component that make up your agent individually). In addition, you must assign a Brain instance to the new Agent and initialize it by calling its `AgentReset()` method. For example, the following function creates a new agent given a Prefab, Brain instance, location, and orientation:
+To add an Agent to an environment at runtime, use the Unity
+`GameObject.Instantiate()` function. It is typically easiest to instantiate an
+agent from a [Prefab](https://docs.unity3d.com/Manual/Prefabs.html) (otherwise,
+you have to instantiate every GameObject and Component that make up your agent
+individually). In addition, you must assign a Brain instance to the new Agent
+and initialize it by calling its `AgentReset()` method. For example, the
+following function creates a new agent given a Prefab, Brain instance, location,
+and orientation:

 ```csharp
 private void CreateAgent(GameObject agentPrefab, Brain brain, Vector3 position, Quaternion orientation)

 ## Destroying an Agent

-Before destroying an Agent GameObject, you must mark it as done (and wait for the next step in the simulation) so that the Brain knows that this agent is no longer active. Thus, the best place to destroy an agent is in the `Agent.AgentOnDone()` function:
+Before destroying an Agent GameObject, you must mark it as done (and wait for
+the next step in the simulation) so that the Brain knows that this agent is no
+longer active. Thus, the best place to destroy an agent is in the
+`Agent.AgentOnDone()` function:

 ```csharp
 public override void AgentOnDone()
 ```

-Note that in order for `AgentOnDone()` to be called, the agent's `ResetOnDone` property must be false. You can set `ResetOnDone` on the agent's Inspector or in code. 
+Note that in order for `AgentOnDone()` to be called, the agent's `ResetOnDone`
+property must be false. You can set `ResetOnDone` on the agent's Inspector or in
+code.
--- a/docs/Learning-Environment-Design-Brains.md
+++ b/docs/Learning-Environment-Design-Brains.md
 # Brains

-The Brain encapsulates the decision making process. Brain objects must be children of the Academy in the Unity scene hierarchy. Every Agent must be assigned a Brain, but you can use the same Brain with more than one Agent. You can also create several Brains, attach each of the Brain to one or more than one Agent.  
+The Brain encapsulates the decision making process. Brain objects must be
+children of the Academy in the Unity scene hierarchy. Every Agent must be
+assigned a Brain, but you can use the same Brain with more than one Agent. You
+can also create several Brains, attach each of the Brain to one or more than one
+Agent.  
-Use the Brain class directly, rather than a subclass. Brain behavior is determined by the **Brain Type**. The ML-Agents toolkit defines four Brain Types:
+Use the Brain class directly, rather than a subclass. Brain behavior is
+determined by the **Brain Type**. The ML-Agents toolkit defines four Brain
+Types:
-* [External](Learning-Environment-Design-External-Internal-Brains.md) — The **External** and **Internal** types typically work together; set **External** when training your agents. You can also use the **External** brain to communicate with a Python script via the Python `UnityEnvironment` class included in the Python portion of the ML-Agents SDK.
-* [Internal](Learning-Environment-Design-External-Internal-Brains.md) – Set **Internal**  to make use of a trained model.
-* [Heuristic](Learning-Environment-Design-Heuristic-Brains.md) – Set **Heuristic** to hand-code the agent's logic by extending the Decision class.
-* [Player](Learning-Environment-Design-Player-Brains.md) – Set **Player** to map keyboard keys to agent actions, which can be useful to test your agent code.
+* [External](Learning-Environment-Design-External-Internal-Brains.md) — The
+  **External** and **Internal** types typically work together; set **External**
+  when training your agents. You can also use the **External** brain to
+  communicate with a Python script via the Python `UnityEnvironment` class
+  included in the Python portion of the ML-Agents SDK.
+* [Internal](Learning-Environment-Design-External-Internal-Brains.md) – Set
+  **Internal**  to make use of a trained model.
+* [Heuristic](Learning-Environment-Design-Heuristic-Brains.md) – Set
+  **Heuristic** to hand-code the agent's logic by extending the Decision class.
+* [Player](Learning-Environment-Design-Player-Brains.md) – Set **Player** to map
+  keyboard keys to agent actions, which can be useful to test your agent code.
-During training, set your agent's brain type to **External**. To use the trained model, import the model file into the Unity project and change the brain type to **Internal**. 
+During training, set your agent's brain type to **External**. To use the trained
+model, import the model file into the Unity project and change the brain type to
+**Internal**.
-The Brain class has several important properties that you can set using the Inspector window. These properties must be appropriate for the agents using the brain. For example, the `Vector Observation Space Size` property must match the length of the feature vector created by an agent exactly. See [Agents](Learning-Environment-Design-Agents.md) for information about creating agents and setting up a Brain instance correctly.
+The Brain class has several important properties that you can set using the
+Inspector window. These properties must be appropriate for the agents using the
+brain. For example, the `Vector Observation Space Size` property must match the
+length of the feature vector created by an agent exactly. See
+[Agents](Learning-Environment-Design-Agents.md) for information about creating
+agents and setting up a Brain instance correctly.
-The Brain Inspector window in the Unity Editor displays the properties assigned to a Brain component:
+The Brain Inspector window in the Unity Editor displays the properties assigned
+to a Brain component:
-* `Brain Parameters` - Define vector observations, visual observation, and vector actions for the Brain.
-    * `Vector Observation` 
-    	* `Space Size` - Length of vector observation for brain.
-		* `Stacked Vectors` - The number of previous vector observations that will be stacked and used collectively for decision making. This results in the effective size of the vector observation being passed to the brain being: _Space Size_ x _Stacked Vectors_.
-	* `Visual Observations`	- Describes height, width, and whether to grayscale visual observations for the Brain.
-	* `Vector Action`
-		* `Space Type` - Corresponds to whether action vector contains a single integer (Discrete) or a series of real-valued floats (Continuous).
-		* `Space Size` (Continuous) - Length of action vector for brain.
-		* `Branches` (Discrete) - An array of integers, defines multiple concurent discrete actions. The values in the `Branches` array correspond to the number of possible discrete values for each action branch.
-		* `Action Descriptions` - A list of strings used to name the available actions for the Brain.
+* `Brain Parameters` - Define vector observations, visual observation, and
+  vector actions for the Brain.
+  * `Vector Observation`
+    * `Space Size` - Length of vector observation for brain.
+    * `Stacked Vectors` - The number of previous vector observations that will
+      be stacked and used collectively for decision making. This results in the
+      effective size of the vector observation being passed to the brain being:
+      _Space Size_ x _Stacked Vectors_.
+  * `Visual Observations` - Describes height, width, and whether to grayscale
+    visual observations for the Brain.
+  * `Vector Action`
+    * `Space Type` - Corresponds to whether action vector contains a single
+      integer (Discrete) or a series of real-valued floats (Continuous).
+    * `Space Size` (Continuous) - Length of action vector for brain.
+    * `Branches` (Discrete) - An array of integers, defines multiple concurent
+      discrete actions. The values in the `Branches` array correspond to the
+      number of possible discrete values for each action branch.
+    * `Action Descriptions` - A list of strings used to name the available
+      actions for the Brain.
-    * `External` - Actions are decided by an external process, such as the PPO training process.
-    * `Internal` - Actions are decided using internal TensorFlowSharp model.
-    * `Player` - Actions are decided using keyboard input mappings.
-    * `Heuristic` - Actions are decided using a custom `Decision` script, which must be attached to the Brain game object.
+  * `External` - Actions are decided by an external process, such as the PPO
+    training process.
+  * `Internal` - Actions are decided using internal TensorFlowSharp model.
+  * `Player` - Actions are decided using keyboard input mappings.
+  * `Heuristic` - Actions are decided using a custom `Decision` script, which
+    must be attached to the Brain game object.
-The Player, Heuristic and Internal brains have been updated to support broadcast. The broadcast feature allows you to collect data from your agents using a Python program without controlling them.  
+The Player, Heuristic and Internal brains have been updated to support
+broadcast. The broadcast feature allows you to collect data from your agents
+using a Python program without controlling them.  

 ### How to use: Unity


-### How to use: Python 
+### How to use: Python
-When you launch your Unity Environment from a Python program, you can see what the agents connected to non-external brains are doing. When calling `step` or `reset` on your environment, you retrieve a dictionary mapping brain names to `BrainInfo` objects. The  dictionary contains a `BrainInfo` object for each non-external brain set to broadcast as well as for any external brains.  
-
-Just like with an external brain, the `BrainInfo` object contains the fields for `visual_observations`, `vector_observations`,  `text_observations`, `memories`,`rewards`, `local_done`, `max_reached`, `agents` and `previous_actions`. Note that `previous_actions` corresponds to the actions that were taken by the agents at the previous step, not the current one.  
-
-Note that when you do a `step` on the environment, you cannot provide actions for non-external brains. If there are no external brains in the scene, simply call `step()` with no arguments.  
+When you launch your Unity Environment from a Python program, you can see what
+the agents connected to non-external brains are doing. When calling `step` or
+`reset` on your environment, you retrieve a dictionary mapping brain names to
+`BrainInfo` objects. The  dictionary contains a `BrainInfo` object for each
+non-external brain set to broadcast as well as for any external brains.  
-You can use the broadcast feature to collect data generated by Player, Heuristics or Internal brains game sessions. You can then use this data to train an agent in a supervised context.
+Just like with an external brain, the `BrainInfo` object contains the fields for
+`visual_observations`, `vector_observations`,  `text_observations`,
+`memories`,`rewards`, `local_done`, `max_reached`, `agents` and
+`previous_actions`. Note that `previous_actions` corresponds to the actions that
+were taken by the agents at the previous step, not the current one.  
+Note that when you do a `step` on the environment, you cannot provide actions
+for non-external brains. If there are no external brains in the scene, simply
+call `step()` with no arguments.  
+You can use the broadcast feature to collect data generated by Player,
+Heuristics or Internal brains game sessions. You can then use this data to train
+an agent in a supervised context.
--- a/docs/Learning-Environment-Design-External-Internal-Brains.md
+++ b/docs/Learning-Environment-Design-External-Internal-Brains.md
 # External and Internal Brains

-The **External** and **Internal** types of Brains work in different phases of training. When training your agents, set their brain types to **External**; when using the trained models, set their brain types to **Internal**.
+The **External** and **Internal** types of Brains work in different phases of
+training. When training your agents, set their brain types to **External**; when
+using the trained models, set their brain types to **Internal**.
-When [running an ML-Agents training algorithm](Training-ML-Agents.md), at least one Brain object in a scene must be set to **External**. This allows the training process to collect the observations of agents using that brain and give the agents their actions.
+When [running an ML-Agents training algorithm](Training-ML-Agents.md), at least
+one Brain object in a scene must be set to **External**. This allows the
+training process to collect the observations of agents using that brain and give
+the agents their actions.
-In addition to using an External brain for training using the ML-Agents learning algorithms, you can use an External brain to control agents in a Unity environment using an external Python program. See [Python API](Python-API.md) for more information.
+In addition to using an External brain for training using the ML-Agents learning
+algorithms, you can use an External brain to control agents in a Unity
+environment using an external Python program. See [Python API](Python-API.md)
+for more information.
-Unlike the other types, the External Brain has no properties to set in the Unity Inspector window.
+Unlike the other types, the External Brain has no properties to set in the Unity
+Inspector window.
-The Internal Brain type uses a [TensorFlow model](https://www.tensorflow.org/get_started/get_started_for_beginners#models_and_training) to make decisions. The Proximal Policy Optimization (PPO) and Behavioral Cloning algorithms included with the ML-Agents SDK produce trained TensorFlow models that you can use with the Internal Brain type.
+The Internal Brain type uses a
+[TensorFlow model](https://www.tensorflow.org/get_started/get_started_for_beginners#models_and_training)
+to make decisions. The Proximal Policy Optimization (PPO) and Behavioral Cloning
+algorithms included with the ML-Agents SDK produce trained TensorFlow models
+that you can use with the Internal Brain type.
-A __model__ is a mathematical relationship mapping an agent's observations to its actions. TensorFlow is a software library for performing numerical computation through data flow graphs. A TensorFlow model, then, defines the mathematical relationship between your agent's observations and its actions using a TensorFlow data flow graph. 
+A __model__ is a mathematical relationship mapping an agent's observations to
+its actions. TensorFlow is a software library for performing numerical
+computation through data flow graphs. A TensorFlow model, then, defines the
+mathematical relationship between your agent's observations and its actions
+using a TensorFlow data flow graph.
-The training algorithms included in the ML-Agents SDK produce TensorFlow graph models as the end result of the training process. See [Training ML-Agents](Training-ML-Agents.md) for instructions on how to train a model.
+The training algorithms included in the ML-Agents SDK produce TensorFlow graph
+models as the end result of the training process. See
+[Training ML-Agents](Training-ML-Agents.md) for instructions on how to train a
+model.
-1. Select the Brain GameObject in the **Hierarchy** window of the Unity Editor. (The Brain GameObject must be a child of the Academy GameObject and must have a Brain component.)
+1. Select the Brain GameObject in the **Hierarchy** window of the Unity Editor.
+   (The Brain GameObject must be a child of the Academy GameObject and must have
+   a Brain component.)
+    **Note:** In order to see the **Internal** Brain Type option, you must
+    [enable TensorFlowSharp](Using-TensorFlow-Sharp-in-Unity.md).  
+3. Import the `environment_run-id.bytes` file produced by the PPO training
+   program. (Where `environment_run-id` is the name of the model file, which is
+   constructed from the name of your Unity environment executable and the run-id
+   value you assigned when running the training process.)
-    **Note:** In order to see the **Internal** Brain Type option, you must [enable TensorFlowSharp](Using-TensorFlow-Sharp-in-Unity.md).  
-
-3. Import the `environment_run-id.bytes` file produced by the PPO training program. (Where `environment_run-id` is the name of the model file, which is constructed from the name of your Unity environment executable and the run-id value you assigned when running the training process.)
-
-    You can [import assets into Unity](https://docs.unity3d.com/Manual/ImportingAssets.html) in various ways. The easiest way is to simply drag the file into the **Project** window and drop it into an appropriate folder.
-    
-4. Once the `environment.bytes` file is imported, drag it from the **Project** window to the **Graph Model** field of the Brain component.
+   You can
+   [import assets into Unity](https://docs.unity3d.com/Manual/ImportingAssets.html)
+   in various ways. The easiest way is to simply drag the file into the
+   **Project** window and drop it into an appropriate folder.
+4. Once the `environment.bytes` file is imported, drag it from the **Project**
+   window to the **Graph Model** field of the Brain component.
-If you are using a model produced by the ML-Agents `mlagents-learn` command, use the default values for the other Internal Brain parameters.
+If you are using a model produced by the ML-Agents `mlagents-learn` command, use
+the default values for the other Internal Brain parameters.
-The default values of the TensorFlow graph parameters work with the model produced by the PPO and BC training code in the ML-Agents SDK. To use a default ML-Agents model, the only parameter that you need to set is the `Graph Model`, which must be set to the .bytes file containing the trained model itself. 
+The default values of the TensorFlow graph parameters work with the model
+produced by the PPO and BC training code in the ML-Agents SDK. To use a default
+ML-Agents model, the only parameter that you need to set is the `Graph Model`,
+which must be set to the .bytes file containing the trained model itself.
+* `Graph Model` : This must be the `bytes` file corresponding to the pre-trained
+   TensorFlow graph. (You must first drag this file into your Resources folder
+   and then from the Resources folder into the inspector)
-   *  `Graph Model` : This must be the `bytes` file corresponding to the pre-trained TensorFlow graph. (You must first drag this file into your Resources folder and then from the Resources folder into the inspector)
+Only change the following Internal Brain properties if you have created your own
+TensorFlow model and are not using an ML-Agents model:
-Only change the following Internal Brain properties if you have created your own TensorFlow model and are not using an ML-Agents model:
-
-   *  `Graph Scope` : If you set a scope while training your TensorFlow model, all your placeholder name will have a prefix. You must specify that prefix here. Note that if more than one Brain were set to external during training, you must give a `Graph Scope` to the internal Brain corresponding to the name of the Brain GameObject.
-   *  `Batch Size Node Name` : If the batch size is one of the inputs of your graph, you must specify the name if the placeholder here. The brain will make the batch size equal to the number of agents connected to the brain automatically.
-   *  `State Node Name` : If your graph uses the state as an input, you must specify the name of the placeholder here.
-   *  `Recurrent Input Node Name` : If your graph uses a recurrent input / memory as input and outputs new recurrent input / memory, you must specify the name if the input placeholder here.
-   *  `Recurrent Output Node Name` : If your graph uses a recurrent input / memory as input and outputs new recurrent input / memory, you must specify the name if the output placeholder here.
-   * `Observation Placeholder Name` : If your graph uses observations as input, you must specify it here. Note that the number of observations is equal to the length of `Camera Resolutions` in the brain parameters.
-   * `Action Node Name` : Specify the name of the placeholder corresponding to the actions of the brain in your graph. If the action space type is continuous, the output must be a one dimensional tensor of float of length `Action Space Size`, if the action space type is discrete, the output must be a one dimensional tensor of int of the same length as the `Branches` array.
-   * `Graph Placeholder` : If your graph takes additional inputs that are fixed (example: noise level) you can specify them here. Note that in your graph, these must correspond to one dimensional tensors of int or float of size 1.
-     * `Name` : Corresponds to the name of the placeholder.
-     * `Value Type` : Either Integer or Floating Point.
-     * `Min Value` and `Max Value` : Specify the range of the value here. The value will be sampled from the uniform distribution ranging from `Min Value` to `Max Value` inclusive.
-
+* `Graph Scope` : If you set a scope while training your TensorFlow model, all
+  your placeholder name will have a prefix. You must specify that prefix here.
+  Note that if more than one Brain were set to external during training, you
+  must give a `Graph Scope` to the internal Brain corresponding to the name of
+  the Brain GameObject.
+* `Batch Size Node Name` : If the batch size is one of the inputs of your
+  graph, you must specify the name if the placeholder here. The brain will make
+  the batch size equal to the number of agents connected to the brain
+  automatically.
+* `State Node Name` : If your graph uses the state as an input, you must specify
+  the name of the placeholder here.
+* `Recurrent Input Node Name` : If your graph uses a recurrent input / memory as
+  input and outputs new recurrent input / memory, you must specify the name if
+  the input placeholder here.
+* `Recurrent Output Node Name` : If your graph uses a recurrent input / memory
+  as input and outputs new recurrent input / memory, you must specify the name
+  if the output placeholder here.
+* `Observation Placeholder Name` : If your graph uses observations as input, you
+  must specify it here. Note that the number of observations is equal to the
+  length of `Camera Resolutions` in the brain parameters.
+* `Action Node Name` : Specify the name of the placeholder corresponding to the
+  actions of the brain in your graph. If the action space type is continuous,
+  the output must be a one dimensional tensor of float of length `Action Space
+  Size`, if the action space type is discrete, the output must be a one
+  dimensional tensor of int of the same length as the `Branches` array.
+* `Graph Placeholder` : If your graph takes additional inputs that are fixed
+  (example: noise level) you can specify them here. Note that in your graph,
+  these must correspond to one dimensional tensors of int or float of size 1.
+  * `Name` : Corresponds to the name of the placeholder.
+  * `Value Type` : Either Integer or Floating Point.
+  * `Min Value` and `Max Value` : Specify the range of the value here. The value
+    will be sampled from the uniform distribution ranging from `Min Value` to
+    `Max Value` inclusive.
--- a/docs/Learning-Environment-Design-Heuristic-Brains.md
+++ b/docs/Learning-Environment-Design-Heuristic-Brains.md
 # Heuristic Brain

-The **Heuristic** brain type allows you to hand code an agent's decision making process. A Heuristic brain requires an implementation of the Decision interface to which it delegates the decision making process.
+The **Heuristic** brain type allows you to hand code an agent's decision making
+process. A Heuristic brain requires an implementation of the Decision interface
+to which it delegates the decision making process.
-When you set the **Brain Type** property of a Brain to **Heuristic**, you must add a component implementing the Decision interface to the same GameObject as the Brain. 
+When you set the **Brain Type** property of a Brain to **Heuristic**, you must
+add a component implementing the Decision interface to the same GameObject as
+the Brain.
-When creating your Decision class, extend MonoBehaviour (so you can use the class as a Unity component) and extend the Decision interface.
+When creating your Decision class, extend MonoBehaviour (so you can use the
+class as a Unity component) and extend the Decision interface.
-public class HeuristicLogic : MonoBehaviour, Decision 
+public class HeuristicLogic : MonoBehaviour, Decision
-The Decision interface defines two methods, `Decide()` and `MakeMemory()`. 
+The Decision interface defines two methods, `Decide()` and `MakeMemory()`.
-The `Decide()` method receives an agents current state, consisting of the agent's observations, reward, memory and other aspects of the agent's state, and must return an array containing the action that the agent should take. The format of the returned action array depends on the **Vector Action Space Type**. When using a **Continuous** action space, the action array is just a float array with a length equal to the **Vector Action Space Size** setting. When using a **Discrete** action space, the action array is an integer array with the same size as the `Branches` array. In the discrete action space, the values of the **Branches** array define the number of discrete values that your `Decide()` function can return for each branch, which don't need to be consecutive integers. 
+The `Decide()` method receives an agents current state, consisting of the
+agent's observations, reward, memory and other aspects of the agent's state, and
+must return an array containing the action that the agent should take. The
+format of the returned action array depends on the **Vector Action Space Type**.
+When using a **Continuous** action space, the action array is just a float array
+with a length equal to the **Vector Action Space Size** setting. When using a
+**Discrete** action space, the action array is an integer array with the same
+size as the `Branches` array. In the discrete action space, the values of the
+**Branches** array define the number of discrete values that your `Decide()`
+function can return for each branch, which don't need to be consecutive
+integers.
-The `MakeMemory()` function allows you to pass data forward to the next iteration of an agent's decision making process. The array you return from `MakeMemory()` is passed to the `Decide()` function in the next iteration. You can use the memory to allow the agent's decision process to take past actions and observations into account when making the current decision. If your heuristic logic does not require memory, just return an empty array.
+The `MakeMemory()` function allows you to pass data forward to the next
+iteration of an agent's decision making process. The array you return from
+`MakeMemory()` is passed to the `Decide()` function in the next iteration. You
+can use the memory to allow the agent's decision process to take past actions
+and observations into account when making the current decision. If your
+heuristic logic does not require memory, just return an empty array.
--- a/docs/Learning-Environment-Design-Player-Brains.md
+++ b/docs/Learning-Environment-Design-Player-Brains.md
 # Player Brain

-The **Player** brain type allows you to control an agent using keyboard commands. You can use Player brains to control a "teacher" agent that trains other agents during [imitation learning](Training-Imitation-Learning.md). You can also use Player brains to test your agents and environment before changing their brain types to **External** and running the training process.
+The **Player** brain type allows you to control an agent using keyboard
+commands. You can use Player brains to control a "teacher" agent that trains
+other agents during [imitation learning](Training-Imitation-Learning.md). You
+can also use Player brains to test your agents and environment before changing
+their brain types to **External** and running the training process.
-The **Player** brain properties allow you to assign one or more keyboard keys to each action and a unique value to send when a key is pressed.
+The **Player** brain properties allow you to assign one or more keyboard keys to
+each action and a unique value to send when a key is pressed.
-Note the differences between the discrete and continuous action spaces. When a brain uses the discrete action space, you can send one integer value as the action per step. In contrast, when a brain uses the continuous action space you can send any number of floating point values (up to the **Vector Action Space Size** setting).
- 
+Note the differences between the discrete and continuous action spaces. When a
+brain uses the discrete action space, you can send one integer value as the
+action per step. In contrast, when a brain uses the continuous action space you
+can send any number of floating point values (up to the **Vector Action Space
+Size** setting).
+
-|**Continuous Player Actions**|| The mapping for the continuous vector action space. Shown when the action space is **Continuous**|. 
-|| **Size** | The number of key commands defined. You can assign more than one command to the same action index in order to send different values for that action. (If you press both keys at the same time, deterministic results are not guaranteed.)|
+|**Continuous Player Actions**|| The mapping for the continuous vector action
+  space. Shown when the action space is **Continuous**|.
+|| **Size** | The number of key commands defined. You can assign more than one
+  command to the same action index in order to send different values for that
+  action. (If you press both keys at the same time, deterministic results are not guaranteed.)|
-|| **Index** | The element of the agent's action vector to set when this key is pressed. The index value cannot exceed the size of the Action Space (minus 1, since it is an array index).|
-|| **Value** | The value to send to the agent as its action for the specified index when the mapped key is pressed. All other members of the action vector are set to 0. |
-|**Discrete Player Actions**|| The mapping for the discrete vector action space. Shown when the action space is **Discrete**.| 
+|| **Index** | The element of the agent's action vector to set when this key is
+  pressed. The index value cannot exceed the size of the Action Space (minus 1,
+  since it is an array index).|
+|| **Value** | The value to send to the agent as its action for the specified
+  index when the mapped key is pressed. All other members of the action vector
+  are set to 0. |
+|**Discrete Player Actions**|| The mapping for the discrete vector action space.
+  Shown when the action space is **Discrete**.|
-|| **Branch Index** |The element of the agent's action vector to set when this key is pressed. The index value cannot exceed the size of the Action Space (minus 1, since it is an array index).|
-|| **Value** | The value to send to the agent as its action when the mapped key is pressed. Cannot exceed the max value for the associated branch (minus 1, since it is an array index).|
+|| **Branch Index** |The element of the agent's action vector to set when this
+  key is pressed. The index value cannot exceed the size of the Action Space
+  (minus 1, since it is an array index).|
+|| **Value** | The value to send to the agent as its action when the mapped key
+  is pressed. Cannot exceed the max value for the associated branch (minus 1,
+  since it is an array index).|
-For more information about the Unity input system, see [Input](https://docs.unity3d.com/ScriptReference/Input.html).
-
+For more information about the Unity input system, see
+[Input](https://docs.unity3d.com/ScriptReference/Input.html).
--- a/docs/Learning-Environment-Design.md
+++ b/docs/Learning-Environment-Design.md
 # Reinforcement Learning in Unity

-Reinforcement learning is an artificial intelligence technique that trains _agents_ to perform tasks by rewarding desirable behavior. During reinforcement learning, an agent explores its environment, observes the state of things, and, based on those observations, takes an action. If the action leads to a better state, the agent receives a positive reward. If it leads to a less desirable state, then the agent receives no reward or a negative reward (punishment). As the agent learns during training, it optimizes its decision making so that it receives the maximum reward over time.
+Reinforcement learning is an artificial intelligence technique that trains
+_agents_ to perform tasks by rewarding desirable behavior. During reinforcement
+learning, an agent explores its environment, observes the state of things, and,
+based on those observations, takes an action. If the action leads to a better
+state, the agent receives a positive reward. If it leads to a less desirable
+state, then the agent receives no reward or a negative reward (punishment). As
+the agent learns during training, it optimizes its decision making so that it
+receives the maximum reward over time.
-The ML-Agents toolkit uses a reinforcement learning technique called [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/). PPO uses a neural network to approximate the ideal function that maps an agent's observations to the best action an agent can take in a given state. The ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate Python process (communicating with the running Unity application over a socket). 
+The ML-Agents toolkit uses a reinforcement learning technique called
+[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/).
+PPO uses a neural network to approximate the ideal function that maps an agent's
+observations to the best action an agent can take in a given state. The
+ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
+Python process (communicating with the running Unity application over a socket).
-**Note:** if you aren't studying machine and reinforcement learning as a subject and just want to train agents to accomplish tasks, you can treat PPO training as a _black box_. There are a few training-related parameters to adjust inside Unity as well as on the Python training side, but you do not need in-depth knowledge of the algorithm itself to successfully create and train agents. Step-by-step procedures for running the training process are provided in the [Training section](Training-ML-Agents.md). 
+**Note:** if you aren't studying machine and reinforcement learning as a subject
+and just want to train agents to accomplish tasks, you can treat PPO training as
+a _black box_. There are a few training-related parameters to adjust inside
+Unity as well as on the Python training side, but you do not need in-depth
+knowledge of the algorithm itself to successfully create and train agents.
+Step-by-step procedures for running the training process are provided in the
+[Training section](Training-ML-Agents.md).
-Training and simulation proceed in steps orchestrated by the ML-Agents Academy class. The Academy works with Agent and Brain objects in the scene to step through the simulation. When either the Academy has reached its maximum number of steps or all agents in the scene are _done_, one training episode is finished. 
+Training and simulation proceed in steps orchestrated by the ML-Agents Academy
+class. The Academy works with Agent and Brain objects in the scene to step
+through the simulation. When either the Academy has reached its maximum number
+of steps or all agents in the scene are _done_, one training episode is
+finished.
-During training, the external Python training process communicates with the Academy to run a series of episodes while it collects data and optimizes its neural network model. The type of Brain assigned to an agent determines whether it participates in training or not. The **External** brain communicates with the external process to train the TensorFlow model. When training is completed successfully, you can add the trained model file to your Unity project for use with an **Internal** brain.
+During training, the external Python training process communicates with the
+Academy to run a series of episodes while it collects data and optimizes its
+neural network model. The type of Brain assigned to an agent determines whether
+it participates in training or not. The **External** brain communicates with the
+external process to train the TensorFlow model. When training is completed
+successfully, you can add the trained model file to your Unity project for use
+with an **Internal** brain.

 The ML-Agents Academy class orchestrates the agent simulation loop as follows:

-4. Uses each agent's Brain class to decide on the agent's next action. 
+4. Uses each agent's Brain class to decide on the agent's next action.
-6. Calls the `AgentAction()` function for each agent in the scene, passing in the action chosen by the agent's brain. (This function is not called if the agent is done.)
-7. Calls the agent's `AgentOnDone()` function if the agent has reached its `Max Step` count or has otherwise marked itself as `done`. Optionally, you can set an agent to restart if it finishes before the end of an episode. In this case, the Academy calls the `AgentReset()` function.
-8. When the Academy reaches its own `Max Step` count, it starts the next episode again by calling your Academy subclass's `AcademyReset()` function.
+6. Calls the `AgentAction()` function for each agent in the scene, passing in
+   the action chosen by the agent's brain. (This function is not called if the
+   agent is done.)
+7. Calls the agent's `AgentOnDone()` function if the agent has reached its `Max
+   Step` count or has otherwise marked itself as `done`. Optionally, you can set
+   an agent to restart if it finishes before the end of an episode. In this
+   case, the Academy calls the `AgentReset()` function.
+8. When the Academy reaches its own `Max Step` count, it starts the next episode
+   again by calling your Academy subclass's `AcademyReset()` function.
-To create a training environment, extend the Academy and Agent classes to implement the above methods. The `Agent.CollectObservations()` and `Agent.AgentAction()` functions are required; the other methods are optional — whether you need to implement them or not depends on your specific scenario.
+To create a training environment, extend the Academy and Agent classes to
+implement the above methods. The `Agent.CollectObservations()` and
+`Agent.AgentAction()` functions are required; the other methods are optional —
+whether you need to implement them or not depends on your specific scenario.
-**Note:** The API used by the Python PPO training process to communicate with and control the Academy during training can be used for other purposes as well. For example, you could use the API to use Unity as the simulation engine for your own machine learning algorithms. See [Python API](Python-API.md) for more information.
+**Note:** The API used by the Python PPO training process to communicate with
+and control the Academy during training can be used for other purposes as well.
+For example, you could use the API to use Unity as the simulation engine for
+your own machine learning algorithms. See [Python API](Python-API.md) for more
+information.
-To train and use the ML-Agents toolkit in a Unity scene, the scene must contain a single Academy subclass along with as many Brain objects and Agent subclasses as you need. Any Brain instances in the scene must be attached to GameObjects that are children of the Academy in the Unity Scene Hierarchy. Agent instances should be attached to the GameObject representing that agent.
+To train and use the ML-Agents toolkit in a Unity scene, the scene must contain
+a single Academy subclass along with as many Brain objects and Agent subclasses
+as you need. Any Brain instances in the scene must be attached to GameObjects
+that are children of the Academy in the Unity Scene Hierarchy. Agent instances
+should be attached to the GameObject representing that agent.
-You must assign a brain to every agent, but you can share brains between multiple agents. Each agent will make its own observations and act independently, but will use the same decision-making logic and, for **Internal** brains, the same trained TensorFlow model. 
+You must assign a brain to every agent, but you can share brains between
+multiple agents. Each agent will make its own observations and act
+independently, but will use the same decision-making logic and, for **Internal**
+brains, the same trained TensorFlow model.
-The Academy object orchestrates agents and their decision making processes. Only place a single Academy object in a scene. 
+The Academy object orchestrates agents and their decision making processes. Only
+place a single Academy object in a scene.
-You must create a subclass of the Academy class (since the base class is abstract). When you create your Academy subclass, you can implement the following methods (all are optional):
+You must create a subclass of the Academy class (since the base class is
+abstract). When you create your Academy subclass, you can implement the
+following methods (all are optional):
-* `AcademyReset()` — Prepare the environment and agents for the next training episode. Use this function to place and initialize entities in the scene as necessary.
-* `AcademyStep()` — Prepare the environment for the next simulation step. The base Academy class calls this function before calling any `AgentAction()` methods for the current step. You can use this function to update other objects in the scene before the agents take their actions. Note that the agents have already collected their observations and chosen an action before the Academy invokes this method.
+* `AcademyReset()` — Prepare the environment and agents for the next training
+  episode. Use this function to place and initialize entities in the scene as
+  necessary.
+* `AcademyStep()` — Prepare the environment for the next simulation step. The
+  base Academy class calls this function before calling any `AgentAction()`
+  methods for the current step. You can use this function to update other
+  objects in the scene before the agents take their actions. Note that the
+  agents have already collected their observations and chosen an action before
+  the Academy invokes this method.
-The base Academy classes also defines several important properties that you can set in the Unity Editor Inspector. For training, the most important of these properties is `Max Steps`, which determines how long each training episode lasts. Once the Academy's step counter reaches this value, it calls the `AcademyReset()` function to start the next episode. 
+The base Academy classes also defines several important properties that you can
+set in the Unity Editor Inspector. For training, the most important of these
+properties is `Max Steps`, which determines how long each training episode
+lasts. Once the Academy's step counter reaches this value, it calls the
+`AcademyReset()` function to start the next episode.
-  See [Academy](Learning-Environment-Design-Academy.md) for a complete list of the Academy properties and their uses.  
+See [Academy](Learning-Environment-Design-Academy.md) for a complete list of
+the Academy properties and their uses.  
- 
-The Brain encapsulates the decision making process. Brain objects must be children of the Academy in the Unity scene hierarchy. Every Agent must be assigned a Brain, but you can use the same Brain with more than one Agent. 
-Use the Brain class directly, rather than a subclass. Brain behavior is determined by the brain type. During training, set your agent's brain type to **External**. To use the trained model, import the model file into the Unity project and change the brain type to **Internal**. See [Brains](Learning-Environment-Design-Brains.md) for details on using the different types of brains. You can extend the CoreBrain class to create different brain types if the four built-in types don't do what you need.
+The Brain encapsulates the decision making process. Brain objects must be
+children of the Academy in the Unity scene hierarchy. Every Agent must be
+assigned a Brain, but you can use the same Brain with more than one Agent.
-The Brain class has several important properties that you can set using the Inspector window. These properties must be appropriate for the agents using the brain. For example, the `Vector Observation Space Size` property must match the length of the feature vector created by an agent exactly. See [Agents](Learning-Environment-Design-Agents.md) for information about creating agents and setting up a Brain instance correctly.
+Use the Brain class directly, rather than a subclass. Brain behavior is
+determined by the brain type. During training, set your agent's brain type to
+**External**. To use the trained model, import the model file into the Unity
+project and change the brain type to **Internal**. See
+[Brains](Learning-Environment-Design-Brains.md) for details on using the
+different types of brains. You can extend the CoreBrain class to create
+different brain types if the four built-in types don't do what you need.
+
+The Brain class has several important properties that you can set using the
+Inspector window. These properties must be appropriate for the agents using the
+brain. For example, the `Vector Observation Space Size` property must match the
+length of the feature vector created by an agent exactly. See
+[Agents](Learning-Environment-Design-Agents.md) for information about creating
+agents and setting up a Brain instance correctly.
-See [Brains](Learning-Environment-Design-Brains.md) for a complete list of the Brain properties.
+See [Brains](Learning-Environment-Design-Brains.md) for a complete list of the
+Brain properties.
-The Agent class represents an actor in the scene that collects observations and carries out actions. The Agent class is typically attached to the GameObject in the scene that otherwise represents the actor — for example, to a player object in a football game or a car object in a vehicle simulation. Every Agent must be assigned a Brain.  
+The Agent class represents an actor in the scene that collects observations and
+carries out actions. The Agent class is typically attached to the GameObject in
+the scene that otherwise represents the actor — for example, to a player object
+in a football game or a car object in a vehicle simulation. Every Agent must be
+assigned a Brain.  
-To create an agent, extend the Agent class and implement the essential `CollectObservations()` and `AgentAction()` methods:
+To create an agent, extend the Agent class and implement the essential
+`CollectObservations()` and `AgentAction()` methods:
-* `AgentAction()` — Carries out the action chosen by the agent's brain and assigns a reward to the current state.
+* `AgentAction()` — Carries out the action chosen by the agent's brain and
+  assigns a reward to the current state.
-Your implementations of these functions determine how the properties of the Brain assigned to this agent must be set.
- 
-You must also determine how an Agent finishes its task or times out. You can manually set an agent to done in your `AgentAction()` function when the agent has finished (or irrevocably failed) its task. You can also set the agent's `Max Steps` property to a positive value and the agent will consider itself done after it has taken that many steps. When the Academy reaches its own `Max Steps` count, it starts the next episode. If you set an agent's `ResetOnDone` property to true, then the agent can attempt its task several times in one episode. (Use the `Agent.AgentReset()` function to prepare the agent to start again.) 
+Your implementations of these functions determine how the properties of the
+Brain assigned to this agent must be set.
-See [Agents](Learning-Environment-Design-Agents.md) for detailed information about programing your own agents.
+You must also determine how an Agent finishes its task or times out. You can
+manually set an agent to done in your `AgentAction()` function when the agent
+has finished (or irrevocably failed) its task. You can also set the agent's `Max
+Steps` property to a positive value and the agent will consider itself done
+after it has taken that many steps. When the Academy reaches its own `Max Steps`
+count, it starts the next episode. If you set an agent's `ResetOnDone` property
+to true, then the agent can attempt its task several times in one episode. (Use
+the `Agent.AgentReset()` function to prepare the agent to start again.)
+
+See [Agents](Learning-Environment-Design-Agents.md) for detailed information
+about programing your own agents.
-An _environment_ in the ML-Agents toolkit can be any scene built in Unity. The Unity scene provides the environment in which agents observe, act, and learn. How you set up the Unity scene to serve as a learning environment really depends on your goal. You may be trying to solve a specific reinforcement learning problem of limited scope, in which case you can use the same scene for both training and for testing trained agents. Or, you may be training agents to operate in a complex game or simulation. In this case, it might be more efficient and practical to create a purpose-built training scene.
+An _environment_ in the ML-Agents toolkit can be any scene built in Unity. The
+Unity scene provides the environment in which agents observe, act, and learn.
+How you set up the Unity scene to serve as a learning environment really depends
+on your goal. You may be trying to solve a specific reinforcement learning
+problem of limited scope, in which case you can use the same scene for both
+training and for testing trained agents. Or, you may be training agents to
+operate in a complex game or simulation. In this case, it might be more
+efficient and practical to create a purpose-built training scene.
-Both training and testing (or normal game) scenes must contain an Academy object to control the agent decision making process. The Academy defines several properties that can be set differently for a training scene versus a regular scene. The Academy's **Configuration** properties control rendering and time scale. You can set the **Training Configuration** to minimize the time Unity spends rendering graphics in order to speed up training. You may need to adjust the other functional, Academy settings as well. For example, `Max Steps` should be as short as possible for training — just long enough for the agent to accomplish its task, with some extra time for "wandering" while it learns. In regular scenes, you often do not want the Academy to reset the scene at all; if so, `Max Steps` should be set to zero. 
+Both training and testing (or normal game) scenes must contain an Academy object
+to control the agent decision making process. The Academy defines several
+properties that can be set differently for a training scene versus a regular
+scene. The Academy's **Configuration** properties control rendering and time
+scale. You can set the **Training Configuration** to minimize the time Unity
+spends rendering graphics in order to speed up training. You may need to adjust
+the other functional, Academy settings as well. For example, `Max Steps` should
+be as short as possible for training — just long enough for the agent to
+accomplish its task, with some extra time for "wandering" while it learns. In
+regular scenes, you often do not want the Academy to reset the scene at all; if
+so, `Max Steps` should be set to zero.
-When you create a training environment in Unity, you must set up the scene so that it can be controlled by the external training process. Considerations include:
+When you create a training environment in Unity, you must set up the scene so
+that it can be controlled by the external training process. Considerations
+include:
-* The training scene must start automatically when your Unity application is launched by the training process.
+* The training scene must start automatically when your Unity application is
+  launched by the training process.
-* The Academy must reset the scene to a valid starting point for each episode of training.
-* A training episode must have a definite end — either using `Max Steps` or by each agent setting itself to `done`.
-
+* The Academy must reset the scene to a valid starting point for each episode of
+  training.
+* A training episode must have a definite end — either using `Max Steps` or by
+  each agent setting itself to `done`.
--- a/docs/Learning-Environment-Examples.md
+++ b/docs/Learning-Environment-Examples.md
 # Example Learning Environments

-The Unity ML-Agents toolkit contains an expanding set of example environments which
-demonstrate various features of the platform. Environments are located in 
-`MLAgentsSDK/Assets/ML-Agents/Examples` and summarized below. 
-Additionally, our 
-[first ML Challenge](https://connect.unity.com/challenges/ml-agents-1)
-contains environments created by the community.
+The Unity ML-Agents toolkit contains an expanding set of example environments
+which demonstrate various features of the platform. Environments are located in
+`MLAgentsSDK/Assets/ML-Agents/Examples` and summarized below. Additionally, our
+[first ML Challenge](https://connect.unity.com/challenges/ml-agents-1) contains
+environments created by the community.
-This page only overviews the example environments we provide. To learn more
-on how to design and build your own environments see our 
-[Making a New Learning Environment](Learning-Environment-Create-New.md)
-page.
+This page only overviews the example environments we provide. To learn more on
+how to design and build your own environments see our [Making a New Learning
+Environment](Learning-Environment-Create-New.md) page.
-Note: Environment scenes marked as _optional_ do not have accompanying pre-trained model files, and are designed to serve as challenges for researchers. 
+Note: Environment scenes marked as _optional_ do not have accompanying
+pre-trained model files, and are designed to serve as challenges for
+researchers.
-If you would like to contribute environments, please see our 
-[contribution guidelines](../CONTRIBUTING.md) page. 
+If you would like to contribute environments, please see our
+[contribution guidelines](../CONTRIBUTING.md) page.
-* Set-up: A linear movement task where the agent must move left or right to rewarding states.
+* Set-up: A linear movement task where the agent must move left or right to
+  rewarding states.
-* Agent Reward Function: 
-    * +0.1 for arriving at suboptimal state.
-    * +1.0 for arriving at optimal state.
+* Agent Reward Function:
+  * +0.1 for arriving at suboptimal state.
+  * +1.0 for arriving at optimal state.
-    * Vector Observation space: One variable corresponding to current state.
-    * Vector Action space: (Discrete) Two possible actions (Move left, move right).
-    * Visual Observations: None.
+  * Vector Observation space: One variable corresponding to current state.
+  * Vector Action space: (Discrete) Two possible actions (Move left, move
+    right).
+  * Visual Observations: None.
 * Reset Parameters: None
 * Benchmark Mean Reward: 0.94


-* Set-up: A balance-ball task, where the agent controls the platform. 
-* Goal: The agent must balance the platform in order to keep the ball on it for as long as possible.
-* Agents: The environment contains 12 agents of the same kind, all linked to a single brain.
-* Agent Reward Function: 
-    * +0.1 for every step the ball remains on the platform. 
-    * -1.0 if the ball falls from the platform.
+* Set-up: A balance-ball task, where the agent controls the platform.
+* Goal: The agent must balance the platform in order to keep the ball on it for
+  as long as possible.
+* Agents: The environment contains 12 agents of the same kind, all linked to a
+  single brain.
+* Agent Reward Function:
+  * +0.1 for every step the ball remains on the platform.
+  * -1.0 if the ball falls from the platform.
-    * Vector Observation space: 8 variables corresponding to rotation of platform, and position, rotation, and velocity of ball.
-    * Vector Observation space (Hard Version): 5 variables corresponding to rotation of platform and position and rotation of ball.
-    * Vector Action space: (Continuous) Size of 2, with one value corresponding to X-rotation, and the other to Z-rotation.
-    * Visual Observations: None.
+  * Vector Observation space: 8 variables corresponding to rotation of platform,
+    and position, rotation, and velocity of ball.
+  * Vector Observation space (Hard Version): 5 variables corresponding to
+    rotation of platform and position and rotation of ball.
+  * Vector Action space: (Continuous) Size of 2, with one value corresponding to
+    X-rotation, and the other to Z-rotation.
+  * Visual Observations: None.
 * Reset Parameters: None
 * Benchmark Mean Reward: 100


-* Set-up: A version of the classic grid-world task. Scene contains agent, goal, and obstacles. 
-* Goal: The agent must navigate the grid to the goal while avoiding the obstacles.
+* Set-up: A version of the classic grid-world task. Scene contains agent, goal,
+  and obstacles.
+* Goal: The agent must navigate the grid to the goal while avoiding the
+  obstacles.
-* Agent Reward Function: 
-    * -0.01 for every step.
-    * +1.0 if the agent navigates to the goal position of the grid (episode ends).
-    * -1.0 if the agent navigates to an obstacle (episode ends).
+* Agent Reward Function:
+  * -0.01 for every step.
+  * +1.0 if the agent navigates to the goal position of the grid (episode ends).
+  * -1.0 if the agent navigates to an obstacle (episode ends).
-    * Vector Observation space: None
-    * Vector Action space: (Discrete) Size of 4, corresponding to movement in cardinal directions.
-    * Visual Observations: One corresponding to top-down view of GridWorld.
-* Reset Parameters: Three, corresponding to grid size, number of obstacles, and number of goals.
+  * Vector Observation space: None
+  * Vector Action space: (Discrete) Size of 4, corresponding to movement in
+    cardinal directions.
+  * Visual Observations: One corresponding to top-down view of GridWorld.
+* Reset Parameters: Three, corresponding to grid size, number of obstacles, and
+  number of goals.
 * Benchmark Mean Reward: 0.8

 ## [Tennis](https://youtu.be/RDaIh7JX6RI)
-* Set-up: Two-player game where agents control rackets to bounce ball over a net. 
-* Goal: The agents must bounce ball between one another while not dropping or sending ball out of bounds.
-* Agents: The environment contains two agent linked to a single brain named TennisBrain. After training you can attach another brain named MyBrain to one of the agent to play against your trained model. 
-* Agent Reward Function (independent): 
-    * +0.1 To agent when hitting ball over net.
-    * -0.1 To agent who let ball hit their ground, or hit ball out of bounds.
+* Set-up: Two-player game where agents control rackets to bounce ball over a
+  net.
+* Goal: The agents must bounce ball between one another while not dropping or
+  sending ball out of bounds.
+* Agents: The environment contains two agent linked to a single brain named
+  TennisBrain. After training you can attach another brain named MyBrain to one
+  of the agent to play against your trained model.
+* Agent Reward Function (independent):
+  * +0.1 To agent when hitting ball over net.
+  * -0.1 To agent who let ball hit their ground, or hit ball out of bounds.
-    * Vector Observation space: 8 variables corresponding to position and velocity of ball and racket.
-    * Vector Action space: (Continuous) Size of 2, corresponding to movement toward net or away from net, and jumping.
-    * Visual Observations: None.
+  * Vector Observation space: 8 variables corresponding to position and velocity
+    of ball and racket.
+  * Vector Action space: (Continuous) Size of 2, corresponding to movement
+    toward net or away from net, and jumping.
+  * Visual Observations: None.
 * Reset Parameters: One, corresponding to size of ball.
 * Benchmark Mean Reward: 2.5
 * Optional Imitation Learning scene: `TennisIL`.
 * Set-up: A platforming environment where the agent can push a block around.
 * Goal: The agent must push the block to the goal.
 * Agents: The environment contains one agent linked to a single brain.
-* Agent Reward Function: 
-    * -0.0025 for every step.
-    * +1.0 if the block touches the goal.
+* Agent Reward Function:
+  * -0.0025 for every step.
+  * +1.0 if the block touches the goal.
-    * Vector Observation space: (Continuous) 70 variables corresponding to 14 ray-casts each detecting one of three possible objects (wall, goal, or block).
-    * Vector Action space: (Continuous) Size of 2, corresponding to movement in X and Z directions.
-    * Visual Observations (Optional): One first-person camera. Use `VisualPushBlock` scene.
+  * Vector Observation space: (Continuous) 70 variables corresponding to 14
+    ray-casts each detecting one of three possible objects (wall, goal, or
+    block).
+  * Vector Action space: (Continuous) Size of 2, corresponding to movement in X
+    and Z directions.
+  * Visual Observations (Optional): One first-person camera. Use
+    `VisualPushBlock` scene.
 * Reset Parameters: None.
 * Benchmark Mean Reward: 4.5
 * Optional Imitation Learning scene: `PushBlockIL`.

 * Set-up: A platforming environment where the agent can jump over a wall.
 * Goal: The agent must use the block to scale the wall and reach the goal.
-* Agents: The environment contains one agent linked to two different brains. The brain the agent is linked to changes depending on the height of the wall.
-* Agent Reward Function: 
-    * -0.0005 for every step.
-    * +1.0 if the agent touches the goal.
-    * -1.0 if the agent falls off the platform.
+* Agents: The environment contains one agent linked to two different brains. The
+  brain the agent is linked to changes depending on the height of the wall.
+* Agent Reward Function:
+  * -0.0005 for every step.
+  * +1.0 if the agent touches the goal.
+  * -1.0 if the agent falls off the platform.
-    * Vector Observation space: Size of 74, corresponding to 14 raycasts each detecting 4 possible objects. plus the global position of the agent and whether or not the agent is grounded.
-    * Vector Action space: (Discrete) 4 Branches : 
-	    * Forward Motion (3 possible actions : Forward, Backwards, No Action)
-	    * Rotation (3 possible acions : Rotate Left, Rotate Right, No Action)
-	    * Side Motion (3 possible acions : Left, Right, No Action)
-	    * Jump (2 possible actions: Jump, No Action)
-    * Visual Observations: None.
+  * Vector Observation space: Size of 74, corresponding to 14 raycasts each
+    detecting 4 possible objects. plus the global position of the agent and
+    whether or not the agent is grounded.
+  * Vector Action space: (Discrete) 4 Branches:
+    * Forward Motion (3 possible actions: Forward, Backwards, No Action)
+    * Rotation (3 possible acions: Rotate Left, Rotate Right, No Action)
+    * Side Motion (3 possible acions: Left, Right, No Action)
+    * Jump (2 possible actions: Jump, No Action)
+  * Visual Observations: None.
 * Reset Parameters: 4, corresponding to the height of the possible walls.
 * Benchmark Mean Reward (Big & Small Wall Brain): 0.8

 * Set-up: Double-jointed arm which can move to target locations.
 * Goal: The agents must move it's hand to the goal location, and keep it there.
 * Agents: The environment contains 10 agent linked to a single brain.
-* Agent Reward Function (independent): 
-    * +0.1 Each step agent's hand is in goal location.
+* Agent Reward Function (independent):
+  * +0.1 Each step agent's hand is in goal location.
-    * Vector Observation space: 26 variables corresponding to position, rotation, velocity, and angular velocities of the two arm Rigidbodies.
-    * Vector Action space: (Continuous) Size of 4, corresponding to torque applicable to two joints. 
-    * Visual Observations: None.
+  * Vector Observation space: 26 variables corresponding to position, rotation,
+    velocity, and angular velocities of the two arm Rigidbodies.
+  * Vector Action space: (Continuous) Size of 4, corresponding to torque
+    applicable to two joints.
+  * Visual Observations: None.
 * Reset Parameters: Two, corresponding to goal size, and goal movement speed.
 * Benchmark Mean Reward: 30


 * Set-up: A creature with 4 arms and 4 forearms.
 * Goal: The agents must move its body toward the goal direction without falling.
-    * `CrawlerStaticTarget` - Goal direction is always forward.
-    * `CrawlerDynamicTarget`- Goal direction is randomized.
+  * `CrawlerStaticTarget` - Goal direction is always forward.
+  * `CrawlerDynamicTarget`- Goal direction is randomized.
-* Agent Reward Function (independent): 
-    * +0.03 times body velocity in the goal direction.
-    * +0.01 times body direction alignment with goal direction.
+* Agent Reward Function (independent):
+  * +0.03 times body velocity in the goal direction.
+  * +0.01 times body direction alignment with goal direction.
-    * Vector Observation space: 117 variables corresponding to position, rotation, velocity, and angular velocities of each limb plus the acceleration and angular acceleration of the body.
-    * Vector Action space: (Continuous) Size of 20, corresponding to target rotations for joints. 
-    * Visual Observations: None.
+  * Vector Observation space: 117 variables corresponding to position, rotation,
+    velocity, and angular velocities of each limb plus the acceleration and
+    angular acceleration of the body.
+  * Vector Action space: (Continuous) Size of 20, corresponding to target
+    rotations for joints.
+  * Visual Observations: None.
 * Reset Parameters: None
 * Benchmark Mean Reward: 2000


-* Set-up: A multi-agent environment where agents compete to collect bananas. 
-* Goal: The agents must learn to move to as many yellow bananas as possible while avoiding blue bananas.
+* Set-up: A multi-agent environment where agents compete to collect bananas.
+* Goal: The agents must learn to move to as many yellow bananas as possible
+  while avoiding blue bananas.
-* Agent Reward Function (independent): 
-    * +1 for interaction with yellow banana
-    * -1 for interaction with blue banana.
+* Agent Reward Function (independent):
+  * +1 for interaction with yellow banana
+  * -1 for interaction with blue banana.
-    * Vector Observation space: 53 corresponding to velocity of agent (2), whether agent is frozen and/or shot its laser (2), plus ray-based perception of objects around agent's forward direction (49; 7 raycast angles with 7 measurements for each).
-    * Vector Action space: (Discrete) 4 Branches : 
-	    * Forward Motion (3 possible actions : Forward, Backwards, No Action)
-	    * Side Motion (3 possible acions : Left, Right, No Action)
-	    * Rotation (3 possible acions : Rotate Left, Rotate Right, No Action)
-	    * Laser (2 possible actions: Laser, No Action)
-    * Visual Observations (Optional): First-person camera per-agent. Use `VisualBanana` scene. 
+  * Vector Observation space: 53 corresponding to velocity of agent (2), whether
+    agent is frozen and/or shot its laser (2), plus ray-based perception of
+    objects around agent's forward direction (49; 7 raycast angles with 7
+    measurements for each).
+  * Vector Action space: (Discrete) 4 Branches:
+    * Forward Motion (3 possible actions: Forward, Backwards, No Action)
+    * Side Motion (3 possible acions: Left, Right, No Action)
+    * Rotation (3 possible acions: Rotate Left, Rotate Right, No Action)
+    * Laser (2 possible actions: Laser, No Action)
+  * Visual Observations (Optional): First-person camera per-agent. Use
+    `VisualBanana` scene.
-* Optional Imitation Learning scene: `BananaIL`. 
+* Optional Imitation Learning scene: `BananaIL`.
-* Set-up: Environment where the agent needs to find information in a room, remember it, and use it to move to the correct goal.
-* Goal: Move to the goal which corresponds to the color of the block in the room.
+* Set-up: Environment where the agent needs to find information in a room,
+  remember it, and use it to move to the correct goal.
+* Goal: Move to the goal which corresponds to the color of the block in the
+  room.
-    * +1 For moving to correct goal.
-    * -0.1 For moving to incorrect goal.
-    * -0.0003 Existential penalty.
+  * +1 For moving to correct goal.
+  * -0.1 For moving to incorrect goal.
+  * -0.0003 Existential penalty.
-    * Vector Observation space: 30 corresponding to local ray-casts detecting objects, goals, and walls.
-    * Vector Action space: (Discrete) 1 Branch, 4 actions corresponding to agent rotation and forward/backward movement.
-    * Visual Observations (Optional): First-person view for the agent. Use `VisualHallway` scene.
+  * Vector Observation space: 30 corresponding to local ray-casts detecting
+    objects, goals, and walls.
+  * Vector Action space: (Discrete) 1 Branch, 4 actions corresponding to agent
+    rotation and forward/backward movement.
+  * Visual Observations (Optional): First-person view for the agent. Use
+    `VisualHallway` scene.
 * Reset Parameters: None.
 * Benchmark Mean Reward: 0.7
 * Optional Imitation Learning scene: `HallwayIL`.
 ![Bouncer](images/bouncer.png)

-* Set-up: Environment where the agent needs on-demand decision making. The agent must decide how perform its next bounce only when it touches the ground.
+* Set-up: Environment where the agent needs on-demand decision making. The agent
+  must decide how perform its next bounce only when it touches the ground.
-    * +1 For catching the banana.
-    * -1 For bouncing out of bounds.
-    * -0.05 Times the action squared. Energy expenditure penalty.
+  * +1 For catching the banana.
+  * -1 For bouncing out of bounds.
+  * -0.05 Times the action squared. Energy expenditure penalty.
-    * Vector Observation space: 6 corresponding to local position of agent and banana.
-    * Vector Action space: (Continuous) 3 corresponding to agent force applied for the jump.
-    * Visual Observations: None.
+  * Vector Observation space: 6 corresponding to local position of agent and
+    banana.
+  * Vector Action space: (Continuous) 3 corresponding to agent force applied for
+    the jump.
+  * Visual Observations: None.
 * Reset Parameters: None.
 * Benchmark Mean Reward: 2.5


-* Set-up: Environment where four agents compete in a 2 vs 2 toy soccer game. 
+* Set-up: Environment where four agents compete in a 2 vs 2 toy soccer game.
-    * Striker: Get the ball into the opponent's goal.
-    * Goalie: Prevent the ball from entering its own goal.
-* Agents: The environment contains four agents, with two linked to one brain (strikers) and two linked to another (goalies).
+  * Striker: Get the ball into the opponent's goal.
+  * Goalie: Prevent the ball from entering its own goal.
+* Agents: The environment contains four agents, with two linked to one brain
+  (strikers) and two linked to another (goalies).
-    * Striker:
-        * +1 When ball enters opponent's goal.
-        * -0.1 When ball enters own team's goal.
-        * -0.001 Existential penalty.
-    * Goalie:
-        * -1 When ball enters team's goal.
-        * +0.1 When ball enters opponents goal.
-        * +0.001 Existential bonus.
+  * Striker:
+    * +1 When ball enters opponent's goal.
+    * -0.1 When ball enters own team's goal.
+    * -0.001 Existential penalty.
+  * Goalie:
+    * -1 When ball enters team's goal.
+    * +0.1 When ball enters opponents goal.
+    * +0.001 Existential bonus.
-    * Vector Observation space: 112 corresponding to local 14 ray casts, each detecting 7 possible object types, along with the object's distance. Perception is in 180 degree view from front of agent.
-    * Vector Action space: (Discrete) One Branch
-        * Striker: 6 actions corresponding to forward, backward, sideways movement, as well as rotation.
-        * Goalie: 4 actions corresponding to forward, backward, sideways movement.
-    * Visual Observations: None.
+  * Vector Observation space: 112 corresponding to local 14 ray casts, each
+    detecting 7 possible object types, along with the object's distance.
+    Perception is in 180 degree view from front of agent.
+  * Vector Action space: (Discrete) One Branch
+    * Striker: 6 actions corresponding to forward, backward, sideways movement,
+      as well as rotation.
+    * Goalie: 4 actions corresponding to forward, backward, sideways movement.
+  * Visual Observations: None.
-* Benchmark Mean Reward (Striker & Goalie Brain): 0 (the means will be inverse of each other and criss crosses during training)
+* Benchmark Mean Reward (Striker & Goalie Brain): 0 (the means will be inverse
+  of each other and criss crosses during training)
-* Set-up: Physics-based Humanoids agents with 26 degrees of freedom. These DOFs correspond to articulation of the following body-parts: hips, chest, spine, head, thighs, shins, feets, arms, forearms and hands. 
-* Goal: The agents must move its body toward the goal direction as quickly as possible without falling.
-* Agents: The environment contains 11 independent agent linked to a single brain.
-* Agent Reward Function (independent): 
-    * +0.03 times body velocity in the goal direction.
-    * +0.01 times head y position.
-    * +0.01 times body direction alignment with goal direction.
-    * -0.01 times head velocity difference from body velocity.
+* Set-up: Physics-based Humanoids agents with 26 degrees of freedom. These DOFs
+  correspond to articulation of the following body-parts: hips, chest, spine,
+  head, thighs, shins, feets, arms, forearms and hands.
+* Goal: The agents must move its body toward the goal direction as quickly as
+  possible without falling.
+* Agents: The environment contains 11 independent agent linked to a single
+  brain.
+* Agent Reward Function (independent):
+  * +0.03 times body velocity in the goal direction.
+  * +0.01 times head y position.
+  * +0.01 times body direction alignment with goal direction.
+  * -0.01 times head velocity difference from body velocity.
-    * Vector Observation space: 215 variables corresponding to position, rotation, velocity, and angular velocities of each limb, along with goal direction.
-    * Vector Action space: (Continuous) Size of 39, corresponding to target rotations applicable to the joints. 
-    * Visual Observations: None.
+  * Vector Observation space: 215 variables corresponding to position, rotation,
+    velocity, and angular velocities of each limb, along with goal direction.
+  * Vector Action space: (Continuous) Size of 39, corresponding to target
+    rotations applicable to the joints.
+  * Visual Observations: None.
 * Reset Parameters: None.
 * Benchmark Mean Reward: 1000


-* Set-up: Environment where the agent needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top.
+* Set-up: Environment where the agent needs to press a button to spawn a
+  pyramid, then navigate to the pyramid, knock it over, and move to the gold
+  brick at the top.
-    * +2 For moving to golden brick (minus 0.001 per step).
+  * +2 For moving to golden brick (minus 0.001 per step).
-    * Vector Observation space: 148 corresponding to local ray-casts detecting switch, bricks, golden brick, and walls, plus variable indicating switch state.
-    * Vector Action space: (Discrete) 4 corresponding to agent rotation and forward/backward movement.
-    * Visual Observations (Optional): First-person camera per-agent. Use `VisualPyramids` scene. 
+  * Vector Observation space: 148 corresponding to local ray-casts detecting
+    switch, bricks, golden brick, and walls, plus variable indicating switch
+    state.
+  * Vector Action space: (Discrete) 4 corresponding to agent rotation and
+    forward/backward movement.
+  * Visual Observations (Optional): First-person camera per-agent. Us
+    `VisualPyramids` scene.
 * Reset Parameters: None.
 * Optional Imitation Learning scene: `PyramidsIL`.
 * Benchmark Mean Reward: 1.75
--- a/docs/Learning-Environment-Executable.md
+++ b/docs/Learning-Environment-Executable.md
 # Using an Environment Executable

-This section will help you create and use built environments rather than the Editor to interact with an environment. Using an executable has some advantages over using the Editor : 
+This section will help you create and use built environments rather than the
+Editor to interact with an environment. Using an executable has some advantages
+over using the Editor:
- * You can exchange executable with other people without having to share your entire repository.
- * You can put your executable on a remote machine for faster training.
- * You can use `Headless` mode for faster training.
- * You can keep using the Unity Editor for other tasks while the agents are training.
+* You can exchange executable with other people without having to share your
+  entire repository.
+* You can put your executable on a remote machine for faster training.
+* You can use `Headless` mode for faster training.
+* You can keep using the Unity Editor for other tasks while the agents are
+  training.

 ## Building the 3DBall environment

 1. Launch Unity.
 2. On the Projects dialog, choose the **Open** option at the top of the window.
-3. Using the file dialog that opens, locate the `MLAgentsSDK` folder 
-within the ML-Agents project and click **Open**.
-4. In the **Project** window, navigate to the folder 
-`Assets/ML-Agents/Examples/3DBall/`.
-5. Double-click the `3DBall` file to load the scene containing the Balance 
-Ball environment.
+3. Using the file dialog that opens, locate the `MLAgentsSDK` folder within the
+   ML-Agents project and click **Open**.
+4. In the **Project** window, navigate to the folder
+   `Assets/ML-Agents/Examples/3DBall/`.
+5. Double-click the `3DBall` file to load the scene containing the Balance Ball
+   environment.
-Make sure the Brains in the scene have the right type. For example, if you want to be able to control your agents from Python, you will need to set the corresponding brain to **External**.
+Make sure the Brains in the scene have the right type. For example, if you want
+to be able to control your agents from Python, you will need to set the
+corresponding brain to **External**.
-1. In the **Scene** window, click the triangle icon next to the Ball3DAcademy 
-object.
+1. In the **Scene** window, click the triangle icon next to the Ball3DAcademy
+   object.
-Next, we want the set up scene to play correctly when the training process 
+Next, we want the set up scene to play correctly when the training process
-* The environment application runs in the background
-* No dialogs require interaction
-* The correct scene loads automatically
- 
+
+* The environment application runs in the background.
+* No dialogs require interaction.
+* The correct scene loads automatically.
+
-    - Ensure that **Run in Background** is Checked.
-    - Ensure that **Display Resolution Dialog** is set to Disabled.
+   * Ensure that **Run in Background** is Checked.
+   * Ensure that **Display Resolution Dialog** is set to Disabled.
-    - (optional) Select “Development Build” to
-    [log debug messages](https://docs.unity3d.com/Manual/LogFiles.html).
-5. If any scenes are shown in the **Scenes in Build** list, make sure that 
-the 3DBall Scene is the only one checked. (If the list is empty, than only the 
-current scene is included in the build).
+   * (optional) Select “Development Build” to [log debug
+      messages](https://docs.unity3d.com/Manual/LogFiles.html).
+5. If any scenes are shown in the **Scenes in Build** list, make sure that the
+   3DBall Scene is the only one checked. (If the list is empty, than only the
+   current scene is included in the build).
-    - In the File dialog, navigate to your ML-Agents directory.
-    - Assign a file name and click **Save**.
-    - (For Windows）With Unity 2018.1, it will ask you to select a folder instead of a file name. Create a subfolder within the ML-Agents folder and select that folder to build. In the following steps you will refer to this subfolder's name as `env_name`. 
+   * In the File dialog, navigate to your ML-Agents directory.
+   * Assign a file name and click **Save**.
+   * (For Windows）With Unity 2018.1, it will ask you to select a folder instead
+     of a file name. Create a subfolder within the ML-Agents folder and select
+     that folder to build. In the following steps you will refer to this
+     subfolder's name as `env_name`.
-Now that we have a Unity executable containing the simulation environment, we 
+Now that we have a Unity executable containing the simulation environment, we
-If you want to use the [Python API](Python-API.md) to interact with your executable, you can pass the name of the executable with the argument 'file_name' of the `UnityEnvironment`. For instance :
+
+If you want to use the [Python API](Python-API.md) to interact with your
+executable, you can pass the name of the executable with the argument
+'file_name' of the `UnityEnvironment`. For instance:

 ```python
 from mlagents.envs import UnityEnvironment
 ## Training the Environment
-1. Open a command or terminal window. 
-2. Nagivate to the folder where you installed ML-Agents. 
-3. Change to the python directory. 
-4. Run `mlagents-learn <trainer-config-file> --env=<env_name> --run-id=<run-identifier> --train`
-Where:
-  - `<trainer-config-file>` is the filepath of the trainer configuration yaml.
-  - `<env_name>` is the name and path to the executable you exported from Unity (without extension)
-  - `<run-identifier>` is a string used to separate the results of different training runs
- And the `--train` tells `mlagents-learn` to run a training session (rather than inference)
+
+1. Open a command or terminal window.
+2. Nagivate to the folder where you installed ML-Agents.
+3. Change to the python directory.
+4. Run
+   `mlagents-learn <trainer-config-file> --env=<env_name> --run-id=<run-identifier> --train`
+   Where:
+   * `<trainer-config-file>` is the filepath of the trainer configuration yaml.
+   * `<env_name>` is the name and path to the executable you exported from Unity
+     (without extension)
+   * `<run-identifier>` is a string used to separate the results of different
+     training runs
+   * And the `--train` tells `mlagents-learn` to run a training session (rather
+     than inference)
-For example, if you are training with a 3DBall executable you exported to the ml-agents/python directory, run:
+For example, if you are training with a 3DBall executable you exported to the
+ml-agents/python directory, run:
-```shell
+```sh
-**Note**: If you're using Anaconda, don't forget to activate the ml-agents environment first.
+**Note**: If you're using Anaconda, don't forget to activate the ml-agents
+environment first.
-If `mlagents-learn` runs correctly and starts training, you should see something like this:
+If `mlagents-learn` runs correctly and starts training, you should see something
+like this:
-You can press Ctrl+C to stop the training, and your trained model will be at `models/<run-identifier>/<env_name>_<run-identifier>.bytes`, which corresponds to your model's latest checkpoint. You can now embed this trained model into your internal brain by following the steps below:
+You can press Ctrl+C to stop the training, and your trained model will be at
+`models/<run-identifier>/<env_name>_<run-identifier>.bytes`, which corresponds
+to your model's latest checkpoint. You can now embed this trained model into
+your internal brain by following the steps below:
-1. Move your model file into 
-`MLAgentsSDK/Assets/ML-Agents/Examples/3DBall/TFModels/`.
+1. Move your model file into
+   `MLAgentsSDK/Assets/ML-Agents/Examples/3DBall/TFModels/`.
-5. Drag the `<env_name>_<run-identifier>.bytes` file from the Project window of the Editor
-to the **Graph Model** placeholder in the **Ball3DBrain** inspector window.
+5. Drag the `<env_name>_<run-identifier>.bytes` file from the Project window of
+   the Editor to the **Graph Model** placeholder in the **Ball3DBrain**
+   inspector window.
 6. Press the Play button at the top of the editor.
--- a/docs/Limitations.md
+++ b/docs/Limitations.md
-# Limitations 
+# Limitations
+
-If you enable Headless mode, you will not be able to collect visual 
-observations from your agents.
+
+If you enable Headless mode, you will not be able to collect visual observations
+from your agents.
-Currently the speed of the game physics can only be increased to 100x 
-real-time. The Academy also moves in time with FixedUpdate() rather than 
-Update(), so game behavior implemented in Update() may be out of sync with the Agent decision making. See [Execution Order of Event Functions](https://docs.unity3d.com/Manual/ExecutionOrder.html) for more information.
+
+Currently the speed of the game physics can only be increased to 100x real-time.
+The Academy also moves in time with FixedUpdate() rather than Update(), so game
+behavior implemented in Update() may be out of sync with the Agent decision
+making. See
+[Execution Order of Event Functions](https://docs.unity3d.com/Manual/ExecutionOrder.html)
+for more information.
-As of version 0.3, we no longer support Python 2. 
+
+As of version 0.3, we no longer support Python 2.
-Currently the Ml-Agents toolkit uses TensorFlow 1.7.1 due to the version of the TensorFlowSharp plugin we are using. 
+
+Currently the Ml-Agents toolkit uses TensorFlow 1.7.1 due to the version of the
+TensorFlowSharp plugin we are using.
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
  Python API. It lives within the Learning Environment.

 <p align="center">
-    <img src="images/learning_environment_basic.png" 
-        alt="Simplified ML-Agents Scene Block Diagram" 
-        width="700" border="10" />
+  <img src="images/learning_environment_basic.png"
+       alt="Simplified ML-Agents Scene Block Diagram"
+       width="700" border="10" />
 </p>

 _Simplified block diagram of ML-Agents._
 medics (medics and drivers have different actions).

 <p align="center">
-    <img src="images/learning_environment_example.png"
-        alt="Example ML-Agents Scene Block Diagram"
-        border="10" />
+  <img src="images/learning_environment_example.png"
+       alt="Example ML-Agents Scene Block Diagram"
+       border="10" />
 </p>

 _Example block diagram of ML-Agents toolkit for our sample game._
 enables additional training modes.

 <p align="center">
-    <img src="images/learning_environment.png"
-        alt="ML-Agents Scene Block Diagram"
-        border="10" />
+  <img src="images/learning_environment.png"
+       alt="ML-Agents Scene Block Diagram"
+       border="10" />
 </p>

 _An example of how a scene containing multiple Agents and Brains might be 
 future.

 <p align="center">
-    <img src="images/math.png"
-        alt="Example Math Curriculum"
-        width="700"
-        border="10" />
+  <img src="images/math.png"
+       alt="Example Math Curriculum"
+       width="700"
+       border="10" />
 </p>

 _Example of a mathematics curriculum. Lessons progress from simpler topics to
--- a/docs/Migrating.md
+++ b/docs/Migrating.md
-# Migrating from ML-Agents toolkit v0.3 to v0.4
+# Migrating
+
+## Migrating from ML-Agents toolkit v0.3 to v0.4
-## Unity API
- * `using MLAgents;` needs to be added in all of the C# scripts that use ML-Agents. 
+### Unity API
+
+* `using MLAgents;` needs to be added in all of the C# scripts that use
+  ML-Agents.
+
+### Python API
-## Python API
- * We've changed some of the python packages dependencies in requirement.txt file. Make sure to run `pip install .` within your `ml-agents/python` folder to update your python packages. 
+* We've changed some of the python packages dependencies in requirement.txt
+  file. Make sure to run `pip install .` within your `ml-agents/python` folder
+  to update your python packages.
-# Migrating from ML-Agents toolkit v0.2 to v0.3
+## Migrating from ML-Agents toolkit v0.2 to v0.3
-There are a large number of new features and improvements in the ML-Agents toolkit v0.3 which change both the training process and Unity API in ways which will cause incompatibilities with environments made using older versions. This page is designed to highlight those changes for users familiar with v0.1 or v0.2 in order to ensure a smooth transition.
+There are a large number of new features and improvements in the ML-Agents
+toolkit v0.3 which change both the training process and Unity API in ways which
+will cause incompatibilities with environments made using older versions. This
+page is designed to highlight those changes for users familiar with v0.1 or v0.2
+in order to ensure a smooth transition.
+
+### Important
+
+* The ML-Agents toolkit is no longer compatible with Python 2.
+
+### Python Training
+
+* The training script `ppo.py` and `PPO.ipynb` Python notebook have been
+  replaced with a single `learn.py` script as the launching point for training
+  with ML-Agents. For more information on using `learn.py`, see
+  [here](Training-ML-Agents.md).
+* Hyperparameters for training brains are now stored in the
+  `trainer_config.yaml` file. For more information on using this file, see
+  [here](Training-ML-Agents.md).
-## Important
- * The ML-Agents toolkit is no longer compatible with Python 2. 
+### Unity API
-## Python Training
- * The training script `ppo.py` and `PPO.ipynb` Python notebook have been replaced with a single `learn.py` script as the launching point for training with ML-Agents. For more information on using `learn.py`, see [here]().
- * Hyperparameters for training brains are now stored in the `trainer_config.yaml` file. For more information on using this file, see [here]().
+* Modifications to an Agent's rewards must now be done using either
+  `AddReward()` or `SetReward()`.
+* Setting an Agent to done now requires the use of the `Done()` method.
+* `CollectStates()` has been replaced by `CollectObservations()`, which now no
+  longer returns a list of floats.
+* To collect observations, call `AddVectorObs()` within `CollectObservations()`.
+  Note that you can call `AddVectorObs()` with floats, integers, lists and
+  arrays of floats, Vector3 and Quaternions.
+* `AgentStep()` has been replaced by `AgentAction()`.
+* `WaitTime()` has been removed.
+* The `Frame Skip` field of the Academy is replaced by the Agent's `Decision
+  Frequency` field, enabling agent to make decisions at different frequencies.
+* The names of the inputs in the Internal Brain have been changed. You must
+  replace `state` with `vector_observation` and `observation` with
+  `visual_observation`. In addition, you must remove the `epsilon` placeholder.
-## Unity API
- * Modifications to an Agent's rewards must now be done using either `AddReward()` or `SetReward()`.
- * Setting an Agent to done now requires the use of the `Done()` method.
- * `CollectStates()` has been replaced by `CollectObservations()`, which now no longer returns a list of floats.
- * To collect observations, call `AddVectorObs()` within `CollectObservations()`. Note that you can call `AddVectorObs()` with floats, integers, lists and arrays of floats, Vector3 and Quaternions. 
- * `AgentStep()` has been replaced by `AgentAction()`.
- * `WaitTime()` has been removed.
- * The `Frame Skip` field of the Academy is replaced by the Agent's `Decision Frequency` field, enabling agent to make decisions at different frequencies.
- * The names of the inputs in the Internal Brain have been changed. You must replace `state` with `vector_observation` and `observation` with `visual_observation`. In addition, you must remove the `epsilon` placeholder.
+### Semantics
-## Semantics
-In order to more closely align with the terminology used in the Reinforcement Learning field, and to be more descriptive, we have changed the names of some of the concepts used in ML-Agents. The changes are highlighted in the table below.
+In order to more closely align with the terminology used in the Reinforcement
+Learning field, and to be more descriptive, we have changed the names of some of
+the concepts used in ML-Agents. The changes are highlighted in the table below.

 | Old - v0.2 and earlier | New - v0.3 and later |
 | --- | --- |
--- a/docs/Python-API.md
+++ b/docs/Python-API.md

 The ML-Agents toolkit provides a Python API for controlling the agent simulation
 loop of a environment or game built with Unity. This API is used by the ML-Agent
-training algorithms (run with `mlagents-learn`), but you can also write your Python
-programs using this API.
+training algorithms (run with `mlagents-learn`), but you can also write your
+Python programs using this API.

 The key objects in the Python API include:

 A BrainInfo object contains the following fields:

 - **`visual_observations`** : A list of 4 dimensional numpy arrays. Matrix n of
-  the list corresponds to the n<sup>th</sup> observation of the brain. 
+  the list corresponds to the n<sup>th</sup> observation of the brain.
 - **`vector_observations`** : A two dimensional numpy array of dimension `(batch
  size, vector observation size)`.
 - **`text_observations`** : A list of string corresponding to the agents text
 - **`rewards`** : A list as long as the number of agents using the brain
-  containing the rewards they each obtained at the previous step. 
+  containing the rewards they each obtained at the previous step.
-  containing  `done` flags (whether or not the agent is done). 
+  containing  `done` flags (whether or not the agent is done).
 - **`max_reached`** : A list as long as the number of agents using the brain
  containing true if the agents reached their max steps.
 - **`agents`** : A list of the unique ids of the agents using the brain.
--- a/docs/Training-Curriculum-Learning.md
+++ b/docs/Training-Curriculum-Learning.md
 ```

 * `measure` - What to measure learning progress, and advancement in lessons by.
-    * `reward` - Uses a measure received reward.
-    * `progress` - Uses ratio of steps/max_steps.
+  * `reward` - Uses a measure received reward.
+  * `progress` - Uses ratio of steps/max_steps.
 * `thresholds` (float array) - Points in value of `measure` where lesson should
  be increased.
 * `min_lesson_length` (int) - How many times the progress measure should be
-    * If `true`, weighting will be 0.75 (new) 0.25 (old).
+  * If `true`, weighting will be 0.75 (new) 0.25 (old).
-

 Once our curriculum is defined, we have to use the reset parameters we defined
 and modify the environment from the agent's `AgentReset()` function. See
 folder and PPO will train using Curriculum Learning. For example, to train
 agents in the Wall Jump environment with curriculum learning, we can run

-```shell
+```sh
 mlagents-learn config/trainer_config.yaml --curriculum=curricula/wall-jump/ --run-id=wall-jump-curriculum --train
 ```

--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
 # Imitation Learning

-It is often more intuitive to simply demonstrate the behavior we want an agent to perform, rather than attempting to have it learn via trial-and-error methods. Consider our [running example](ML-Agents-Overview.md#running-example-training-npc-behaviors) of training a medic NPC : instead of indirectly training a medic with the help of a reward function, we can give the medic real world examples of observations from the game and actions from a game controller to guide the medic's behavior. More specifically, in this mode, the Brain type during training is set to Player and all the actions performed with the controller (in addition to the agent observations) will be recorded and sent to the Python API. The imitation learning algorithm will then use these pairs of observations and actions from the human player to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFYs).
+It is often more intuitive to simply demonstrate the behavior we want an agent
+to perform, rather than attempting to have it learn via trial-and-error methods.
+Consider our
+[running example](ML-Agents-Overview.md#running-example-training-npc-behaviors)
+of training a medic NPC : instead of indirectly training a medic with the help
+of a reward function, we can give the medic real world examples of observations
+from the game and actions from a game controller to guide the medic's behavior.
+More specifically, in this mode, the Brain type during training is set to Player
+and all the actions performed with the controller (in addition to the agent
+observations) will be recorded and sent to the Python API. The imitation
+learning algorithm will then use these pairs of observations and actions from
+the human player to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFYs).
-There are a variety of possible imitation learning algorithms which can be used, the simplest one of them is Behavioral Cloning. It works by collecting training data from a teacher, and then simply uses it to directly learn a policy, in the same way the supervised learning for image classification or other traditional Machine Learning tasks work.
+There are a variety of possible imitation learning algorithms which can be used,
+the simplest one of them is Behavioral Cloning. It works by collecting training
+data from a teacher, and then simply uses it to directly learn a policy, in the
+same way the supervised learning for image classification or other traditional
+Machine Learning tasks work.
-1. In order to use imitation learning in a scene, the first thing you will need is to create two Brains, one which will be the "Teacher," and the other which will be the "Student." We will assume that the names of the brain `GameObject`s are "Teacher" and "Student" respectively.
-2. Set the "Teacher" brain to Player mode, and properly configure the inputs to map to the corresponding actions. **Ensure that "Broadcast" is checked within the Brain inspector window.**
+1. In order to use imitation learning in a scene, the first thing you will need
+   is to create two Brains, one which will be the "Teacher," and the other which
+   will be the "Student." We will assume that the names of the brain
+   `GameObject`s are "Teacher" and "Student" respectively.
+2. Set the "Teacher" brain to Player mode, and properly configure the inputs to
+   map to the corresponding actions. **Ensure that "Broadcast" is checked within
+   the Brain inspector window.**
-4. Link the brains to the desired agents (one agent as the teacher and at least one agent as a student).
-5. In `config/trainer_config.yaml`, add an entry for the "Student" brain. Set the `trainer` parameter of this entry to `imitation`, and the `brain_to_imitate` parameter to the name of the teacher brain: "Teacher". Additionally, set `batches_per_epoch`, which controls how much training to do each moment. Increase the `max_steps` option if you'd like to keep training the agents for a longer period of time.
-6. Launch the training process with `mlagents-learn config/trainer_config.yaml --train --slow`, and press the :arrow_forward: button in Unity when the message _"Start training by pressing the Play button in the Unity Editor"_ is displayed on the screen
-7. From the Unity window, control the agent with the Teacher brain by providing "teacher demonstrations" of the behavior you would like to see.
-8. Watch as the agent(s) with the student brain attached begin to behave similarly to the demonstrations.
-9. Once the Student agents are exhibiting the desired behavior, end the training process with `CTL+C` from the command line.
-10. Move the resulting `*.bytes` file into the `TFModels` subdirectory of the Assets folder (or a subdirectory within Assets of your choosing) , and use with `Internal` brain.
+4. Link the brains to the desired agents (one agent as the teacher and at least
+   one agent as a student).
+5. In `config/trainer_config.yaml`, add an entry for the "Student" brain. Set
+   the `trainer` parameter of this entry to `imitation`, and the
+   `brain_to_imitate` parameter to the name of the teacher brain: "Teacher".
+   Additionally, set `batches_per_epoch`, which controls how much training to do
+   each moment. Increase the `max_steps` option if you'd like to keep training
+   the agents for a longer period of time.
+6. Launch the training process with `mlagents-learn config/trainer_config.yaml
+   --train --slow`, and press the :arrow_forward: button in Unity when the
+   message _"Start training by pressing the Play button in the Unity Editor"_ is
+   displayed on the screen
+7. From the Unity window, control the agent with the Teacher brain by providing
+   "teacher demonstrations" of the behavior you would like to see.
+8. Watch as the agent(s) with the student brain attached begin to behave
+   similarly to the demonstrations.
+9. Once the Student agents are exhibiting the desired behavior, end the training
+   process with `CTL+C` from the command line.
+10. Move the resulting `*.bytes` file into the `TFModels` subdirectory of the
+    Assets folder (or a subdirectory within Assets of your choosing) , and use
+    with `Internal` brain.
-We provide a convenience utility, `BC Teacher Helper` component that you can add to the Teacher Agent.
+We provide a convenience utility, `BC Teacher Helper` component that you can add
+to the Teacher Agent.
-    <img src="images/bc_teacher_helper.png"
-        alt="BC Teacher Helper"
-        width="375" border="10" />
+  <img src="images/bc_teacher_helper.png"
+       alt="BC Teacher Helper"
+       width="375" border="10" />
-1. To start and stop recording experiences. This is useful in case you'd like to interact with the game _but not have the agents learn from these interactions_. The default command to toggle this is to press `R` on the keyboard.
+1. To start and stop recording experiences. This is useful in case you'd like to
+   interact with the game _but not have the agents learn from these
+   interactions_. The default command to toggle this is to press `R` on the
+   keyboard.
-2. Reset the training buffer. This enables you to instruct the agents to forget their buffer of recent experiences. This is useful if you'd like to get them to quickly learn a new behavior. The default command to reset the buffer is to press `C` on the keyboard.
-
+2. Reset the training buffer. This enables you to instruct the agents to forget
+   their buffer of recent experiences. This is useful if you'd like to get them
+   to quickly learn a new behavior. The default command to reset the buffer is
+   to press `C` on the keyboard.
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 # Training ML-Agents

-The ML-Agents toolkit conducts training using an external Python training process. During training, this external process communicates with the Academy object in the Unity scene to generate a block of agent experiences. These experiences become the training set for a neural network used to optimize the agent's policy (which is essentially a mathematical function mapping observations to actions). In reinforcement learning, the neural network optimizes the policy by maximizing the expected rewards. In imitation learning, the neural network optimizes the policy to achieve the smallest difference between the actions chosen by the agent trainee and the actions chosen by the expert in the same situation. 
+The ML-Agents toolkit conducts training using an external Python training
+process. During training, this external process communicates with the Academy
+object in the Unity scene to generate a block of agent experiences. These
+experiences become the training set for a neural network used to optimize the
+agent's policy (which is essentially a mathematical function mapping
+observations to actions). In reinforcement learning, the neural network
+optimizes the policy by maximizing the expected rewards. In imitation learning,
+the neural network optimizes the policy to achieve the smallest difference
+between the actions chosen by the agent trainee and the actions chosen by the
+expert in the same situation.
-The output of the training process is a model file containing the optimized policy. This model file is a TensorFlow data graph containing the mathematical operations and the optimized weights selected during the training process. You can use the generated model file with the Internal Brain type in your Unity project to decide the best course of action for an agent. 
+The output of the training process is a model file containing the optimized
+policy. This model file is a TensorFlow data graph containing the mathematical
+operations and the optimized weights selected during the training process. You
+can use the generated model file with the Internal Brain type in your Unity
+project to decide the best course of action for an agent.
-Use the command `mlagents-learn` to train your agents. This command is installed with the `mlagents` package
-and its implementation can be found at `ml-agents/learn.py`. The [configuration file](#training-config-file), `config/trainer_config.yaml` specifies the hyperparameters used during training. You can edit this file with a text editor to add a specific configuration for each brain.
+Use the command `mlagents-learn` to train your agents. This command is installed
+with the `mlagents` package and its implementation can be found at
+`ml-agents/learn.py`. The [configuration file](#training-config-file),
+`config/trainer_config.yaml` specifies the hyperparameters used during training.
+You can edit this file with a text editor to add a specific configuration for
+each brain.
-For a broader overview of reinforcement learning, imitation learning and the ML-Agents training process, see [ML-Agents Toolkit Overview](ML-Agents-Overview.md).
+For a broader overview of reinforcement learning, imitation learning and the
+ML-Agents training process, see [ML-Agents Toolkit
+Overview](ML-Agents-Overview.md).
-Use the `mlagents-learn` command to train agents. `mlagents-learn` supports training with [reinforcement learning](Background-Machine-Learning.md#reinforcement-learning), [curriculum learning](Training-Curriculum-Learning.md), and [behavioral cloning imitation learning](Training-Imitation-Learning.md).
+Use the `mlagents-learn` command to train agents. `mlagents-learn` supports
+training with
+[reinforcement learning](Background-Machine-Learning.md#reinforcement-learning),
+[curriculum learning](Training-Curriculum-Learning.md),
+and [behavioral cloning imitation learning](Training-Imitation-Learning.md).
-Run `mlagents-learn` from the command line to launch the training process. Use the command line patterns and the `config/trainer_config.yaml` file to control training options.
+Run `mlagents-learn` from the command line to launch the training process. Use
+the command line patterns and the `config/trainer_config.yaml` file to control
+training options.
-```shell
+```sh
- * `<trainer-config-file>` is the filepath of the trainer configuration yaml.
- * `<env_name>`__(Optional)__ is the name (including path) of your Unity executable containing the agents to be trained. If `<env_name>` is not passed, the training will happen in the Editor. Press the :arrow_forward: button in Unity when the message _"Start training by pressing the Play button in the Unity Editor"_ is displayed on the screen.
- * `<run-identifier>` is an optional identifier you can use to identify the results of individual training runs.
+* `<trainer-config-file>` is the filepath of the trainer configuration yaml.
+* `<env_name>`__(Optional)__ is the name (including path) of your Unity
+  executable containing the agents to be trained. If `<env_name>` is not passed,
+  the training will happen in the Editor. Press the :arrow_forward: button in
+  Unity when the message _"Start training by pressing the Play button in the
+  Unity Editor"_ is displayed on the screen.
+* `<run-identifier>` is an optional identifier you can use to identify the
+  results of individual training runs.
-For example, suppose you have a project in Unity named "CatsOnBicycles" which contains agents ready to train. To perform the training:
+For example, suppose you have a project in Unity named "CatsOnBicycles" which
+contains agents ready to train. To perform the training:
-1. [Build the project](Learning-Environment-Executable.md), making sure that you only include the training scene.
+1. [Build the project](Learning-Environment-Executable.md), making sure that you
+   only include the training scene.
-4. Run the following to launch the training process using the path to the Unity environment you built in step 1:
+4. Run the following to launch the training process using the path to the Unity
+   environment you built in step 1:
-        mlagents-learn config/trainer_config.yaml --env=../../projects/Cats/CatsOnBicycles.app --run-id=cob_1 --train
+```sh
+mlagents-learn config/trainer_config.yaml --env=../../projects/Cats/CatsOnBicycles.app --run-id=cob_1 --train
+```
-During a training session, the training program prints out and saves updates at regular intervals (specified by the `summary_freq` option). The saved statistics are grouped by the `run-id` value so you should assign a unique id to each training run if you plan to view the statistics. You can view these statistics using TensorBoard during or after training by running the following command (from the ML-Agents python directory):
+During a training session, the training program prints out and saves updates at
+regular intervals (specified by the `summary_freq` option). The saved statistics
+are grouped by the `run-id` value so you should assign a unique id to each
+training run if you plan to view the statistics. You can view these statistics
+using TensorBoard during or after training by running the following command
+(from the ML-Agents python directory):
-    tensorboard --logdir=summaries
+```sh
+tensorboard --logdir=summaries
+```
- 
+
-While this example used the default training hyperparameters, you can edit the [training_config.yaml file](#training-config-file) with a text editor to set different values. 
+While this example used the default training hyperparameters, you can edit the
+[training_config.yaml file](#training-config-file) with a text editor to set
+different values.
-In addition to passing the path of the Unity executable containing your training environment, you can set the following command line options when invoking `mlagents-learn`:
+In addition to passing the path of the Unity executable containing your training
+environment, you can set the following command line options when invoking
+`mlagents-learn`:
-* `--curriculum=<file>` – Specify a curriculum JSON file for defining the lessons for curriculum training. See [Curriculum Training](Training-Curriculum-Learning.md) for more information.
-* `--keep-checkpoints=<n>` – Specify the maximum number of model checkpoints to keep. Checkpoints are saved after the number of steps specified by the `save-freq` option. Once the maximum number of checkpoints has been reached, the oldest checkpoint is deleted when saving a new checkpoint. Defaults to 5.
-* `--lesson=<n>` – Specify which lesson to start with when performing curriculum training. Defaults to 0.
-* `--load` – If set, the training code loads an already trained model to initialize the neural network before training. The learning code looks for the model in `models/<run-id>/` (which is also where it saves models at the end of training). When not set (the default), the neural network weights are randomly initialized and an existing model is not loaded.
-* `--num-runs=<n>` - Sets the number of concurrent training sessions to perform. Default is set to 1. Set to higher values when benchmarking performance and multiple training sessions is desired. Training sessions are independent, and do not improve learning performance.
-* `--run-id=<path>` – Specifies an identifier for each training run. This identifier is used to name the subdirectories in which the trained model and summary statistics are saved as well as the saved model itself. The default id is "ppo". If you use TensorBoard to view the training statistics, always set a unique run-id for each training run. (The statistics for all runs with the same id are combined as if they were produced by a the same session.)
-* `--save-freq=<n>` Specifies how often (in  steps) to save the model during training. Defaults to 50000.
-* `--seed=<n>` – Specifies a number to use as a seed for the random number generator used by the training code.
-* `--slow` – Specify this option to run the Unity environment at normal, game speed. The `--slow` mode uses the **Time Scale** and **Target Frame Rate** specified in the Academy's **Inference Configuration**. By default, training runs using the speeds specified in your Academy's **Training Configuration**. See [Academy Properties](Learning-Environment-Design-Academy.md#academy-properties).
-* `--train` – Specifies whether to train model or only run in inference mode. When training, **always** use the `--train` option.
-* `--worker-id=<n>` – When you are running more than one training environment at the same time, assign each a unique worker-id number. The worker-id is added to the communication port opened between the current instance of `mlagents-learn` and the ExternalCommunicator object in the Unity environment. Defaults to 0.
-* `--docker-target-name=<dt>` – The Docker Volume on which to store curriculum, executable and model files. See [Using Docker](Using-Docker.md).
-* `--no-graphics` - Specify this option to run the Unity executable in `-batchmode` and doesn't initialize the graphics driver. Use this only if your training doesn't involve visual observations (reading from Pixels). See [here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more details.
+* `--curriculum=<file>` – Specify a curriculum JSON file for defining the
+  lessons for curriculum training. See [Curriculum
+  Training](Training-Curriculum-Learning.md) for more information.
+* `--keep-checkpoints=<n>` – Specify the maximum number of model checkpoints to
+  keep. Checkpoints are saved after the number of steps specified by the
+  `save-freq` option. Once the maximum number of checkpoints has been reached,
+  the oldest checkpoint is deleted when saving a new checkpoint. Defaults to 5.
+* `--lesson=<n>` – Specify which lesson to start with when performing curriculum
+  training. Defaults to 0.
+* `--load` – If set, the training code loads an already trained model to
+  initialize the neural network before training. The learning code looks for the
+  model in `models/<run-id>/` (which is also where it saves models at the end of
+  training). When not set (the default), the neural network weights are randomly
+  initialized and an existing model is not loaded.
+* `--num-runs=<n>` - Sets the number of concurrent training sessions to perform.
+  Default is set to 1. Set to higher values when benchmarking performance and
+  multiple training sessions is desired. Training sessions are independent, and
+  do not improve learning performance.
+* `--run-id=<path>` – Specifies an identifier for each training run. This
+  identifier is used to name the subdirectories in which the trained model and
+  summary statistics are saved as well as the saved model itself. The default id
+  is "ppo". If you use TensorBoard to view the training statistics, always set a
+  unique run-id for each training run. (The statistics for all runs with the
+  same id are combined as if they were produced by a the same session.)
+* `--save-freq=<n>` Specifies how often (in  steps) to save the model during
+  training. Defaults to 50000.
+* `--seed=<n>` – Specifies a number to use as a seed for the random number
+  generator used by the training code.
+* `--slow` – Specify this option to run the Unity environment at normal, game
+  speed. The `--slow` mode uses the **Time Scale** and **Target Frame Rate**
+  specified in the Academy's **Inference Configuration**. By default, training
+  runs using the speeds specified in your Academy's **Training Configuration**.
+  See
+  [Academy Properties](Learning-Environment-Design-Academy.md#academy-properties).
+* `--train` – Specifies whether to train model or only run in inference mode.
+  When training, **always** use the `--train` option.
+* `--worker-id=<n>` – When you are running more than one training environment at
+  the same time, assign each a unique worker-id number. The worker-id is added
+  to the communication port opened between the current instance of
+  `mlagents-learn` and the ExternalCommunicator object in the Unity environment.
+  Defaults to 0.
+* `--docker-target-name=<dt>` – The Docker Volume on which to store curriculum,
+  executable and model files. See [Using Docker](Using-Docker.md).
+* `--no-graphics` - Specify this option to run the Unity executable in
+  `-batchmode` and doesn't initialize the graphics driver. Use this only if your
+  training doesn't involve visual observations (reading from Pixels). See
+  [here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more
+  details.
-The training config file, `config/trainer_config.yaml` specifies the training method, the hyperparameters, and a few additional values to use during training. The file is divided into sections. The **default** section defines the default values for all the available settings. You can also add new sections to override these defaults to train specific Brains. Name each of these override sections after the GameObject containing the Brain component that should use these settings. (This GameObject will be a child of the Academy in your scene.) Sections for the example environments are included in the provided config file.
+The training config file, `config/trainer_config.yaml` specifies the training
+method, the hyperparameters, and a few additional values to use during training.
+The file is divided into sections. The **default** section defines the default
+values for all the available settings. You can also add new sections to override
+these defaults to train specific Brains. Name each of these override sections
+after the GameObject containing the Brain component that should use these
+settings. (This GameObject will be a child of the Academy in your scene.)
+Sections for the example environments are included in the provided config file.
-| :--                | :--                       | :--                                  |
+| :--           | :--             | :--                   |
 | batch_size | The number of experiences in each iteration of gradient descent.| PPO, BC |
 | batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model.| BC |
 | beta | The strength of entropy regularization.| PPO, BC |
 | use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).| PPO, BC |
 || PPO = Proximal Policy Optimization, BC = Behavioral Cloning (Imitation)) ||

-For specific advice on setting hyperparameters based on the type of training you are conducting, see:
+For specific advice on setting hyperparameters based on the type of training you
+are conducting, see:

 * [Training with PPO](Training-PPO.md)
 * [Using Recurrent Neural Networks](Feature-Memory.md)
-You can also compare the [example environments](Learning-Environment-Examples.md) to the corresponding sections of the `config/trainer_config.yaml` file for each example to see how the hyperparameters and other configuration variables have been changed from the defaults.
+You can also compare the
+[example environments](Learning-Environment-Examples.md)
+to the corresponding sections of the `config/trainer_config.yaml` file for each
+example to see how the hyperparameters and other configuration variables have
+been changed from the defaults.
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md
 # Training with Proximal Policy Optimization

-ML-Agents uses a reinforcement learning technique called [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/). PPO uses a neural network to approximate the ideal function that maps an agent's observations to the best action an agent can take in a given state. The ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate Python process (communicating with the running Unity application over a socket). 
+ML-Agents uses a reinforcement learning technique called
+[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/).
+PPO uses a neural network to approximate the ideal function that maps an agent's
+observations to the best action an agent can take in a given state. The
+ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate
+Python process (communicating with the running Unity application over a socket).
-See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the training program, `learn.py`.
+See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the
+training program, `learn.py`.
-If you are using the recurrent neural network (RNN) to utilize memory, see [Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training details.
+If you are using the recurrent neural network (RNN) to utilize memory, see
+[Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training
+details.
-If you are using curriculum training to pace the difficulty of the learning task presented to an agent, see [Training with Curriculum Learning](Training-Curriculum-Learning.md).
+If you are using curriculum training to pace the difficulty of the learning task
+presented to an agent, see [Training with Curriculum
+Learning](Training-Curriculum-Learning.md).
-For information about imitation learning, which uses a different training algorithm, see [Training with Imitation Learning](Training-Imitation-Learning.md).
+For information about imitation learning, which uses a different training
+algorithm, see
+[Training with Imitation Learning](Training-Imitation-Learning.md).
-Successfully training a Reinforcement Learning model often involves tuning the training hyperparameters. This guide contains some best practices for tuning the training process when the default parameters don't seem to be giving the level of performance you would like.
+Successfully training a Reinforcement Learning model often involves tuning the
+training hyperparameters. This guide contains some best practices for tuning the
+training process when the default parameters don't seem to be giving the level
+of performance you would like.
-#### Gamma
+### Gamma
-`gamma` corresponds to the discount factor for future rewards. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large. In cases when rewards are more immediate, it can be smaller.
+`gamma` corresponds to the discount factor for future rewards. This can be
+thought of as how far into the future the agent should care about possible
+rewards. In situations when the agent should be acting in the present in order
+to prepare for rewards in the distant future, this value should be large. In
+cases when rewards are more immediate, it can be smaller.
-#### Lambda
+### Lambda
-`lambd` corresponds to the `lambda` parameter used when calculating the Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process.
+`lambd` corresponds to the `lambda` parameter used when calculating the
+Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This
+can be thought of as how much the agent relies on its current value estimate
+when calculating an updated value estimate. Low values correspond to relying
+more on the current value estimate (which can be high bias), and high values
+correspond to relying more on the actual rewards received in the environment
+(which can be high variance). The parameter provides a trade-off between the
+two, and the right value can lead to a more stable training process.
-#### Buffer Size
+### Buffer Size
-`buffer_size` corresponds to how many experiences (agent observations, actions and rewards obtained) should be collected before we do any 
-learning or updating of the model. **This should be a multiple of `batch_size`**. Typically larger `buffer_size` correspond to more stable training updates.
+`buffer_size` corresponds to how many experiences (agent observations, actions
+and rewards obtained) should be collected before we do any learning or updating
+of the model. **This should be a multiple of `batch_size`**. Typically a larger
+`buffer_size` corresponds to more stable training updates.
-#### Batch Size
+### Batch Size
-`batch_size` is the number of experiences used for one iteration of a gradient descent update. **This should always be a fraction of the 
-`buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value 
-should be smaller (in order of 10s). 
+`batch_size` is the number of experiences used for one iteration of a gradient
+descent update. **This should always be a fraction of the `buffer_size`**. If
+you are using a continuous action space, this value should be large (in the
+order of 1000s). If you are using a discrete action space, this value should be
+smaller (in order of 10s).
+### Number of Epochs
-#### Number of Epochs
-
-`num_epoch` is the number of passes through the experience buffer during gradient descent. The larger the `batch_size`, the
-larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning.
+`num_epoch` is the number of passes through the experience buffer during
+gradient descent. The larger the `batch_size`, the larger it is acceptable to
+make this. Decreasing this will ensure more stable updates, at the cost of
+slower learning.
+### Learning Rate
-#### Learning Rate
-
-`learning_rate` corresponds to the strength of each gradient descent update step. This should typically be decreased if
-training is unstable, and the reward does not consistently increase.
+`learning_rate` corresponds to the strength of each gradient descent update
+step. This should typically be decreased if training is unstable, and the reward
+does not consistently increase.
-
-#### Time Horizon
+### Time Horizon
-`time_horizon` corresponds to how many steps of experience to collect per-agent before adding it to the experience buffer.
-When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state.
-As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon).
-In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. 
-This number should be large enough to capture all the important behavior within a sequence of an agent's actions.
+`time_horizon` corresponds to how many steps of experience to collect per-agent
+before adding it to the experience buffer. When this limit is reached before the
+end of an episode, a value estimate is used to predict the overall expected
+reward from the agent's current state. As such, this parameter trades off
+between a less biased, but higher variance estimate (long time horizon) and more
+biased, but less varied estimate (short time horizon). In cases where there are
+frequent rewards within an episode, or episodes are prohibitively large, a
+smaller number can be more ideal. This number should be large enough to capture
+all the important behavior within a sequence of an agent's actions.
-#### Max Steps
+### Max Steps
-`max_steps` corresponds to how many steps of the simulation (multiplied by frame-skip) are run during the training process. This value should be increased for more complex problems.
+`max_steps` corresponds to how many steps of the simulation (multiplied by
+frame-skip) are run during the training process. This value should be increased
+for more complex problems.
-#### Beta
+### Beta
-`beta` corresponds to the strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.
+`beta` corresponds to the strength of the entropy regularization, which makes
+the policy "more random." This ensures that agents properly explore the action
+space during training. Increasing this will ensure more random actions are
+taken. This should be adjusted such that the entropy (measurable from
+TensorBoard) slowly decreases alongside increases in reward. If entropy drops
+too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.
-
-#### Epsilon
+### Epsilon
-`epsilon` corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process.
+`epsilon` corresponds to the acceptable threshold of divergence between the old
+and new policies during gradient descent updating. Setting this value small will
+result in more stable updates, but will also slow the training process.
-#### Normalize 
+### Normalize
-`normalize` corresponds to whether normalization is applied to the vector observation inputs. This normalization is based on the running average and variance of the vector observation.
-Normalization can be helpful in cases with complex continuous control problems, but may be harmful with simpler discrete control problems.
+`normalize` corresponds to whether normalization is applied to the vector
+observation inputs. This normalization is based on the running average and
+variance of the vector observation. Normalization can be helpful in cases with
+complex continuous control problems, but may be harmful with simpler discrete
+control problems.
-#### Number of Layers
+### Number of Layers
-`num_layers` corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation. For simple problems,
-fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems.
+`num_layers` corresponds to how many hidden layers are present after the
+observation input, or after the CNN encoding of the visual observation. For
+simple problems, fewer layers are likely to train faster and more efficiently.
+More layers may be necessary for more complex control problems.
-#### Hidden Units
+### Hidden Units
-`hidden_units` correspond to how many units are in each fully connected layer of the neural network. For simple problems
-where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where
-the action is a very complex interaction between the observation variables, this should be larger.
+`hidden_units` correspond to how many units are in each fully connected layer of
+the neural network. For simple problems where the correct action is a
+straightforward combination of the observation inputs, this should be small. For
+problems where the action is a very complex interaction between the observation
+variables, this should be larger.
-### (Optional) Recurrent Neural Network Hyperparameters
+## (Optional) Recurrent Neural Network Hyperparameters
-#### Sequence Length
+### Sequence Length
-`sequence_length` corresponds to the length of the sequences of experience passed through the network during training. This should be long enough to capture whatever information your agent might need to remember over time. For example, if your agent needs to remember the velocity of objects, then this can be a small value. If your agent needs to remember a piece of information given only once at the beginning of an episode, then this should be a larger value.
+`sequence_length` corresponds to the length of the sequences of experience
+passed through the network during training. This should be long enough to
+capture whatever information your agent might need to remember over time. For
+example, if your agent needs to remember the velocity of objects, then this can
+be a small value. If your agent needs to remember a piece of information given
+only once at the beginning of an episode, then this should be a larger value.
-#### Memory Size
+### Memory Size
-`memory_size` corresponds to the size of the array of floating point numbers used to store the hidden state of the recurrent neural network. This value must be a multiple of 4, and should scale with the amount of information you expect the agent will need to remember in order to successfully complete the task.
+`memory_size` corresponds to the size of the array of floating point numbers
+used to store the hidden state of the recurrent neural network. This value must
+be a multiple of 4, and should scale with the amount of information you expect
+the agent will need to remember in order to successfully complete the task.
-### (Optional) Intrinsic Curiosity Module Hyperparameters
+## (Optional) Intrinsic Curiosity Module Hyperparameters
-#### Curioisty Encoding Size
+### Curioisty Encoding Size
-`curiosity_enc_size` corresponds to the size of the hidden layer used to encode the observations within the intrinsic curiosity module. This value should be small enough to encourage the curiosity module to compress the original observation, but also not too small to prevent it from learning the dynamics of the environment.
+`curiosity_enc_size` corresponds to the size of the hidden layer used to encode
+the observations within the intrinsic curiosity module. This value should be
+small enough to encourage the curiosity module to compress the original
+observation, but also not too small to prevent it from learning the dynamics of
+the environment.
-#### Curiosity Strength
+### Curiosity Strength
-`curiosity_strength` corresponds to the magnitude of the intrinsic reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrnisic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. 
+`curiosity_strength` corresponds to the magnitude of the intrinsic reward
+generated by the intrinsic curiosity module. This should be scaled in order to
+ensure it is large enough to not be overwhelmed by extrnisic reward signals in
+the environment. Likewise it should not be too large to overwhelm the extrinsic
+reward signal.
-To view training statistics, use TensorBoard. For information on launching and using TensorBoard, see [here](./Getting-Started-with-Balance-Ball.md#observing-training-progress).
+To view training statistics, use TensorBoard. For information on launching and
+using TensorBoard, see
+[here](./Getting-Started-with-Balance-Ball.md#observing-training-progress).
-#### Cumulative Reward
+### Cumulative Reward
-The general trend in reward should consistently increase over time. Small ups and downs are to be expected. Depending on the complexity of the task, a significant increase in reward may not present itself until millions of steps into the training process.
+The general trend in reward should consistently increase over time. Small ups
+and downs are to be expected. Depending on the complexity of the task, a
+significant increase in reward may not present itself until millions of steps
+into the training process.
-#### Entropy
+### Entropy
-This corresponds to how random the decisions of a brain are. This should consistently decrease during training. If it decreases too soon or not at all, `beta` should be adjusted (when using discrete action space).
+This corresponds to how random the decisions of a brain are. This should
+consistently decrease during training. If it decreases too soon or not at all,
+`beta` should be adjusted (when using discrete action space).
-#### Learning Rate
+### Learning Rate
-#### Policy Loss
+### Policy Loss
-These values will oscillate during training. Generally they should be less than 1.0. 
+These values will oscillate during training. Generally they should be less than
+1.0.
-#### Value Estimate
+### Value Estimate
-These values should increase as the cumulative reward increases. They correspond to how much future reward the agent predicts itself receiving at any given point.
+These values should increase as the cumulative reward increases. They correspond
+to how much future reward the agent predicts itself receiving at any given
+point.
-#### Value Loss
+### Value Loss
-These values will increase as the reward increases, and then should decrease once reward becomes stable.
+These values will increase as the reward increases, and then should decrease
+once reward becomes stable.
--- a/docs/Training-on-Amazon-Web-Service.md
+++ b/docs/Training-on-Amazon-Web-Service.md
 # Training on Amazon Web Service

-This page contains instructions for setting up an EC2 instance on Amazon Web Service for training ML-Agents environments. 
+This page contains instructions for setting up an EC2 instance on Amazon Web
+Service for training ML-Agents environments.
-We've prepared an preconfigured AMI for you with the ID: `ami-18642967` in the `us-east-1` region. It was created as a modification of [Deep Learning AMI (Ubuntu)](https://aws.amazon.com/marketplace/pp/B077GCH38C). If you want to do training without the headless mode, you need to enable X Server on it. After launching your EC2 instance using the ami and ssh into it, run the following commands to enable it:
+We've prepared an preconfigured AMI for you with the ID: `ami-18642967` in the
+`us-east-1` region. It was created as a modification of [Deep Learning AMI
+(Ubuntu)](https://aws.amazon.com/marketplace/pp/B077GCH38C). If you want to do
+training without the headless mode, you need to enable X Server on it. After
+launching your EC2 instance using the ami and ssh into it, run the following
+commands to enable it:
-```
+```console
-sudo /usr/bin/X :0 &
+$ sudo /usr/bin/X :0 &
-nvidia-smi
+$ nvidia-smi
 /*
 * Thu Jun 14 20:27:26 2018
 * +-----------------------------------------------------------------------------+
 */

 //Make the ubuntu use X Server for display
-export DISPLAY=:0
+$ export DISPLAY=:0
-You could also choose to configure your own instance. To begin with, you will need an EC2 instance which contains the latest Nvidia drivers, CUDA9, and cuDNN. In this tutorial we used the [Deep Learning AMI (Ubuntu)](https://aws.amazon.com/marketplace/pp/B077GCH38C) listed under AWS Marketplace with a p2.xlarge instance. 
+You could also choose to configure your own instance. To begin with, you will
+need an EC2 instance which contains the latest Nvidia drivers, CUDA9, and cuDNN.
+In this tutorial we used the
+[Deep Learning AMI (Ubuntu)](https://aws.amazon.com/marketplace/pp/B077GCH38C)
+listed under AWS Marketplace with a p2.xlarge instance.

 ### Installing the ML-Agents toolkit on the instance


-    ```
+    ```sh
-    ```
+    ```sh
    git clone https://github.com/Unity-Technologies/ml-agents.git
    cd ml-agents/python
    pip3 install .

-X Server setup is only necessary if you want to do training that requires visual observation input. _Instructions here are adapted from this [Medium post](https://medium.com/towards-data-science/how-to-run-unity-on-amazon-cloud-or-without-monitor-3c10ce022639) on running general Unity applications in the cloud._
+X Server setup is only necessary if you want to do training that requires visual
+observation input. _Instructions here are adapted from this
+[Medium post](https://medium.com/towards-data-science/how-to-run-unity-on-amazon-cloud-or-without-monitor-3c10ce022639)
+on running general Unity applications in the cloud._
-Current limitations of the Unity Engine require that a screen be available to render to when using visual observations. In order to make this possible when training on a remote server, a virtual screen is required. We can do this by installing Xorg and creating a virtual screen. Once installed and created, we can display the Unity environment in the virtual environment, and train as we would on a local machine. Ensure that `headless` mode is disabled when building linux executables which use visual observations.
+Current limitations of the Unity Engine require that a screen be available to
+render to when using visual observations. In order to make this possible when
+training on a remote server, a virtual screen is required. We can do this by
+installing Xorg and creating a virtual screen. Once installed and created, we
+can display the Unity environment in the virtual environment, and train as we
+would on a local machine. Ensure that `headless` mode is disabled when building
+linux executables which use visual observations.
-    ```
+    ```console
-    sudo apt-get update
-    sudo apt-get install -y xserver-xorg mesa-utils
-    sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024
+    $ sudo apt-get update
+    $ sudo apt-get install -y xserver-xorg mesa-utils
+    $ sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024
-    nvidia-xconfig --query-gpu-info
+    $ nvidia-xconfig --query-gpu-info
-    sudo sed -i 's/    BoardName      "Tesla K80"/    BoardName      "Tesla K80"\n    BusID          "0:30:0"/g' /etc/X11/xorg.conf
+    $ sudo sed -i 's/    BoardName      "Tesla K80"/    BoardName      "Tesla K80"\n    BusID          "0:30:0"/g' /etc/X11/xorg.conf
-    sudo vim /etc/X11/xorg.conf //And remove two lines that contain Section "Files" and EndSection
+    $ sudo vim /etc/X11/xorg.conf //And remove two lines that contain Section "Files" and EndSection
-    ```
+    ```console
-    wget http://download.nvidia.com/XFree86/Linux-x86_64/390.67/NVIDIA-Linux-x86_64-390.67.run
-    sudo /bin/bash ./NVIDIA-Linux-x86_64-390.67.run --accept-license --no-questions --ui=none
+    $ wget http://download.nvidia.com/XFree86/Linux-x86_64/390.67/NVIDIA-Linux-x86_64-390.67.run
+    $ sudo /bin/bash ./NVIDIA-Linux-x86_64-390.67.run --accept-license --no-questions --ui=none
-    sudo echo 'blacklist nouveau'  | sudo tee -a /etc/modprobe.d/blacklist.conf
-    sudo echo 'options nouveau modeset=0'  | sudo tee -a /etc/modprobe.d/blacklist.conf
-    sudo echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
-    sudo update-initramfs -u
+    $ sudo echo 'blacklist nouveau'  | sudo tee -a /etc/modprobe.d/blacklist.conf
+    $ sudo echo 'options nouveau modeset=0'  | sudo tee -a /etc/modprobe.d/blacklist.conf
+    $ sudo echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
+    $ sudo update-initramfs -u
-2. Restart the EC2 instance:
+3. Restart the EC2 instance:
-    ```
+    ```console
-3. Make sure there are no Xorg processes running:
+4. Make sure there are no Xorg processes running:
-   ```
+   ```console
-   sudo killall Xorg
+   $ sudo killall Xorg
-   nvidia-smi
+   $ nvidia-smi
   /*
    * Thu Jun 14 20:21:11 2018
    * +-----------------------------------------------------------------------------+
    */
   ```

-4. Start X Server and make the ubuntu use X Server for display:
+5. Start X Server and make the ubuntu use X Server for display:
-    ```
+    ```console
-    sudo /usr/bin/X :0 &
+    $ sudo /usr/bin/X :0 &
-    nvidia-smi
+    $ nvidia-smi
-    export DISPLAY=:0
+    $ export DISPLAY=:0
-5. Ensure the Xorg is correctly configured:
+6. Ensure the Xorg is correctly configured:
-    ```
-    //For more information on glxgears, see ftp://www.x.org/pub/X11R6.8.1/doc/glxgears.1.html. 
-    glxgears
+    ```console
+    //For more information on glxgears, see ftp://www.x.org/pub/X11R6.8.1/doc/glxgears.1.html.
+    $ glxgears
    //If Xorg is configured correctly, you should see the following message
    /*
     * Running synchronized to the vertical refresh.  The framerate should be

 ## Training on EC2 instance

-1. In the Unity Editor, load a project containing an ML-Agents environment (you can use one of the example environments if you have not created your own).
+1. In the Unity Editor, load a project containing an ML-Agents environment (you
+   can use one of the example environments if you have not created your own).
 2. Open the Build Settings window (menu: File > Build Settings).
 3. Select Linux as the Target Platform, and x86_64 as the target architecture.
 4. Check Headless Mode (If you haven't setup the X Server).
    You should receive a message confirming that the environment was loaded successfully.
 8. Train the executable

-    ```
+    ```console
    //cd into your ml-agents/python folder
    chmod +x <your_env>.x86_64
    python learn.py <your_env> --train
--- a/docs/Training-on-Microsoft-Azure-Custom-Instance.md
+++ b/docs/Training-on-Microsoft-Azure-Custom-Instance.md

 This page contains instructions for setting up a custom Virtual Machine on Microsoft Azure so you can running ML-Agents training in the cloud.

-1.  Start by [deploying an Azure VM](https://docs.microsoft.com/azure/virtual-machines/linux/quick-create-portal) with Ubuntu Linux (tests were done with 16.04 LTS).  To use GPU support, use a N-Series VM.
-2.  SSH into your VM.
-3.  Start with the following commands to install the Nvidia driver:
+1. Start by
+   [deploying an Azure VM](https://docs.microsoft.com/azure/virtual-machines/linux/quick-create-portal)
+   with Ubuntu Linux (tests were done with 16.04 LTS).  To use GPU support, use
+   a N-Series VM.
+2. SSH into your VM.
+3. Start with the following commands to install the Nvidia driver:
-```
-wget http://us.download.nvidia.com/tesla/375.66/nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb 
+   ```sh
+   wget http://us.download.nvidia.com/tesla/375.66/nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
-sudo dpkg -i nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb 
+   sudo dpkg -i nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
-sudo apt-get update 
+   sudo apt-get update
-sudo apt-get install cuda-drivers 
+   sudo apt-get install cuda-drivers
-sudo reboot 
-```
+   sudo reboot
+   ```
-4.  After a minute you should be able to reconnect to your VM and install the CUDA toolkit:
+4. After a minute you should be able to reconnect to your VM and install the
+   CUDA toolkit:
-```
-wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb 
+   ```sh
+   wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
-sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb 
+   sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
-sudo apt-get update 
+   sudo apt-get update
-sudo apt-get install cuda-8-0 
-```
+   sudo apt-get install cuda-8-0
+   ```
-5.  You'll next need to download cuDNN from the Nvidia developer site.  This requires a registered account.
+5. You'll next need to download cuDNN from the Nvidia developer site.  This
+   requires a registered account.
-6.  Navigate to [http://developer.nvidia.com](http://developer.nvidia.com) and create an account and verify it.
+6. Navigate to [http://developer.nvidia.com](http://developer.nvidia.com) and
+   create an account and verify it.
-7.  Download (to your own computer) cuDNN from [this url](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v6/prod/8.0_20170307/Ubuntu16_04_x64/libcudnn6_6.0.20-1+cuda8.0_amd64-deb).  
+7. Download (to your own computer) cuDNN from [this url](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v6/prod/8.0_20170307/Ubuntu16_04_x64/libcudnn6_6.0.20-1+cuda8.0_amd64-deb).  
-8.  Copy the deb package to your VM: ```scp libcudnn6_6.0.21-1+cuda8.0_amd64.deb <VMUserName>@<VMIPAddress>:libcudnn6_6.0.21-1+cuda8.0_amd64.deb ```
+8. Copy the deb package to your VM:
-9.  SSH back to your VM and execute the following:
+   ```sh
+   scp libcudnn6_6.0.21-1+cuda8.0_amd64.deb <VMUserName>@<VMIPAddress>:libcudnn6_6.0.21-1+cuda8.0_amd64.deb
+   ```
-```
-sudo dpkg -i libcudnn6_6.0.21-1+cuda8.0_amd64.deb 
+9. SSH back to your VM and execute the following:
-export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH 
-. ~/.profile 
+   ```console
+   sudo dpkg -i libcudnn6_6.0.21-1+cuda8.0_amd64.deb
-sudo reboot 
-```
+   export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
+   . ~/.profile
-10.  After a minute, you should be able to SSH back into your VM.  After doing so, run the following:
+   sudo reboot
+   ```
-```
-sudo apt install python-pip 
-sudo apt install python3-pip
-```
+10. After a minute, you should be able to SSH back into your VM.  After doing
+    so, run the following:
-11.  At this point, you need to install TensorFlow.  The version you install should be tied to if you are using GPU to train:
+    ```sh
+    sudo apt install python-pip
+    sudo apt install python3-pip
+    ```
-```
-pip3 install tensorflow-gpu==1.4.0 keras==2.0.6 
-```
-Or CPU to train:
-```
-pip3 install tensorflow==1.4.0 keras==2.0.6 
-```
+11. At this point, you need to install TensorFlow.  The version you install
+    should be tied to if you are using GPU to train:
-12.  You'll then need to install additional dependencies:
-```
-pip3 install pillow 
-pip3 install numpy 
-pip3 install docopt 
-```
+    ```sh
+    pip3 install tensorflow-gpu==1.4.0 keras==2.0.6
+    ```
+
+    Or CPU to train:
-13.  You can now return to the [main Azure instruction page](Training-on-Microsoft-Azure.md).
+    ```sh
+    pip3 install tensorflow==1.4.0 keras==2.0.6
+    ```
+
+12. You'll then need to install additional dependencies:
+
+    ```sh
+    pip3 install pillow
+    pip3 install numpy
+    pip3 install docopt
+    ```
+
+13. You can now return to the
+    [main Azure instruction page](Training-on-Microsoft-Azure.md).
--- a/docs/Training-on-Microsoft-Azure.md
+++ b/docs/Training-on-Microsoft-Azure.md
 # Training on Microsoft Azure (works with ML-Agents toolkit v0.3)

-This page contains instructions for setting up training on Microsoft Azure through either [Azure Container Instances](https://azure.microsoft.com/services/container-instances/) or Virtual Machines. Non "headless" training has not yet been tested to verify support. 
+This page contains instructions for setting up training on Microsoft Azure
+through either
+[Azure Container Instances](https://azure.microsoft.com/services/container-instances/)
+or Virtual Machines. Non "headless" training has not yet been tested to verify
+support.
-A pre-configured virtual machine image is available in the Azure Marketplace and is nearly compltely ready for training.  You can start by deploying the [Data Science Virtual Machine for Linux (Ubuntu)](https://azuremarketplace.microsoft.com/marketplace/apps/microsoft-ads.linux-data-science-vm-ubuntu) into your Azure subscription.  Once your VM is deployed, SSH into it and run the following command to complete dependency installation:
-```
+A pre-configured virtual machine image is available in the Azure Marketplace and
+is nearly compltely ready for training.  You can start by deploying the
+[Data Science Virtual Machine for Linux (Ubuntu)](https://azuremarketplace.microsoft.com/marketplace/apps/microsoft-ads.linux-data-science-vm-ubuntu)
+into your Azure subscription.  Once your VM is deployed, SSH into it and run the
+following command to complete dependency installation:
+
+```sh
-Note that, if you choose to deploy the image to an [N-Series GPU optimized VM](https://docs.microsoft.com/azure/virtual-machines/linux/sizes-gpu), training will, by default, run on the GPU.  If you choose any other type of VM, training will run on the CPU.
+Note that, if you choose to deploy the image to an
+[N-Series GPU optimized VM](https://docs.microsoft.com/azure/virtual-machines/linux/sizes-gpu),
+training will, by default, run on the GPU.  If you choose any other type of VM,
+training will run on the CPU.
-Setting up your own instance requires a number of package installations.  Please view the documentation for doing so [here](Training-on-Microsoft-Azure-Custom-Instance.md).
+Setting up your own instance requires a number of package installations.  Please
+view the documentation for doing so
+[here](Training-on-Microsoft-Azure-Custom-Instance.md).
-2. [Move](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/copy-files-to-linux-vm-using-scp) the `ml-agents` sub-folder of this ml-agents repo to the remote Azure instance, and set it as the working directory.
+1. [Move](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/copy-files-to-linux-vm-using-scp)
+   the `ml-agents` sub-folder of this ml-agents repo to the remote Azure
+   instance, and set it as the working directory.
 2. Install the required packages with `pip3 install .`.

 ## Testing
-1. In the Unity Editor, load a project containing an ML-Agents environment (you can use one of the example environments if you have not created your own).
+1. In the Unity Editor, load a project containing an ML-Agents environment (you
+   can use one of the example environments if you have not created your own).
 2. Open the Build Settings window (menu: File > Build Settings).
 3. Select Linux as the Target Platform, and x86_64 as the target architecture.
 4. Check Headless Mode.

 env = UnityEnvironment(<your_env>)
 ```
+
- 
-You should receive a message confirming that the environment was loaded successfully.
+
+You should receive a message confirming that the environment was loaded
+successfully.
-1.  [Move](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/copy-files-to-linux-vm-using-scp) your built Unity application to your Virtual Machine.
-2.  Set the `ml-agents` sub-folder of the ml-agents repo to your working directory.
-3.  Run the following command:
+1. [Move](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/copy-files-to-linux-vm-using-scp)
+    your built Unity application to your Virtual Machine.
+2. Set the `ml-agents` sub-folder of the ml-agents repo to your working
+   directory.
+3. Run the following command:
-```
+```sh
-Where `<your_app>` is the path to your app (i.e. `~/unity-volume/3DBallHeadless`) and `<run_id>` is an identifer you would like to identify your training run with.
+Where `<your_app>` is the path to your app (i.e.
+`~/unity-volume/3DBallHeadless`) and `<run_id>` is an identifer you would like
+to identify your training run with.
-If you've selected to run on a N-Series VM with GPU support, you can verify that the GPU is being used by running `nvidia-smi` from the command line.
+If you've selected to run on a N-Series VM with GPU support, you can verify that
+the GPU is being used by running `nvidia-smi` from the command line.
-Once you have started training, you can [use Tensorboard to observe the training](Using-Tensorboard.md).  
+Once you have started training, you can [use Tensorboard to observe the
+training](Using-Tensorboard.md).  
+
+1. Start by [opening the appropriate port for web traffic to connect to your VM](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/nsg-quickstart-portal).  
-1.  Start by [opening the appropriate port for web traffic to connect to your VM](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/nsg-quickstart-portal).  
-    *  Note that you don't need to generate a new `Network Security Group` but instead, go to the **Networking** tab under **Settings** for your VM.   
-    *  As an example, you could use the following settings to open the Port with the following Inbound Rule settings:
-        * Source: Any
-        * Source Port Ranges: *
-        * Destination: Any
-        * Destination Port Ranges: 6006
-        * Protocol: Any
-        * Action: Allow
-        * Priority: <Leave as default>
-2.  Unless you started the training as a background process, connect to your VM from another terminal instance.
-3.  Set the `python` folder in ml-agents to your current working directory.
-4.  Run the following command from your `tensorboard --logdir=summaries --host 0.0.0.0`
-5.  You should now be able to open a browser and navigate to `<Your_VM_IP_Address>:6060` to view the TensorBoard report.
+    * Note that you don't need to generate a new `Network Security Group` but
+      instead, go to the **Networking** tab under **Settings** for your VM.
+    * As an example, you could use the following settings to open the Port with
+      the following Inbound Rule settings:
+      * Source: Any
+      * Source Port Ranges: *
+      * Destination: Any
+      * Destination Port Ranges: 6006
+      * Protocol: Any
+      * Action: Allow
+      * Priority: (Leave as default)
+
+2. Unless you started the training as a background process, connect to your VM
+   from another terminal instance.
+3. Set the `python` folder in ml-agents to your current working directory.
+4. Run the following command from your `tensorboard --logdir=summaries --host
+   0.0.0.0`
+5. You should now be able to open a browser and navigate to
+   `<Your_VM_IP_Address>:6060` to view the TensorBoard report.
-[Azure Container Instances](https://azure.microsoft.com/services/container-instances/) allow you to spin up a container, on demand, that will run your training and then be shut down.  This ensures you aren't leaving a billable VM running when it isn't needed.  You can read more about [The ML-Agents toolkit support for Docker containers here](Using-Docker.md).  Using ACI enables you to offload training of your models without needing to install Python and Tensorflow on your own computer.  You can find [instructions, including a pre-deployed image in DockerHub for you to use, available here](https://github.com/druttka/unity-ml-on-azure).
+[Azure Container Instances](https://azure.microsoft.com/services/container-instances/)
+allow you to spin up a container, on demand, that will run your training and
+then be shut down.  This ensures you aren't leaving a billable VM running when
+it isn't needed.  You can read more about
+[The ML-Agents toolkit support for Docker containers here](Using-Docker.md).
+Using ACI enables you to offload training of your models without needing to
+install Python and Tensorflow on your own computer.  You can find instructions,
+including a pre-deployed image in DockerHub for you to use, available
+[here](https://github.com/druttka/unity-ml-on-azure).
--- a/docs/Using-Docker.md
+++ b/docs/Using-Docker.md
 Docker container by calling the following command at the top-level of the
 repository:

-```shell
+```sh
 docker build -t <image-name> .
 ```

 Run the Docker container by calling the following command at the top-level of
 the repository:

-```shell
+```sh
 docker run --name <container-name> \
           --mount type=bind,source="$(pwd)"/unity-volume,target=/unity-volume \
           -p 5005:5005 \

 To train with a `3DBall` environment executable, the command would be:

-```shell
+```sh
 docker run --name 3DBallContainer.first.trial \
           --mount type=bind,source="$(pwd)"/unity-volume,target=/unity-volume \
           -p 5005:5005 \
 container while saving state by either using `Ctrl+C` or `⌘+C` (Mac) or by using
 the following command:

-```shell
+```sh
 docker kill --signal=SIGINT <container-name>
 ```

--- a/docs/Using-TensorFlow-Sharp-in-Unity.md
+++ b/docs/Using-TensorFlow-Sharp-in-Unity.md
 # Using TensorFlowSharp in Unity (Experimental)

-The ML-Agents toolkit allows you to use pre-trained [TensorFlow graphs](https://www.tensorflow.org/programmers_guide/graphs) inside your Unity games. This support is possible thanks to [the TensorFlowSharp project](https://github.com/migueldeicaza/TensorFlowSharp). The primary purpose for this support is to use the TensorFlow models produced by the ML-Agents toolkit's own training programs, but a side benefit is that you can use any TensorFlow model.
+The ML-Agents toolkit allows you to use pre-trained
+[TensorFlow graphs](https://www.tensorflow.org/programmers_guide/graphs)
+inside your Unity
+games. This support is possible thanks to the
+[TensorFlowSharp project](https://github.com/migueldeicaza/TensorFlowSharp).
+The primary purpose for this support is to use the TensorFlow models produced by
+the ML-Agents toolkit's own training programs, but a side benefit is that you
+can use any TensorFlow model.
-_Notice: This feature is still experimental. While it is possible to embed trained models into Unity games, Unity Technologies does not officially support this use-case for production games at this time. As such, no guarantees are provided regarding the quality of experience. If you encounter issues regarding battery life, or general performance (especially on mobile), please let us know._
+_Notice: This feature is still experimental. While it is possible to embed
+trained models into Unity games, Unity Technologies does not officially support
+this use-case for production games at this time. As such, no guarantees are
+provided regarding the quality of experience. If you encounter issues regarding
+battery life, or general performance (especially on mobile), please let us
+know._
-## Supported devices :
+## Supported devices
- * Linux 64 bits
- * Mac OS X 64 bits
- * Windows 64 bits
- * iOS (Requires additional steps)
- * Android
+* Linux 64 bits
+* Mac OS X 64 bits
+* Windows 64 bits
+* iOS (Requires additional steps)
+* Android

 ## Requirements

-# Using TensorFlowSharp with ML-Agents
+## Using TensorFlowSharp with ML-Agents
-Go to `Edit` -> `Player Settings` and add `ENABLE_TENSORFLOW` to the `Scripting Define Symbols` for each type of device you want to use (**`PC, Mac and Linux Standalone`**, **`iOS`** or **`Android`**).
+Go to `Edit` -> `Player Settings` and add `ENABLE_TENSORFLOW` to the `Scripting
+Define Symbols` for each type of device you want to use (**`PC, Mac and Linux
+Standalone`**, **`iOS`** or **`Android`**).
-Set the Brain you used for training to `Internal`. Drag `your_name_graph.bytes` into Unity and then drag it into The `Graph Model` field in the Brain. 
+Set the Brain you used for training to `Internal`. Drag `your_name_graph.bytes`
+into Unity and then drag it into The `Graph Model` field in the Brain.
-The TensorFlow data graphs produced by the ML-Agents training programs work without any additional settings.
+The TensorFlow data graphs produced by the ML-Agents training programs work
+without any additional settings.
-In order to use a TensorFlow data graph in Unity, make sure the nodes of your graph have appropriate names. You can assign names to nodes in TensorFlow :
+In order to use a TensorFlow data graph in Unity, make sure the nodes of your
+graph have appropriate names. You can assign names to nodes in TensorFlow :

 ```python
 variable= tf.identity(variable, name="variable_name")
- * Name the batch size input placeholder `batch_size`
- * Name the input vector observation placeholder `state`
- * Name the output node `action`
- * Name the recurrent vector (memory) input placeholder `recurrent_in` (if any)
- * Name the recurrent vector (memory) output node `recurrent_out` (if any)
- * Name the observations placeholders input placeholders `visual_observation_i` where `i` is the index of the observation (starting at 0)
-You can have additional placeholders for float or integers but they must be placed in placeholders of dimension 1 and size 1. (Be sure to name them.)
+* Name the batch size input placeholder `batch_size`
+* Name the input vector observation placeholder `state`
+* Name the output node `action`
+* Name the recurrent vector (memory) input placeholder `recurrent_in` (if any)
+* Name the recurrent vector (memory) output node `recurrent_out` (if any)
+* Name the observations placeholders input placeholders `visual_observation_i`
+  where `i` is the index of the observation (starting at 0)
-It is important that the inputs and outputs of the graph are exactly the ones you receive and return when training your model with an `External` brain. This means you cannot have any operations such as reshaping outside of the graph.
-The object you get by calling `step` or `reset` has fields `vector_observations`, `visual_observations` and `memories` which must correspond to the placeholders of your graph. Similarly, the arguments `action` and `memory` you pass to `step` must correspond to the output nodes of your graph.
+You can have additional placeholders for float or integers but they must be
+placed in placeholders of dimension 1 and size 1. (Be sure to name them.)
+
+It is important that the inputs and outputs of the graph are exactly the ones
+you receive and return when training your model with an `External` brain. This
+means you cannot have any operations such as reshaping outside of the graph. The
+object you get by calling `step` or `reset` has fields `vector_observations`,
+`visual_observations` and `memories` which must correspond to the placeholders
+of your graph. Similarly, the arguments `action` and `memory` you pass to `step`
+must correspond to the output nodes of your graph.
-While training your Agent using the Python API, you can save your graph at any point of the training. Note that the argument `output_node_names` must be the name of the tensor your graph outputs (separated by a coma if using multiple outputs). In this case, it will be either `action` or `action,recurrent_out` if you have recurrent outputs.
+While training your Agent using the Python API, you can save your graph at any
+point of the training. Note that the argument `output_node_names` must be the
+name of the tensor your graph outputs (separated by a coma if using multiple
+outputs). In this case, it will be either `action` or `action,recurrent_out` if
+you have recurrent outputs.

 ```python
 from tensorflow.python.tools import freeze_graph
              restore_op_name = "save/restore_all", filename_tensor_name = "save/Const:0")
 ```

-Your model will be saved with the name `your_name_graph.bytes` and will contain both the graph and associated weights. Note that you must save your graph as a .bytes file so Unity can load it.
+Your model will be saved with the name `your_name_graph.bytes` and will contain
+both the graph and associated weights. Note that you must save your graph as a
+.bytes file so Unity can load it.
-In the Unity Editor, you must specify the names of the nodes used by your graph in the **Internal** brain Inspector window. If you used a scope when defining your graph, specify it in the `Graph Scope` field. 
+In the Unity Editor, you must specify the names of the nodes used by your graph
+in the **Internal** brain Inspector window. If you used a scope when defining
+your graph, specify it in the `Graph Scope` field.
-See [Internal Brain](Learning-Environment-Design-External-Internal-Brains.md#internal-brain) for more information about using Internal Brains.
+See
+[Internal Brain](Learning-Environment-Design-External-Internal-Brains.md#internal-brain)
+for more information about using Internal Brains.
-If you followed these instructions well, the agents in your environment that use this brain will use your fully trained network to make decisions.
+If you followed these instructions well, the agents in your environment that use
+this brain will use your fully trained network to make decisions.
-# iOS additional instructions for building
+## iOS additional instructions for building
-* Once you build the project for iOS in the editor, open the .xcodeproj file within the project folder using Xcode.
-* Set up your ios account following the [iOS Account setup page](https://docs.unity3d.com/Manual/iphone-accountsetup.html). 
+* Once you build the project for iOS in the editor, open the .xcodeproj file
+  within the project folder using Xcode.
+* Set up your ios account following the
+  [iOS Account setup page](https://docs.unity3d.com/Manual/iphone-accountsetup.html).
-  * Drag the library `libtensorflow-core.a` from the **Project Navigator** on the left under `Libraries/ML-Agents/Plugins/iOS` into the flag list, after `-force_load`.
+  * Drag the library `libtensorflow-core.a` from the **Project Navigator** on
+    the left under `Libraries/ML-Agents/Plugins/iOS` into the flag list, after
+    `-force_load`.
-# Using TensorFlowSharp without ML-Agents
+## Using TensorFlowSharp without ML-Agents
-Beyond controlling an in-game agent, you can also use TensorFlowSharp for more general computation. The following instructions describe how to generally embed TensorFlow models without using the ML-Agents framework.
+Beyond controlling an in-game agent, you can also use TensorFlowSharp for more
+general computation. The following instructions describe how to generally embed
+TensorFlow models without using the ML-Agents framework.
-You must have a TensorFlow graph, such as `your_name_graph.bytes`, made using TensorFlow's `freeze_graph.py`. The process to create such graph is explained in the [Using your own trained graphs](#using-your-own-trained-graphs) section.
+You must have a TensorFlow graph, such as `your_name_graph.bytes`, made using
+TensorFlow's `freeze_graph.py`. The process to create such graph is explained in
+the [Using your own trained graphs](#using-your-own-trained-graphs) section.

 ## Inside of Unity


 2. At the top off your C# script, add the line:

-```csharp
-using TensorFlow;
-```
+   ```csharp
+   using TensorFlow;
+   ```
-3. If you will be building for android, you must add this block at the start of your code :
+3. If you will be building for android, you must add this block at the start of
+   your code :
-```csharp
-#if UNITY_ANDROID
-TensorFlowSharp.Android.NativeBinding.Init();
-#endif
-```
+   ```csharp
+   #if UNITY_ANDROID
+   TensorFlowSharp.Android.NativeBinding.Init();
+   #endif
+   ```
-```csharp
-TextAsset graphModel = Resources.Load (your_name_graph) as TextAsset;
-```
+   ```csharp
+   TextAsset graphModel = Resources.Load (your_name_graph) as TextAsset;
+   ```
-```csharp
-graph = new TFGraph ();
-graph.Import (graphModel.bytes);
-session = new TFSession (graph);
-```
+   ```csharp graph = new TFGraph ();
+   graph.Import (graphModel.bytes);
+   session = new TFSession (graph);
+   ```
-6. Assign the input tensors for the graph. For example, the following code assigns a one dimensional input tensor of size 2:
+6. Assign the input tensors for the graph. For example, the following code
+   assigns a one dimensional input tensor of size 2:
-```csharp
-var runner = session.GetRunner ();
-runner.AddInput (graph ["input_placeholder_name"] [0], new float[]{ placeholder_value1, placeholder_value2 });
-```
+   ```csharp
+   var runner = session.GetRunner ();
+   runner.AddInput (graph ["input_placeholder_name"] [0], new float[]{ placeholder_value1, placeholder_value2 });
+   ```
-You must provide all required inputs to the graph. Supply one input per TensorFlow placeholder.
+   You must provide all required inputs to the graph. Supply one input per
+   TensorFlow placeholder.
-```csharp
-runner.Fetch (graph["output_placeholder_name"][0]);
-float[,] recurrent_tensor = runner.Run () [0].GetValue () as float[,];
-```
+   ```csharp
+   runner.Fetch (graph["output_placeholder_name"][0]);
+   float[,] recurrent_tensor = runner.Run () [0].GetValue () as float[,];
+   ```
-Note that this example assumes the output array is a two-dimensional tensor of floats. Cast to a long array if your outputs are integers.
+Note that this example assumes the output array is a two-dimensional tensor of
+floats. Cast to a long array if your outputs are integers.
--- a/docs/Using-Tensorboard.md
+++ b/docs/Using-Tensorboard.md
 # Using TensorBoard to Observe Training

-The ML-Agents toolkit saves statistics during learning session that you can view with a TensorFlow utility named, [TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard).
+The ML-Agents toolkit saves statistics during learning session that you can view
+with a TensorFlow utility named,
+[TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard).
-The `mlagents-learn` command saves training statistics to a folder named `summaries`, organized by the `run-id` value you assign to a training session.
+The `mlagents-learn` command saves training statistics to a folder named
+`summaries`, organized by the `run-id` value you assign to a training session.
-In order to observe the training process, either during training or afterward, 
+In order to observe the training process, either during training or afterward,
 start TensorBoard:

 1. Open a terminal or console window:

 4. Open a browser window and navigate to [localhost:6006](http://localhost:6006).

-**Note:** If you don't assign a `run-id` identifier, `mlagents-learn` uses the default string, "ppo". All the statistics will be saved to the same sub-folder and displayed as one session in TensorBoard. After a few runs, the displays can become difficult to interpret in this situation. You can delete the folders under the `summaries` directory to clear out old statistics.
+**Note:** If you don't assign a `run-id` identifier, `mlagents-learn` uses the
+default string, "ppo". All the statistics will be saved to the same sub-folder
+and displayed as one session in TensorBoard. After a few runs, the displays can
+become difficult to interpret in this situation. You can delete the folders
+under the `summaries` directory to clear out old statistics.
-On the left side of the TensorBoard window, you can select which of the training runs you want to display. You can select multiple run-ids to compare statistics. The TensorBoard window also provides options for how to display and smooth graphs.
- 
-When you run the training program, `mlagents-learn`, you can use the `--save-freq` option to specify how frequently to save the statistics.
+On the left side of the TensorBoard window, you can select which of the training
+runs you want to display. You can select multiple run-ids to compare statistics.
+The TensorBoard window also provides options for how to display and smooth
+graphs.
+
+When you run the training program, `mlagents-learn`, you can use the
+`--save-freq` option to specify how frequently to save the statistics.

 ## The ML-Agents toolkit training statistics


-* Lesson - Plots the progress from lesson to lesson. Only interesting when performing
-[curriculum training](Training-Curriculum-Learning.md). 
+* Lesson - Plots the progress from lesson to lesson. Only interesting when
+  performing [curriculum training](Training-Curriculum-Learning.md).
-* Cumulative Reward - The mean cumulative episode reward over all agents. 
-Should increase during a successful training session.
+* Cumulative Reward - The mean cumulative episode reward over all agents. Should
+  increase during a successful training session.
-* Entropy - How random the decisions of the model are. Should slowly decrease 
-during a successful training process. If it decreases too quickly, the `beta` 
-hyperparameter should be increased.
+* Entropy - How random the decisions of the model are. Should slowly decrease
+  during a successful training process. If it decreases too quickly, the `beta`
+  hyperparameter should be increased.
-* Episode Length - The mean length of each episode in the environment for all 
-agents.
+* Episode Length - The mean length of each episode in the environment for all
+  agents.
-* Learning Rate - How large a step the training algorithm takes as it searches 
-for the optimal policy. Should decrease over time.
+* Learning Rate - How large a step the training algorithm takes as it searches
+  for the optimal policy. Should decrease over time.
-much the policy (process for deciding actions) is changing. The magnitude of 
-this should decrease during a successful training session.
+  much the policy (process for deciding actions) is changing. The magnitude of
+  this should decrease during a successful training session.
-* Value Estimate - The mean value estimate for all states visited by the agent. 
-Should increase during a successful training session.
+* Value Estimate - The mean value estimate for all states visited by the agent.
+  Should increase during a successful training session.
-well the model is able to predict the value of each state. This should increase
-while the agent is learning, and then decrease once the reward stabilizes.
+  well the model is able to predict the value of each state. This should
+  increase while the agent is learning, and then decrease once the reward
+  stabilizes.
-* _(Curiosity-Specific)_ Intrinsic Reward - This corresponds to the mean cumulative intrinsic reward generated per-episode. 
+* _(Curiosity-Specific)_ Intrinsic Reward - This corresponds to the mean
+  cumulative intrinsic reward generated per-episode.
-* _(Curiosity-Specific)_ Forward Loss - The mean magnitude of the inverse model loss function. Corresponds to how well the model is able to predict the new observation encoding.
+* _(Curiosity-Specific)_ Forward Loss - The mean magnitude of the inverse model
+  loss function. Corresponds to how well the model is able to predict the new
+  observation encoding.
-* _(Curiosity-Specific)_ Inverse Loss - The mean magnitude of the forward model loss function. Corresponds to how well the model is able to predict the action taken between two observations.
+* _(Curiosity-Specific)_ Inverse Loss - The mean magnitude of the forward model
+  loss function. Corresponds to how well the model is able to predict the action
+  taken between two observations.