Improvements to Getting Started guide (#3774)

* Improvements to Getting Started guide - Changed the ordered list to use "1." - Trimmed down text - Removed references to Agent APIs * Incorporating feedback * Prettier formatting
5 年前 · a09850fa
--- a/docs/Getting-Started.md
+++ b/docs/Getting-Started.md
 # Getting Started Guide

-This guide walks through the end-to-end process of opening an ML-Agents
-toolkit example environment in Unity, building the Unity executable, training an
-Agent in it, and finally embedding the trained model into the Unity environment.
-
-The ML-Agents toolkit includes a number of [example
-environments](Learning-Environment-Examples.md) which you can examine to help
-understand the different ways in which the ML-Agents toolkit can be used. These
-environments can also serve as templates for new environments or as ways to test
-new ML algorithms. After reading this tutorial, you should be able to explore
-train the example environments.
-
-If you are not familiar with the [Unity Engine](https://unity3d.com/unity), we
-highly recommend the [Roll-a-ball
-tutorial](https://unity3d.com/learn/tutorials/s/roll-ball-tutorial) to learn all
-the basic concepts first.
+This guide walks through the end-to-end process of opening one of our
+[example environments](Learning-Environment-Examples.md) in Unity, training an
+Agent in it, and embedding the trained model into the Unity environment. After
+reading this tutorial, you should be able to train any of the example
+environments. If you are not familiar with the
+[Unity Engine](https://unity3d.com/unity), view our
+[Background: Unity](Background-Unity.md) page for helpful pointers.
+Additionally, if you're not familiar with machine learning, view our
+[Background: Machine Learning](Background-Machine-Learning.md) page for a brief
+overview and helpful pointers.
-This guide uses the **3D Balance Ball** environment to teach the basic concepts and
-usage patterns of ML-Agents. 3D Balance Ball
-contains a number of agent cubes and balls (which are all copies of each other).
-Each agent cube tries to keep its ball from falling by rotating either
-horizontally or vertically. In this environment, an agent cube is an **Agent** that
-receives a reward for every step that it balances the ball. An agent is also
-penalized with a negative reward for dropping the ball. The goal of the training
-process is to have the agents learn to balance the ball on their head.
+For this guide, we'll use the **3D Balance Ball** environment which contains a
+number of agent cubes and balls (which are all copies of each other). Each agent
+cube tries to keep its ball from falling by rotating either horizontally or
+vertically. In this environment, an agent cube is an **Agent** that receives a
+reward for every step that it balances the ball. An agent is also penalized with
+a negative reward for dropping the ball. The goal of the training process is to
+have the agents learn to balance the ball on their head.
-In order to install and set up the ML-Agents toolkit, the Python dependencies
-and Unity, see the [installation instructions](Installation.md).
-
-Depending on your version of Unity, it may be necessary to change the **Scripting Runtime Version** of your project. This can be done as follows:
+If you haven't already, follow the [installation instructions](Installation.md).
+Afterwards, open the Unity Project that contains all the example environments:
-2. On the Projects dialog, choose the **Add** option at the top of the window.
-3. Using the file dialog that opens, locate the `Project` folder
-   within the ML-Agents toolkit project and click **Open**.
-4. Go to **Edit** > **Project Settings** > **Player**
-5. For **each** of the platforms you target (**PC, Mac and Linux Standalone**,
-   **iOS** or **Android**):
-    1. Expand the **Other Settings** section.
-    2. Select **Scripting Runtime Version** to **Experimental (.NET 4.6
-       Equivalent or .NET 4.x Equivalent)**
-6. Go to **File** > **Save Project**
-
+1. On the Projects dialog, choose the **Add** option at the top of the window.
+1. Using the file dialog that opens, locate the `Project` folder within the
+   ML-Agents Toolkit and click **Open**.
+1. In the **Project** window, go to the
+   `Assets/ML-Agents/Examples/3DBall/Scenes` folder and open the `3DBall` scene
+   file.
-_environment_. In the context of Unity, an environment is a scene containing
-one or more Agent objects, and, of course, the other
-entities that an agent interacts with.
+_environment_. In the context of Unity, an environment is a scene containing one
+or more Agent objects, and, of course, the other entities that an agent
+interacts with.

 ![Unity Editor](images/mlagents-3DBallHierarchy.png)

 window. The Inspector shows every component on a GameObject.

 The first thing you may notice after opening the 3D Balance Ball scene is that
-it contains not one, but several agent cubes.  Each agent cube in the scene is an
-independent agent, but they all share the same Behavior. 3D Balance Ball does this
-to speed up training since all twelve agents contribute to training in parallel.
+it contains not one, but several agent cubes. Each agent cube in the scene is an
+independent agent, but they all share the same Behavior. 3D Balance Ball does
+this to speed up training since all twelve agents contribute to training in
+parallel.

 ### Agent

 behavior:

-* **Behavior Parameters** — Every Agent must have a Behavior. The Behavior
-  determines how an Agent makes decisions. More on Behavior Parameters in
-  the next section.
-* **Max Step** — Defines how many simulation steps can occur before the Agent's
+- **Behavior Parameters** — Every Agent must have a Behavior. The Behavior
+  determines how an Agent makes decisions.
+- **Max Step** — Defines how many simulation steps can occur before the Agent's
-When you create an Agent, you must extend the base Agent class.
-The Ball3DAgent subclass defines the following methods:
-
-* `Agent.OnEpisodeBegin()` — Called at the beginning of an Agent's episode, including at the beginning
-  of the simulation. The Ball3DAgent class uses this function to reset the
-  agent cube and ball to their starting positions. The function randomizes the reset values so that the
-  training generalizes to more than a specific starting position and agent cube
-  attitude.
-* `Agent.CollectObservations(VectorSensor sensor)` — Called every simulation step. Responsible for
-  collecting the Agent's observations of the environment. Since the Behavior
-  Parameters of the Agent are set with vector observation
-  space with a state size of 8, the `CollectObservations(VectorSensor sensor)` must call
-  `VectorSensor.AddObservation()` such that vector size adds up to 8.
-* `Agent.OnActionReceived()` — Called every time the Agent receives an action to take. Receives the action chosen
-  by the Agent. The vector action spaces result in a
-  small change in the agent cube's rotation at each step. The `OnActionReceived()` method
-  assigns a reward to the Agent; in this example, an Agent receives a small
-  positive reward for each step it keeps the ball on the agent cube's head and a larger,
-  negative reward for dropping the ball. An Agent's episode is also ended when it
-  drops the ball so that it will reset with a new ball for the next simulation
-  step.
-* `Agent.Heuristic()` - When the `Behavior Type` is set to `Heuristic Only` in the Behavior
-  Parameters of the Agent, the Agent will use the `Heuristic()` method to generate
-  the actions of the Agent. As such, the `Heuristic()` method takes an array of
-  floats. In the case of the Ball 3D Agent, the `Heuristic()` method converts the
-  keyboard inputs into actions.
-
-
 #### Behavior Parameters : Vector Observation Space

 Before making a decision, an agent collects its observation about its state in
-The Behavior Parameters of the 3D Balance Ball example uses a **Space Size** of 8.
-This means that the feature
-vector containing the Agent's observations contains eight elements: the `x` and
-`z` components of the agent cube's rotation and the `x`, `y`, and `z` components
-of the ball's relative position and velocity. (The observation values are
-defined in the Agent's `CollectObservations(VectorSensor sensor)` method.)
+The Behavior Parameters of the 3D Balance Ball example uses a `Space Size` of 8.
+This means that the feature vector containing the Agent's observations contains
+eight elements: the `x` and `z` components of the agent cube's rotation and the
+`x`, `y`, and `z` components of the ball's relative position and velocity.
-An Agent is given instructions in the form of a float array of *actions*.
-ML-Agents toolkit classifies actions into two types: the **Continuous** vector
-action space is a vector of numbers that can vary continuously. What each
-element of the vector means is defined by the Agent logic (the training
-process just learns what values are better given particular state observations
-based on the rewards received when it tries different values). For example, an
-element might represent a force or torque applied to a `Rigidbody` in the Agent.
-The **Discrete** action vector space defines its actions as tables. An action
-given to the Agent is an array of indices into tables.
-
-The 3D Balance Ball example is programmed to use continuous action
-space with `Space Size` of 2.
+An Agent is given instructions in the form of a float array of _actions_.
+ML-Agents Toolkit classifies actions into two types: continuous and discrete.
+The 3D Balance Ball example is programmed to use continuous action space which
+is a a vector of numbers that can vary continuously. More specifically, it uses
+a `Space Size` of 2 to control the amount of `x` and `z` rotations to apply to
+itself to keep the ball balanced on its head.
-[Unity Inference Engine](Unity-Inference-Engine.md) to run these models
-inside Unity. In this section, we will use the pre-trained model for the
-3D Ball example.
+[Unity Inference Engine](Unity-Inference-Engine.md) to run these models inside
+Unity. In this section, we will use the pre-trained model for the 3D Ball
+example.
-1. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Scenes` folder
-   and open the `3DBall` scene file.
-2. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Prefabs` folder.
-   Expand `3DBall` and click on the `Agent` prefab.  You should see the `Agent` prefab in the **Inspector** window.
+1. In the **Project** window, go to the
+   `Assets/ML-Agents/Examples/3DBall/Prefabs` folder. Expand `3DBall` and click
+   on the `Agent` prefab. You should see the `Agent` prefab in the **Inspector**
+   window.
-   **Note**: The platforms in the `3DBall` scene were created using the `3DBall` prefab.  Instead of updating all 12 platforms individually, you can update the `3DBall` prefab instead.
+   **Note**: The platforms in the `3DBall` scene were created using the `3DBall`
+   prefab. Instead of updating all 12 platforms individually, you can update the
+   `3DBall` prefab instead.
-3. In the **Project** window, drag the **3DBall** Model located in
-   `Assets/ML-Agents/Examples/3DBall/TFModels` into the `Model` property under `Behavior Parameters (Script)` component in the Agent GameObject **Inspector** window.
+1. In the **Project** window, drag the **3DBall** Model located in
+   `Assets/ML-Agents/Examples/3DBall/TFModels` into the `Model` property under
+   `Behavior Parameters (Script)` component in the Agent GameObject
+   **Inspector** window.
-4. You should notice that each `Agent` under each `3DBall` in the **Hierarchy** windows now contains **3DBall** as `Model` on the `Behavior Parameters`. __Note__ : You can modify multiple game objects in a scene by selecting them all at
-   once using the search bar in the Scene Hierarchy.
-8. Select the **InferenceDevice** to use for this model (CPU or GPU) on the Agent.
-   _Note: CPU is faster for the majority of ML-Agents toolkit generated models_
-9. Click the **Play** button and you will see the platforms balance the balls
-   using the pre-trained model.
+1. You should notice that each `Agent` under each `3DBall` in the **Hierarchy**
+   windows now contains **3DBall** as `Model` on the `Behavior Parameters`.
+   **Note** : You can modify multiple game objects in a scene by selecting them
+   all at once using the search bar in the Scene Hierarchy.
+1. Set the **Inference Device** to use for this model as `CPU`.
+1. Click the :arrow_forward: button in the Unity Editor and you will see the
+   platforms balance the balls using the pre-trained model.
-While we provide pre-trained `.nn` files for the agents in this environment, any environment you make yourself will require training agents from scratch to generate a new model file. We can do this using reinforcement learning.
-
-In order to train an agent to correctly balance the ball, we provide two
-deep reinforcement learning algorithms.
-
-The default algorithm is Proximal Policy Optimization (PPO). This
-is a method that has been shown to be more general purpose and stable
-than many other RL algorithms. For more information on PPO, OpenAI
-has a [blog post](https://blog.openai.com/openai-baselines-ppo/)
-explaining it, and [our page](Training-PPO.md) for how to use it in training.
-
-We also provide Soft-Actor Critic, an off-policy algorithm that
-has been shown to be both stable and sample-efficient.
-For more information on SAC, see UC Berkeley's
-[blog post](https://bair.berkeley.edu/blog/2018/12/14/sac/) and
-[our page](Training-SAC.md) for more guidance on when to use SAC vs. PPO. To
-use SAC to train Balance Ball, replace all references to `config/trainer_config.yaml`
-with `config/sac_trainer_config.yaml` below.
-
-To train the agents within the Balance Ball environment, we will be using the
-ML-Agents Python package. We have provided a convenient command called `mlagents-learn`
-which accepts arguments used to configure both training and inference phases.
+While we provide pre-trained `.nn` files for the agents in this environment, any
+environment you make yourself will require training agents from scratch to
+generate a new model file. In this section we will demonstrate how to use the
+reinforcement learning algorithms that are part of the ML-Agents Python package
+to accomplish this. We have provided a convenient command `mlagents-learn` which
+accepts arguments used to configure both training and inference phases.
-2. Navigate to the folder where you cloned the ML-Agents toolkit repository.
-   **Note**: If you followed the default [installation](Installation.md), then
-   you should be able to run `mlagents-learn` from any directory.
-3. Run `mlagents-learn <trainer-config-path> --run-id=<run-identifier>`
-   where:
-    - `<trainer-config-path>` is the relative or absolute filepath of the
-      trainer configuration. The defaults used by example environments included
-      in `MLAgentsSDK` can be found in `config/trainer_config.yaml`.
-    - `<run-identifier>` is a string used to separate the results of different
-      training runs. Make sure to use one that hasn't been used already!
-4. If you cloned the ML-Agents repo, then you can simply run
-
-      ```sh
-      mlagents-learn config/trainer_config.yaml --run-id=firstRun
-      ```
-
-5. When the message _"Start training by pressing the Play button in the Unity
+1. Navigate to the folder where you cloned the `ml-agents` repository. **Note**:
+   If you followed the default [installation](Installation.md), then you should
+   be able to run `mlagents-learn` from any directory.
+1. Run `mlagents-learn config/trainer_config.yaml --run-id=first3DBallRun`.
+   - `config/trainer_config.yaml` is the path to a default training
+     configuration file that we provide. In includes training configurations for
+     all our example environments, including 3DBall.
+   - `run-id` is a unique name for this training session.
+1. When the message _"Start training by pressing the Play button in the Unity
-**Note**: If you're using Anaconda, don't forget to activate the ml-agents
-environment first.
-
-The `--time-scale=100` sets the `Time.TimeScale` value in Unity.
-
-**Note**: You can train using an executable rather than the Editor. To do so,
-follow the instructions in
-[Using an Executable](Learning-Environment-Executable.md).
-
-**Note**: Re-running this command will start training from scratch again. To resume
-a previous training run, append the `--load` flag and give the same `--run-id` as the
-run you want to resume.
-
 If `mlagents-learn` runs correctly and starts training, you should see something
 like this:

        sequence_length:     64
        summary_freq:        1000
        use_recurrent:       False
-        summary_path:        ./summaries/first-run-0
+        summary_path:        ./summaries/first3DBallRun
-        model_path:	./models/first-run-0/3DBallLearning
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 1000. Mean Reward: 1.242. Std of Reward: 0.746. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 2000. Mean Reward: 1.319. Std of Reward: 0.693. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 3000. Mean Reward: 1.804. Std of Reward: 1.056. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 4000. Mean Reward: 2.151. Std of Reward: 1.432. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 5000. Mean Reward: 3.175. Std of Reward: 2.250. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 6000. Mean Reward: 4.898. Std of Reward: 4.019. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 7000. Mean Reward: 6.716. Std of Reward: 5.125. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 8000. Mean Reward: 12.124. Std of Reward: 11.929. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 9000. Mean Reward: 18.151. Std of Reward: 16.871. Training.
-INFO:mlagents.trainers: first-run-0: 3DBallLearning: Step: 10000. Mean Reward: 27.284. Std of Reward: 28.667. Training.
+        model_path: ./models/first3DBallRun/3DBallLearning
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 1000. Mean Reward: 1.242. Std of Reward: 0.746. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 2000. Mean Reward: 1.319. Std of Reward: 0.693. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 3000. Mean Reward: 1.804. Std of Reward: 1.056. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 4000. Mean Reward: 2.151. Std of Reward: 1.432. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 5000. Mean Reward: 3.175. Std of Reward: 2.250. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 6000. Mean Reward: 4.898. Std of Reward: 4.019. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 7000. Mean Reward: 6.716. Std of Reward: 5.125. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 8000. Mean Reward: 12.124. Std of Reward: 11.929. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 9000. Mean Reward: 18.151. Std of Reward: 16.871. Training.
+INFO:mlagents.trainers: first3DBallRun: 3DBallLearning: Step: 10000. Mean Reward: 27.284. Std of Reward: 28.667. Training.
+Note how the `Mean Reward` value printed to the screen increases as training
+progresses. This is a positive sign that training is succeeding.

 ### Observing Training Progress

 tensorboard --logdir=summaries
 ```

-Then navigate to `localhost:6006` in your browser.
-
-From TensorBoard, you will see the summary statistics:
-
-* **Lesson** - only interesting when performing [curriculum
-  training](Training-Curriculum-Learning.md). This is not used in the 3D Balance
-  Ball environment.
-* **Cumulative Reward** - The mean cumulative episode reward over all agents. Should
-  increase during a successful training session.
-* **Entropy** - How random the decisions of the model are. Should slowly decrease
-  during a successful training process. If it decreases too quickly, the `beta`
-  hyperparameter should be increased.
-* **Episode Length** - The mean length of each episode in the environment for all
-  agents.
-* **Learning Rate** - How large a step the training algorithm takes as it searches
-  for the optimal policy. Should decrease over time.
-* **Policy Loss** - The mean loss of the policy function update. Correlates to how
-  much the policy (process for deciding actions) is changing. The magnitude of
-  this should decrease during a successful training session.
-* **Value Estimate** - The mean value estimate for all states visited by the agent.
-  Should increase during a successful training session.
-* **Value Loss** - The mean loss of the value function update. Correlates to how
-  well the model is able to predict the value of each state. This should
-  decrease during a successful training session.
+Then navigate to `localhost:6006` in your browser to view the TensorBoard
+summary statistics as shown below. For the purposes of this section, the most
+important statistic is `Environment/Cumulative Reward` which should increase
+throughout training, eventually converging close to `100` which is the maximum
+reward the agent can accumulate.

 ![Example TensorBoard Run](images/mlagents-TensorBoard.png)

 (denoted by the `Saved Model` message) you can add it to the Unity project and
-use it with compatible Agents (the Agents that generated the model).
-__Note:__ Do not just close the Unity Window once the `Saved Model` message appears.
+use it with compatible Agents (the Agents that generated the model). **Note:**
+Do not just close the Unity Window once the `Saved Model` message appears.
-command-line prompt. If you close the window manually, the `.nn` file
-containing the trained model is not exported into the ml-agents folder.
+command-line prompt. If you close the window manually, the `.nn` file containing
+the trained model is not exported into the ml-agents folder.
-If you've quit the training early using Ctrl+C and want to resume training, run the
-same command again, appending the `--resume` flag:
+If you've quit the training early using Ctrl+C and want to resume training, run
+the same command again, appending the `--resume` flag:
-mlagents-learn config/trainer_config.yaml --run-id=firstRun --resume
+mlagents-learn config/trainer_config.yaml --run-id=first3DBallRun --resume
-`<behavior_name>` is the name of the `Behavior Name` of the agents corresponding to the model.
-(**Note:** There is a known bug on Windows that causes the saving of the model to
-fail when you early terminate the training, it's recommended to wait until Step
-has reached the max_steps parameter you set in trainer_config.yaml.) This file
-corresponds to your model's latest checkpoint. You can now embed this trained
-model into your Agents by following the steps below, which is similar to
-the steps described
-[above](#running-a-pre-trained-model).
+`<behavior_name>` is the name of the `Behavior Name` of the agents corresponding
+to the model. This file corresponds to your model's latest checkpoint. You can
+now embed this trained model into your Agents by following the steps below,
+which is similar to the steps described [above](#running-a-pre-trained-model).
-2. Open the Unity Editor, and select the **3DBall** scene as described above.
-3. Select the  **3DBall** prefab Agent object.
-4. Drag the `<behavior_name>.nn` file from the Project window of
-   the Editor to the **Model** placeholder in the **Ball3DAgent**
-   inspector window.
-5. Press the :arrow_forward: button at the top of the Editor.
+1. Open the Unity Editor, and select the **3DBall** scene as described above.
+1. Select the **3DBall** prefab Agent object.
+1. Drag the `<behavior_name>.nn` file from the Project window of the Editor to
+   the **Model** placeholder in the **Ball3DAgent** inspector window.
+1. Press the :arrow_forward: button at the top of the Editor.
- For more information on the ML-Agents toolkit, in addition to helpful
+- For more information on the ML-Agents Toolkit, in addition to helpful
-  check out the [Making a New Learning
-  Environment](Learning-Environment-Create-New.md) page.
+  check out the
+  [Making a New Learning Environment](Learning-Environment-Create-New.md) page.
+- For an overview on the more complex example environments that are provided in
+  this toolkit, check out the
+  [Example Environments](Learning-Environment-Examples.md) page.
+- For more information on the various training options available, check out the
+  [Training ML-Agents](Training-ML-Agents.md) page.
--- a/docs/Learning-Environment-Design-Agents.md
+++ b/docs/Learning-Environment-Design-Agents.md
 # Agents

 An agent is an entity that can observe its environment, decide on the best
-course of action using those observations, and execute those actions within
-its environment. Agents can be created in Unity by extending
-the `Agent` class. The most important aspects of creating agents that can
-successfully learn are the observations the agent collects,
-and the reward you assign to estimate the value of the
-agent's current state toward accomplishing its tasks.
+course of action using those observations, and execute those actions within its
+environment. Agents can be created in Unity by extending the `Agent` class. The
+most important aspects of creating agents that can successfully learn are the
+observations the agent collects, and the reward you assign to estimate the value
+of the agent's current state toward accomplishing its tasks.

 An Agent passes its observations to its Policy. The Policy then makes a decision
 and passes the chosen action back to the agent. Your agent code must execute the
 discover the optimal decision-making policy.

-The `Policy` class abstracts out the decision making logic from the Agent itself so
-that you can use the same Policy in multiple Agents. How a Policy makes its
+The `Policy` class abstracts out the decision making logic from the Agent itself
+so that you can use the same Policy in multiple Agents. How a Policy makes its
-write your own Policy. If the Agent has a `Model` file, its Policy will use
-the neural network `Model` to take decisions.
+write your own Policy. If the Agent has a `Model` file, its Policy will use the
+neural network `Model` to take decisions.
+
+When you create an Agent, you must extend the base Agent class. This includes
+implementing the following methods:
+
+- `Agent.OnEpisodeBegin()` — Called at the beginning of an Agent's episode,
+  including at the beginning of the simulation. The Ball3DAgent class uses this
+  function to reset the agent cube and ball to their starting positions. The
+  function randomizes the reset values so that the training generalizes to more
+  than a specific starting position and agent cube attitude.
+- `Agent.CollectObservations(VectorSensor sensor)` — Called every simulation
+  step. Responsible for collecting the Agent's observations of the environment.
+  Since the Behavior Parameters of the Agent are set with vector observation
+  space with a state size of 8, the `CollectObservations(VectorSensor sensor)`
+  must call `VectorSensor.AddObservation()` such that vector size adds up to 8.
+- `Agent.OnActionReceived()` — Called every time the Agent receives an action to
+  take. Receives the action chosen by the Agent. The vector action spaces result
+  in a small change in the agent cube's rotation at each step. The
+  `OnActionReceived()` method assigns a reward to the Agent; in this example, an
+  Agent receives a small positive reward for each step it keeps the ball on the
+  agent cube's head and a larger, negative reward for dropping the ball. An
+  Agent's episode is also ended when it drops the ball so that it will reset
+  with a new ball for the next simulation step.
+- `Agent.Heuristic()` - When the `Behavior Type` is set to `Heuristic Only` in
+  the Behavior Parameters of the Agent, the Agent will use the `Heuristic()`
+  method to generate the actions of the Agent. As such, the `Heuristic()` method
+  returns an array of floats. In the case of the Ball 3D Agent, the
+  `Heuristic()` method converts the keyboard inputs into actions.
-a decision.
-Agents will request a decision when `Agent.RequestDecision()` is called. If you need
-the Agent to request decisions on its own at regular intervals, add a
-`Decision Requester` component to the Agent's GameObject. Making decisions at regular step
-intervals is generally most appropriate for physics-based simulations. For example, an
-agent in a robotic simulator that must provide fine-control of joint torques
-should make its decisions every step of the simulation. On the other hand, an
-agent that only needs to make decisions when certain game or simulation events
-occur, such as in a turn-based game, should call `Agent.RequestDecision()` manually.
+a decision. Agents will request a decision when `Agent.RequestDecision()` is
+called. If you need the Agent to request decisions on its own at regular
+intervals, add a `Decision Requester` component to the Agent's GameObject.
+Making decisions at regular step intervals is generally most appropriate for
+physics-based simulations. For example, an agent in a robotic simulator that
+must provide fine-control of joint torques should make its decisions every step
+of the simulation. On the other hand, an agent that only needs to make decisions
+when certain game or simulation events occur, such as in a turn-based game,
+should call `Agent.RequestDecision()` manually.
-To make informed decisions, an agent must first make observations of the state of
-the environment. The observations are collected by Sensors attached to the agent
-GameObject. By default, agents come with a `VectorSensor` which allows them to
-collect floating-point observations into a single array. There are additional
-sensor components which can be attached to the agent GameObject which collect their own
-observations, or modify other observations. These are:
+To make informed decisions, an agent must first make observations of the state
+of the environment. The observations are collected by Sensors attached to the
+agent GameObject. By default, agents come with a `VectorSensor` which allows
+them to collect floating-point observations into a single array. There are
+additional sensor components which can be attached to the agent GameObject which
+collect their own observations, or modify other observations. These are:
-* `CameraSensorComponent` - Allows image from `Camera` to be used as observation.
-* `RenderTextureSensorComponent` - Allows content of `RenderTexture` to be used as observation.
-* `RayPerceptionSensorComponent` - Allows information from set of ray-casts to be used as observation.
+- `CameraSensorComponent` - Allows image from `Camera` to be used as
+  observation.
+- `RenderTextureSensorComponent` - Allows content of `RenderTexture` to be used
+  as observation.
+- `RayPerceptionSensorComponent` - Allows information from set of ray-casts to
+  be used as observation.
-Vector observations are best used for aspects of the environment which are numerical
-and non-visual. The Policy class calls the `CollectObservations(VectorSensor sensor)`
-method of each Agent. Your implementation of this function must call
-`VectorSensor.AddObservation` to add vector observations.
+Vector observations are best used for aspects of the environment which are
+numerical and non-visual. The Policy class calls the
+`CollectObservations(VectorSensor sensor)` method of each Agent. Your
+implementation of this function must call `VectorSensor.AddObservation` to add
+vector observations.
-information an agent needs to accomplish its task. Without sufficient and relevant
-information, an agent may learn poorly
-or may not learn at all. A reasonable approach for determining what information
-should be included is to consider what you would need to calculate an analytical
-solution to the problem, or what you would expect a human to be able to use to solve the problem.
+information an agent needs to accomplish its task. Without sufficient and
+relevant information, an agent may learn poorly or may not learn at all. A
+reasonable approach for determining what information should be included is to
+consider what you would need to calculate an analytical solution to the problem,
+or what you would expect a human to be able to use to solve the problem.
-ML-Agents SDK.  For instance, the 3DBall example uses the rotation of the
+ML-Agents SDK. For instance, the 3DBall example uses the rotation of the
 platform, the relative position of the ball, and the velocity of the ball as its
 state observation. As an experiment, you can remove the velocity components from
 the observation and retrain the 3DBall agent. While it will learn to balance the
 an agent's observations to a fixed subset. For example, instead of observing
 every enemy agent in an environment, you could only observe the closest five.

-When you set up an Agent's `Behavior Parameters` in the Unity Editor, set the following
-properties to use a vector observation:
+When you set up an Agent's `Behavior Parameters` in the Unity Editor, set the
+following properties to use a vector observation:
-* **Space Size** — The state size must match the length of your feature vector.
+- **Space Size** — The state size must match the length of your feature vector.
-The `VectorSensor.AddObservation` method provides a number of overloads for adding common types
-of data to your observation vector. You can add Integers and booleans directly to
-the observation vector, as well as some common Unity data types such as `Vector2`,
-`Vector3`, and `Quaternion`.
+The `VectorSensor.AddObservation` method provides a number of overloads for
+adding common types of data to your observation vector. You can add Integers and
+booleans directly to the observation vector, as well as some common Unity data
+types such as `Vector2`, `Vector3`, and `Quaternion`.

 #### One-hot encoding categorical information

 }
 ```

-`VectorSensor` also provides a two-argument function `AddOneHotObservation()` as a shortcut for _one-hot_
-style observations. The following example is identical to the previous one.
+`VectorSensor` also provides a two-argument function `AddOneHotObservation()` as
+a shortcut for _one-hot_ style observations. The following example is identical
+to the previous one.

 ```csharp
 enum CarriedItems { Sword, Shield, Bow, LastItem }
 ```csharp
 normalizedValue = (currentValue - minValue)/(maxValue - minValue)
 ```
-:warning: For vectors, you should apply the above formula to each component (x, y, and z). Note that this is *not* the same as using the `Vector3.normalized` property or `Vector3.Normalize()` method in Unity (and similar for `Vector2`).
+
+:warning: For vectors, you should apply the above formula to each component (x,
+y, and z). Note that this is _not_ the same as using the `Vector3.normalized`
+property or `Vector3.Normalize()` method in Unity (and similar for `Vector2`).

 Rotations and angles should also be normalized. For angles between 0 and 360
 degrees, you can use the following formulas:

 #### Vector Observation Summary & Best Practices

-* Vector Observations should include all variables relevant for allowing the
-  agent to take the optimally informed decision, and ideally no extraneous information.
-* In cases where Vector Observations need to be remembered or compared over
-  time, either an LSTM (see [here](Feature-Memory.md)) should be used in the model, or the
-  `Stacked Vectors` value in the agent GameObject's `Behavior Parameters` should be changed.
-* Categorical variables such as type of object (Sword, Shield, Bow) should be
-  encoded in one-hot fashion (i.e. `3` -> `0, 0, 1`). This can be done automatically using the
-  `AddOneHotObservation()` method of the `VectorSensor`.
-* In general, all inputs should be normalized to be in
-  the range 0 to +1 (or -1 to 1). For example, the `x` position information of
-  an agent where the maximum possible value is `maxValue` should be recorded as
+- Vector Observations should include all variables relevant for allowing the
+  agent to take the optimally informed decision, and ideally no extraneous
+  information.
+- In cases where Vector Observations need to be remembered or compared over
+  time, either an LSTM (see [here](Feature-Memory.md)) should be used in the
+  model, or the `Stacked Vectors` value in the agent GameObject's
+  `Behavior Parameters` should be changed.
+- Categorical variables such as type of object (Sword, Shield, Bow) should be
+  encoded in one-hot fashion (i.e. `3` -> `0, 0, 1`). This can be done
+  automatically using the `AddOneHotObservation()` method of the `VectorSensor`.
+- In general, all inputs should be normalized to be in the range 0 to +1 (or -1
+  to 1). For example, the `x` position information of an agent where the maximum
+  possible value is `maxValue` should be recorded as
-* Positional information of relevant GameObjects should be encoded in relative
+- Positional information of relevant GameObjects should be encoded in relative
-
-Visual observations are generally provided to agent via either a `CameraSensor` or `RenderTextureSensor`.
-These collect image information and transforms it into a 3D Tensor which
-can be fed into the convolutional neural network (CNN) of the agent policy. For more information on
-CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). This allows agents
-to learn from spatial regularities in the observation images. It is possible to
-use visual and vector observations with the same agent.
+Visual observations are generally provided to agent via either a `CameraSensor`
+or `RenderTextureSensor`. These collect image information and transforms it into
+a 3D Tensor which can be fed into the convolutional neural network (CNN) of the
+agent policy. For more information on CNNs, see
+[this guide](http://cs231n.github.io/convolutional-networks/). This allows
+agents to learn from spatial regularities in the observation images. It is
+possible to use visual and vector observations with the same agent.
-used when it is not possible to properly define the problem using vector or ray-cast observations.
+used when it is not possible to properly define the problem using vector or
+ray-cast observations.
-Visual observations can be derived from Cameras or RenderTextures within your scene.
-To add a visual observation to an Agent, add either a Camera Sensor Component
-or RenderTextures Sensor Component to the Agent. Then drag the camera or
-render texture you want to add to the `Camera` or `RenderTexture` field.
-You can have more than one camera or render texture and even use a combination
-of both attached to an Agent. For each visual observation, set the width and height
-of the image (in pixels) and whether or not the observation is color or grayscale.
+Visual observations can be derived from Cameras or RenderTextures within your
+scene. To add a visual observation to an Agent, add either a Camera Sensor
+Component or RenderTextures Sensor Component to the Agent. Then drag the camera
+or render texture you want to add to the `Camera` or `RenderTexture` field. You
+can have more than one camera or render texture and even use a combination of
+both attached to an Agent. For each visual observation, set the width and height
+of the image (in pixels) and whether or not the observation is color or
+grayscale.

 ![Agent Camera](images/visual-observation.png)


-Each Agent that uses the same Policy must have the same number of visual observations,
-and they must all have the same resolutions (including whether or not they are grayscale).
-Additionally, each Sensor Component on an Agent must have a unique name so that they can
-be sorted deterministically (the name must be unique for that Agent, but multiple Agents can
-have a Sensor Component with the same name).
+Each Agent that uses the same Policy must have the same number of visual
+observations, and they must all have the same resolutions (including whether or
+not they are grayscale). Additionally, each Sensor Component on an Agent must
+have a unique name so that they can be sorted deterministically (the name must
+be unique for that Agent, but multiple Agents can have a Sensor Component with
+the same name).
-adding a `Canvas`, then adding a `Raw Image` with it's texture set to the Agent's
-`RenderTexture`. This will render the agent observation on the game screen.
+adding a `Canvas`, then adding a `Raw Image` with it's texture set to the
+Agent's `RenderTexture`. This will render the agent observation on the game
+screen.
-The [GridWorld environment](Learning-Environment-Examples.md#gridworld)
-is an example on how to use a RenderTexture for both debugging and observation. Note
-that in this example, a Camera is rendered to a RenderTexture, which is then used for
-observations and debugging. To update the RenderTexture, the Camera must be asked to
-render every time a decision is requested within the game code. When using Cameras
-as observations directly, this is done automatically by the Agent.
+The [GridWorld environment](Learning-Environment-Examples.md#gridworld) is an
+example on how to use a RenderTexture for both debugging and observation. Note
+that in this example, a Camera is rendered to a RenderTexture, which is then
+used for observations and debugging. To update the RenderTexture, the Camera
+must be asked to render every time a decision is requested within the game code.
+When using Cameras as observations directly, this is done automatically by the
+Agent.
-* To collect visual observations, attach `CameraSensor` or `RenderTextureSensor`
+- To collect visual observations, attach `CameraSensor` or `RenderTextureSensor`
-* Visual observations should generally be used unless vector observations are not sufficient.
-* Image size should be kept as small as possible, without the loss of
-  needed details for decision making.
-* Images should be made greyscale in situations where color information is
-  not needed for making informed decisions.
+- Visual observations should generally be used unless vector observations are
+  not sufficient.
+- Image size should be kept as small as possible, without the loss of needed
+  details for decision making.
+- Images should be made greyscale in situations where color information is not
+  needed for making informed decisions.
-This can be easily implemented by adding a
-`RayPerceptionSensorComponent3D` (or `RayPerceptionSensorComponent2D`) to the Agent GameObject.
+This can be easily implemented by adding a `RayPerceptionSensorComponent3D` (or
+`RayPerceptionSensorComponent2D`) to the Agent GameObject.
-During observations, several rays (or spheres, depending on settings) are cast into
-the physics world, and the objects that are hit determine the observation vector that
-is produced.
+During observations, several rays (or spheres, depending on settings) are cast
+into the physics world, and the objects that are hit determine the observation
+vector that is produced.
- * _Detectable Tags_ A list of strings corresponding to the types of objects that the
- Agent should be able to distinguish between. For example, in the WallJump example,
- we use "wall", "goal", and "block" as the list of objects to detect.
- * _Rays Per Direction_ Determines the number of rays that are cast. One ray is
+
+- _Detectable Tags_ A list of strings corresponding to the types of objects that
+  the Agent should be able to distinguish between. For example, in the WallJump
+  example, we use "wall", "goal", and "block" as the list of objects to detect.
+- _Rays Per Direction_ Determines the number of rays that are cast. One ray is
- * _Max Ray Degrees_ The angle (in degrees) for the outermost rays. 90 degrees
+- _Max Ray Degrees_ The angle (in degrees) for the outermost rays. 90 degrees
- * _Sphere Cast Radius_ The size of the sphere used for sphere casting. If set
-  to 0, rays will be used instead of spheres. Rays may be more efficient,
+- _Sphere Cast Radius_ The size of the sphere used for sphere casting. If set to
+  0, rays will be used instead of spheres. Rays may be more efficient,
- * _Ray Length_ The length of the casts
- * _Observation Stacks_ The number of previous results to "stack" with the cast
-  results. Note that this can be independent of the "Stacked Vectors" setting
-  in `Behavior Parameters`.
- * _Start Vertical Offset_ (3D only) The vertical offset of the ray start point.
- * _End Vertical Offset_ (3D only) The vertical offset of the ray end point.
+- _Ray Length_ The length of the casts
+- _Observation Stacks_ The number of previous results to "stack" with the cast
+  results. Note that this can be independent of the "Stacked Vectors" setting in
+  `Behavior Parameters`.
+- _Start Vertical Offset_ (3D only) The vertical offset of the ray start point.
+- _End Vertical Offset_ (3D only) The vertical offset of the ray end point.
-Both use 3 Rays Per Direction and 90 Max Ray Degrees. One of the components
-had a vertical offset, so the Agent can tell whether it's clear to jump over
-the wall.
+Both use 3 Rays Per Direction and 90 Max Ray Degrees. One of the components had
+a vertical offset, so the Agent can tell whether it's clear to jump over the
+wall.
+
+
 so the number of rays and tags should be kept as small as possible to reduce the
 amount of data used. Note that this is separate from the State Size defined in
 `Behavior Parameters`, so you don't need to worry about the formula above when

-* Attach `RayPerceptionSensorComponent3D` or `RayPerceptionSensorComponent2D` to use.
-* This observation type is best used when there is relevant spatial information
+- Attach `RayPerceptionSensorComponent3D` or `RayPerceptionSensorComponent2D` to
+  use.
+- This observation type is best used when there is relevant spatial information
-* Use as few rays and tags as necessary to solve the problem in order to improve learning stability and agent performance.
+- Use as few rays and tags as necessary to solve the problem in order to improve
+  learning stability and agent performance.
-agent's `OnActionReceived()` function. Actions for an agent can take one of two forms, either **Continuous** or **Discrete**.
+agent's `OnActionReceived()` function. Actions for an agent can take one of two
+forms, either **Continuous** or **Discrete**.
-When you specify that the vector action space
-is **Continuous**, the action parameter passed to the Agent is an array of
-floating point numbers with length equal to the `Vector Action Space Size` property.
-When you specify a **Discrete** vector action space type, the action parameter
-is an array containing integers. Each integer is an index into a list or table
-of commands. In the **Discrete** vector action space type, the action parameter
-is an array of indices. The number of indices in the array is determined by the
-number of branches defined in the `Branches Size` property. Each branch
-corresponds to an action table, you can specify the size of each table by
-modifying the `Branches` property.
+When you specify that the vector action space is **Continuous**, the action
+parameter passed to the Agent is an array of floating point numbers with length
+equal to the `Vector Action Space Size` property. When you specify a
+**Discrete** vector action space type, the action parameter is an array
+containing integers. Each integer is an index into a list or table of commands.
+In the **Discrete** vector action space type, the action parameter is an array
+of indices. The number of indices in the array is determined by the number of
+branches defined in the `Branches Size` property. Each branch corresponds to an
+action table, you can specify the size of each table by modifying the `Branches`
+property.
-Neither the Policy nor the training algorithm know anything about what the action
-values themselves mean. The training algorithm simply tries different values for
-the action list and observes the affect on the accumulated rewards over time and
-many training episodes. Thus, the only place actions are defined for an Agent is
-in the `OnActionReceived()` function.
+Neither the Policy nor the training algorithm know anything about what the
+action values themselves mean. The training algorithm simply tries different
+values for the action list and observes the affect on the accumulated rewards
+over time and many training episodes. Thus, the only place actions are defined
+for an Agent is in the `OnActionReceived()` function.

 For example, if you designed an agent to move in two dimensions, you could use
 either continuous or the discrete vector actions. In the continuous case, you
 with values ranging from zero to one.

 Note that when you are programming actions for an agent, it is often helpful to
-test your action logic using the `Heuristic()` method of the Agent,
-which lets you map keyboard commands to actions.
+test your action logic using the `Heuristic()` method of the Agent, which lets
+you map keyboard commands to actions.

 The [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) and
 [Area](Learning-Environment-Examples.md#push-block) example environments are set

 When an Agent uses a Policy set to the **Continuous** vector action space, the
-action parameter passed to the Agent's `OnActionReceived()` function is an array with
-length equal to the `Vector Action Space Size` property value.
-The individual values in the array have whatever meanings that you ascribe to
-them. If you assign an element in the array as the speed of an Agent, for
-example, the training process learns to control the speed of the Agent through
-this parameter.
+action parameter passed to the Agent's `OnActionReceived()` function is an array
+with length equal to the `Vector Action Space Size` property value. The
+individual values in the array have whatever meanings that you ascribe to them.
+If you assign an element in the array as the speed of an Agent, for example, the
+training process learns to control the speed of the Agent through this
+parameter.

 The [Reacher example](Learning-Environment-Examples.md#reacher) defines a
 continuous action space with four control values.

 ### Discrete Action Space

-When an Agent uses a  **Discrete** vector action space, the
-action parameter passed to the Agent's `OnActionReceived()` function is an array
-containing indices. With the discrete vector action space, `Branches` is an
-array of integers, each value corresponds to the number of possibilities for
-each branch.
+When an Agent uses a **Discrete** vector action space, the action parameter
+passed to the Agent's `OnActionReceived()` function is an array containing
+indices. With the discrete vector action space, `Branches` is an array of
+integers, each value corresponds to the number of possibilities for each branch.
-agent be able to move __and__ jump concurrently. We define the first branch to
+agent be able to move **and** jump concurrently. We define the first branch to
 have 5 possible actions (don't move, go left, go right, go backward, go forward)
 and the second one to have 2 possible actions (don't jump, jump). The
 `OnActionReceived()` method would look something like:
 #### Masking Discrete Actions

 When using Discrete Actions, it is possible to specify that some actions are
-impossible for the next decision. When the Agent is controlled by a
-neural network, the Agent will be unable to perform the specified action. Note
-that when the Agent is controlled by its Heuristic, the Agent will
-still be able to decide to perform the masked action. In order to mask an
-action,  override the `Agent.CollectDiscreteActionMasks()` virtual method,
-and call `DiscreteActionMasker.SetMask()` in it:
+impossible for the next decision. When the Agent is controlled by a neural
+network, the Agent will be unable to perform the specified action. Note that
+when the Agent is controlled by its Heuristic, the Agent will still be able to
+decide to perform the masked action. In order to mask an action, override the
+`Agent.CollectDiscreteActionMasks()` virtual method, and call
+`DiscreteActionMasker.SetMask()` in it:

 ```csharp
 public override void CollectDiscreteActionMasks(DiscreteActionMasker actionMasker){

 Where:

-* `branch` is the index (starting at 0) of the branch on which you want to mask
+- `branch` is the index (starting at 0) of the branch on which you want to mask
-* `actionIndices` is a list of `int` corresponding to the
-  indices of the actions that the Agent cannot perform.
+- `actionIndices` is a list of `int` corresponding to the indices of the actions
+  that the Agent cannot perform.

 For example, if you have an Agent with 2 branches and on the first branch
 (branch 0) there are 4 possible actions : _"do nothing"_, _"jump"_, _"shoot"_

 Notes:

-* You can call `SetMask` multiple times if you want to put masks on
-  multiple branches.
-* You cannot mask all the actions of a branch.
-* You cannot mask actions in continuous control.
+- You can call `SetMask` multiple times if you want to put masks on multiple
+  branches.
+- You cannot mask all the actions of a branch.
+- You cannot mask actions in continuous control.
-### Actions Summary &  Best Practices
+### Actions Summary & Best Practices
-* Actions can either use `Discrete` or `Continuous` spaces.
-* When using `Discrete` it is possible to assign multiple action branches, and to mask certain actions.
-* In general, smaller action spaces will make for easier learning.
-* Be sure to set the Vector Action's Space Size to the number of used Vector
+- Actions can either use `Discrete` or `Continuous` spaces.
+- When using `Discrete` it is possible to assign multiple action branches, and
+  to mask certain actions.
+- In general, smaller action spaces will make for easier learning.
+- Be sure to set the Vector Action's Space Size to the number of used Vector
-* When using continuous control, action values should be clipped to an
+- When using continuous control, action values should be clipped to an
-

 ## Rewards

 reward over time. The better your reward mechanism, the better your agent will
 learn.

-**Note:** Rewards are not used during inference by an Agent using a
-trained model and is also not used during imitation learning.
+**Note:** Rewards are not used during inference by an Agent using a trained
+model and is also not used during imitation learning.
-the desired results. You can even use the
-Agent's Heuristic to control the Agent while watching how it accumulates rewards.
+the desired results. You can even use the Agent's Heuristic to control the Agent
+while watching how it accumulates rewards.
-Allocate rewards to an Agent by calling the `AddReward()` or `SetReward()` methods on the agent.
-The reward assigned between each decision
-should be in the range [-1,1]. Values outside this range can lead to
-unstable training. The `reward` value is reset to zero when the agent receives a
-new decision. If there are multiple calls to `AddReward()` for a single agent
-decision, the rewards will be summed together to evaluate how good the previous
-decision was. The `SetReward()` will override all
-previous rewards given to an agent since the previous decision.
+Allocate rewards to an Agent by calling the `AddReward()` or `SetReward()`
+methods on the agent. The reward assigned between each decision should be in the
+range [-1,1]. Values outside this range can lead to unstable training. The
+`reward` value is reset to zero when the agent receives a new decision. If there
+are multiple calls to `AddReward()` for a single agent decision, the rewards
+will be summed together to evaluate how good the previous decision was. The
+`SetReward()` will override all previous rewards given to an agent since the
+previous decision.
-You can examine the `OnActionReceived()` functions defined in the [example
-environments](Learning-Environment-Examples.md) to see how those projects
-allocate rewards.
+You can examine the `OnActionReceived()` functions defined in the
+[example environments](Learning-Environment-Examples.md) to see how those
+projects allocate rewards.
-The `GridAgent` class in the [GridWorld
-example](Learning-Environment-Examples.md#gridworld) uses a very simple reward
-system:
+The `GridAgent` class in the
+[GridWorld example](Learning-Environment-Examples.md#gridworld) uses a very
+simple reward system:

 ```csharp
 Collider[] hitObjects = Physics.OverlapBox(trueAgent.transform.position,
 example of a _sparse_ reward system. The agent must explore a lot to find the
 infrequent reward.

-In contrast, the `AreaAgent` in the [Area
-example](Learning-Environment-Examples.md#push-block) gets a small negative
-reward every step. In order to get the maximum reward, the agent must finish its
-task of reaching the goal square as quickly as possible:
+In contrast, the `AreaAgent` in the
+[Area example](Learning-Environment-Examples.md#push-block) gets a small
+negative reward every step. In order to get the maximum reward, the agent must
+finish its task of reaching the goal square as quickly as possible:

 ```csharp
 AddReward( -0.005f);
 The `Ball3DAgent` also assigns a negative penalty when the ball falls off the
 platform.

-Note that all of these environments make use of the `EndEpisode()` method, which manually
-terminates an episode when a termination condition is reached. This can be
-called independently of the `Max Step` property.
+Note that all of these environments make use of the `EndEpisode()` method, which
+manually terminates an episode when a termination condition is reached. This can
+be called independently of the `Max Step` property.
-* Use `AddReward()` to accumulate rewards between decisions. Use `SetReward()`
+- Use `AddReward()` to accumulate rewards between decisions. Use `SetReward()`
-* The magnitude of any given reward should typically not be greater than 1.0 in
+- The magnitude of any given reward should typically not be greater than 1.0 in
-* Positive rewards are often more helpful to shaping the desired behavior of an
-  agent than negative rewards. Excessive negative rewards can result in the agent
-  failing to learn any meaningful behavior.
-* For locomotion tasks, a small positive reward (+0.1) for forward velocity is
+- Positive rewards are often more helpful to shaping the desired behavior of an
+  agent than negative rewards. Excessive negative rewards can result in the
+  agent failing to learn any meaningful behavior.
+- For locomotion tasks, a small positive reward (+0.1) for forward velocity is
-* If you want the agent to finish a task quickly, it is often helpful to provide
+- If you want the agent to finish a task quickly, it is often helpful to provide
-  episode by calling `EndEpisode()` on the agent when it has accomplished its goal.
+  episode by calling `EndEpisode()` on the agent when it has accomplished its
+  goal.
-* `Behavior Parameters` - The parameters dictating what Policy the Agent will
-receive.
-  * `Behavior Name` - The identifier for the behavior. Agents with the same behavior name
-  will learn the same policy. If you're using [curriculum learning](Training-Curriculum-Learning.md),
-   this is used as the top-level key in the config.
-  * `Vector Observation`
-    * `Space Size` - Length of vector observation for the Agent.
-    * `Stacked Vectors` - The number of previous vector observations that will
+- `Behavior Parameters` - The parameters dictating what Policy the Agent will
+  receive.
+  - `Behavior Name` - The identifier for the behavior. Agents with the same
+    behavior name will learn the same policy. If you're using
+    [curriculum learning](Training-Curriculum-Learning.md), this is used as the
+    top-level key in the config.
+  - `Vector Observation`
+    - `Space Size` - Length of vector observation for the Agent.
+    - `Stacked Vectors` - The number of previous vector observations that will
-  * `Vector Action`
-    * `Space Type` - Corresponds to whether action vector contains a single
+  - `Vector Action`
+    - `Space Type` - Corresponds to whether action vector contains a single
-    * `Space Size` (Continuous) - Length of action vector.
-    * `Branches` (Discrete) - An array of integers, defines multiple concurrent
+    - `Space Size` (Continuous) - Length of action vector.
+    - `Branches` (Discrete) - An array of integers, defines multiple concurrent
-  * `Model` - The neural network model used for inference (obtained after
-  training)
-  * `Inference Device` - Whether to use CPU or GPU to run the model during inference
-  * `Behavior Type` - Determines whether the Agent will do training, inference, or use its
-  Heuristic() method:
-    * `Default` - the Agent will train if they connect to a python trainer, otherwise they will perform inference.
-    * `Heuristic Only` - the Agent will always use the `Heuristic()` method.
-    * `Inference Only` - the Agent will always perform inference.
-  * `Team ID` - Used to define the team for [self-play](Training-Self-Play.md)
-  * `Use Child Sensors` - Whether to use all Sensor components attached to child GameObjects of this Agent.
-* `Max Step` - The per-agent maximum number of steps. Once this number is
+  - `Model` - The neural network model used for inference (obtained after
+    training)
+  - `Inference Device` - Whether to use CPU or GPU to run the model during
+    inference
+  - `Behavior Type` - Determines whether the Agent will do training, inference,
+    or use its Heuristic() method:
+    - `Default` - the Agent will train if they connect to a python trainer,
+      otherwise they will perform inference.
+    - `Heuristic Only` - the Agent will always use the `Heuristic()` method.
+    - `Inference Only` - the Agent will always perform inference.
+  - `Team ID` - Used to define the team for [self-play](Training-Self-Play.md)
+  - `Use Child Sensors` - Whether to use all Sensor components attached to child
+    GameObjects of this Agent.
+- `Max Step` - The per-agent maximum number of steps. Once this number is
  reached, the Agent will be reset.

 ## Monitoring Agents

 ## Destroying an Agent

-You can destroy an Agent GameObject during the simulation. Make sure that there is
-always at least one Agent training at all times by either spawning a new Agent
-every time one is destroyed or by re-spawning new Agents when the whole environment
-resets.
+You can destroy an Agent GameObject during the simulation. Make sure that there
+is always at least one Agent training at all times by either spawning a new
+Agent every time one is destroyed or by re-spawning new Agents when the whole
+environment resets.