Combine "Best Practices" and "Agents" documentation (#3643)

* Merge agent & best practices doc. Plus other fixes * Fix overly long lines * Address typos and comments * Address feedback
5 年前 · a6ade9b2
--- a/docs/Getting-Started-with-Balance-Ball.md
+++ b/docs/Getting-Started-with-Balance-Ball.md
 When you create an Agent, you must extend the base Agent class.
 The Ball3DAgent subclass defines the following methods:

-* `Agent.OnEpisodeBegin()` — Called when the Agent resets, including at the beginning
-  of the simulation. The Ball3DAgent class uses the reset function to reset the
-  agent cube and ball. The function randomizes the reset values so that the
+* `Agent.OnEpisodeBegin()` — Called at the beginning of an Agent's episode, including at the beginning
+  of the simulation. The Ball3DAgent class uses this function to reset the
+  agent cube and ball to their starting positions. The function randomizes the reset values so that the
  training generalizes to more than a specific starting position and agent cube
  attitude.
 * `Agent.CollectObservations(VectorSensor sensor)` — Called every simulation step. Responsible for

 From TensorBoard, you will see the summary statistics:

-* Lesson - only interesting when performing [curriculum
+* **Lesson** - only interesting when performing [curriculum
-* Cumulative Reward - The mean cumulative episode reward over all agents. Should
+* **Cumulative Reward** - The mean cumulative episode reward over all agents. Should
-* Entropy - How random the decisions of the model are. Should slowly decrease
+* **Entropy** - How random the decisions of the model are. Should slowly decrease
-* Episode Length - The mean length of each episode in the environment for all
+* **Episode Length** - The mean length of each episode in the environment for all
-* Learning Rate - How large a step the training algorithm takes as it searches
+* **Learning Rate** - How large a step the training algorithm takes as it searches
-* Policy Loss - The mean loss of the policy function update. Correlates to how
+* **Policy Loss** - The mean loss of the policy function update. Correlates to how
-* Value Estimate - The mean value estimate for all states visited by the agent.
+* **Value Estimate** - The mean value estimate for all states visited by the agent.
-* Value Loss - The mean loss of the value function update. Correlates to how
+* **Value Loss** - The mean loss of the value function update. Correlates to how
  well the model is able to predict the value of each state. This should
  decrease during a successful training session.

--- a/docs/Learning-Environment-Create-New.md
+++ b/docs/Learning-Environment-Create-New.md
 # Making a New Learning Environment

-This tutorial walks through the process of creating a Unity Environment. A Unity
-Environment is an application built using the Unity Engine which can be used to
-train Reinforcement Learning Agents.
+This tutorial walks through the process of creating a Unity Environment from scratch. We recommend first reading the [Getting Started](Getting-Started-with-Balance-Ball.md) guide to understand the concepts presented here first in an already-built environment.
-In this example, we will train a ball to roll to a randomly placed cube. The
-ball also learns to avoid falling off the platform.
+In this example, we will create an agent capable of controlling a ball on a platform. We will then train the agent to roll the ball toward the cube while avoiding falling off the platform.

 ## Overview


 This is only one way to achieve this objective. Refer to the
 [example environments](Learning-Environment-Examples.md) for other ways we can achieve relative positioning.
-
-## Review: Scene Layout
-
-This section briefly reviews how to organize your scene when using Agents in
-your Unity environment.
-
-There are two kinds of game objects you need to include in your scene in order
-to use Unity ML-Agents: an Academy and one or more Agents.
-
-Keep in mind:
-
-* If you are using multiple training areas, make sure all the Agents have the same `Behavior Name`
-and `Behavior Parameters`
--- a/docs/Learning-Environment-Design-Agents.md
+++ b/docs/Learning-Environment-Design-Agents.md
 # Agents

-An agent is an actor that can observe its environment and decide on the best
-course of action using those observations. Create Agents in Unity by extending
-the Agent class. The most important aspects of creating agents that can
-successfully learn are the observations the agent collects for
-reinforcement learning and the reward you assign to estimate the value of the
+An agent is an entity that can observe its environment, decide on the best
+course of action using those observations, and execute those actions within
+its environment. Agents can be created in Unity by extending
+the `Agent` class. The most important aspects of creating agents that can
+successfully learn are the observations the agent collects,
+and the reward you assign to estimate the value of the
-An Agent passes its observations to its Policy. The Policy, then, makes a decision
+An Agent passes its observations to its Policy. The Policy then makes a decision
 and passes the chosen action back to the agent. Your agent code must execute the
 action, for example, move the agent in one direction or another. In order to
 [train an agent using reinforcement learning](Learning-Environment-Design.md),
-The Policy class abstracts out the decision making logic from the Agent itself so
+The `Policy` class abstracts out the decision making logic from the Agent itself so
-decisions depends on the kind of Policy it is. You can change the Policy of an
-Agent by changing its `Behavior Parameters`. If you set `Behavior Type` to
-`Heuristic Only`, the Agent will use its `Heuristic()` method to make decisions
-which can allow you to control the Agent manually or write your own Policy. If
-the Agent has a `Model` file, it Policy will use the neural network `Model` to
-take decisions.
+decisions depends on the `Behavior Parameters` associated with the agent. If you
+set `Behavior Type` to `Heuristic Only`, the Agent will use its `Heuristic()`
+method to make decisions which can allow you to control the Agent manually or
+write your own Policy. If the Agent has a `Model` file, its Policy will use
+the neural network `Model` to take decisions.

 ## Decisions

 the Agent to request decisions on its own at regular intervals, add a
-`Decision Requester` component to the Agent's Game Object. Making decisions at regular step
+`Decision Requester` component to the Agent's GameObject. Making decisions at regular step
-occur, should call `Agent.RequestDecision()` manually.
+occur, such as in a turn-based game, should call `Agent.RequestDecision()` manually.
-## Observations
+## Observations and Sensors
-To make decisions, an agent must observe its environment in order to infer the
-state of the world. A state observation can take the following forms:
+To make informed decisions, an agent must first make observations of the state of
+the environment. The observations are collected by Sensors attached to the agent
+GameObject. By default, agents come with a `VectorSensor` which allows them to
+collect floating-point observations into a single array. There are additional
+sensor components which can be attached to the agent GameObject which collect their own
+observations, or modify other observations. These are:
-* **Vector Observation** — a feature vector consisting of an array of floating
-  point numbers.
-* **Visual Observations** — one or more camera images and/or render textures.
+* `CameraSensorComponent` - Allows image from `Camera` to be used as observation.
+* `RenderTextureSensorComponent` - Allows content of `RenderTexture` to be used as observation.
+* `RayPerceptionSensorComponent` - Allows information from set of ray-casts to be used as observation.
-When you use vector observations for an Agent, implement the
-`Agent.CollectObservations(VectorSensor sensor)` method to create the feature vector. When you use
-**Visual Observations**, you only need to identify which Unity Camera objects
-or RenderTextures will provide images and the base Agent class handles the rest.
-You do not need to implement the `CollectObservations(VectorSensor sensor)` method when your Agent
-uses visual observations (unless it also uses vector observations).
+### Vector Observations
-### Vector Observation Space: Feature Vectors
+Vector observations are best used for aspects of the environment which are numerical
+and non-visual. The Policy class calls the `CollectObservations(VectorSensor sensor)`
+method of each Agent. Your implementation of this function must call
+`VectorSensor.AddObservation` to add vector observations.
-For agents using a continuous state space, you create a feature vector to
-represent the agent's observation at each step of the simulation. The Policy
-class calls the `CollectObservations(VectorSensor sensor)` method of each Agent. Your
-implementation of this function must call `VectorSensor.AddObservation` to add vector
-observations.
-
-The observation must include all the information an agents needs to accomplish
-its task. Without sufficient and relevant information, an agent may learn poorly
+In order for an agent to learn, the observations should include all the
+information an agent needs to accomplish its task. Without sufficient and relevant
+information, an agent may learn poorly
-solution to the problem.
+solution to the problem, or what you would expect a human to be able to use to solve the problem.

 For examples of various state observation functions, you can look at the
 [example environments](Learning-Environment-Examples.md) included in the
 every enemy agent in an environment, you could only observe the closest five.

 When you set up an Agent's `Behavior Parameters` in the Unity Editor, set the following
-properties to use a continuous vector observation:
+properties to use a vector observation:

 * **Space Size** — The state size must match the length of your feature vector.

 of data to your observation vector. You can add Integers and booleans directly to
 the observation vector, as well as some common Unity data types such as `Vector2`,
 `Vector3`, and `Quaternion`.
+
+#### One-hot encoding categorical information

 Type enumerations should be encoded in the _one-hot_ style. That is, add an
 element to the feature vector for each element of enumeration, setting the
 }
 ```

-`VectorSensor.AddObservation` also provides a two-argument version as a shortcut for _one-hot_
+`VectorSensor` also provides a two-argument function `AddOneHotObservation()` as a shortcut for _one-hot_
 style observations. The following example is identical to the previous one.

 ```csharp
 angle, or, if the number of turns is significant, increase the maximum value
 used in your normalization formula.

-### Multiple Visual Observations
+#### Vector Observation Summary & Best Practices
+
+* Vector Observations should include all variables relevant for allowing the
+  agent to take the optimally informed decision, and ideally no extraneous information.
+* In cases where Vector Observations need to be remembered or compared over
+  time, either an LSTM (see [here](Feature-Memory.md)) should be used in the model, or the
+  `Stacked Vectors` value in the agent GameObject's `Behavior Parameters` should be changed.
+* Categorical variables such as type of object (Sword, Shield, Bow) should be
+  encoded in one-hot fashion (i.e. `3` -> `0, 0, 1`). This can be done automatically using the
+  `AddOneHotObservation()` method of the `VectorSensor`.
+* In general, all inputs should be normalized to be in
+  the range 0 to +1 (or -1 to 1). For example, the `x` position information of
+  an agent where the maximum possible value is `maxValue` should be recorded as
+  `VectorSensor.AddObservation(transform.position.x / maxValue);` rather than
+  `VectorSensor.AddObservation(transform.position.x);`.
+* Positional information of relevant GameObjects should be encoded in relative
+  coordinates wherever possible. This is often relative to the agent position.
+
+
+### Visual Observations
-Visual observations use rendered textures directly or from one or more
-cameras in a scene. The Policy vectorizes the textures into a 3D Tensor which
-can be fed into a convolutional neural network (CNN). For more information on
-CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). You
-can use visual observations along side vector observations.
+Visual observations are generally provided to agent via either a `CameraSensor` or `RenderTextureSensor`.
+These collect image information and transforms it into a 3D Tensor which
+can be fed into the convolutional neural network (CNN) of the agent policy. For more information on
+CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). This allows agents
+to learn from spatial regularities in the observation images. It is possible to
+use visual and vector observations with the same agent.
-succeed at all.
+succeed at all as compared to vector observations. As such, they should only be
+used when it is not possible to properly define the problem using vector or ray-cast observations.

 Visual observations can be derived from Cameras or RenderTextures within your scene.
 To add a visual observation to an Agent, add either a Camera Sensor Component

 ![Agent RenderTexture Debug](images/gridworld.png)

+#### Visual Observation Summary & Best Practices
+
+* To collect visual observations, attach `CameraSensor` or `RenderTextureSensor`
+  components to the agent GameObject.
+* Visual observations should generally be used unless vector observations are not sufficient.
+* Image size should be kept as small as possible, without the loss of
+  needed details for decision making.
+* Images should be made greyscale in situations where color information is
+  not needed for making informed decisions.
+
-Raycasts are an alternative system for the Agent to provide observations based on
-the physical environment. This can be easily implemented by adding a
-RayPerceptionSensorComponent3D (or RayPerceptionSensorComponent2D) to the Agent.
+
+Raycasts are another possible method for providing observations to an agent.
+This can be easily implemented by adding a
+`RayPerceptionSensorComponent3D` (or `RayPerceptionSensorComponent2D`) to the Agent GameObject.

 During observations, several rays (or spheres, depending on settings) are cast into
 the physics world, and the objects that are hit determine the observation vector that
 * _Start Vertical Offset_ (3D only) The vertical offset of the ray start point.
 * _End Vertical Offset_ (3D only) The vertical offset of the ray end point.

-In the example image above, the Agent has two RayPerceptionSensorComponent3Ds.
+In the example image above, the Agent has two `RayPerceptionSensorComponent3D`s.
 Both use 3 Rays Per Direction and 90 Max Ray Degrees. One of the components
 had a vertical offset, so the Agent can tell whether it's clear to jump over
 the wall.
 `Behavior Parameters`, so you don't need to worry about the formula above when
 setting the State Size.

-## Vector Actions
+#### RayCast Observation Summary & Best Practices
+
+* Attach `RayPerceptionSensorComponent3D` or `RayPerceptionSensorComponent2D` to use.
+* This observation type is best used when there is relevant spatial information
+  for the agent that doesn't require a fully rendered image to convey.
+* Use as few rays and tags as necessary to solve the problem in order to improve learning stability and agent performance.
+
+## Actions
-agent's `OnActionReceived()` function. When you specify that the vector action space
+agent's `OnActionReceived()` function. Actions for an agent can take one of two forms, either **Continuous** or **Discrete**.
+
+When you specify that the vector action space
-control signals with length equal to the `Vector Action Space Size` property.
+floating point numbers with length equal to the `Vector Action Space Size` property.
 When you specify a **Discrete** vector action space type, the action parameter
 is an array containing integers. Each integer is an index into a list or table
 of commands. In the **Discrete** vector action space type, the action parameter
 array of integers, each value corresponds to the number of possibilities for
 each branch.

-For example, if we wanted an Agent that can move in an plane and jump, we could
+For example, if we wanted an Agent that can move in a plane and jump, we could
 define two branches (one for motion and one for jumping) because we want our
 agent be able to move __and__ jump concurrently. We define the first branch to
 have 5 possible actions (don't move, go left, go right, go backward, go forward)
 neural network, the Agent will be unable to perform the specified action. Note
 that when the Agent is controlled by its Heuristic, the Agent will
 still be able to decide to perform the masked action. In order to mask an
-action,  override the `Agent.CollectDiscreteActionMasks()` virtual method, and call `DiscreteActionMasker.SetMask()` in it:
+action,  override the `Agent.CollectDiscreteActionMasks()` virtual method,
+and call `DiscreteActionMasker.SetMask()` in it:

 ```csharp
 public override void CollectDiscreteActionMasks(DiscreteActionMasker actionMasker){
 * You cannot mask all the actions of a branch.
 * You cannot mask actions in continuous control.

+### Actions Summary &  Best Practices
+
+* Actions can either use `Discrete` or `Continuous` spaces.
+* When using `Discrete` it is possible to assign multiple action branches, and to mask certain actions.
+* In general, smaller action spaces will make for easier learning.
+* Be sure to set the Vector Action's Space Size to the number of used Vector
+  Actions, and not greater, as doing the latter can interfere with the
+  efficiency of the training process.
+* When using continuous control, action values should be clipped to an
+  appropriate range. The provided PPO model automatically clips these values
+  between -1 and 1, but third party training systems may not do so.
+
+
 ## Rewards

 In reinforcement learning, the reward is a signal that the agent has done

 Perhaps the best advice is to start simple and only add complexity as needed. In
 general, you should reward results rather than actions you think will lead to
-the desired results. To help develop your rewards, you can use the Monitor class
-to display the cumulative reward received by an Agent. You can even use the
+the desired results. You can even use the
-Allocate rewards to an Agent by calling the `AddReward()` method in the
-`OnActionReceived()` function. The reward assigned between each decision
+Allocate rewards to an Agent by calling the `AddReward()` or `SetReward()` methods on the agent.
+The reward assigned between each decision
-decision was. There is a method called `SetReward()` that will override all
+decision was. The `SetReward()` will override all
 previous rewards given to an agent since the previous decision.

 ### Examples
 Note that all of these environments make use of the `EndEpisode()` method, which manually
 terminates an episode when a termination condition is reached. This can be
 called independently of the `Max Step` property.
+
+### Rewards Summary & Best Practices
+
+* Use `AddReward()` to accumulate rewards between decisions. Use `SetReward()`
+  to overwrite any previous rewards accumulate between decisions.
+* The magnitude of any given reward should typically not be greater than 1.0 in
+  order to ensure a more stable learning process.
+* Positive rewards are often more helpful to shaping the desired behavior of an
+  agent than negative rewards. Excessive negative rewards can result in the agent
+  failing to learn any meaningful behavior.
+* For locomotion tasks, a small positive reward (+0.1) for forward velocity is
+  typically used.
+* If you want the agent to finish a task quickly, it is often helpful to provide
+  a small penalty every step (-0.05) that the agent does not complete the task.
+  In this case completion of the task should also coincide with the end of the
+  episode by calling `EndEpisode()` on the agent when it has accomplished its goal.

 ## Agent Properties

--- a/docs/Learning-Environment-Examples.md
+++ b/docs/Learning-Environment-Examples.md
 * Float Properties: None
 * Benchmark Mean Reward: 0.93

-## [3DBall: 3D Balance Ball](https://youtu.be/dheeCO29-EI)
+## 3DBall: 3D Balance Ball

 ![3D Balance Ball](images/balance.png)

      * Recommended Maximum: 20
 * Benchmark Mean Reward: 100

-## [GridWorld](https://youtu.be/gu8HE9WKEVI)
+## GridWorld

 ![GridWorld](images/gridworld.png)

  number of goals.
 * Benchmark Mean Reward: 0.8

-## [Tennis](https://youtu.be/RDaIh7JX6RI)
+## Tennis

 ![Tennis](images/tennis.png)

      * Recommended Minimum: 0.2
      * Recommended Maximum: 5

-## [Push Block](https://youtu.be/jKdw216ZgoE)
+## Push Block

 ![Push](images/push.png)

        * Recommended Maximum: 2000
 * Benchmark Mean Reward: 4.5

-## [Wall Jump](https://youtu.be/NITLug2DIWQ)
+## Wall Jump

 ![Wall](images/wall.png)

 * Float Properties: Four
 * Benchmark Mean Reward (Big & Small Wall): 0.8

-## [Reacher](https://youtu.be/2N9EoF6pQyE)
+## Reacher

 ![Reacher](images/reacher.png)

    * Recommended Maximum: 3
 * Benchmark Mean Reward: 30

-## [Crawler](https://youtu.be/ftLliaeooYI)
+## Crawler

 ![Crawler](images/crawler.png)

 * Benchmark Mean Reward for `CrawlerStaticTarget`: 2000
 * Benchmark Mean Reward for `CrawlerDynamicTarget`: 400

-## [Food Collector](https://youtu.be/heVMs3t9qSk)
+## Food Collector

 ![Collector](images/foodCollector.png)

    * Recommended Maximum: 5
 * Benchmark Mean Reward: 10

-## [Hallway](https://youtu.be/53GyfpPQRUQ)
+## Hallway

 ![Hallway](images/hallway.png)

 * Benchmark Mean Reward: 0.7
  * To speed up training, you can enable curiosity by adding the `curiosity` reward signal in `config/trainer_config.yaml`

-## [Bouncer](https://youtu.be/Tkv-c-b1b2I)
+## Bouncer

 ![Bouncer](images/bouncer.png)

        * Recommended Maximum: 250
 * Benchmark Mean Reward: 10

-## [Soccer Twos](https://youtu.be/Hg3nmYD3DjQ)
+## Soccer Twos

 ![SoccerTwos](images/soccer.png)

--- a/docs/Readme.md
+++ b/docs/Readme.md
 * [Making a New Learning Environment](Learning-Environment-Create-New.md)
 * [Designing a Learning Environment](Learning-Environment-Design.md)
 * [Designing Agents](Learning-Environment-Design-Agents.md)
-* [Learning Environment Best Practices](Learning-Environment-Best-Practices.md)

 ### Advanced Usage
  * [Using the Monitor](Feature-Monitor.md)
--- a/docs/Learning-Environment-Best-Practices.md
+++ b/docs/Learning-Environment-Best-Practices.md
-# Environment Design Best Practices
-
-## General
-
-* It is often helpful to start with the simplest version of the problem, to
-  ensure the agent can learn it. From there, increase complexity over time. This
-  can either be done manually, or via Curriculum Learning, where a set of
-  lessons which progressively increase in difficulty are presented to the agent
-  ([learn more here](Training-Curriculum-Learning.md)).
-* When possible, it is often helpful to ensure that you can complete the task by
-  using a heuristic to control the agent. To do so, set the `Behavior Type`
-  to `Heuristic Only` on the Agent's Behavior Parameters, and implement the
-   `Heuristic()` method on the Agent.
-* It is often helpful to make many copies of the agent, and give them the same
-  `Behavior Name`. In this way the learning process can get more feedback
-  information from all of these agents, which helps it train faster.
-
-## Rewards
-
-* The magnitude of any given reward should typically not be greater than 1.0 in
-  order to ensure a more stable learning process.
-* Positive rewards are often more helpful to shaping the desired behavior of an
-  agent than negative rewards.
-* For locomotion tasks, a small positive reward (+0.1) for forward velocity is
-  typically used.
-* If you want the agent to finish a task quickly, it is often helpful to provide
-  a small penalty every step (-0.05) that the agent does not complete the task.
-  In this case completion of the task should also coincide with the end of the
-  episode.
-* Overly-large negative rewards can cause undesirable behavior where an agent
-  learns to avoid any behavior which might produce the negative reward, even if
-  it is also behavior which can eventually lead to a positive reward.
-
-## Vector Observations
-
-* Vector Observations should include all variables relevant to allowing the
-  agent to take the optimally informed decision.
-* In cases where Vector Observations need to be remembered or compared over
-  time, increase the `Stacked Vectors` value to allow the agent to keep track of
-  multiple observations into the past.
-* Categorical variables such as type of object (Sword, Shield, Bow) should be
-  encoded in one-hot fashion (i.e. `3` > `0, 0, 1`).
-* Besides encoding non-numeric values, all inputs should be normalized to be in
-  the range 0 to +1 (or -1 to 1). For example, the `x` position information of
-  an agent where the maximum possible value is `maxValue` should be recorded as
-  `VectorSensor.AddObservation(transform.position.x / maxValue);` rather than
-  `VectorSensor.AddObservation(transform.position.x);`. See the equation below for one approach
-  of normalization.
-* Positional information of relevant GameObjects should be encoded in relative
-  coordinates wherever possible. This is often relative to the agent position.
-
-![normalization](images/normalization.png)
-
-## Vector Actions
-
-* When using continuous control, action values should be clipped to an
-  appropriate range. The provided PPO model automatically clips these values
-  between -1 and 1, but third party training systems may not do so.
-* Be sure to set the Vector Action's Space Size to the number of used Vector
-  Actions, and not greater, as doing the latter can interfere with the
-  efficiency of the training process.