浏览代码

Combine "Best Practices" and "Agents" documentation (#3643)

* Merge agent & best practices doc. Plus other fixes

* Fix overly long lines

* Address typos and comments

* Address feedback
/bug-failed-api-check
GitHub 4 年前
当前提交
a6ade9b2
共有 6 个文件被更改,包括 153 次插入162 次删除
  1. 22
      docs/Getting-Started-with-Balance-Ball.md
  2. 20
      docs/Learning-Environment-Create-New.md
  3. 189
      docs/Learning-Environment-Design-Agents.md
  4. 22
      docs/Learning-Environment-Examples.md
  5. 1
      docs/Readme.md
  6. 61
      docs/Learning-Environment-Best-Practices.md

22
docs/Getting-Started-with-Balance-Ball.md


When you create an Agent, you must extend the base Agent class.
The Ball3DAgent subclass defines the following methods:
* `Agent.OnEpisodeBegin()` — Called when the Agent resets, including at the beginning
of the simulation. The Ball3DAgent class uses the reset function to reset the
agent cube and ball. The function randomizes the reset values so that the
* `Agent.OnEpisodeBegin()` — Called at the beginning of an Agent's episode, including at the beginning
of the simulation. The Ball3DAgent class uses this function to reset the
agent cube and ball to their starting positions. The function randomizes the reset values so that the
training generalizes to more than a specific starting position and agent cube
attitude.
* `Agent.CollectObservations(VectorSensor sensor)` — Called every simulation step. Responsible for

From TensorBoard, you will see the summary statistics:
* Lesson - only interesting when performing [curriculum
* **Lesson** - only interesting when performing [curriculum
* Cumulative Reward - The mean cumulative episode reward over all agents. Should
* **Cumulative Reward** - The mean cumulative episode reward over all agents. Should
* Entropy - How random the decisions of the model are. Should slowly decrease
* **Entropy** - How random the decisions of the model are. Should slowly decrease
* Episode Length - The mean length of each episode in the environment for all
* **Episode Length** - The mean length of each episode in the environment for all
* Learning Rate - How large a step the training algorithm takes as it searches
* **Learning Rate** - How large a step the training algorithm takes as it searches
* Policy Loss - The mean loss of the policy function update. Correlates to how
* **Policy Loss** - The mean loss of the policy function update. Correlates to how
* Value Estimate - The mean value estimate for all states visited by the agent.
* **Value Estimate** - The mean value estimate for all states visited by the agent.
* Value Loss - The mean loss of the value function update. Correlates to how
* **Value Loss** - The mean loss of the value function update. Correlates to how
well the model is able to predict the value of each state. This should
decrease during a successful training session.

20
docs/Learning-Environment-Create-New.md


# Making a New Learning Environment
This tutorial walks through the process of creating a Unity Environment. A Unity
Environment is an application built using the Unity Engine which can be used to
train Reinforcement Learning Agents.
This tutorial walks through the process of creating a Unity Environment from scratch. We recommend first reading the [Getting Started](Getting-Started-with-Balance-Ball.md) guide to understand the concepts presented here first in an already-built environment.
In this example, we will train a ball to roll to a randomly placed cube. The
ball also learns to avoid falling off the platform.
In this example, we will create an agent capable of controlling a ball on a platform. We will then train the agent to roll the ball toward the cube while avoiding falling off the platform.
## Overview

This is only one way to achieve this objective. Refer to the
[example environments](Learning-Environment-Examples.md) for other ways we can achieve relative positioning.
## Review: Scene Layout
This section briefly reviews how to organize your scene when using Agents in
your Unity environment.
There are two kinds of game objects you need to include in your scene in order
to use Unity ML-Agents: an Academy and one or more Agents.
Keep in mind:
* If you are using multiple training areas, make sure all the Agents have the same `Behavior Name`
and `Behavior Parameters`

189
docs/Learning-Environment-Design-Agents.md


# Agents
An agent is an actor that can observe its environment and decide on the best
course of action using those observations. Create Agents in Unity by extending
the Agent class. The most important aspects of creating agents that can
successfully learn are the observations the agent collects for
reinforcement learning and the reward you assign to estimate the value of the
An agent is an entity that can observe its environment, decide on the best
course of action using those observations, and execute those actions within
its environment. Agents can be created in Unity by extending
the `Agent` class. The most important aspects of creating agents that can
successfully learn are the observations the agent collects,
and the reward you assign to estimate the value of the
An Agent passes its observations to its Policy. The Policy, then, makes a decision
An Agent passes its observations to its Policy. The Policy then makes a decision
and passes the chosen action back to the agent. Your agent code must execute the
action, for example, move the agent in one direction or another. In order to
[train an agent using reinforcement learning](Learning-Environment-Design.md),

The Policy class abstracts out the decision making logic from the Agent itself so
The `Policy` class abstracts out the decision making logic from the Agent itself so
decisions depends on the kind of Policy it is. You can change the Policy of an
Agent by changing its `Behavior Parameters`. If you set `Behavior Type` to
`Heuristic Only`, the Agent will use its `Heuristic()` method to make decisions
which can allow you to control the Agent manually or write your own Policy. If
the Agent has a `Model` file, it Policy will use the neural network `Model` to
take decisions.
decisions depends on the `Behavior Parameters` associated with the agent. If you
set `Behavior Type` to `Heuristic Only`, the Agent will use its `Heuristic()`
method to make decisions which can allow you to control the Agent manually or
write your own Policy. If the Agent has a `Model` file, its Policy will use
the neural network `Model` to take decisions.
## Decisions

the Agent to request decisions on its own at regular intervals, add a
`Decision Requester` component to the Agent's Game Object. Making decisions at regular step
`Decision Requester` component to the Agent's GameObject. Making decisions at regular step
occur, should call `Agent.RequestDecision()` manually.
occur, such as in a turn-based game, should call `Agent.RequestDecision()` manually.
## Observations
## Observations and Sensors
To make decisions, an agent must observe its environment in order to infer the
state of the world. A state observation can take the following forms:
To make informed decisions, an agent must first make observations of the state of
the environment. The observations are collected by Sensors attached to the agent
GameObject. By default, agents come with a `VectorSensor` which allows them to
collect floating-point observations into a single array. There are additional
sensor components which can be attached to the agent GameObject which collect their own
observations, or modify other observations. These are:
* **Vector Observation** — a feature vector consisting of an array of floating
point numbers.
* **Visual Observations** — one or more camera images and/or render textures.
* `CameraSensorComponent` - Allows image from `Camera` to be used as observation.
* `RenderTextureSensorComponent` - Allows content of `RenderTexture` to be used as observation.
* `RayPerceptionSensorComponent` - Allows information from set of ray-casts to be used as observation.
When you use vector observations for an Agent, implement the
`Agent.CollectObservations(VectorSensor sensor)` method to create the feature vector. When you use
**Visual Observations**, you only need to identify which Unity Camera objects
or RenderTextures will provide images and the base Agent class handles the rest.
You do not need to implement the `CollectObservations(VectorSensor sensor)` method when your Agent
uses visual observations (unless it also uses vector observations).
### Vector Observations
### Vector Observation Space: Feature Vectors
Vector observations are best used for aspects of the environment which are numerical
and non-visual. The Policy class calls the `CollectObservations(VectorSensor sensor)`
method of each Agent. Your implementation of this function must call
`VectorSensor.AddObservation` to add vector observations.
For agents using a continuous state space, you create a feature vector to
represent the agent's observation at each step of the simulation. The Policy
class calls the `CollectObservations(VectorSensor sensor)` method of each Agent. Your
implementation of this function must call `VectorSensor.AddObservation` to add vector
observations.
The observation must include all the information an agents needs to accomplish
its task. Without sufficient and relevant information, an agent may learn poorly
In order for an agent to learn, the observations should include all the
information an agent needs to accomplish its task. Without sufficient and relevant
information, an agent may learn poorly
solution to the problem.
solution to the problem, or what you would expect a human to be able to use to solve the problem.
For examples of various state observation functions, you can look at the
[example environments](Learning-Environment-Examples.md) included in the

every enemy agent in an environment, you could only observe the closest five.
When you set up an Agent's `Behavior Parameters` in the Unity Editor, set the following
properties to use a continuous vector observation:
properties to use a vector observation:
* **Space Size** — The state size must match the length of your feature vector.

of data to your observation vector. You can add Integers and booleans directly to
the observation vector, as well as some common Unity data types such as `Vector2`,
`Vector3`, and `Quaternion`.
#### One-hot encoding categorical information
Type enumerations should be encoded in the _one-hot_ style. That is, add an
element to the feature vector for each element of enumeration, setting the

}
```
`VectorSensor.AddObservation` also provides a two-argument version as a shortcut for _one-hot_
`VectorSensor` also provides a two-argument function `AddOneHotObservation()` as a shortcut for _one-hot_
style observations. The following example is identical to the previous one.
```csharp

angle, or, if the number of turns is significant, increase the maximum value
used in your normalization formula.
### Multiple Visual Observations
#### Vector Observation Summary & Best Practices
* Vector Observations should include all variables relevant for allowing the
agent to take the optimally informed decision, and ideally no extraneous information.
* In cases where Vector Observations need to be remembered or compared over
time, either an LSTM (see [here](Feature-Memory.md)) should be used in the model, or the
`Stacked Vectors` value in the agent GameObject's `Behavior Parameters` should be changed.
* Categorical variables such as type of object (Sword, Shield, Bow) should be
encoded in one-hot fashion (i.e. `3` -> `0, 0, 1`). This can be done automatically using the
`AddOneHotObservation()` method of the `VectorSensor`.
* In general, all inputs should be normalized to be in
the range 0 to +1 (or -1 to 1). For example, the `x` position information of
an agent where the maximum possible value is `maxValue` should be recorded as
`VectorSensor.AddObservation(transform.position.x / maxValue);` rather than
`VectorSensor.AddObservation(transform.position.x);`.
* Positional information of relevant GameObjects should be encoded in relative
coordinates wherever possible. This is often relative to the agent position.
### Visual Observations
Visual observations use rendered textures directly or from one or more
cameras in a scene. The Policy vectorizes the textures into a 3D Tensor which
can be fed into a convolutional neural network (CNN). For more information on
CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). You
can use visual observations along side vector observations.
Visual observations are generally provided to agent via either a `CameraSensor` or `RenderTextureSensor`.
These collect image information and transforms it into a 3D Tensor which
can be fed into the convolutional neural network (CNN) of the agent policy. For more information on
CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). This allows agents
to learn from spatial regularities in the observation images. It is possible to
use visual and vector observations with the same agent.
succeed at all.
succeed at all as compared to vector observations. As such, they should only be
used when it is not possible to properly define the problem using vector or ray-cast observations.
Visual observations can be derived from Cameras or RenderTextures within your scene.
To add a visual observation to an Agent, add either a Camera Sensor Component

![Agent RenderTexture Debug](images/gridworld.png)
#### Visual Observation Summary & Best Practices
* To collect visual observations, attach `CameraSensor` or `RenderTextureSensor`
components to the agent GameObject.
* Visual observations should generally be used unless vector observations are not sufficient.
* Image size should be kept as small as possible, without the loss of
needed details for decision making.
* Images should be made greyscale in situations where color information is
not needed for making informed decisions.
Raycasts are an alternative system for the Agent to provide observations based on
the physical environment. This can be easily implemented by adding a
RayPerceptionSensorComponent3D (or RayPerceptionSensorComponent2D) to the Agent.
Raycasts are another possible method for providing observations to an agent.
This can be easily implemented by adding a
`RayPerceptionSensorComponent3D` (or `RayPerceptionSensorComponent2D`) to the Agent GameObject.
During observations, several rays (or spheres, depending on settings) are cast into
the physics world, and the objects that are hit determine the observation vector that

* _Start Vertical Offset_ (3D only) The vertical offset of the ray start point.
* _End Vertical Offset_ (3D only) The vertical offset of the ray end point.
In the example image above, the Agent has two RayPerceptionSensorComponent3Ds.
In the example image above, the Agent has two `RayPerceptionSensorComponent3D`s.
Both use 3 Rays Per Direction and 90 Max Ray Degrees. One of the components
had a vertical offset, so the Agent can tell whether it's clear to jump over
the wall.

`Behavior Parameters`, so you don't need to worry about the formula above when
setting the State Size.
## Vector Actions
#### RayCast Observation Summary & Best Practices
* Attach `RayPerceptionSensorComponent3D` or `RayPerceptionSensorComponent2D` to use.
* This observation type is best used when there is relevant spatial information
for the agent that doesn't require a fully rendered image to convey.
* Use as few rays and tags as necessary to solve the problem in order to improve learning stability and agent performance.
## Actions
agent's `OnActionReceived()` function. When you specify that the vector action space
agent's `OnActionReceived()` function. Actions for an agent can take one of two forms, either **Continuous** or **Discrete**.
When you specify that the vector action space
control signals with length equal to the `Vector Action Space Size` property.
floating point numbers with length equal to the `Vector Action Space Size` property.
When you specify a **Discrete** vector action space type, the action parameter
is an array containing integers. Each integer is an index into a list or table
of commands. In the **Discrete** vector action space type, the action parameter

array of integers, each value corresponds to the number of possibilities for
each branch.
For example, if we wanted an Agent that can move in an plane and jump, we could
For example, if we wanted an Agent that can move in a plane and jump, we could
define two branches (one for motion and one for jumping) because we want our
agent be able to move __and__ jump concurrently. We define the first branch to
have 5 possible actions (don't move, go left, go right, go backward, go forward)

neural network, the Agent will be unable to perform the specified action. Note
that when the Agent is controlled by its Heuristic, the Agent will
still be able to decide to perform the masked action. In order to mask an
action, override the `Agent.CollectDiscreteActionMasks()` virtual method, and call `DiscreteActionMasker.SetMask()` in it:
action, override the `Agent.CollectDiscreteActionMasks()` virtual method,
and call `DiscreteActionMasker.SetMask()` in it:
```csharp
public override void CollectDiscreteActionMasks(DiscreteActionMasker actionMasker){

* You cannot mask all the actions of a branch.
* You cannot mask actions in continuous control.
### Actions Summary & Best Practices
* Actions can either use `Discrete` or `Continuous` spaces.
* When using `Discrete` it is possible to assign multiple action branches, and to mask certain actions.
* In general, smaller action spaces will make for easier learning.
* Be sure to set the Vector Action's Space Size to the number of used Vector
Actions, and not greater, as doing the latter can interfere with the
efficiency of the training process.
* When using continuous control, action values should be clipped to an
appropriate range. The provided PPO model automatically clips these values
between -1 and 1, but third party training systems may not do so.
## Rewards
In reinforcement learning, the reward is a signal that the agent has done

Perhaps the best advice is to start simple and only add complexity as needed. In
general, you should reward results rather than actions you think will lead to
the desired results. To help develop your rewards, you can use the Monitor class
to display the cumulative reward received by an Agent. You can even use the
the desired results. You can even use the
Allocate rewards to an Agent by calling the `AddReward()` method in the
`OnActionReceived()` function. The reward assigned between each decision
Allocate rewards to an Agent by calling the `AddReward()` or `SetReward()` methods on the agent.
The reward assigned between each decision
decision was. There is a method called `SetReward()` that will override all
decision was. The `SetReward()` will override all
previous rewards given to an agent since the previous decision.
### Examples

Note that all of these environments make use of the `EndEpisode()` method, which manually
terminates an episode when a termination condition is reached. This can be
called independently of the `Max Step` property.
### Rewards Summary & Best Practices
* Use `AddReward()` to accumulate rewards between decisions. Use `SetReward()`
to overwrite any previous rewards accumulate between decisions.
* The magnitude of any given reward should typically not be greater than 1.0 in
order to ensure a more stable learning process.
* Positive rewards are often more helpful to shaping the desired behavior of an
agent than negative rewards. Excessive negative rewards can result in the agent
failing to learn any meaningful behavior.
* For locomotion tasks, a small positive reward (+0.1) for forward velocity is
typically used.
* If you want the agent to finish a task quickly, it is often helpful to provide
a small penalty every step (-0.05) that the agent does not complete the task.
In this case completion of the task should also coincide with the end of the
episode by calling `EndEpisode()` on the agent when it has accomplished its goal.
## Agent Properties

22
docs/Learning-Environment-Examples.md


* Float Properties: None
* Benchmark Mean Reward: 0.93
## [3DBall: 3D Balance Ball](https://youtu.be/dheeCO29-EI)
## 3DBall: 3D Balance Ball
![3D Balance Ball](images/balance.png)

* Recommended Maximum: 20
* Benchmark Mean Reward: 100
## [GridWorld](https://youtu.be/gu8HE9WKEVI)
## GridWorld
![GridWorld](images/gridworld.png)

number of goals.
* Benchmark Mean Reward: 0.8
## [Tennis](https://youtu.be/RDaIh7JX6RI)
## Tennis
![Tennis](images/tennis.png)

* Recommended Minimum: 0.2
* Recommended Maximum: 5
## [Push Block](https://youtu.be/jKdw216ZgoE)
## Push Block
![Push](images/push.png)

* Recommended Maximum: 2000
* Benchmark Mean Reward: 4.5
## [Wall Jump](https://youtu.be/NITLug2DIWQ)
## Wall Jump
![Wall](images/wall.png)

* Float Properties: Four
* Benchmark Mean Reward (Big & Small Wall): 0.8
## [Reacher](https://youtu.be/2N9EoF6pQyE)
## Reacher
![Reacher](images/reacher.png)

* Recommended Maximum: 3
* Benchmark Mean Reward: 30
## [Crawler](https://youtu.be/ftLliaeooYI)
## Crawler
![Crawler](images/crawler.png)

* Benchmark Mean Reward for `CrawlerStaticTarget`: 2000
* Benchmark Mean Reward for `CrawlerDynamicTarget`: 400
## [Food Collector](https://youtu.be/heVMs3t9qSk)
## Food Collector
![Collector](images/foodCollector.png)

* Recommended Maximum: 5
* Benchmark Mean Reward: 10
## [Hallway](https://youtu.be/53GyfpPQRUQ)
## Hallway
![Hallway](images/hallway.png)

* Benchmark Mean Reward: 0.7
* To speed up training, you can enable curiosity by adding the `curiosity` reward signal in `config/trainer_config.yaml`
## [Bouncer](https://youtu.be/Tkv-c-b1b2I)
## Bouncer
![Bouncer](images/bouncer.png)

* Recommended Maximum: 250
* Benchmark Mean Reward: 10
## [Soccer Twos](https://youtu.be/Hg3nmYD3DjQ)
## Soccer Twos
![SoccerTwos](images/soccer.png)

1
docs/Readme.md


* [Making a New Learning Environment](Learning-Environment-Create-New.md)
* [Designing a Learning Environment](Learning-Environment-Design.md)
* [Designing Agents](Learning-Environment-Design-Agents.md)
* [Learning Environment Best Practices](Learning-Environment-Best-Practices.md)
### Advanced Usage
* [Using the Monitor](Feature-Monitor.md)

61
docs/Learning-Environment-Best-Practices.md


# Environment Design Best Practices
## General
* It is often helpful to start with the simplest version of the problem, to
ensure the agent can learn it. From there, increase complexity over time. This
can either be done manually, or via Curriculum Learning, where a set of
lessons which progressively increase in difficulty are presented to the agent
([learn more here](Training-Curriculum-Learning.md)).
* When possible, it is often helpful to ensure that you can complete the task by
using a heuristic to control the agent. To do so, set the `Behavior Type`
to `Heuristic Only` on the Agent's Behavior Parameters, and implement the
`Heuristic()` method on the Agent.
* It is often helpful to make many copies of the agent, and give them the same
`Behavior Name`. In this way the learning process can get more feedback
information from all of these agents, which helps it train faster.
## Rewards
* The magnitude of any given reward should typically not be greater than 1.0 in
order to ensure a more stable learning process.
* Positive rewards are often more helpful to shaping the desired behavior of an
agent than negative rewards.
* For locomotion tasks, a small positive reward (+0.1) for forward velocity is
typically used.
* If you want the agent to finish a task quickly, it is often helpful to provide
a small penalty every step (-0.05) that the agent does not complete the task.
In this case completion of the task should also coincide with the end of the
episode.
* Overly-large negative rewards can cause undesirable behavior where an agent
learns to avoid any behavior which might produce the negative reward, even if
it is also behavior which can eventually lead to a positive reward.
## Vector Observations
* Vector Observations should include all variables relevant to allowing the
agent to take the optimally informed decision.
* In cases where Vector Observations need to be remembered or compared over
time, increase the `Stacked Vectors` value to allow the agent to keep track of
multiple observations into the past.
* Categorical variables such as type of object (Sword, Shield, Bow) should be
encoded in one-hot fashion (i.e. `3` > `0, 0, 1`).
* Besides encoding non-numeric values, all inputs should be normalized to be in
the range 0 to +1 (or -1 to 1). For example, the `x` position information of
an agent where the maximum possible value is `maxValue` should be recorded as
`VectorSensor.AddObservation(transform.position.x / maxValue);` rather than
`VectorSensor.AddObservation(transform.position.x);`. See the equation below for one approach
of normalization.
* Positional information of relevant GameObjects should be encoded in relative
coordinates wherever possible. This is often relative to the agent position.
![normalization](images/normalization.png)
## Vector Actions
* When using continuous control, action values should be clipped to an
appropriate range. The provided PPO model automatically clips these values
between -1 and 1, but third party training systems may not do so.
* Be sure to set the Vector Action's Space Size to the number of used Vector
Actions, and not greater, as doing the latter can interfere with the
efficiency of the training process.
正在加载...
取消
保存