Deric Pang
6 年前
当前提交
40f4eb3e
共有 37 个文件被更改,包括 2535 次插入 和 1296 次删除
-
10docs/API-Reference.md
-
10docs/Background-Jupyter.md
-
301docs/Background-Machine-Learning.md
-
74docs/Background-TensorFlow.md
-
12docs/Background-Unity.md
-
11docs/Basic-Guide.md
-
12docs/FAQ.md
-
55docs/Feature-Memory.md
-
42docs/Feature-Monitor.md
-
16docs/Getting-Started-with-Balance-Ball.md
-
248docs/Installation-Windows.md
-
28docs/Installation.md
-
64docs/Learning-Environment-Best-Practices.md
-
330docs/Learning-Environment-Create-New.md
-
55docs/Learning-Environment-Design-Academy.md
-
429docs/Learning-Environment-Design-Agents.md
-
106docs/Learning-Environment-Design-Brains.md
-
118docs/Learning-Environment-Design-External-Internal-Brains.md
-
34docs/Learning-Environment-Design-Heuristic-Brains.md
-
47docs/Learning-Environment-Design-Player-Brains.md
-
189docs/Learning-Environment-Design.md
-
343docs/Learning-Environment-Examples.md
-
123docs/Learning-Environment-Executable.md
-
25docs/Limitations.md
-
26docs/ML-Agents-Overview.md
-
75docs/Migrating.md
-
10docs/Python-API.md
-
9docs/Training-Curriculum-Learning.md
-
76docs/Training-Imitation-Learning.md
-
167docs/Training-ML-Agents.md
-
218docs/Training-PPO.md
-
104docs/Training-on-Amazon-Web-Service.md
-
112docs/Training-on-Microsoft-Azure-Custom-Instance.md
-
102docs/Training-on-Microsoft-Azure.md
-
8docs/Using-Docker.md
-
171docs/Using-TensorFlow-Sharp-in-Unity.md
-
71docs/Using-Tensorboard.md
|
|||
# Environment Design Best Practices |
|||
|
|||
## General |
|||
* It is often helpful to start with the simplest version of the problem, to ensure the agent can learn it. From there increase |
|||
complexity over time. This can either be done manually, or via Curriculum Learning, where a set of lessons which progressively increase in difficulty are presented to the agent ([learn more here](Training-Curriculum-Learning.md)). |
|||
* When possible, it is often helpful to ensure that you can complete the task by using a Player Brain to control the agent. |
|||
* It is often helpful to make many copies of the agent, and attach the brain to be trained to all of these agents. In this way the brain can get more feedback information from all of these agents, which helps it train faster. |
|||
|
|||
* It is often helpful to start with the simplest version of the problem, to |
|||
ensure the agent can learn it. From there increase complexity over time. This |
|||
can either be done manually, or via Curriculum Learning, where a set of |
|||
lessons which progressively increase in difficulty are presented to the agent |
|||
([learn more here](Training-Curriculum-Learning.md)). |
|||
* When possible, it is often helpful to ensure that you can complete the task by |
|||
using a Player Brain to control the agent. |
|||
* It is often helpful to make many copies of the agent, and attach the brain to |
|||
be trained to all of these agents. In this way the brain can get more feedback |
|||
information from all of these agents, which helps it train faster. |
|||
* The magnitude of any given reward should typically not be greater than 1.0 in order to ensure a more stable learning process. |
|||
* Positive rewards are often more helpful to shaping the desired behavior of an agent than negative rewards. |
|||
* For locomotion tasks, a small positive reward (+0.1) for forward velocity is typically used. |
|||
* If you want the agent to finish a task quickly, it is often helpful to provide a small penalty every step (-0.05) that the agent does not complete the task. In this case completion of the task should also coincide with the end of the episode. |
|||
* Overly-large negative rewards can cause undesirable behavior where an agent learns to avoid any behavior which might produce the negative reward, even if it is also behavior which can eventually lead to a positive reward. |
|||
|
|||
* The magnitude of any given reward should typically not be greater than 1.0 in |
|||
order to ensure a more stable learning process. |
|||
* Positive rewards are often more helpful to shaping the desired behavior of an |
|||
agent than negative rewards. |
|||
* For locomotion tasks, a small positive reward (+0.1) for forward velocity is |
|||
typically used. |
|||
* If you want the agent to finish a task quickly, it is often helpful to provide |
|||
a small penalty every step (-0.05) that the agent does not complete the task. |
|||
In this case completion of the task should also coincide with the end of the |
|||
episode. |
|||
* Overly-large negative rewards can cause undesirable behavior where an agent |
|||
learns to avoid any behavior which might produce the negative reward, even if |
|||
it is also behavior which can eventually lead to a positive reward. |
|||
* Vector Observations should include all variables relevant to allowing the agent to take the optimally informed decision. |
|||
* In cases where Vector Observations need to be remembered or compared over time, increase the `Stacked Vectors` value to allow the agent to keep track of multiple observations into the past. |
|||
* Categorical variables such as type of object (Sword, Shield, Bow) should be encoded in one-hot fashion (i.e. `3` > `0, 0, 1`). |
|||
* Besides encoding non-numeric values, all inputs should be normalized to be in the range 0 to +1 (or -1 to 1). For example, the `x` position information of an agent where the maximum possible value is `maxValue` should be recorded as `AddVectorObs(transform.position.x / maxValue);` rather than `AddVectorObs(transform.position.x);`. See the equation below for one approach of normalization. |
|||
* Positional information of relevant GameObjects should be encoded in relative coordinates wherever possible. This is often relative to the agent position. |
|||
|
|||
* Vector Observations should include all variables relevant to allowing the |
|||
agent to take the optimally informed decision. |
|||
* In cases where Vector Observations need to be remembered or compared over |
|||
time, increase the `Stacked Vectors` value to allow the agent to keep track of |
|||
multiple observations into the past. |
|||
* Categorical variables such as type of object (Sword, Shield, Bow) should be |
|||
encoded in one-hot fashion (i.e. `3` > `0, 0, 1`). |
|||
* Besides encoding non-numeric values, all inputs should be normalized to be in |
|||
the range 0 to +1 (or -1 to 1). For example, the `x` position information of |
|||
an agent where the maximum possible value is `maxValue` should be recorded as |
|||
`AddVectorObs(transform.position.x / maxValue);` rather than |
|||
`AddVectorObs(transform.position.x);`. See the equation below for one approach |
|||
of normalization. |
|||
* Positional information of relevant GameObjects should be encoded in relative |
|||
coordinates wherever possible. This is often relative to the agent position. |
|||
* When using continuous control, action values should be clipped to an appropriate range. The provided PPO model automatically clips these values between -1 and 1, but third party training systems may not do so. |
|||
* Be sure to set the Vector Action's Space Size to the number of used Vector Actions, and not greater, as doing the latter can interfere with the efficiency of the training process. |
|||
|
|||
* When using continuous control, action values should be clipped to an |
|||
appropriate range. The provided PPO model automatically clips these values |
|||
between -1 and 1, but third party training systems may not do so. |
|||
* Be sure to set the Vector Action's Space Size to the number of used Vector |
|||
Actions, and not greater, as doing the latter can interfere with the |
|||
efficiency of the training process. |
|
|||
# External and Internal Brains |
|||
|
|||
The **External** and **Internal** types of Brains work in different phases of training. When training your agents, set their brain types to **External**; when using the trained models, set their brain types to **Internal**. |
|||
The **External** and **Internal** types of Brains work in different phases of |
|||
training. When training your agents, set their brain types to **External**; when |
|||
using the trained models, set their brain types to **Internal**. |
|||
When [running an ML-Agents training algorithm](Training-ML-Agents.md), at least one Brain object in a scene must be set to **External**. This allows the training process to collect the observations of agents using that brain and give the agents their actions. |
|||
When [running an ML-Agents training algorithm](Training-ML-Agents.md), at least |
|||
one Brain object in a scene must be set to **External**. This allows the |
|||
training process to collect the observations of agents using that brain and give |
|||
the agents their actions. |
|||
In addition to using an External brain for training using the ML-Agents learning algorithms, you can use an External brain to control agents in a Unity environment using an external Python program. See [Python API](Python-API.md) for more information. |
|||
In addition to using an External brain for training using the ML-Agents learning |
|||
algorithms, you can use an External brain to control agents in a Unity |
|||
environment using an external Python program. See [Python API](Python-API.md) |
|||
for more information. |
|||
Unlike the other types, the External Brain has no properties to set in the Unity Inspector window. |
|||
Unlike the other types, the External Brain has no properties to set in the Unity |
|||
Inspector window. |
|||
The Internal Brain type uses a [TensorFlow model](https://www.tensorflow.org/get_started/get_started_for_beginners#models_and_training) to make decisions. The Proximal Policy Optimization (PPO) and Behavioral Cloning algorithms included with the ML-Agents SDK produce trained TensorFlow models that you can use with the Internal Brain type. |
|||
The Internal Brain type uses a |
|||
[TensorFlow model](https://www.tensorflow.org/get_started/get_started_for_beginners#models_and_training) |
|||
to make decisions. The Proximal Policy Optimization (PPO) and Behavioral Cloning |
|||
algorithms included with the ML-Agents SDK produce trained TensorFlow models |
|||
that you can use with the Internal Brain type. |
|||
A __model__ is a mathematical relationship mapping an agent's observations to its actions. TensorFlow is a software library for performing numerical computation through data flow graphs. A TensorFlow model, then, defines the mathematical relationship between your agent's observations and its actions using a TensorFlow data flow graph. |
|||
A __model__ is a mathematical relationship mapping an agent's observations to |
|||
its actions. TensorFlow is a software library for performing numerical |
|||
computation through data flow graphs. A TensorFlow model, then, defines the |
|||
mathematical relationship between your agent's observations and its actions |
|||
using a TensorFlow data flow graph. |
|||
The training algorithms included in the ML-Agents SDK produce TensorFlow graph models as the end result of the training process. See [Training ML-Agents](Training-ML-Agents.md) for instructions on how to train a model. |
|||
The training algorithms included in the ML-Agents SDK produce TensorFlow graph |
|||
models as the end result of the training process. See |
|||
[Training ML-Agents](Training-ML-Agents.md) for instructions on how to train a |
|||
model. |
|||
1. Select the Brain GameObject in the **Hierarchy** window of the Unity Editor. (The Brain GameObject must be a child of the Academy GameObject and must have a Brain component.) |
|||
1. Select the Brain GameObject in the **Hierarchy** window of the Unity Editor. |
|||
(The Brain GameObject must be a child of the Academy GameObject and must have |
|||
a Brain component.) |
|||
**Note:** In order to see the **Internal** Brain Type option, you must |
|||
[enable TensorFlowSharp](Using-TensorFlow-Sharp-in-Unity.md). |
|||
3. Import the `environment_run-id.bytes` file produced by the PPO training |
|||
program. (Where `environment_run-id` is the name of the model file, which is |
|||
constructed from the name of your Unity environment executable and the run-id |
|||
value you assigned when running the training process.) |
|||
**Note:** In order to see the **Internal** Brain Type option, you must [enable TensorFlowSharp](Using-TensorFlow-Sharp-in-Unity.md). |
|||
|
|||
3. Import the `environment_run-id.bytes` file produced by the PPO training program. (Where `environment_run-id` is the name of the model file, which is constructed from the name of your Unity environment executable and the run-id value you assigned when running the training process.) |
|||
|
|||
You can [import assets into Unity](https://docs.unity3d.com/Manual/ImportingAssets.html) in various ways. The easiest way is to simply drag the file into the **Project** window and drop it into an appropriate folder. |
|||
|
|||
4. Once the `environment.bytes` file is imported, drag it from the **Project** window to the **Graph Model** field of the Brain component. |
|||
You can |
|||
[import assets into Unity](https://docs.unity3d.com/Manual/ImportingAssets.html) |
|||
in various ways. The easiest way is to simply drag the file into the |
|||
**Project** window and drop it into an appropriate folder. |
|||
4. Once the `environment.bytes` file is imported, drag it from the **Project** |
|||
window to the **Graph Model** field of the Brain component. |
|||
If you are using a model produced by the ML-Agents `mlagents-learn` command, use the default values for the other Internal Brain parameters. |
|||
If you are using a model produced by the ML-Agents `mlagents-learn` command, use |
|||
the default values for the other Internal Brain parameters. |
|||
The default values of the TensorFlow graph parameters work with the model produced by the PPO and BC training code in the ML-Agents SDK. To use a default ML-Agents model, the only parameter that you need to set is the `Graph Model`, which must be set to the .bytes file containing the trained model itself. |
|||
The default values of the TensorFlow graph parameters work with the model |
|||
produced by the PPO and BC training code in the ML-Agents SDK. To use a default |
|||
ML-Agents model, the only parameter that you need to set is the `Graph Model`, |
|||
which must be set to the .bytes file containing the trained model itself. |
|||
* `Graph Model` : This must be the `bytes` file corresponding to the pre-trained |
|||
TensorFlow graph. (You must first drag this file into your Resources folder |
|||
and then from the Resources folder into the inspector) |
|||
* `Graph Model` : This must be the `bytes` file corresponding to the pre-trained TensorFlow graph. (You must first drag this file into your Resources folder and then from the Resources folder into the inspector) |
|||
Only change the following Internal Brain properties if you have created your own |
|||
TensorFlow model and are not using an ML-Agents model: |
|||
Only change the following Internal Brain properties if you have created your own TensorFlow model and are not using an ML-Agents model: |
|||
|
|||
* `Graph Scope` : If you set a scope while training your TensorFlow model, all your placeholder name will have a prefix. You must specify that prefix here. Note that if more than one Brain were set to external during training, you must give a `Graph Scope` to the internal Brain corresponding to the name of the Brain GameObject. |
|||
* `Batch Size Node Name` : If the batch size is one of the inputs of your graph, you must specify the name if the placeholder here. The brain will make the batch size equal to the number of agents connected to the brain automatically. |
|||
* `State Node Name` : If your graph uses the state as an input, you must specify the name of the placeholder here. |
|||
* `Recurrent Input Node Name` : If your graph uses a recurrent input / memory as input and outputs new recurrent input / memory, you must specify the name if the input placeholder here. |
|||
* `Recurrent Output Node Name` : If your graph uses a recurrent input / memory as input and outputs new recurrent input / memory, you must specify the name if the output placeholder here. |
|||
* `Observation Placeholder Name` : If your graph uses observations as input, you must specify it here. Note that the number of observations is equal to the length of `Camera Resolutions` in the brain parameters. |
|||
* `Action Node Name` : Specify the name of the placeholder corresponding to the actions of the brain in your graph. If the action space type is continuous, the output must be a one dimensional tensor of float of length `Action Space Size`, if the action space type is discrete, the output must be a one dimensional tensor of int of the same length as the `Branches` array. |
|||
* `Graph Placeholder` : If your graph takes additional inputs that are fixed (example: noise level) you can specify them here. Note that in your graph, these must correspond to one dimensional tensors of int or float of size 1. |
|||
* `Name` : Corresponds to the name of the placeholder. |
|||
* `Value Type` : Either Integer or Floating Point. |
|||
* `Min Value` and `Max Value` : Specify the range of the value here. The value will be sampled from the uniform distribution ranging from `Min Value` to `Max Value` inclusive. |
|||
|
|||
* `Graph Scope` : If you set a scope while training your TensorFlow model, all |
|||
your placeholder name will have a prefix. You must specify that prefix here. |
|||
Note that if more than one Brain were set to external during training, you |
|||
must give a `Graph Scope` to the internal Brain corresponding to the name of |
|||
the Brain GameObject. |
|||
* `Batch Size Node Name` : If the batch size is one of the inputs of your |
|||
graph, you must specify the name if the placeholder here. The brain will make |
|||
the batch size equal to the number of agents connected to the brain |
|||
automatically. |
|||
* `State Node Name` : If your graph uses the state as an input, you must specify |
|||
the name of the placeholder here. |
|||
* `Recurrent Input Node Name` : If your graph uses a recurrent input / memory as |
|||
input and outputs new recurrent input / memory, you must specify the name if |
|||
the input placeholder here. |
|||
* `Recurrent Output Node Name` : If your graph uses a recurrent input / memory |
|||
as input and outputs new recurrent input / memory, you must specify the name |
|||
if the output placeholder here. |
|||
* `Observation Placeholder Name` : If your graph uses observations as input, you |
|||
must specify it here. Note that the number of observations is equal to the |
|||
length of `Camera Resolutions` in the brain parameters. |
|||
* `Action Node Name` : Specify the name of the placeholder corresponding to the |
|||
actions of the brain in your graph. If the action space type is continuous, |
|||
the output must be a one dimensional tensor of float of length `Action Space |
|||
Size`, if the action space type is discrete, the output must be a one |
|||
dimensional tensor of int of the same length as the `Branches` array. |
|||
* `Graph Placeholder` : If your graph takes additional inputs that are fixed |
|||
(example: noise level) you can specify them here. Note that in your graph, |
|||
these must correspond to one dimensional tensors of int or float of size 1. |
|||
* `Name` : Corresponds to the name of the placeholder. |
|||
* `Value Type` : Either Integer or Floating Point. |
|||
* `Min Value` and `Max Value` : Specify the range of the value here. The value |
|||
will be sampled from the uniform distribution ranging from `Min Value` to |
|||
`Max Value` inclusive. |
|
|||
# Heuristic Brain |
|||
|
|||
The **Heuristic** brain type allows you to hand code an agent's decision making process. A Heuristic brain requires an implementation of the Decision interface to which it delegates the decision making process. |
|||
The **Heuristic** brain type allows you to hand code an agent's decision making |
|||
process. A Heuristic brain requires an implementation of the Decision interface |
|||
to which it delegates the decision making process. |
|||
When you set the **Brain Type** property of a Brain to **Heuristic**, you must add a component implementing the Decision interface to the same GameObject as the Brain. |
|||
When you set the **Brain Type** property of a Brain to **Heuristic**, you must |
|||
add a component implementing the Decision interface to the same GameObject as |
|||
the Brain. |
|||
When creating your Decision class, extend MonoBehaviour (so you can use the class as a Unity component) and extend the Decision interface. |
|||
When creating your Decision class, extend MonoBehaviour (so you can use the |
|||
class as a Unity component) and extend the Decision interface. |
|||
public class HeuristicLogic : MonoBehaviour, Decision |
|||
public class HeuristicLogic : MonoBehaviour, Decision |
|||
The Decision interface defines two methods, `Decide()` and `MakeMemory()`. |
|||
The Decision interface defines two methods, `Decide()` and `MakeMemory()`. |
|||
The `Decide()` method receives an agents current state, consisting of the agent's observations, reward, memory and other aspects of the agent's state, and must return an array containing the action that the agent should take. The format of the returned action array depends on the **Vector Action Space Type**. When using a **Continuous** action space, the action array is just a float array with a length equal to the **Vector Action Space Size** setting. When using a **Discrete** action space, the action array is an integer array with the same size as the `Branches` array. In the discrete action space, the values of the **Branches** array define the number of discrete values that your `Decide()` function can return for each branch, which don't need to be consecutive integers. |
|||
The `Decide()` method receives an agents current state, consisting of the |
|||
agent's observations, reward, memory and other aspects of the agent's state, and |
|||
must return an array containing the action that the agent should take. The |
|||
format of the returned action array depends on the **Vector Action Space Type**. |
|||
When using a **Continuous** action space, the action array is just a float array |
|||
with a length equal to the **Vector Action Space Size** setting. When using a |
|||
**Discrete** action space, the action array is an integer array with the same |
|||
size as the `Branches` array. In the discrete action space, the values of the |
|||
**Branches** array define the number of discrete values that your `Decide()` |
|||
function can return for each branch, which don't need to be consecutive |
|||
integers. |
|||
The `MakeMemory()` function allows you to pass data forward to the next iteration of an agent's decision making process. The array you return from `MakeMemory()` is passed to the `Decide()` function in the next iteration. You can use the memory to allow the agent's decision process to take past actions and observations into account when making the current decision. If your heuristic logic does not require memory, just return an empty array. |
|||
The `MakeMemory()` function allows you to pass data forward to the next |
|||
iteration of an agent's decision making process. The array you return from |
|||
`MakeMemory()` is passed to the `Decide()` function in the next iteration. You |
|||
can use the memory to allow the agent's decision process to take past actions |
|||
and observations into account when making the current decision. If your |
|||
heuristic logic does not require memory, just return an empty array. |
|
|||
# Player Brain |
|||
|
|||
The **Player** brain type allows you to control an agent using keyboard commands. You can use Player brains to control a "teacher" agent that trains other agents during [imitation learning](Training-Imitation-Learning.md). You can also use Player brains to test your agents and environment before changing their brain types to **External** and running the training process. |
|||
The **Player** brain type allows you to control an agent using keyboard |
|||
commands. You can use Player brains to control a "teacher" agent that trains |
|||
other agents during [imitation learning](Training-Imitation-Learning.md). You |
|||
can also use Player brains to test your agents and environment before changing |
|||
their brain types to **External** and running the training process. |
|||
The **Player** brain properties allow you to assign one or more keyboard keys to each action and a unique value to send when a key is pressed. |
|||
The **Player** brain properties allow you to assign one or more keyboard keys to |
|||
each action and a unique value to send when a key is pressed. |
|||
Note the differences between the discrete and continuous action spaces. When a brain uses the discrete action space, you can send one integer value as the action per step. In contrast, when a brain uses the continuous action space you can send any number of floating point values (up to the **Vector Action Space Size** setting). |
|||
|
|||
Note the differences between the discrete and continuous action spaces. When a |
|||
brain uses the discrete action space, you can send one integer value as the |
|||
action per step. In contrast, when a brain uses the continuous action space you |
|||
can send any number of floating point values (up to the **Vector Action Space |
|||
Size** setting). |
|||
|
|||
|**Continuous Player Actions**|| The mapping for the continuous vector action space. Shown when the action space is **Continuous**|. |
|||
|| **Size** | The number of key commands defined. You can assign more than one command to the same action index in order to send different values for that action. (If you press both keys at the same time, deterministic results are not guaranteed.)| |
|||
|**Continuous Player Actions**|| The mapping for the continuous vector action |
|||
space. Shown when the action space is **Continuous**|. |
|||
|| **Size** | The number of key commands defined. You can assign more than one |
|||
command to the same action index in order to send different values for that |
|||
action. (If you press both keys at the same time, deterministic results are not guaranteed.)| |
|||
|| **Index** | The element of the agent's action vector to set when this key is pressed. The index value cannot exceed the size of the Action Space (minus 1, since it is an array index).| |
|||
|| **Value** | The value to send to the agent as its action for the specified index when the mapped key is pressed. All other members of the action vector are set to 0. | |
|||
|**Discrete Player Actions**|| The mapping for the discrete vector action space. Shown when the action space is **Discrete**.| |
|||
|| **Index** | The element of the agent's action vector to set when this key is |
|||
pressed. The index value cannot exceed the size of the Action Space (minus 1, |
|||
since it is an array index).| |
|||
|| **Value** | The value to send to the agent as its action for the specified |
|||
index when the mapped key is pressed. All other members of the action vector |
|||
are set to 0. | |
|||
|**Discrete Player Actions**|| The mapping for the discrete vector action space. |
|||
Shown when the action space is **Discrete**.| |
|||
|| **Branch Index** |The element of the agent's action vector to set when this key is pressed. The index value cannot exceed the size of the Action Space (minus 1, since it is an array index).| |
|||
|| **Value** | The value to send to the agent as its action when the mapped key is pressed. Cannot exceed the max value for the associated branch (minus 1, since it is an array index).| |
|||
|| **Branch Index** |The element of the agent's action vector to set when this |
|||
key is pressed. The index value cannot exceed the size of the Action Space |
|||
(minus 1, since it is an array index).| |
|||
|| **Value** | The value to send to the agent as its action when the mapped key |
|||
is pressed. Cannot exceed the max value for the associated branch (minus 1, |
|||
since it is an array index).| |
|||
For more information about the Unity input system, see [Input](https://docs.unity3d.com/ScriptReference/Input.html). |
|||
|
|||
For more information about the Unity input system, see |
|||
[Input](https://docs.unity3d.com/ScriptReference/Input.html). |
|
|||
# Limitations |
|||
# Limitations |
|||
|
|||
If you enable Headless mode, you will not be able to collect visual |
|||
observations from your agents. |
|||
|
|||
If you enable Headless mode, you will not be able to collect visual observations |
|||
from your agents. |
|||
Currently the speed of the game physics can only be increased to 100x |
|||
real-time. The Academy also moves in time with FixedUpdate() rather than |
|||
Update(), so game behavior implemented in Update() may be out of sync with the Agent decision making. See [Execution Order of Event Functions](https://docs.unity3d.com/Manual/ExecutionOrder.html) for more information. |
|||
|
|||
Currently the speed of the game physics can only be increased to 100x real-time. |
|||
The Academy also moves in time with FixedUpdate() rather than Update(), so game |
|||
behavior implemented in Update() may be out of sync with the Agent decision |
|||
making. See |
|||
[Execution Order of Event Functions](https://docs.unity3d.com/Manual/ExecutionOrder.html) |
|||
for more information. |
|||
As of version 0.3, we no longer support Python 2. |
|||
|
|||
As of version 0.3, we no longer support Python 2. |
|||
Currently the Ml-Agents toolkit uses TensorFlow 1.7.1 due to the version of the TensorFlowSharp plugin we are using. |
|||
|
|||
Currently the Ml-Agents toolkit uses TensorFlow 1.7.1 due to the version of the |
|||
TensorFlowSharp plugin we are using. |
|
|||
# Imitation Learning |
|||
|
|||
It is often more intuitive to simply demonstrate the behavior we want an agent to perform, rather than attempting to have it learn via trial-and-error methods. Consider our [running example](ML-Agents-Overview.md#running-example-training-npc-behaviors) of training a medic NPC : instead of indirectly training a medic with the help of a reward function, we can give the medic real world examples of observations from the game and actions from a game controller to guide the medic's behavior. More specifically, in this mode, the Brain type during training is set to Player and all the actions performed with the controller (in addition to the agent observations) will be recorded and sent to the Python API. The imitation learning algorithm will then use these pairs of observations and actions from the human player to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFYs). |
|||
It is often more intuitive to simply demonstrate the behavior we want an agent |
|||
to perform, rather than attempting to have it learn via trial-and-error methods. |
|||
Consider our |
|||
[running example](ML-Agents-Overview.md#running-example-training-npc-behaviors) |
|||
of training a medic NPC : instead of indirectly training a medic with the help |
|||
of a reward function, we can give the medic real world examples of observations |
|||
from the game and actions from a game controller to guide the medic's behavior. |
|||
More specifically, in this mode, the Brain type during training is set to Player |
|||
and all the actions performed with the controller (in addition to the agent |
|||
observations) will be recorded and sent to the Python API. The imitation |
|||
learning algorithm will then use these pairs of observations and actions from |
|||
the human player to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFYs). |
|||
There are a variety of possible imitation learning algorithms which can be used, the simplest one of them is Behavioral Cloning. It works by collecting training data from a teacher, and then simply uses it to directly learn a policy, in the same way the supervised learning for image classification or other traditional Machine Learning tasks work. |
|||
There are a variety of possible imitation learning algorithms which can be used, |
|||
the simplest one of them is Behavioral Cloning. It works by collecting training |
|||
data from a teacher, and then simply uses it to directly learn a policy, in the |
|||
same way the supervised learning for image classification or other traditional |
|||
Machine Learning tasks work. |
|||
1. In order to use imitation learning in a scene, the first thing you will need is to create two Brains, one which will be the "Teacher," and the other which will be the "Student." We will assume that the names of the brain `GameObject`s are "Teacher" and "Student" respectively. |
|||
2. Set the "Teacher" brain to Player mode, and properly configure the inputs to map to the corresponding actions. **Ensure that "Broadcast" is checked within the Brain inspector window.** |
|||
1. In order to use imitation learning in a scene, the first thing you will need |
|||
is to create two Brains, one which will be the "Teacher," and the other which |
|||
will be the "Student." We will assume that the names of the brain |
|||
`GameObject`s are "Teacher" and "Student" respectively. |
|||
2. Set the "Teacher" brain to Player mode, and properly configure the inputs to |
|||
map to the corresponding actions. **Ensure that "Broadcast" is checked within |
|||
the Brain inspector window.** |
|||
4. Link the brains to the desired agents (one agent as the teacher and at least one agent as a student). |
|||
5. In `config/trainer_config.yaml`, add an entry for the "Student" brain. Set the `trainer` parameter of this entry to `imitation`, and the `brain_to_imitate` parameter to the name of the teacher brain: "Teacher". Additionally, set `batches_per_epoch`, which controls how much training to do each moment. Increase the `max_steps` option if you'd like to keep training the agents for a longer period of time. |
|||
6. Launch the training process with `mlagents-learn config/trainer_config.yaml --train --slow`, and press the :arrow_forward: button in Unity when the message _"Start training by pressing the Play button in the Unity Editor"_ is displayed on the screen |
|||
7. From the Unity window, control the agent with the Teacher brain by providing "teacher demonstrations" of the behavior you would like to see. |
|||
8. Watch as the agent(s) with the student brain attached begin to behave similarly to the demonstrations. |
|||
9. Once the Student agents are exhibiting the desired behavior, end the training process with `CTL+C` from the command line. |
|||
10. Move the resulting `*.bytes` file into the `TFModels` subdirectory of the Assets folder (or a subdirectory within Assets of your choosing) , and use with `Internal` brain. |
|||
4. Link the brains to the desired agents (one agent as the teacher and at least |
|||
one agent as a student). |
|||
5. In `config/trainer_config.yaml`, add an entry for the "Student" brain. Set |
|||
the `trainer` parameter of this entry to `imitation`, and the |
|||
`brain_to_imitate` parameter to the name of the teacher brain: "Teacher". |
|||
Additionally, set `batches_per_epoch`, which controls how much training to do |
|||
each moment. Increase the `max_steps` option if you'd like to keep training |
|||
the agents for a longer period of time. |
|||
6. Launch the training process with `mlagents-learn config/trainer_config.yaml |
|||
--train --slow`, and press the :arrow_forward: button in Unity when the |
|||
message _"Start training by pressing the Play button in the Unity Editor"_ is |
|||
displayed on the screen |
|||
7. From the Unity window, control the agent with the Teacher brain by providing |
|||
"teacher demonstrations" of the behavior you would like to see. |
|||
8. Watch as the agent(s) with the student brain attached begin to behave |
|||
similarly to the demonstrations. |
|||
9. Once the Student agents are exhibiting the desired behavior, end the training |
|||
process with `CTL+C` from the command line. |
|||
10. Move the resulting `*.bytes` file into the `TFModels` subdirectory of the |
|||
Assets folder (or a subdirectory within Assets of your choosing) , and use |
|||
with `Internal` brain. |
|||
We provide a convenience utility, `BC Teacher Helper` component that you can add to the Teacher Agent. |
|||
We provide a convenience utility, `BC Teacher Helper` component that you can add |
|||
to the Teacher Agent. |
|||
<img src="images/bc_teacher_helper.png" |
|||
alt="BC Teacher Helper" |
|||
width="375" border="10" /> |
|||
<img src="images/bc_teacher_helper.png" |
|||
alt="BC Teacher Helper" |
|||
width="375" border="10" /> |
|||
1. To start and stop recording experiences. This is useful in case you'd like to interact with the game _but not have the agents learn from these interactions_. The default command to toggle this is to press `R` on the keyboard. |
|||
1. To start and stop recording experiences. This is useful in case you'd like to |
|||
interact with the game _but not have the agents learn from these |
|||
interactions_. The default command to toggle this is to press `R` on the |
|||
keyboard. |
|||
2. Reset the training buffer. This enables you to instruct the agents to forget their buffer of recent experiences. This is useful if you'd like to get them to quickly learn a new behavior. The default command to reset the buffer is to press `C` on the keyboard. |
|||
|
|||
2. Reset the training buffer. This enables you to instruct the agents to forget |
|||
their buffer of recent experiences. This is useful if you'd like to get them |
|||
to quickly learn a new behavior. The default command to reset the buffer is |
|||
to press `C` on the keyboard. |
|
|||
# Training with Proximal Policy Optimization |
|||
|
|||
ML-Agents uses a reinforcement learning technique called [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/). PPO uses a neural network to approximate the ideal function that maps an agent's observations to the best action an agent can take in a given state. The ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate Python process (communicating with the running Unity application over a socket). |
|||
ML-Agents uses a reinforcement learning technique called |
|||
[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/). |
|||
PPO uses a neural network to approximate the ideal function that maps an agent's |
|||
observations to the best action an agent can take in a given state. The |
|||
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate |
|||
Python process (communicating with the running Unity application over a socket). |
|||
See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the training program, `learn.py`. |
|||
See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the |
|||
training program, `learn.py`. |
|||
If you are using the recurrent neural network (RNN) to utilize memory, see [Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training details. |
|||
If you are using the recurrent neural network (RNN) to utilize memory, see |
|||
[Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training |
|||
details. |
|||
If you are using curriculum training to pace the difficulty of the learning task presented to an agent, see [Training with Curriculum Learning](Training-Curriculum-Learning.md). |
|||
If you are using curriculum training to pace the difficulty of the learning task |
|||
presented to an agent, see [Training with Curriculum |
|||
Learning](Training-Curriculum-Learning.md). |
|||
For information about imitation learning, which uses a different training algorithm, see [Training with Imitation Learning](Training-Imitation-Learning.md). |
|||
For information about imitation learning, which uses a different training |
|||
algorithm, see |
|||
[Training with Imitation Learning](Training-Imitation-Learning.md). |
|||
Successfully training a Reinforcement Learning model often involves tuning the training hyperparameters. This guide contains some best practices for tuning the training process when the default parameters don't seem to be giving the level of performance you would like. |
|||
Successfully training a Reinforcement Learning model often involves tuning the |
|||
training hyperparameters. This guide contains some best practices for tuning the |
|||
training process when the default parameters don't seem to be giving the level |
|||
of performance you would like. |
|||
#### Gamma |
|||
### Gamma |
|||
`gamma` corresponds to the discount factor for future rewards. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large. In cases when rewards are more immediate, it can be smaller. |
|||
`gamma` corresponds to the discount factor for future rewards. This can be |
|||
thought of as how far into the future the agent should care about possible |
|||
rewards. In situations when the agent should be acting in the present in order |
|||
to prepare for rewards in the distant future, this value should be large. In |
|||
cases when rewards are more immediate, it can be smaller. |
|||
#### Lambda |
|||
### Lambda |
|||
`lambd` corresponds to the `lambda` parameter used when calculating the Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process. |
|||
`lambd` corresponds to the `lambda` parameter used when calculating the |
|||
Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This |
|||
can be thought of as how much the agent relies on its current value estimate |
|||
when calculating an updated value estimate. Low values correspond to relying |
|||
more on the current value estimate (which can be high bias), and high values |
|||
correspond to relying more on the actual rewards received in the environment |
|||
(which can be high variance). The parameter provides a trade-off between the |
|||
two, and the right value can lead to a more stable training process. |
|||
#### Buffer Size |
|||
### Buffer Size |
|||
`buffer_size` corresponds to how many experiences (agent observations, actions and rewards obtained) should be collected before we do any |
|||
learning or updating of the model. **This should be a multiple of `batch_size`**. Typically larger `buffer_size` correspond to more stable training updates. |
|||
`buffer_size` corresponds to how many experiences (agent observations, actions |
|||
and rewards obtained) should be collected before we do any learning or updating |
|||
of the model. **This should be a multiple of `batch_size`**. Typically a larger |
|||
`buffer_size` corresponds to more stable training updates. |
|||
#### Batch Size |
|||
### Batch Size |
|||
`batch_size` is the number of experiences used for one iteration of a gradient descent update. **This should always be a fraction of the |
|||
`buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value |
|||
should be smaller (in order of 10s). |
|||
`batch_size` is the number of experiences used for one iteration of a gradient |
|||
descent update. **This should always be a fraction of the `buffer_size`**. If |
|||
you are using a continuous action space, this value should be large (in the |
|||
order of 1000s). If you are using a discrete action space, this value should be |
|||
smaller (in order of 10s). |
|||
### Number of Epochs |
|||
#### Number of Epochs |
|||
|
|||
`num_epoch` is the number of passes through the experience buffer during gradient descent. The larger the `batch_size`, the |
|||
larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning. |
|||
`num_epoch` is the number of passes through the experience buffer during |
|||
gradient descent. The larger the `batch_size`, the larger it is acceptable to |
|||
make this. Decreasing this will ensure more stable updates, at the cost of |
|||
slower learning. |
|||
### Learning Rate |
|||
#### Learning Rate |
|||
|
|||
`learning_rate` corresponds to the strength of each gradient descent update step. This should typically be decreased if |
|||
training is unstable, and the reward does not consistently increase. |
|||
`learning_rate` corresponds to the strength of each gradient descent update |
|||
step. This should typically be decreased if training is unstable, and the reward |
|||
does not consistently increase. |
|||
|
|||
#### Time Horizon |
|||
### Time Horizon |
|||
`time_horizon` corresponds to how many steps of experience to collect per-agent before adding it to the experience buffer. |
|||
When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. |
|||
As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). |
|||
In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. |
|||
This number should be large enough to capture all the important behavior within a sequence of an agent's actions. |
|||
`time_horizon` corresponds to how many steps of experience to collect per-agent |
|||
before adding it to the experience buffer. When this limit is reached before the |
|||
end of an episode, a value estimate is used to predict the overall expected |
|||
reward from the agent's current state. As such, this parameter trades off |
|||
between a less biased, but higher variance estimate (long time horizon) and more |
|||
biased, but less varied estimate (short time horizon). In cases where there are |
|||
frequent rewards within an episode, or episodes are prohibitively large, a |
|||
smaller number can be more ideal. This number should be large enough to capture |
|||
all the important behavior within a sequence of an agent's actions. |
|||
#### Max Steps |
|||
### Max Steps |
|||
`max_steps` corresponds to how many steps of the simulation (multiplied by frame-skip) are run during the training process. This value should be increased for more complex problems. |
|||
`max_steps` corresponds to how many steps of the simulation (multiplied by |
|||
frame-skip) are run during the training process. This value should be increased |
|||
for more complex problems. |
|||
#### Beta |
|||
### Beta |
|||
`beta` corresponds to the strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`. |
|||
`beta` corresponds to the strength of the entropy regularization, which makes |
|||
the policy "more random." This ensures that agents properly explore the action |
|||
space during training. Increasing this will ensure more random actions are |
|||
taken. This should be adjusted such that the entropy (measurable from |
|||
TensorBoard) slowly decreases alongside increases in reward. If entropy drops |
|||
too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`. |
|||
|
|||
#### Epsilon |
|||
### Epsilon |
|||
`epsilon` corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process. |
|||
`epsilon` corresponds to the acceptable threshold of divergence between the old |
|||
and new policies during gradient descent updating. Setting this value small will |
|||
result in more stable updates, but will also slow the training process. |
|||
#### Normalize |
|||
### Normalize |
|||
`normalize` corresponds to whether normalization is applied to the vector observation inputs. This normalization is based on the running average and variance of the vector observation. |
|||
Normalization can be helpful in cases with complex continuous control problems, but may be harmful with simpler discrete control problems. |
|||
`normalize` corresponds to whether normalization is applied to the vector |
|||
observation inputs. This normalization is based on the running average and |
|||
variance of the vector observation. Normalization can be helpful in cases with |
|||
complex continuous control problems, but may be harmful with simpler discrete |
|||
control problems. |
|||
#### Number of Layers |
|||
### Number of Layers |
|||
`num_layers` corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation. For simple problems, |
|||
fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems. |
|||
`num_layers` corresponds to how many hidden layers are present after the |
|||
observation input, or after the CNN encoding of the visual observation. For |
|||
simple problems, fewer layers are likely to train faster and more efficiently. |
|||
More layers may be necessary for more complex control problems. |
|||
#### Hidden Units |
|||
### Hidden Units |
|||
`hidden_units` correspond to how many units are in each fully connected layer of the neural network. For simple problems |
|||
where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where |
|||
the action is a very complex interaction between the observation variables, this should be larger. |
|||
`hidden_units` correspond to how many units are in each fully connected layer of |
|||
the neural network. For simple problems where the correct action is a |
|||
straightforward combination of the observation inputs, this should be small. For |
|||
problems where the action is a very complex interaction between the observation |
|||
variables, this should be larger. |
|||
### (Optional) Recurrent Neural Network Hyperparameters |
|||
## (Optional) Recurrent Neural Network Hyperparameters |
|||
#### Sequence Length |
|||
### Sequence Length |
|||
`sequence_length` corresponds to the length of the sequences of experience passed through the network during training. This should be long enough to capture whatever information your agent might need to remember over time. For example, if your agent needs to remember the velocity of objects, then this can be a small value. If your agent needs to remember a piece of information given only once at the beginning of an episode, then this should be a larger value. |
|||
`sequence_length` corresponds to the length of the sequences of experience |
|||
passed through the network during training. This should be long enough to |
|||
capture whatever information your agent might need to remember over time. For |
|||
example, if your agent needs to remember the velocity of objects, then this can |
|||
be a small value. If your agent needs to remember a piece of information given |
|||
only once at the beginning of an episode, then this should be a larger value. |
|||
#### Memory Size |
|||
### Memory Size |
|||
`memory_size` corresponds to the size of the array of floating point numbers used to store the hidden state of the recurrent neural network. This value must be a multiple of 4, and should scale with the amount of information you expect the agent will need to remember in order to successfully complete the task. |
|||
`memory_size` corresponds to the size of the array of floating point numbers |
|||
used to store the hidden state of the recurrent neural network. This value must |
|||
be a multiple of 4, and should scale with the amount of information you expect |
|||
the agent will need to remember in order to successfully complete the task. |
|||
### (Optional) Intrinsic Curiosity Module Hyperparameters |
|||
## (Optional) Intrinsic Curiosity Module Hyperparameters |
|||
#### Curioisty Encoding Size |
|||
### Curioisty Encoding Size |
|||
`curiosity_enc_size` corresponds to the size of the hidden layer used to encode the observations within the intrinsic curiosity module. This value should be small enough to encourage the curiosity module to compress the original observation, but also not too small to prevent it from learning the dynamics of the environment. |
|||
`curiosity_enc_size` corresponds to the size of the hidden layer used to encode |
|||
the observations within the intrinsic curiosity module. This value should be |
|||
small enough to encourage the curiosity module to compress the original |
|||
observation, but also not too small to prevent it from learning the dynamics of |
|||
the environment. |
|||
#### Curiosity Strength |
|||
### Curiosity Strength |
|||
`curiosity_strength` corresponds to the magnitude of the intrinsic reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrnisic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. |
|||
`curiosity_strength` corresponds to the magnitude of the intrinsic reward |
|||
generated by the intrinsic curiosity module. This should be scaled in order to |
|||
ensure it is large enough to not be overwhelmed by extrnisic reward signals in |
|||
the environment. Likewise it should not be too large to overwhelm the extrinsic |
|||
reward signal. |
|||
To view training statistics, use TensorBoard. For information on launching and using TensorBoard, see [here](./Getting-Started-with-Balance-Ball.md#observing-training-progress). |
|||
To view training statistics, use TensorBoard. For information on launching and |
|||
using TensorBoard, see |
|||
[here](./Getting-Started-with-Balance-Ball.md#observing-training-progress). |
|||
#### Cumulative Reward |
|||
### Cumulative Reward |
|||
The general trend in reward should consistently increase over time. Small ups and downs are to be expected. Depending on the complexity of the task, a significant increase in reward may not present itself until millions of steps into the training process. |
|||
The general trend in reward should consistently increase over time. Small ups |
|||
and downs are to be expected. Depending on the complexity of the task, a |
|||
significant increase in reward may not present itself until millions of steps |
|||
into the training process. |
|||
#### Entropy |
|||
### Entropy |
|||
This corresponds to how random the decisions of a brain are. This should consistently decrease during training. If it decreases too soon or not at all, `beta` should be adjusted (when using discrete action space). |
|||
This corresponds to how random the decisions of a brain are. This should |
|||
consistently decrease during training. If it decreases too soon or not at all, |
|||
`beta` should be adjusted (when using discrete action space). |
|||
#### Learning Rate |
|||
### Learning Rate |
|||
#### Policy Loss |
|||
### Policy Loss |
|||
These values will oscillate during training. Generally they should be less than 1.0. |
|||
These values will oscillate during training. Generally they should be less than |
|||
1.0. |
|||
#### Value Estimate |
|||
### Value Estimate |
|||
These values should increase as the cumulative reward increases. They correspond to how much future reward the agent predicts itself receiving at any given point. |
|||
These values should increase as the cumulative reward increases. They correspond |
|||
to how much future reward the agent predicts itself receiving at any given |
|||
point. |
|||
#### Value Loss |
|||
### Value Loss |
|||
These values will increase as the reward increases, and then should decrease once reward becomes stable. |
|||
These values will increase as the reward increases, and then should decrease |
|||
once reward becomes stable. |
撰写
预览
正在加载...
取消
保存
Reference in new issue