浏览代码
Release 0.9.0 docs checklist and cleanup - v2 (#2372)
Release 0.9.0 docs checklist and cleanup - v2 (#2372)
* Included explicit version # for ZN * added explicit version for KR docs * minor fix in installation doc * Consistency with numbers for reset parameters * Removed extra verbiage. minor consistency * minor consistency * Cleaned up IL language * moved parameter sampling above in list * Cleaned up language in Env Parameter sampling * Cleaned up migrating content * updated consistency of Reset Parameter Sampling * Rename Training-Generalization-Learning.md to Training-Generalization-Reinforcement-Learning-Agents.md * Updated doc link for generalization * Rename Training-Generalization-Reinforcement-Learning-Agents.md to Training-Generalized-Reinforcement-Learning-Agents.md * Re-wrote the intro paragraph for generalization * add titles, cleaned up language for reset params * Update Training-Generalized-Reinforcement-Learning-Agents.md * cleanup of generalization doc * More cleanu.../develop-generalizationTraining-TrainerController
Jeffrey Shih
5 年前
当前提交
728afebf
共有 18 个文件被更改,包括 511 次插入 和 465 次删除
-
2docs/Installation.md
-
36docs/Learning-Environment-Examples.md
-
26docs/ML-Agents-Overview.md
-
10docs/Migrating.md
-
1docs/Readme.md
-
23docs/Training-Imitation-Learning.md
-
56docs/Training-ML-Agents.md
-
35docs/Training-PPO.md
-
2docs/localized/KR/README.md
-
2docs/localized/zh-CN/README.md
-
7docs/Profiling-Python.md
-
1ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
-
236docs/Reward-Signals.md
-
171docs/Training-Generalized-Reinforcement-Learning-Agents.md
-
211docs/Training-RewardSignals.md
-
157docs/Training-Generalization-Learning.md
-
0/docs/Training-Behavioral-Cloning.md
-
0/docs/Profiling-Python.md
|
|||
# Reward Signals |
|||
|
|||
In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy) |
|||
that maximizes a reward. Typically, a reward is defined by your environment, and corresponds |
|||
to reaching some goal. These are what we refer to as "extrinsic" rewards, as they are defined |
|||
external of the learning algorithm. |
|||
|
|||
Rewards, however, can be defined outside of the enviroment as well, to encourage the agent to |
|||
behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these |
|||
rewards as "intrinsic" reward signals. The total reward that the agent will learn to maximize can |
|||
be a mix of extrinsic and intrinsic reward signals. |
|||
|
|||
ML-Agents allows reward signals to be defined in a modular way, and we provide three reward |
|||
signals that can the mixed and matched to help shape your agent's behavior. The `extrinsic` Reward |
|||
Signal represents the rewards defined in your environment, and is enabled by default. |
|||
The `curiosity` reward signal helps your agent explore when extrinsic rewards are sparse. |
|||
|
|||
## Enabling Reward Signals |
|||
|
|||
Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An |
|||
example is provided in `config/trainer_config.yaml` and `config/gail_config.yaml`. To enable a reward signal, add it to the |
|||
`reward_signals:` section under the brain name. For instance, to enable the extrinsic signal |
|||
in addition to a small curiosity reward and a GAIL reward signal, you would define your `reward_signals` as follows: |
|||
|
|||
```yaml |
|||
reward_signals: |
|||
extrinsic: |
|||
strength: 1.0 |
|||
gamma: 0.99 |
|||
curiosity: |
|||
strength: 0.02 |
|||
gamma: 0.99 |
|||
encoding_size: 256 |
|||
gail: |
|||
strength: 0.01 |
|||
gamma: 0.99 |
|||
encoding_size: 128 |
|||
demo_path: demos/ExpertPyramid.demo |
|||
``` |
|||
|
|||
Each reward signal should define at least two parameters, `strength` and `gamma`, in addition |
|||
to any class-specific hyperparameters. Note that to remove a reward signal, you should delete |
|||
its entry entirely from `reward_signals`. At least one reward signal should be left defined |
|||
at all times. |
|||
|
|||
## Reward Signal Types |
|||
As part of the toolkit, we provide three reward signal types as part of hyperparameters - Extrinsic, Curiosity, and GAIL. |
|||
|
|||
### Extrinsic Reward Signal |
|||
|
|||
The `extrinsic` reward signal is simply the reward given by the |
|||
[environment](Learning-Environment-Design.md). Remove it to force the agent |
|||
to ignore the environment reward. |
|||
|
|||
#### Strength |
|||
|
|||
`strength` is the factor by which to multiply the raw |
|||
reward. Typical ranges will vary depending on the reward signal. |
|||
|
|||
Typical Range: `1.0` |
|||
|
|||
#### Gamma |
|||
|
|||
`gamma` corresponds to the discount factor for future rewards. This can be |
|||
thought of as how far into the future the agent should care about possible |
|||
rewards. In situations when the agent should be acting in the present in order |
|||
to prepare for rewards in the distant future, this value should be large. In |
|||
cases when rewards are more immediate, it can be smaller. |
|||
|
|||
Typical Range: `0.8` - `0.995` |
|||
|
|||
### Curiosity Reward Signal |
|||
|
|||
The `curiosity` reward signal enables the Intrinsic Curiosity Module. This is an implementation |
|||
of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction" |
|||
by Pathak, et al. It trains two networks: |
|||
* an inverse model, which takes the current and next obersvation of the agent, encodes them, and |
|||
uses the encoding to predict the action that was taken between the observations |
|||
* a forward model, which takes the encoded current obseravation and action, and predicts the |
|||
next encoded observation. |
|||
|
|||
The loss of the forward model (the difference between the predicted and actual encoded observations) is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be. |
|||
|
|||
For more information, see |
|||
* https://arxiv.org/abs/1705.05363 |
|||
* https://pathak22.github.io/noreward-rl/ |
|||
* https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/ |
|||
|
|||
#### Strength |
|||
|
|||
In this case, `strength` corresponds to the magnitude of the curiosity reward generated |
|||
by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough |
|||
to not be overwhelmed by extrinsic reward signals in the environment. |
|||
Likewise it should not be too large to overwhelm the extrinsic reward signal. |
|||
|
|||
Typical Range: `0.001` - `0.1` |
|||
|
|||
#### Gamma |
|||
|
|||
`gamma` corresponds to the discount factor for future rewards. |
|||
|
|||
Typical Range: `0.8` - `0.995` |
|||
|
|||
#### (Optional) Encoding Size |
|||
|
|||
`encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model. |
|||
This value should be small enough to encourage the ICM to compress the original |
|||
observation, but also not too small to prevent it from learning to differentiate between |
|||
demonstrated and actual behavior. |
|||
|
|||
Default Value: `64` |
|||
|
|||
Typical Range: `64` - `256` |
|||
|
|||
#### (Optional) Learning Rate |
|||
|
|||
`learning_rate` is the learning rate used to update the intrinsic curiosity module. |
|||
This should typically be decreased if training is unstable, and the curiosity loss is unstable. |
|||
|
|||
Default Value: `3e-4` |
|||
|
|||
Typical Range: `1e-5` - `1e-3` |
|||
|
|||
#### (Optional) Num Epochs |
|||
|
|||
`num_epoch` The number of passes to make through the experience buffer when performing gradient |
|||
descent optimization for the ICM. This typically should be set to the same as used for PPO. |
|||
|
|||
Default Value: `3` |
|||
|
|||
Typical Range: `3` - `10` |
|||
|
|||
### GAIL Reward Signal |
|||
|
|||
GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an |
|||
imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs |
|||
(Generative Adversarial Networks). In this framework, a second neural network, the |
|||
discriminator, is taught to distinguish whether an observation/action is from a demonstration or |
|||
produced by the agent. This discriminator can the examine a new observation/action and provide it a |
|||
reward based on how close it believes this new observation/action is to the provided demonstrations. |
|||
|
|||
At each training step, the agent tries to learn how to maximize this reward. Then, the |
|||
discriminator is trained to better distinguish between demonstrations and agent state/actions. |
|||
In this way, while the agent gets better and better at mimicing the demonstrations, the |
|||
discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it. |
|||
|
|||
This approach, when compared to [Behavioral Cloning](Training-Behavioral-Cloning.md), requires |
|||
far fewer demonstrations to be provided. After all, we are still learning a policy that happens |
|||
to be similar to the demonstrations, not directly copying the behavior of the demonstrations. It |
|||
is especially effective when combined with an Extrinsic signal. However, the GAIL reward signal can |
|||
also be used independently to purely learn from demonstrations. |
|||
|
|||
Using GAIL requires recorded demonstrations from your Unity environment. See the |
|||
[imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations. |
|||
|
|||
#### Strength |
|||
|
|||
`strength` is the factor by which to multiply the raw reward. Note that when using GAIL |
|||
with an Extrinsic Signal, this value should be set lower if your demonstrations are |
|||
suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic |
|||
rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. |
|||
|
|||
Typical Range: `0.01` - `1.0` |
|||
|
|||
#### Gamma |
|||
|
|||
`gamma` corresponds to the discount factor for future rewards. |
|||
|
|||
Typical Range: `0.8` - `0.9` |
|||
|
|||
#### Demo Path |
|||
|
|||
`demo_path` is the path to your `.demo` file or directory of `.demo` files. See the [imitation learning guide] |
|||
(Training-Imitation-Learning.md). |
|||
|
|||
#### (Optional) Encoding Size |
|||
|
|||
`encoding_size` corresponds to the size of the hidden layer used by the discriminator. |
|||
This value should be small enough to encourage the discriminator to compress the original |
|||
observation, but also not too small to prevent it from learning to differentiate between |
|||
demonstrated and actual behavior. Dramatically increasing this size will also negatively affect |
|||
training times. |
|||
|
|||
Default Value: `64` |
|||
|
|||
Typical Range: `64` - `256` |
|||
|
|||
#### (Optional) Learning Rate |
|||
|
|||
`learning_rate` is the learning rate used to update the discriminator. |
|||
This should typically be decreased if training is unstable, and the GAIL loss is unstable. |
|||
|
|||
Default Value: `3e-4` |
|||
|
|||
Typical Range: `1e-5` - `1e-3` |
|||
|
|||
#### (Optional) Use Actions |
|||
|
|||
`use_actions` determines whether the discriminator should discriminate based on both |
|||
observations and actions, or just observations. Set to `True` if you want the agent to |
|||
mimic the actions from the demonstrations, and `False` if you'd rather have the agent |
|||
visit the same states as in the demonstrations but with possibly different actions. |
|||
Setting to `False` is more likely to be stable, especially with imperfect demonstrations, |
|||
but may learn slower. |
|||
|
|||
Default Value: `false` |
|||
|
|||
#### (Optional) Variational Discriminator Bottleneck |
|||
|
|||
`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the |
|||
GAIL discriminator. This forces the discriminator to learn a more general representation |
|||
and reduces its tendency to be "too good" at discriminating, making learning more stable. |
|||
However, it does increase training time. Enable this if you notice your imitation learning is |
|||
unstable, or unable to learn the task at hand. |
|||
|
|||
Default Value: `false` |
|||
|
|||
#### (Optional) Samples Per Update |
|||
|
|||
`samples_per_update` is the maximum number of samples to use during each discriminator update. You may |
|||
want to lower this if your buffer size is very large to avoid overfitting the discriminator on current data. |
|||
If set to 0, we will use the minimum of buffer size and the number of demonstration samples. |
|||
|
|||
Default Value: `0` |
|||
|
|||
Typical Range: Approximately equal to [`buffer_size`](Training-PPO.md) |
|||
|
|||
#### (Optional) Num Epochs |
|||
|
|||
`num_epoch` The number of passes to make through the experience buffer when performing gradient |
|||
descent optimization for the discriminator. To avoid overfitting, this typically should be set to |
|||
the same as or less than used for PPO. |
|||
|
|||
Default Value: `3` |
|||
|
|||
Typical Range: `1` - `10` |
|
|||
# Training Generalized Reinforcement Learning Agents |
|||
|
|||
One of the challenges of training and testing agents on the same |
|||
environment is that the agents tend to overfit. The result is that the |
|||
agents are unable to generalize to any tweaks or variations in the enviornment. |
|||
This is analgous to a model being trained and tested on an identical dataset |
|||
in supervised learning. This becomes problematic in cases where environments |
|||
are randomly instantiated with varying objects or properties. |
|||
|
|||
To make agents robust and generalizable to different environments, the agent |
|||
should be trained over multiple variations of the enviornment. Using this approach |
|||
for training, the agent will be better suited to adapt (with higher performance) |
|||
to future unseen variations of the enviornment |
|||
|
|||
_Example of variations of the 3D Ball environment._ |
|||
|
|||
Ball scale of 0.5 | Ball scale of 4 |
|||
:-------------------------:|:-------------------------: |
|||
![](images/3dball_small.png) | ![](images/3dball_big.png) |
|||
|
|||
## Introducing Generalization Using Reset Parameters |
|||
|
|||
To enable variations in the environments, we implemented `Reset Parameters`. We |
|||
also included different sampling methods and the ability to create new kinds of |
|||
sampling methods for each `Reset Parameter`. In the 3D ball environment example displayed |
|||
in the figure above, the reset parameters are `gravity`, `ball_mass` and `ball_scale`. |
|||
|
|||
|
|||
## How to Enable Generalization Using Reset Parameters |
|||
|
|||
We first need to provide a way to modify the environment by supplying a set of `Reset Parameters` |
|||
and vary them over time. This provision can be done either deterministically or randomly. |
|||
|
|||
This is done by assigning each `Reset Parameter` a `sampler-type`(such as a uniform sampler), |
|||
which determines how to sample a `Reset |
|||
Parameter`. If a `sampler-type` isn't provided for a |
|||
`Reset Parameter`, the parameter maintains the default value throughout the |
|||
training procedure, remaining unchanged. The samplers for all the `Reset Parameters` |
|||
are handled by a **Sampler Manager**, which also handles the generation of new |
|||
values for the reset parameters when needed. |
|||
|
|||
To setup the Sampler Manager, we create a YAML file that specifies how we wish to |
|||
generate new samples for each `Reset Parameters`. In this file, we specify the samplers and the |
|||
`resampling-interval` (the number of simulation steps after which reset parameters are |
|||
resampled). Below is an example of a sampler file for the 3D ball environment. |
|||
|
|||
```yaml |
|||
resampling-interval: 5000 |
|||
|
|||
mass: |
|||
sampler-type: "uniform" |
|||
min_value: 0.5 |
|||
max_value: 10 |
|||
|
|||
gravity: |
|||
sampler-type: "multirange_uniform" |
|||
intervals: [[7, 10], [15, 20]] |
|||
|
|||
scale: |
|||
sampler-type: "uniform" |
|||
min_value: 0.75 |
|||
max_value: 3 |
|||
|
|||
``` |
|||
|
|||
Below is the explanation of the fields in the above example. |
|||
|
|||
* `resampling-interval` - Specifies the number of steps for the agent to |
|||
train under a particular environment configuration before resetting the |
|||
environment with a new sample of `Reset Parameters`. |
|||
|
|||
* `Reset Parameter` - Name of the `Reset Parameter` like `mass`, `gravity` and `scale`. This should match the name |
|||
specified in the academy of the intended environment for which the agent is |
|||
being trained. If a parameter specified in the file doesn't exist in the |
|||
environment, then this parameter will be ignored. Within each `Reset Parameter` |
|||
|
|||
* `sampler-type` - Specify the sampler type to use for the `Reset Parameter`. |
|||
This is a string that should exist in the `Sampler Factory` (explained |
|||
below). |
|||
|
|||
* `sampler-type-sub-arguments` - Specify the sub-arguments depending on the `sampler-type`. |
|||
In the example above, this would correspond to the `intervals` |
|||
under the `sampler-type` `"multirange_uniform"` for the `Reset Parameter` called gravity`. |
|||
The key name should match the name of the corresponding argument in the sampler definition. |
|||
(See below) |
|||
|
|||
The Sampler Manager allocates a sampler type for each `Reset Parameter` by using the *Sampler Factory*, |
|||
which maintains a dictionary mapping of string keys to sampler objects. The available sampler types |
|||
to be used for each `Reset Parameter` is available in the Sampler Factory. |
|||
|
|||
### Included Sampler Types |
|||
|
|||
Below is a list of included `sampler-type` as part of the toolkit. |
|||
|
|||
* `uniform` - Uniform sampler |
|||
* Uniformly samples a single float value between defined endpoints. |
|||
The sub-arguments for this sampler to specify the interval |
|||
endpoints are as below. The sampling is done in the range of |
|||
[`min_value`, `max_value`). |
|||
|
|||
* **sub-arguments** - `min_value`, `max_value` |
|||
|
|||
* `gaussian` - Gaussian sampler |
|||
* Samples a single float value from the distribution characterized by |
|||
the mean and standard deviation. The sub-arguments to specify the |
|||
gaussian distribution to use are as below. |
|||
|
|||
* **sub-arguments** - `mean`, `st_dev` |
|||
|
|||
* `multirange_uniform` - Multirange uniform sampler |
|||
* Uniformly samples a single float value between the specified intervals. |
|||
Samples by first performing a weight pick of an interval from the list |
|||
of intervals (weighted based on interval width) and samples uniformly |
|||
from the selected interval (half-closed interval, same as the uniform |
|||
sampler). This sampler can take an arbitrary number of intervals in a |
|||
list in the following format: |
|||
[[`interval_1_min`, `interval_1_max`], [`interval_2_min`, `interval_2_max`], ...] |
|||
|
|||
* **sub-arguments** - `intervals` |
|||
|
|||
The implementation of the samplers can be found at `ml-agents-envs/mlagents/envs/sampler_class.py`. |
|||
|
|||
### Defining a New Sampler Type |
|||
|
|||
If you want to define your own sampler type, you must first inherit the *Sampler* |
|||
base class (included in the `sampler_class` file) and preserve the interface. |
|||
Once the class for the required method is specified, it must be registered in the Sampler Factory. |
|||
|
|||
This can be done by subscribing to the *register_sampler* method of the SamplerFactory. The command |
|||
is as follows: |
|||
|
|||
`SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)` |
|||
|
|||
Once the Sampler Factory reflects the new register, the new sampler type can be used for sample any |
|||
`Reset Parameter`. For example, lets say a new sampler type was implemented as below and we register |
|||
the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory. |
|||
|
|||
```python |
|||
class CustomSampler(Sampler): |
|||
|
|||
def __init__(self, argA, argB, argC): |
|||
self.possible_vals = [argA, argB, argC] |
|||
|
|||
def sample_all(self): |
|||
return np.random.choice(self.possible_vals) |
|||
``` |
|||
|
|||
Now we need to specify the new sampler type in the sampler YAML file. For example, we use this new |
|||
sampler type for the `Reset Parameter` *mass*. |
|||
|
|||
```yaml |
|||
mass: |
|||
sampler-type: "custom-sampler" |
|||
argB: 1 |
|||
argA: 2 |
|||
argC: 3 |
|||
``` |
|||
|
|||
### Training with Generalization Using Reset Parameters |
|||
|
|||
After the sampler YAML file is defined, we proceed by launching `mlagents-learn` and specify |
|||
our configured sampler file with the `--sampler` flag. For example, if we wanted to train the |
|||
3D ball agent with generalization using `Reset Parameters` with `config/3dball_generalize.yaml` |
|||
sampling setup, we would run |
|||
|
|||
```sh |
|||
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_generalize.yaml |
|||
--run-id=3D-Ball-generalization --train |
|||
``` |
|||
|
|||
We can observe progress and metrics via Tensorboard. |
|
|||
# Reward Signals |
|||
|
|||
In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy) |
|||
that maximizes a reward. Typically, a reward is defined by your environment, and corresponds |
|||
to reaching some goal. These are what we refer to as "extrinsic" rewards, as they are defined |
|||
external of the learning algorithm. |
|||
|
|||
Rewards, however, can be defined outside of the enviroment as well, to encourage the agent to |
|||
behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these |
|||
rewards as "intrinsic" reward signals. The total reward that the agent will learn to maximize can |
|||
be a mix of extrinsic and intrinsic reward signals. |
|||
|
|||
ML-Agents allows reward signals to be defined in a modular way, and we provide three reward |
|||
signals that can the mixed and matched to help shape your agent's behavior. The `extrinsic` Reward |
|||
Signal represents the rewards defined in your environment, and is enabled by default. |
|||
The `curiosity` reward signal helps your agent explore when extrinsic rewards are sparse. |
|||
|
|||
## Enabling Reward Signals |
|||
|
|||
Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An |
|||
example is provided in `config/trainer_config.yaml`. To enable a reward signal, add it to the |
|||
`reward_signals:` section under the brain name. For instance, to enable the extrinsic signal |
|||
in addition to a small curiosity reward, you would define your `reward_signals` as follows: |
|||
|
|||
```yaml |
|||
reward_signals: |
|||
extrinsic: |
|||
strength: 1.0 |
|||
gamma: 0.99 |
|||
curiosity: |
|||
strength: 0.01 |
|||
gamma: 0.99 |
|||
encoding_size: 128 |
|||
``` |
|||
|
|||
Each reward signal should define at least two parameters, `strength` and `gamma`, in addition |
|||
to any class-specific hyperparameters. Note that to remove a reward signal, you should delete |
|||
its entry entirely from `reward_signals`. At least one reward signal should be left defined |
|||
at all times. |
|||
|
|||
## Reward Signal Types |
|||
|
|||
### The Extrinsic Reward Signal |
|||
|
|||
The `extrinsic` reward signal is simply the reward given by the |
|||
[environment](Learning-Environment-Design.md). Remove it to force the agent |
|||
to ignore the environment reward. |
|||
|
|||
#### Strength |
|||
|
|||
`strength` is the factor by which to multiply the raw |
|||
reward. Typical ranges will vary depending on the reward signal. |
|||
|
|||
Typical Range: `1.0` |
|||
|
|||
#### Gamma |
|||
|
|||
`gamma` corresponds to the discount factor for future rewards. This can be |
|||
thought of as how far into the future the agent should care about possible |
|||
rewards. In situations when the agent should be acting in the present in order |
|||
to prepare for rewards in the distant future, this value should be large. In |
|||
cases when rewards are more immediate, it can be smaller. |
|||
|
|||
Typical Range: `0.8` - `0.995` |
|||
|
|||
### The Curiosity Reward Signal |
|||
|
|||
The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an implementation |
|||
of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction" |
|||
by Pathak, et al. It trains two networks: |
|||
* an inverse model, which takes the current and next obersvation of the agent, encodes them, and |
|||
uses the encoding to predict the action that was taken between the observations |
|||
* a forward model, which takes the encoded current obseravation and action, and predicts the |
|||
next encoded observation. |
|||
|
|||
The loss of the forward model (the difference between the predicted and actual encoded observations) is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be. |
|||
|
|||
For more information, see |
|||
* https://arxiv.org/abs/1705.05363 |
|||
* https://pathak22.github.io/noreward-rl/ |
|||
* https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/ |
|||
|
|||
#### Strength |
|||
|
|||
In this case, `strength` corresponds to the magnitude of the curiosity reward generated |
|||
by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough |
|||
to not be overwhelmed by extrinsic reward signals in the environment. |
|||
Likewise it should not be too large to overwhelm the extrinsic reward signal. |
|||
|
|||
Typical Range: `0.001` - `0.1` |
|||
|
|||
#### Gamma |
|||
|
|||
`gamma` corresponds to the discount factor for future rewards. |
|||
|
|||
Typical Range: `0.8` - `0.995` |
|||
|
|||
#### Encoding Size |
|||
|
|||
`encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model. |
|||
This value should be small enough to encourage the ICM to compress the original |
|||
observation, but also not too small to prevent it from learning to differentiate between |
|||
demonstrated and actual behavior. |
|||
|
|||
Default Value: `64` |
|||
|
|||
Typical Range: `64` - `256` |
|||
|
|||
#### Learning Rate |
|||
|
|||
`learning_rate` is the learning rate used to update the intrinsic curiosity module. |
|||
This should typically be decreased if training is unstable, and the curiosity loss is unstable. |
|||
|
|||
Default Value: `3e-4` |
|||
|
|||
Typical Range: `1e-5` - `1e-3` |
|||
|
|||
### The GAIL Reward Signal |
|||
|
|||
GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an |
|||
imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs |
|||
(Generative Adversarial Networks). In this framework, a second neural network, the |
|||
discriminator, is taught to distinguish whether an observation/action is from a demonstration, or |
|||
produced by the agent. This discriminator can the examine a new observation/action and provide it a |
|||
reward based on how close it believes this new observation/action is to the provided demonstrations. |
|||
|
|||
At each training step, the agent tries to learn how to maximize this reward. Then, the |
|||
discriminator is trained to better distinguish between demonstrations and agent state/actions. |
|||
In this way, while the agent gets better and better at mimicing the demonstrations, the |
|||
discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it. |
|||
|
|||
This approach, when compared to [Behavioral Cloning](Training-BehavioralCloning.md), requires |
|||
far fewer demonstrations to be provided. After all, we are still learning a policy that happens |
|||
to be similar to the demonstration, not directly copying the behavior of the demonstrations. It |
|||
is also especially effective when combined with an Extrinsic signal, but can also be used |
|||
independently to purely learn from demonstration. |
|||
|
|||
Using GAIL requires recorded demonstrations from your Unity environment. See the |
|||
[imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations. |
|||
|
|||
#### Strength |
|||
|
|||
`strength` is the factor by which to multiply the raw reward. Note that when using GAIL |
|||
with an Extrinsic Signal, this value should be set lower if your demonstrations are |
|||
suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic |
|||
rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. |
|||
|
|||
Typical Range: `0.01` - `1.0` |
|||
|
|||
#### Gamma |
|||
|
|||
`gamma` corresponds to the discount factor for future rewards. |
|||
|
|||
Typical Range: `0.8` - `0.9` |
|||
|
|||
#### Demo Path |
|||
|
|||
`demo_path` is the path to your `.demo` file or directory of `.demo` files. See the [imitation learning guide] |
|||
(Training-Imitation-Learning.md). |
|||
|
|||
#### Encoding Size |
|||
|
|||
`encoding_size` corresponds to the size of the hidden layer used by the discriminator. |
|||
This value should be small enough to encourage the discriminator to compress the original |
|||
observation, but also not too small to prevent it from learning to differentiate between |
|||
demonstrated and actual behavior. Dramatically increasing this size will also negatively affect |
|||
training times. |
|||
|
|||
Default Value: `64` |
|||
|
|||
Typical Range: `64` - `256` |
|||
|
|||
#### Learning Rate |
|||
|
|||
`learning_rate` is the learning rate used to update the discriminator. |
|||
This should typically be decreased if training is unstable, and the GAIL loss is unstable. |
|||
|
|||
Default Value: `3e-4` |
|||
|
|||
Typical Range: `1e-5` - `1e-3` |
|||
|
|||
#### Use Actions |
|||
|
|||
`use_actions` determines whether the discriminator should discriminate based on both |
|||
observations and actions, or just observations. Set to `True` if you want the agent to |
|||
mimic the actions from the demonstrations, and `False` if you'd rather have the agent |
|||
visit the same states as in the demonstrations but with possibly different actions. |
|||
Setting to `False` is more likely to be stable, especially with imperfect demonstrations, |
|||
but may learn slower. |
|||
|
|||
Default Value: `false` |
|||
|
|||
#### (Optional) Samples Per Update |
|||
|
|||
`samples_per_update` is the maximum number of samples to use during each discriminator update. You may |
|||
want to lower this if your buffer size is very large to avoid overfitting the discriminator on current data. |
|||
If set to 0, we will use the minimum of buffer size and the number of demonstration samples. |
|||
|
|||
Default Value: `0` |
|||
|
|||
Typical Range: Approximately equal to [`buffer_size`](Training-PPO.md) |
|||
|
|||
#### (Optional) Variational Discriminator Bottleneck |
|||
|
|||
`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the |
|||
GAIL discriminator. This forces the discriminator to learn a more general representation |
|||
and reduces its tendency to be "too good" at discriminating, making learning more stable. |
|||
However, it does increase training time. Enable this if you notice your imitation learning is |
|||
unstable, or unable to learn the task at hand. |
|||
|
|||
Default Value: `false` |
|
|||
# Training Generalized Reinforcement Learning Agents |
|||
|
|||
Reinforcement learning has a rather unique setup as opposed to supervised and |
|||
unsupervised learning. Agents here are trained and tested on the same exact |
|||
environment, which is analogous to a model being trained and tested on an |
|||
identical dataset in supervised learning! This setting results in overfitting; |
|||
the inability of the agent to generalize to slight tweaks or variations in the |
|||
environment. This is problematic in instances when environments are randomly |
|||
instantiated with varying properties. To make agents robust, one approach is to |
|||
train an agent over multiple variations of the environment. The agent is |
|||
trained in this approach with the intent that it learns to adapt its performance |
|||
to future unseen variations of the environment. |
|||
|
|||
Ball scale of 0.5 | Ball scale of 4 |
|||
:-------------------------:|:-------------------------: |
|||
![](images/3dball_small.png) | ![](images/3dball_big.png) |
|||
|
|||
_Variations of the 3D Ball environment._ |
|||
|
|||
To vary environments, we first decide what parameters to vary in an |
|||
environment. We call these parameters `Reset Parameters`. In the 3D ball |
|||
environment example displayed in the figure above, the reset parameters are |
|||
`gravity`, `ball_mass` and `ball_scale`. |
|||
|
|||
|
|||
## How-to |
|||
|
|||
For generalization training, we need to provide a way to modify the environment |
|||
by supplying a set of reset parameters, and vary them over time. This provision |
|||
can be done either deterministically or randomly. |
|||
|
|||
This is done by assigning each reset parameter a sampler, which samples a reset |
|||
parameter value (such as a uniform sampler). If a sampler isn't provided for a |
|||
reset parameter, the parameter maintains the default value throughout the |
|||
training procedure, remaining unchanged. The samplers for all the reset parameters |
|||
are handled by a **Sampler Manager**, which also handles the generation of new |
|||
values for the reset parameters when needed. |
|||
|
|||
To setup the Sampler Manager, we setup a YAML file that specifies how we wish to |
|||
generate new samples. In this file, we specify the samplers and the |
|||
`resampling-interval` (number of simulation steps after which reset parameters are |
|||
resampled). Below is an example of a sampler file for the 3D ball environment. |
|||
|
|||
```yaml |
|||
resampling-interval: 5000 |
|||
|
|||
mass: |
|||
sampler-type: "uniform" |
|||
min_value: 0.5 |
|||
max_value: 10 |
|||
|
|||
gravity: |
|||
sampler-type: "multirange_uniform" |
|||
intervals: [[7, 10], [15, 20]] |
|||
|
|||
scale: |
|||
sampler-type: "uniform" |
|||
min_value: 0.75 |
|||
max_value: 3 |
|||
|
|||
``` |
|||
|
|||
* `resampling-interval` (int) - Specifies the number of steps for agent to |
|||
train under a particular environment configuration before resetting the |
|||
environment with a new sample of reset parameters. |
|||
|
|||
* `parameter_name` - Name of the reset parameter. This should match the name |
|||
specified in the academy of the intended environment for which the agent is |
|||
being trained. If a parameter specified in the file doesn't exist in the |
|||
environment, then this specification will be ignored. |
|||
|
|||
* `sampler-type` - Specify the sampler type to use for the reset parameter. |
|||
This is a string that should exist in the `Sampler Factory` (explained |
|||
below). |
|||
|
|||
* `sub-arguments` - Specify the characteristic parameters for the sampler. |
|||
In the example sampler file above, this would correspond to the `intervals` |
|||
key under the `multirange_uniform` sampler for the gravity reset parameter. |
|||
The key name should match the name of the corresponding argument in the sampler definition. (Look at defining a new sampler method) |
|||
|
|||
|
|||
The sampler manager allocates a sampler for a reset parameter by using the *Sampler Factory*, which maintains a dictionary mapping of string keys to sampler objects. The available samplers to be used for reset parameter resampling is as available in the Sampler Factory. |
|||
|
|||
#### Possible Sampler Types |
|||
|
|||
The currently implemented samplers that can be used with the `sampler-type` arguments are: |
|||
|
|||
* `uniform` - Uniform sampler |
|||
* Uniformly samples a single float value between defined endpoints. |
|||
The sub-arguments for this sampler to specify the interval |
|||
endpoints are as below. The sampling is done in the range of |
|||
[`min_value`, `max_value`). |
|||
|
|||
* **sub-arguments** - `min_value`, `max_value` |
|||
|
|||
* `gaussian` - Gaussian sampler |
|||
* Samples a single float value from the distribution characterized by |
|||
the mean and standard deviation. The sub-arguments to specify the |
|||
gaussian distribution to use are as below. |
|||
|
|||
* **sub-arguments** - `mean`, `st_dev` |
|||
|
|||
* `multirange_uniform` - Multirange Uniform sampler |
|||
* Uniformly samples a single float value between the specified intervals. |
|||
Samples by first performing a weight pick of an interval from the list |
|||
of intervals (weighted based on interval width) and samples uniformly |
|||
from the selected interval (half-closed interval, same as the uniform |
|||
sampler). This sampler can take an arbitrary number of intervals in a |
|||
list in the following format: |
|||
[[`interval_1_min`, `interval_1_max`], [`interval_2_min`, `interval_2_max`], ...] |
|||
|
|||
* **sub-arguments** - `intervals` |
|||
|
|||
|
|||
The implementation of the samplers can be found at `ml-agents-envs/mlagents/envs/sampler_class.py`. |
|||
|
|||
### Defining a new sampler method |
|||
|
|||
Custom sampling techniques must inherit from the *Sampler* base class (included in the `sampler_class` file) and preserve the interface. Once the class for the required method is specified, it must be registered in the Sampler Factory. |
|||
|
|||
This can be done by subscribing to the *register_sampler* method of the SamplerFactory. The command is as follows: |
|||
|
|||
`SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)` |
|||
|
|||
Once the Sampler Factory reflects the new register, the custom sampler can be used for resampling reset parameter. For demonstration, lets say our sampler was implemented as below, and we register the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory. |
|||
|
|||
```python |
|||
class CustomSampler(Sampler): |
|||
|
|||
def __init__(self, argA, argB, argC): |
|||
self.possible_vals = [argA, argB, argC] |
|||
|
|||
def sample_all(self): |
|||
return np.random.choice(self.possible_vals) |
|||
``` |
|||
|
|||
Now we need to specify this sampler in the sampler file. Lets say we wish to use this sampler for the reset parameter *mass*; the sampler file would specify the same for mass as the following (any order of the subarguments is valid). |
|||
|
|||
```yaml |
|||
mass: |
|||
sampler-type: "custom-sampler" |
|||
argB: 1 |
|||
argA: 2 |
|||
argC: 3 |
|||
``` |
|||
|
|||
With the sampler file setup, we can proceed to train our agent as explained in the next section. |
|||
|
|||
### Training with Generalization Learning |
|||
|
|||
We first begin with setting up the sampler file. After the sampler file is defined and configured, we proceed by launching `mlagents-learn` and specify our configured sampler file with the `--sampler` flag. To demonstrate, if we wanted to train a 3D ball agent with generalization using the `config/3dball_generalize.yaml` sampling setup, we can run |
|||
|
|||
```sh |
|||
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_generalize.yaml --run-id=3D-Ball-generalization --train |
|||
``` |
|||
|
|||
We can observe progress and metrics via Tensorboard. |
撰写
预览
正在加载...
取消
保存
Reference in new issue