浏览代码

Release 0.9.0 docs checklist and cleanup - v2 (#2372)

* Included explicit version # for ZN

* added explicit version for KR docs

* minor fix in installation doc

* Consistency with numbers for reset parameters

* Removed extra verbiage. minor consistency

* minor consistency

* Cleaned up IL language

* moved parameter sampling above in list

* Cleaned up language in Env Parameter sampling

* Cleaned up migrating content

* updated consistency of Reset Parameter Sampling

* Rename Training-Generalization-Learning.md to Training-Generalization-Reinforcement-Learning-Agents.md

* Updated doc link for generalization

* Rename Training-Generalization-Reinforcement-Learning-Agents.md to Training-Generalized-Reinforcement-Learning-Agents.md

* Re-wrote the intro paragraph for generalization

* add titles, cleaned up language for reset params

* Update Training-Generalized-Reinforcement-Learning-Agents.md

* cleanup of generalization doc

* More cleanu...
/develop-generalizationTraining-TrainerController
Jeffrey Shih 5 年前
当前提交
728afebf
共有 18 个文件被更改,包括 511 次插入465 次删除
  1. 2
      docs/Installation.md
  2. 36
      docs/Learning-Environment-Examples.md
  3. 26
      docs/ML-Agents-Overview.md
  4. 10
      docs/Migrating.md
  5. 1
      docs/Readme.md
  6. 23
      docs/Training-Imitation-Learning.md
  7. 56
      docs/Training-ML-Agents.md
  8. 35
      docs/Training-PPO.md
  9. 2
      docs/localized/KR/README.md
  10. 2
      docs/localized/zh-CN/README.md
  11. 7
      docs/Profiling-Python.md
  12. 1
      ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
  13. 236
      docs/Reward-Signals.md
  14. 171
      docs/Training-Generalized-Reinforcement-Learning-Agents.md
  15. 211
      docs/Training-RewardSignals.md
  16. 157
      docs/Training-Generalization-Learning.md
  17. 0
      /docs/Training-Behavioral-Cloning.md
  18. 0
      /docs/Profiling-Python.md

2
docs/Installation.md


`mlagents-learn --help`, after which you will see the Unity logo and the command line
parameters you can use with `mlagents-learn`.
By installing the `mlagents` package, its dependencies listed in the [setup.py file](../ml-agents/setup.py) are also installed.
By installing the `mlagents` package, the dependencies listed in the [setup.py file](../ml-agents/setup.py) are also installed.
Some of the primary dependencies include:
- [TensorFlow](Background-TensorFlow.md) (Requires a CPU w/ AVX support)

36
docs/Learning-Environment-Examples.md


* Vector Observation space: One variable corresponding to current state.
* Vector Action space: (Discrete) Two possible actions (Move left, move
right).
* Visual Observations: None.
* Visual Observations: None
* Reset Parameters: None
* Benchmark Mean Reward: 0.94

* Vector Action space: (Continuous) Size of 2, with one value corresponding to
X-rotation, and the other to Z-rotation.
* Visual Observations: None.
* Reset Parameters: Three, corresponding to the following:
* Reset Parameters: Three
* scale: Specifies the scale of the ball in the 3 dimensions (equal across the three dimensions)
* Default: 1
* Recommended Minimum: 0.2

of ball and racket.
* Vector Action space: (Continuous) Size of 2, corresponding to movement
toward net or away from net, and jumping.
* Visual Observations: None.
* Reset Parameters: Three, corresponding to the following:
* Visual Observations: None
* Reset Parameters: Three
* angle: Angle of the racket from the vertical (Y) axis.
* Default: 55
* Recommended Minimum: 35

`VisualPushBlock` scene. __The visual observation version of
this environment does not train with the provided default
training parameters.__
* Reset Parameters: Four, corresponding to the following:
* Reset Parameters: Four
* block_scale: Scale of the block along the x and z dimensions
* Default: 2
* Recommended Minimum: 0.5

* Rotation (3 possible actions: Rotate Left, Rotate Right, No Action)
* Side Motion (3 possible actions: Left, Right, No Action)
* Jump (2 possible actions: Jump, No Action)
* Visual Observations: None.
* Reset Parameters: 4, corresponding to the height of the possible walls.
* Visual Observations: None
* Reset Parameters: Four
* Benchmark Mean Reward (Big & Small Wall Brain): 0.8
## [Reacher](https://youtu.be/2N9EoF6pQyE)

* Vector Action space: (Continuous) Size of 4, corresponding to torque
applicable to two joints.
* Visual Observations: None.
* Reset Parameters: Five, corresponding to the following
* Reset Parameters: Five
* goal_size: radius of the goal zone
* Default: 5
* Recommended Minimum: 1

angular acceleration of the body.
* Vector Action space: (Continuous) Size of 20, corresponding to target
rotations for joints.
* Visual Observations: None.
* Visual Observations: None
* Reset Parameters: None
* Benchmark Mean Reward for `CrawlerStaticTarget`: 2000
* Benchmark Mean Reward for `CrawlerDynamicTarget`: 400

`VisualBanana` scene. __The visual observation version of
this environment does not train with the provided default
training parameters.__
* Reset Parameters: Two, corresponding to the following
* Reset Parameters: Two
* laser_length: Length of the laser used by the agent
* Default: 1
* Recommended Minimum: 0.2

`VisualHallway` scene. __The visual observation version of
this environment does not train with the provided default
training parameters.__
* Reset Parameters: None.
* Reset Parameters: None
* Benchmark Mean Reward: 0.7
* To speed up training, you can enable curiosity by adding `use_curiosity: true` in `config/trainer_config.yaml`
* Optional Imitation Learning scene: `HallwayIL`.

banana.
* Vector Action space: (Continuous) 3 corresponding to agent force applied for
the jump.
* Visual Observations: None.
* Reset Parameters: Two, corresponding to the following
* Visual Observations: None
* Reset Parameters: Two
* banana_scale: The scale of the banana in the 3 dimensions
* Default: 150
* Recommended Minimum: 50

* Striker: 6 actions corresponding to forward, backward, sideways movement,
as well as rotation.
* Goalie: 4 actions corresponding to forward, backward, sideways movement.
* Visual Observations: None.
* Reset Parameters: Two, corresponding to the following:
* Visual Observations: None
* Reset Parameters: Two
* ball_scale: Specifies the scale of the ball in the 3 dimensions (equal across the three dimensions)
* Default: 7.5
* Recommended minimum: 4

velocity, and angular velocities of each limb, along with goal direction.
* Vector Action space: (Continuous) Size of 39, corresponding to target
rotations applicable to the joints.
* Visual Observations: None.
* Reset Parameters: Four, corresponding to the following
* Visual Observations: None
* Reset Parameters: Four
* gravity: Magnitude of gravity
* Default: 9.81
* Recommended Minimum:

`VisualPyramids` scene. __The visual observation version of
this environment does not train with the provided default
training parameters.__
* Reset Parameters: None.
* Reset Parameters: None
* Optional Imitation Learning scene: `PyramidsIL`.
* Benchmark Mean Reward: 1.75

26
docs/ML-Agents-Overview.md


actions from the human player to learn a policy. [Video
Link](https://youtu.be/kpb8ZkMBFYs).
ML-Agents provides ways to both learn directly from demonstrations as well as
use demonstrations to help speed up reward-based training, and two algorithms to do
so (Generative Adversarial Imitation Learning and Behavioral Cloning). The
[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial
covers these features in more depth.
The toolkit provides a way to learn directly from demonstrations, as well as use them
to help speed up reward-based training (RL). We include two algorithms called
Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL). The
[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial covers these
features in more depth.
## Flexible Training Scenarios

learn more about adding visual observations to an agent
[here](Learning-Environment-Design-Agents.md#multiple-visual-observations).
- **Training with Reset Parameter Sampling** - To train agents to be adapt
to changes in its environment (i.e., generalization), the agent should be exposed
to several variations of the environment. Similar to Curriculum Learning,
where environments become more difficult as the agent learns, the toolkit provides
a way to randomly sample Reset Parameters of the environment during training. See
[Training Generalized Reinforcement Learning Agents](Training-Generalized-Reinforcement-Learning-Agents.md)
to learn more about this feature.
- **Broadcasting** - As discussed earlier, a Learning Brain sends the
observations for all its Agents to the Python API when dragged into the
Academy's `Broadcast Hub` with the `Control` checkbox checked. This is helpful

particularly when debugging agent behaviors. You can learn more about using
the broadcasting feature
[here](Learning-Environment-Design-Brains.md#using-the-broadcast-feature).
- **Training with Environment Parameter Sampling** - To train agents to be robust
to changes in its environment (i.e., generalization), the agent should be exposed
to a variety of environment variations. Similarly to Curriculum Learning, which
allows environments to get more difficult as the agent learns, we also provide
a way to randomly resample aspects of the environment during training. See
[Training with Environment Parameter Sampling](Training-Generalization-Learning.md)
to learn more about this feature.
- **Docker Set-up (Experimental)** - To facilitate setting up ML-Agents without
installing Python or TensorFlow directly, we provide a

10
docs/Migrating.md


### Important Changes
* We have changed the way reward signals (including Curiosity) are defined in the
`trainer_config.yaml`.
* When using multiple environments, every "step" as recorded in TensorBoard and
printed in the command line now corresponds to a single step of a single environment.
* When using multiple environments, every "step" is recorded in TensorBoard.
* The steps in the command line console corresponds to a single step of a single environment.
* `gamma` - Define a new `extrinsic` reward signal and set it's `gamma` to your new gamma.
* `use_curiosity`, `curiosity_strength`, `curiosity_enc_size` - Define a `curiosity` reward signal
* `gamma`: Define a new `extrinsic` reward signal and set it's `gamma` to your new gamma.
* `use_curiosity`, `curiosity_strength`, `curiosity_enc_size`: Define a `curiosity` reward signal
See [Reward Signals](Training-RewardSignals.md) for more information on defining reward signals.
See [Reward Signals](Reward-Signals.md) for more information on defining reward signals.
* TensorBoards generated when running multiple environments in v0.8 are not comparable to those generated in
v0.9 in terms of step count. Multiply your v0.8 step count by `num_envs` for an approximate comparison.
You may need to change `max_steps` in your config as appropriate as well.

1
docs/Readme.md


* [Training with Curriculum Learning](Training-Curriculum-Learning.md)
* [Training with Imitation Learning](Training-Imitation-Learning.md)
* [Training with LSTM](Feature-Memory.md)
* [Training Generalized Reinforcement Learning Agents](Training-Generalized-Reinforcement-Learning-Agents.md)
* [Training on the Cloud with Amazon Web Services](Training-on-Amazon-Web-Service.md)
* [Training on the Cloud with Microsoft Azure](Training-on-Microsoft-Azure.md)
* [Training Using Concurrent Unity Instances](Training-Using-Concurrent-Unity-Instances.md)

23
docs/Training-Imitation-Learning.md


# Imitation Learning
# Training with Imitation Learning
It is often more intuitive to simply demonstrate the behavior we want an agent
to perform, rather than attempting to have it learn via trial-and-error methods.

Imitation learning can also be used to help reinforcement learning. Especially in
environments with sparse (i.e., infrequent or rare) rewards, the agent may never see
the reward and thus not learn from it. Curiosity helps the agent explore, but in some cases
it is easier to just show the agent how to achieve the reward. In these cases,
imitation learning can dramatically reduce the time it takes to solve the environment.
the reward and thus not learn from it. Curiosity (which is available in the toolkit)
helps the agent explore, but in some cases
it is easier to show the agent how to achieve the reward. In these cases,
imitation learning combined with reinforcement learning can dramatically
reduce the time the agent takes to solve the environment.
just 6 episodes of demonstrations can reduce training steps by more than 4 times.
using 6 episodes of demonstrations can reduce training steps by more than 4 times.
See PreTraining + GAIL + Curiosity + RL below.
width="350" border="0" />
width="700" border="0" />
ML-Agents provides several ways to learn from demonstrations.
The ML-Agents toolkit provides several ways to learn from demonstrations.
[GAIL reward signal](Training-RewardSignals.md#the-gail-reward-signal). GAIL can be
[GAIL reward signal](Reward-Signals.md#the-gail-reward-signal). GAIL can be
used with or without environment rewards, and works well when there are a limited
number of demonstrations.
* To help bootstrap reinforcement learning, you can enable

[Behavioral Cloning](Training-BehavioralCloning.md) trainer. Behavioral Cloning can be
[Behavioral Cloning](Training-Behavioral-Cloning.md) trainer. Behavioral Cloning can be
used offline and online (in-editor), and learns very quickly. However, it usually is ineffective
on more complex environments without a large number of demonstrations.

and save them as assets. These demonstrations contain information on the
observations, actions, and rewards for a given agent during the recording session.
They can be managed from the Editor, as well as used for training with Offline
Behavioral Cloning (see below).
Behavioral Cloning and GAIL.
In order to record demonstrations from an agent, add the `Demonstration Recorder`
component to a GameObject in the scene which contains an `Agent` component.

56
docs/Training-ML-Agents.md


[training_config.yaml file](#training-config-file) with a text editor to set
different values.
### Command line training options
### Command Line Training Options
* `--env=<env>` - Specify an executable environment to train.
* `--curriculum=<file>` Specify a curriculum JSON file for defining the
* `--env=<env>`: Specify an executable environment to train.
* `--curriculum=<file>`: Specify a curriculum JSON file for defining the
* `--sampler=<file>` - Specify a sampler YAML file for defining the
* `--sampler=<file>`: Specify a sampler YAML file for defining the
Training](Training-Generalization-Learning.md) for more information.
* `--keep-checkpoints=<n>` Specify the maximum number of model checkpoints to
Training](Training-Generalized-Reinforcement-Learning-Agents.md) for more information.
* `--keep-checkpoints=<n>`: Specify the maximum number of model checkpoints to
* `--lesson=<n>` Specify which lesson to start with when performing curriculum
* `--lesson=<n>`: Specify which lesson to start with when performing curriculum
* `--load` If set, the training code loads an already trained model to
* `--load`: If set, the training code loads an already trained model to
* `--num-runs=<n>` - Sets the number of concurrent training sessions to perform.
* `--num-runs=<n>`: Sets the number of concurrent training sessions to perform.
* `--run-id=<path>` Specifies an identifier for each training run. This
* `--run-id=<path>`: Specifies an identifier for each training run. This
* `--save-freq=<n>` Specifies how often (in steps) to save the model during
* `--save-freq=<n>`: Specifies how often (in steps) to save the model during
* `--seed=<n>` Specifies a number to use as a seed for the random number
* `--seed=<n>`: Specifies a number to use as a seed for the random number
* `--slow` Specify this option to run the Unity environment at normal, game
* `--slow`: Specify this option to run the Unity environment at normal, game
* `--train` Specifies whether to train model or only run in inference mode.
* `--train`: Specifies whether to train model or only run in inference mode.
* `--num-envs=<n>` - Specifies the number of concurrent Unity environment instances to collect
* `--num-envs=<n>`: Specifies the number of concurrent Unity environment instances to collect
* `--base-port` - Specifies the starting port. Each concurrent Unity environment instance will get assigned a port sequentially, starting from the `base-port`. Each instance will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs given to each instance from 0 to `num_envs - 1`. Default is 5005.
* `--docker-target-name=<dt>` The Docker Volume on which to store curriculum,
* `--base-port`: Specifies the starting port. Each concurrent Unity environment instance will get assigned a port sequentially, starting from the `base-port`. Each instance will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs given to each instance from 0 to `num_envs - 1`. Default is 5005.
* `--docker-target-name=<dt>`: The Docker Volume on which to store curriculum,
* `--no-graphics` - Specify this option to run the Unity executable in
* `--no-graphics`: Specify this option to run the Unity executable in
* `--debug` - Specify this option to enable debug-level logging for some parts of the code.
* `--debug`: Specify this option to enable debug-level logging for some parts of the code.
### Training config file
### Training Config File
The training config files `config/trainer_config.yaml`,
`config/online_bc_config.yaml` and `config/offline_bc_config.yaml` specifies the

| num_epoch | The number of passes to make through the experience buffer when performing gradient descent optimization. | PPO |
| num_layers | The number of hidden layers in the neural network. | PPO, BC |
| pretraining | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations). | PPO |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Training-RewardSignals.md) for configuration options. | PPO |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, BC |
| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, BC |
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, (online)BC |

* [Training with PPO](Training-PPO.md)
* [Using Recurrent Neural Networks](Feature-Memory.md)
* [Training with Curriculum Learning](Training-Curriculum-Learning.md)
* [Training with Environment Parameter Sampling](Training-Generalization-Learning.md)
* [Training Generalized Reinforcement Learning Agents](Training-Generalized-Reinforcement-Learning-Agents.md)
You can also compare the
[example environments](Learning-Environment-Examples.md)

### Output metrics
Trainer Metrics are logged to a CSV stored in the `summaries` directory. The metrics stored are:
### Debugging and Profiling
If you enable the `--debug` flag in the command line, the trainer metrics are logged to a CSV file
stored in the `summaries` directory. The metrics stored are:
* mean return
* mean return
[Profiling](Profiling.md) information is also saved in the `summaries` directory.
Additionally, we have included basic [Profiling in Python](Profiling-Python.md) as part of the toolkit.
This information is also saved in the `summaries` directory.

35
docs/Training-PPO.md


Python process (communicating with the running Unity application over a socket).
To train an agent, you will need to provide the agent one or more reward signals which
the agent should attempt to maximize. See [Reward Signals](Training-RewardSignals.md)
the agent should attempt to maximize. See [Reward Signals](Reward-Signals.md)
for the available reward signals and the corresponding hyperparameters.
See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the

For information about imitation learning from demonstrations, see
[Training with Imitation Learning](Training-Imitation-Learning.md).
## Best Practices when training with PPO
## Best Practices Training with PPO
Successfully training a Reinforcement Learning model often involves tuning the
training hyperparameters. This guide contains some best practices for tuning the

the agent for exploring new states, rather than just when an explicit reward is given.
Furthermore, we could mix reward signals to help the learning process.
`reward_signals` provides a section to define [reward signals.](Training-RewardSignals.md)
ML-Agents provides two reward signals by default, the Extrinsic (environment) reward, and the
Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
environments.
Using `reward_signals` allows you to define [reward signals.](Reward-Signals.md)
The ML-Agents toolkit provides three reward signals by default, the Extrinsic (environment)
reward signal, the Curiosity reward signal, which can be used to encourage exploration in
sparse extrinsic reward environments, and the GAIL reward signal. Please see [Reward Signals](Reward-Signals.md)
for additional details.
### Lambda

`vis_encode_type` corresponds to the encoder type for encoding visual observations.
Valid options include:
* `simple` (default): a simple encoder which consists of two convolutional layers
* `nature_cnn`: CNN implementation proposed by Mnih et al.(https://www.nature.com/articles/nature14236),
* `nature_cnn`: CNN implementation proposed by Mnih et al.(https://www.nature.com/articles/nature14236),
consisting of three stacked layers, each with two risidual blocks, making a
consisting of three stacked layers, each with two risidual blocks, making a
much larger network than the other two.
Options: `simple`, `nature_cnn`, `resnet`

## (Optional) Pretraining Using Demonstrations
In some cases, you might want to bootstrap the agent's policy using behavior recorded
from a player. This can help guide the agent towards the reward. Pretraining adds
training operations that mimic a demonstration rather than attempting to maximize reward.
It is essentially equivalent to running [behavioral cloning](./Training-BehavioralCloning.md)
from a player. This can help guide the agent towards the reward. Pretraining adds
training operations that mimic a demonstration rather than attempting to maximize reward.
It is essentially equivalent to running [behavioral cloning](Training-Behavioral-Cloning.md)
in-line with PPO.
To use pretraining, add a `pretraining` section to the trainer_config. For instance:

`strength` corresponds to the learning rate of the imitation relative to the learning
rate of PPO, and roughly corresponds to how strongly we allow the behavioral cloning
to influence the policy.
to influence the policy.
`demo_path` is the path to your `.demo` file or directory of `.demo` files.
`demo_path` is the path to your `.demo` file or directory of `.demo` files.
During pretraining, it is often desirable to stop using demonstrations after the agent has
During pretraining, it is often desirable to stop using demonstrations after the agent has
pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
the steps to 0 for constant imitation over the entire training run.
pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
the steps to 0 for constant imitation over the entire training run.
### (Optional) Batch Size

`samples_per_update` is the maximum number of samples
to use during each imitation update. You may want to lower this if your demonstration
dataset is very large to avoid overfitting the policy on demonstrations. Set to 0
dataset is very large to avoid overfitting the policy on demonstrations. Set to 0
to train over all of the demonstrations at each update step.
Default Value: `0` (all)

2
docs/localized/KR/README.md


<img src="docs/images/image-banner.png" align="middle" width="3000"/>
# Unity ML-Agents Toolkit (Beta)
# Unity ML-Agents Toolkit (Beta) v0.9
[![docs badge](https://img.shields.io/badge/docs-reference-blue.svg)](docs/Readme.md)
[![license badge](https://img.shields.io/badge/license-Apache--2.0-green.svg)](LICENSE)

2
docs/localized/zh-CN/README.md


<img src="docs/images/unity-wide.png" align="middle" width="3000"/>
# Unity ML-Agents 工具包(Beta)
# Unity ML-Agents 工具包(Beta) v0.3.1
**注意:** 本文档为v0.3版本文档的部分翻译版,目前并不会随着英文版文档更新而更新。若要查看更新更全的英文版文档,请查看[这里](https://github.com/Unity-Technologies/ml-agents)。

7
docs/Profiling-Python.md


# Profiling ML-Agents in Python
# Profiling in Python
ML-Agents provides a lightweight profiling system, in order to identity hotspots in the training process and help spot
regressions from changes.
As part of the ML-Agents tookit, we provide a lightweight profiling system,
in order to identity hotspots in the training process and help spot regressions from changes.
Timers are hierarchical, meaning that the time tracked in a block of code can be further split into other blocks if
desired. This also means that a function that is called from multiple places in the code will appear in multiple

```
You can also used the `hierarchical_timer` context manager.
``` python
with hierarchical_timer("communicator.exchange"):

1
ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py


reward multiplied by the strength parameter
:param gamma: The time discounting factor used for this reward.
:param demo_path: The path to the demonstration file
:param num_epoch: The number of epochs to train over the training buffer for the discriminator.
:param encoding_size: The size of the the hidden layers of the discriminator
:param learning_rate: The Learning Rate used during GAIL updates.
:param samples_per_update: The maximum number of samples to update during GAIL updates.

236
docs/Reward-Signals.md


# Reward Signals
In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy)
that maximizes a reward. Typically, a reward is defined by your environment, and corresponds
to reaching some goal. These are what we refer to as "extrinsic" rewards, as they are defined
external of the learning algorithm.
Rewards, however, can be defined outside of the enviroment as well, to encourage the agent to
behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these
rewards as "intrinsic" reward signals. The total reward that the agent will learn to maximize can
be a mix of extrinsic and intrinsic reward signals.
ML-Agents allows reward signals to be defined in a modular way, and we provide three reward
signals that can the mixed and matched to help shape your agent's behavior. The `extrinsic` Reward
Signal represents the rewards defined in your environment, and is enabled by default.
The `curiosity` reward signal helps your agent explore when extrinsic rewards are sparse.
## Enabling Reward Signals
Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An
example is provided in `config/trainer_config.yaml` and `config/gail_config.yaml`. To enable a reward signal, add it to the
`reward_signals:` section under the brain name. For instance, to enable the extrinsic signal
in addition to a small curiosity reward and a GAIL reward signal, you would define your `reward_signals` as follows:
```yaml
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 256
gail:
strength: 0.01
gamma: 0.99
encoding_size: 128
demo_path: demos/ExpertPyramid.demo
```
Each reward signal should define at least two parameters, `strength` and `gamma`, in addition
to any class-specific hyperparameters. Note that to remove a reward signal, you should delete
its entry entirely from `reward_signals`. At least one reward signal should be left defined
at all times.
## Reward Signal Types
As part of the toolkit, we provide three reward signal types as part of hyperparameters - Extrinsic, Curiosity, and GAIL.
### Extrinsic Reward Signal
The `extrinsic` reward signal is simply the reward given by the
[environment](Learning-Environment-Design.md). Remove it to force the agent
to ignore the environment reward.
#### Strength
`strength` is the factor by which to multiply the raw
reward. Typical ranges will vary depending on the reward signal.
Typical Range: `1.0`
#### Gamma
`gamma` corresponds to the discount factor for future rewards. This can be
thought of as how far into the future the agent should care about possible
rewards. In situations when the agent should be acting in the present in order
to prepare for rewards in the distant future, this value should be large. In
cases when rewards are more immediate, it can be smaller.
Typical Range: `0.8` - `0.995`
### Curiosity Reward Signal
The `curiosity` reward signal enables the Intrinsic Curiosity Module. This is an implementation
of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction"
by Pathak, et al. It trains two networks:
* an inverse model, which takes the current and next obersvation of the agent, encodes them, and
uses the encoding to predict the action that was taken between the observations
* a forward model, which takes the encoded current obseravation and action, and predicts the
next encoded observation.
The loss of the forward model (the difference between the predicted and actual encoded observations) is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be.
For more information, see
* https://arxiv.org/abs/1705.05363
* https://pathak22.github.io/noreward-rl/
* https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/
#### Strength
In this case, `strength` corresponds to the magnitude of the curiosity reward generated
by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough
to not be overwhelmed by extrinsic reward signals in the environment.
Likewise it should not be too large to overwhelm the extrinsic reward signal.
Typical Range: `0.001` - `0.1`
#### Gamma
`gamma` corresponds to the discount factor for future rewards.
Typical Range: `0.8` - `0.995`
#### (Optional) Encoding Size
`encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model.
This value should be small enough to encourage the ICM to compress the original
observation, but also not too small to prevent it from learning to differentiate between
demonstrated and actual behavior.
Default Value: `64`
Typical Range: `64` - `256`
#### (Optional) Learning Rate
`learning_rate` is the learning rate used to update the intrinsic curiosity module.
This should typically be decreased if training is unstable, and the curiosity loss is unstable.
Default Value: `3e-4`
Typical Range: `1e-5` - `1e-3`
#### (Optional) Num Epochs
`num_epoch` The number of passes to make through the experience buffer when performing gradient
descent optimization for the ICM. This typically should be set to the same as used for PPO.
Default Value: `3`
Typical Range: `3` - `10`
### GAIL Reward Signal
GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an
imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs
(Generative Adversarial Networks). In this framework, a second neural network, the
discriminator, is taught to distinguish whether an observation/action is from a demonstration or
produced by the agent. This discriminator can the examine a new observation/action and provide it a
reward based on how close it believes this new observation/action is to the provided demonstrations.
At each training step, the agent tries to learn how to maximize this reward. Then, the
discriminator is trained to better distinguish between demonstrations and agent state/actions.
In this way, while the agent gets better and better at mimicing the demonstrations, the
discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.
This approach, when compared to [Behavioral Cloning](Training-Behavioral-Cloning.md), requires
far fewer demonstrations to be provided. After all, we are still learning a policy that happens
to be similar to the demonstrations, not directly copying the behavior of the demonstrations. It
is especially effective when combined with an Extrinsic signal. However, the GAIL reward signal can
also be used independently to purely learn from demonstrations.
Using GAIL requires recorded demonstrations from your Unity environment. See the
[imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations.
#### Strength
`strength` is the factor by which to multiply the raw reward. Note that when using GAIL
with an Extrinsic Signal, this value should be set lower if your demonstrations are
suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic
rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases.
Typical Range: `0.01` - `1.0`
#### Gamma
`gamma` corresponds to the discount factor for future rewards.
Typical Range: `0.8` - `0.9`
#### Demo Path
`demo_path` is the path to your `.demo` file or directory of `.demo` files. See the [imitation learning guide]
(Training-Imitation-Learning.md).
#### (Optional) Encoding Size
`encoding_size` corresponds to the size of the hidden layer used by the discriminator.
This value should be small enough to encourage the discriminator to compress the original
observation, but also not too small to prevent it from learning to differentiate between
demonstrated and actual behavior. Dramatically increasing this size will also negatively affect
training times.
Default Value: `64`
Typical Range: `64` - `256`
#### (Optional) Learning Rate
`learning_rate` is the learning rate used to update the discriminator.
This should typically be decreased if training is unstable, and the GAIL loss is unstable.
Default Value: `3e-4`
Typical Range: `1e-5` - `1e-3`
#### (Optional) Use Actions
`use_actions` determines whether the discriminator should discriminate based on both
observations and actions, or just observations. Set to `True` if you want the agent to
mimic the actions from the demonstrations, and `False` if you'd rather have the agent
visit the same states as in the demonstrations but with possibly different actions.
Setting to `False` is more likely to be stable, especially with imperfect demonstrations,
but may learn slower.
Default Value: `false`
#### (Optional) Variational Discriminator Bottleneck
`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the
GAIL discriminator. This forces the discriminator to learn a more general representation
and reduces its tendency to be "too good" at discriminating, making learning more stable.
However, it does increase training time. Enable this if you notice your imitation learning is
unstable, or unable to learn the task at hand.
Default Value: `false`
#### (Optional) Samples Per Update
`samples_per_update` is the maximum number of samples to use during each discriminator update. You may
want to lower this if your buffer size is very large to avoid overfitting the discriminator on current data.
If set to 0, we will use the minimum of buffer size and the number of demonstration samples.
Default Value: `0`
Typical Range: Approximately equal to [`buffer_size`](Training-PPO.md)
#### (Optional) Num Epochs
`num_epoch` The number of passes to make through the experience buffer when performing gradient
descent optimization for the discriminator. To avoid overfitting, this typically should be set to
the same as or less than used for PPO.
Default Value: `3`
Typical Range: `1` - `10`

171
docs/Training-Generalized-Reinforcement-Learning-Agents.md


# Training Generalized Reinforcement Learning Agents
One of the challenges of training and testing agents on the same
environment is that the agents tend to overfit. The result is that the
agents are unable to generalize to any tweaks or variations in the enviornment.
This is analgous to a model being trained and tested on an identical dataset
in supervised learning. This becomes problematic in cases where environments
are randomly instantiated with varying objects or properties.
To make agents robust and generalizable to different environments, the agent
should be trained over multiple variations of the enviornment. Using this approach
for training, the agent will be better suited to adapt (with higher performance)
to future unseen variations of the enviornment
_Example of variations of the 3D Ball environment._
Ball scale of 0.5 | Ball scale of 4
:-------------------------:|:-------------------------:
![](images/3dball_small.png) | ![](images/3dball_big.png)
## Introducing Generalization Using Reset Parameters
To enable variations in the environments, we implemented `Reset Parameters`. We
also included different sampling methods and the ability to create new kinds of
sampling methods for each `Reset Parameter`. In the 3D ball environment example displayed
in the figure above, the reset parameters are `gravity`, `ball_mass` and `ball_scale`.
## How to Enable Generalization Using Reset Parameters
We first need to provide a way to modify the environment by supplying a set of `Reset Parameters`
and vary them over time. This provision can be done either deterministically or randomly.
This is done by assigning each `Reset Parameter` a `sampler-type`(such as a uniform sampler),
which determines how to sample a `Reset
Parameter`. If a `sampler-type` isn't provided for a
`Reset Parameter`, the parameter maintains the default value throughout the
training procedure, remaining unchanged. The samplers for all the `Reset Parameters`
are handled by a **Sampler Manager**, which also handles the generation of new
values for the reset parameters when needed.
To setup the Sampler Manager, we create a YAML file that specifies how we wish to
generate new samples for each `Reset Parameters`. In this file, we specify the samplers and the
`resampling-interval` (the number of simulation steps after which reset parameters are
resampled). Below is an example of a sampler file for the 3D ball environment.
```yaml
resampling-interval: 5000
mass:
sampler-type: "uniform"
min_value: 0.5
max_value: 10
gravity:
sampler-type: "multirange_uniform"
intervals: [[7, 10], [15, 20]]
scale:
sampler-type: "uniform"
min_value: 0.75
max_value: 3
```
Below is the explanation of the fields in the above example.
* `resampling-interval` - Specifies the number of steps for the agent to
train under a particular environment configuration before resetting the
environment with a new sample of `Reset Parameters`.
* `Reset Parameter` - Name of the `Reset Parameter` like `mass`, `gravity` and `scale`. This should match the name
specified in the academy of the intended environment for which the agent is
being trained. If a parameter specified in the file doesn't exist in the
environment, then this parameter will be ignored. Within each `Reset Parameter`
* `sampler-type` - Specify the sampler type to use for the `Reset Parameter`.
This is a string that should exist in the `Sampler Factory` (explained
below).
* `sampler-type-sub-arguments` - Specify the sub-arguments depending on the `sampler-type`.
In the example above, this would correspond to the `intervals`
under the `sampler-type` `"multirange_uniform"` for the `Reset Parameter` called gravity`.
The key name should match the name of the corresponding argument in the sampler definition.
(See below)
The Sampler Manager allocates a sampler type for each `Reset Parameter` by using the *Sampler Factory*,
which maintains a dictionary mapping of string keys to sampler objects. The available sampler types
to be used for each `Reset Parameter` is available in the Sampler Factory.
### Included Sampler Types
Below is a list of included `sampler-type` as part of the toolkit.
* `uniform` - Uniform sampler
* Uniformly samples a single float value between defined endpoints.
The sub-arguments for this sampler to specify the interval
endpoints are as below. The sampling is done in the range of
[`min_value`, `max_value`).
* **sub-arguments** - `min_value`, `max_value`
* `gaussian` - Gaussian sampler
* Samples a single float value from the distribution characterized by
the mean and standard deviation. The sub-arguments to specify the
gaussian distribution to use are as below.
* **sub-arguments** - `mean`, `st_dev`
* `multirange_uniform` - Multirange uniform sampler
* Uniformly samples a single float value between the specified intervals.
Samples by first performing a weight pick of an interval from the list
of intervals (weighted based on interval width) and samples uniformly
from the selected interval (half-closed interval, same as the uniform
sampler). This sampler can take an arbitrary number of intervals in a
list in the following format:
[[`interval_1_min`, `interval_1_max`], [`interval_2_min`, `interval_2_max`], ...]
* **sub-arguments** - `intervals`
The implementation of the samplers can be found at `ml-agents-envs/mlagents/envs/sampler_class.py`.
### Defining a New Sampler Type
If you want to define your own sampler type, you must first inherit the *Sampler*
base class (included in the `sampler_class` file) and preserve the interface.
Once the class for the required method is specified, it must be registered in the Sampler Factory.
This can be done by subscribing to the *register_sampler* method of the SamplerFactory. The command
is as follows:
`SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)`
Once the Sampler Factory reflects the new register, the new sampler type can be used for sample any
`Reset Parameter`. For example, lets say a new sampler type was implemented as below and we register
the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory.
```python
class CustomSampler(Sampler):
def __init__(self, argA, argB, argC):
self.possible_vals = [argA, argB, argC]
def sample_all(self):
return np.random.choice(self.possible_vals)
```
Now we need to specify the new sampler type in the sampler YAML file. For example, we use this new
sampler type for the `Reset Parameter` *mass*.
```yaml
mass:
sampler-type: "custom-sampler"
argB: 1
argA: 2
argC: 3
```
### Training with Generalization Using Reset Parameters
After the sampler YAML file is defined, we proceed by launching `mlagents-learn` and specify
our configured sampler file with the `--sampler` flag. For example, if we wanted to train the
3D ball agent with generalization using `Reset Parameters` with `config/3dball_generalize.yaml`
sampling setup, we would run
```sh
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_generalize.yaml
--run-id=3D-Ball-generalization --train
```
We can observe progress and metrics via Tensorboard.

211
docs/Training-RewardSignals.md


# Reward Signals
In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy)
that maximizes a reward. Typically, a reward is defined by your environment, and corresponds
to reaching some goal. These are what we refer to as "extrinsic" rewards, as they are defined
external of the learning algorithm.
Rewards, however, can be defined outside of the enviroment as well, to encourage the agent to
behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these
rewards as "intrinsic" reward signals. The total reward that the agent will learn to maximize can
be a mix of extrinsic and intrinsic reward signals.
ML-Agents allows reward signals to be defined in a modular way, and we provide three reward
signals that can the mixed and matched to help shape your agent's behavior. The `extrinsic` Reward
Signal represents the rewards defined in your environment, and is enabled by default.
The `curiosity` reward signal helps your agent explore when extrinsic rewards are sparse.
## Enabling Reward Signals
Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An
example is provided in `config/trainer_config.yaml`. To enable a reward signal, add it to the
`reward_signals:` section under the brain name. For instance, to enable the extrinsic signal
in addition to a small curiosity reward, you would define your `reward_signals` as follows:
```yaml
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.01
gamma: 0.99
encoding_size: 128
```
Each reward signal should define at least two parameters, `strength` and `gamma`, in addition
to any class-specific hyperparameters. Note that to remove a reward signal, you should delete
its entry entirely from `reward_signals`. At least one reward signal should be left defined
at all times.
## Reward Signal Types
### The Extrinsic Reward Signal
The `extrinsic` reward signal is simply the reward given by the
[environment](Learning-Environment-Design.md). Remove it to force the agent
to ignore the environment reward.
#### Strength
`strength` is the factor by which to multiply the raw
reward. Typical ranges will vary depending on the reward signal.
Typical Range: `1.0`
#### Gamma
`gamma` corresponds to the discount factor for future rewards. This can be
thought of as how far into the future the agent should care about possible
rewards. In situations when the agent should be acting in the present in order
to prepare for rewards in the distant future, this value should be large. In
cases when rewards are more immediate, it can be smaller.
Typical Range: `0.8` - `0.995`
### The Curiosity Reward Signal
The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an implementation
of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction"
by Pathak, et al. It trains two networks:
* an inverse model, which takes the current and next obersvation of the agent, encodes them, and
uses the encoding to predict the action that was taken between the observations
* a forward model, which takes the encoded current obseravation and action, and predicts the
next encoded observation.
The loss of the forward model (the difference between the predicted and actual encoded observations) is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be.
For more information, see
* https://arxiv.org/abs/1705.05363
* https://pathak22.github.io/noreward-rl/
* https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/
#### Strength
In this case, `strength` corresponds to the magnitude of the curiosity reward generated
by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough
to not be overwhelmed by extrinsic reward signals in the environment.
Likewise it should not be too large to overwhelm the extrinsic reward signal.
Typical Range: `0.001` - `0.1`
#### Gamma
`gamma` corresponds to the discount factor for future rewards.
Typical Range: `0.8` - `0.995`
#### Encoding Size
`encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model.
This value should be small enough to encourage the ICM to compress the original
observation, but also not too small to prevent it from learning to differentiate between
demonstrated and actual behavior.
Default Value: `64`
Typical Range: `64` - `256`
#### Learning Rate
`learning_rate` is the learning rate used to update the intrinsic curiosity module.
This should typically be decreased if training is unstable, and the curiosity loss is unstable.
Default Value: `3e-4`
Typical Range: `1e-5` - `1e-3`
### The GAIL Reward Signal
GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an
imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs
(Generative Adversarial Networks). In this framework, a second neural network, the
discriminator, is taught to distinguish whether an observation/action is from a demonstration, or
produced by the agent. This discriminator can the examine a new observation/action and provide it a
reward based on how close it believes this new observation/action is to the provided demonstrations.
At each training step, the agent tries to learn how to maximize this reward. Then, the
discriminator is trained to better distinguish between demonstrations and agent state/actions.
In this way, while the agent gets better and better at mimicing the demonstrations, the
discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.
This approach, when compared to [Behavioral Cloning](Training-BehavioralCloning.md), requires
far fewer demonstrations to be provided. After all, we are still learning a policy that happens
to be similar to the demonstration, not directly copying the behavior of the demonstrations. It
is also especially effective when combined with an Extrinsic signal, but can also be used
independently to purely learn from demonstration.
Using GAIL requires recorded demonstrations from your Unity environment. See the
[imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations.
#### Strength
`strength` is the factor by which to multiply the raw reward. Note that when using GAIL
with an Extrinsic Signal, this value should be set lower if your demonstrations are
suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic
rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases.
Typical Range: `0.01` - `1.0`
#### Gamma
`gamma` corresponds to the discount factor for future rewards.
Typical Range: `0.8` - `0.9`
#### Demo Path
`demo_path` is the path to your `.demo` file or directory of `.demo` files. See the [imitation learning guide]
(Training-Imitation-Learning.md).
#### Encoding Size
`encoding_size` corresponds to the size of the hidden layer used by the discriminator.
This value should be small enough to encourage the discriminator to compress the original
observation, but also not too small to prevent it from learning to differentiate between
demonstrated and actual behavior. Dramatically increasing this size will also negatively affect
training times.
Default Value: `64`
Typical Range: `64` - `256`
#### Learning Rate
`learning_rate` is the learning rate used to update the discriminator.
This should typically be decreased if training is unstable, and the GAIL loss is unstable.
Default Value: `3e-4`
Typical Range: `1e-5` - `1e-3`
#### Use Actions
`use_actions` determines whether the discriminator should discriminate based on both
observations and actions, or just observations. Set to `True` if you want the agent to
mimic the actions from the demonstrations, and `False` if you'd rather have the agent
visit the same states as in the demonstrations but with possibly different actions.
Setting to `False` is more likely to be stable, especially with imperfect demonstrations,
but may learn slower.
Default Value: `false`
#### (Optional) Samples Per Update
`samples_per_update` is the maximum number of samples to use during each discriminator update. You may
want to lower this if your buffer size is very large to avoid overfitting the discriminator on current data.
If set to 0, we will use the minimum of buffer size and the number of demonstration samples.
Default Value: `0`
Typical Range: Approximately equal to [`buffer_size`](Training-PPO.md)
#### (Optional) Variational Discriminator Bottleneck
`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the
GAIL discriminator. This forces the discriminator to learn a more general representation
and reduces its tendency to be "too good" at discriminating, making learning more stable.
However, it does increase training time. Enable this if you notice your imitation learning is
unstable, or unable to learn the task at hand.
Default Value: `false`

157
docs/Training-Generalization-Learning.md


# Training Generalized Reinforcement Learning Agents
Reinforcement learning has a rather unique setup as opposed to supervised and
unsupervised learning. Agents here are trained and tested on the same exact
environment, which is analogous to a model being trained and tested on an
identical dataset in supervised learning! This setting results in overfitting;
the inability of the agent to generalize to slight tweaks or variations in the
environment. This is problematic in instances when environments are randomly
instantiated with varying properties. To make agents robust, one approach is to
train an agent over multiple variations of the environment. The agent is
trained in this approach with the intent that it learns to adapt its performance
to future unseen variations of the environment.
Ball scale of 0.5 | Ball scale of 4
:-------------------------:|:-------------------------:
![](images/3dball_small.png) | ![](images/3dball_big.png)
_Variations of the 3D Ball environment._
To vary environments, we first decide what parameters to vary in an
environment. We call these parameters `Reset Parameters`. In the 3D ball
environment example displayed in the figure above, the reset parameters are
`gravity`, `ball_mass` and `ball_scale`.
## How-to
For generalization training, we need to provide a way to modify the environment
by supplying a set of reset parameters, and vary them over time. This provision
can be done either deterministically or randomly.
This is done by assigning each reset parameter a sampler, which samples a reset
parameter value (such as a uniform sampler). If a sampler isn't provided for a
reset parameter, the parameter maintains the default value throughout the
training procedure, remaining unchanged. The samplers for all the reset parameters
are handled by a **Sampler Manager**, which also handles the generation of new
values for the reset parameters when needed.
To setup the Sampler Manager, we setup a YAML file that specifies how we wish to
generate new samples. In this file, we specify the samplers and the
`resampling-interval` (number of simulation steps after which reset parameters are
resampled). Below is an example of a sampler file for the 3D ball environment.
```yaml
resampling-interval: 5000
mass:
sampler-type: "uniform"
min_value: 0.5
max_value: 10
gravity:
sampler-type: "multirange_uniform"
intervals: [[7, 10], [15, 20]]
scale:
sampler-type: "uniform"
min_value: 0.75
max_value: 3
```
* `resampling-interval` (int) - Specifies the number of steps for agent to
train under a particular environment configuration before resetting the
environment with a new sample of reset parameters.
* `parameter_name` - Name of the reset parameter. This should match the name
specified in the academy of the intended environment for which the agent is
being trained. If a parameter specified in the file doesn't exist in the
environment, then this specification will be ignored.
* `sampler-type` - Specify the sampler type to use for the reset parameter.
This is a string that should exist in the `Sampler Factory` (explained
below).
* `sub-arguments` - Specify the characteristic parameters for the sampler.
In the example sampler file above, this would correspond to the `intervals`
key under the `multirange_uniform` sampler for the gravity reset parameter.
The key name should match the name of the corresponding argument in the sampler definition. (Look at defining a new sampler method)
The sampler manager allocates a sampler for a reset parameter by using the *Sampler Factory*, which maintains a dictionary mapping of string keys to sampler objects. The available samplers to be used for reset parameter resampling is as available in the Sampler Factory.
#### Possible Sampler Types
The currently implemented samplers that can be used with the `sampler-type` arguments are:
* `uniform` - Uniform sampler
* Uniformly samples a single float value between defined endpoints.
The sub-arguments for this sampler to specify the interval
endpoints are as below. The sampling is done in the range of
[`min_value`, `max_value`).
* **sub-arguments** - `min_value`, `max_value`
* `gaussian` - Gaussian sampler
* Samples a single float value from the distribution characterized by
the mean and standard deviation. The sub-arguments to specify the
gaussian distribution to use are as below.
* **sub-arguments** - `mean`, `st_dev`
* `multirange_uniform` - Multirange Uniform sampler
* Uniformly samples a single float value between the specified intervals.
Samples by first performing a weight pick of an interval from the list
of intervals (weighted based on interval width) and samples uniformly
from the selected interval (half-closed interval, same as the uniform
sampler). This sampler can take an arbitrary number of intervals in a
list in the following format:
[[`interval_1_min`, `interval_1_max`], [`interval_2_min`, `interval_2_max`], ...]
* **sub-arguments** - `intervals`
The implementation of the samplers can be found at `ml-agents-envs/mlagents/envs/sampler_class.py`.
### Defining a new sampler method
Custom sampling techniques must inherit from the *Sampler* base class (included in the `sampler_class` file) and preserve the interface. Once the class for the required method is specified, it must be registered in the Sampler Factory.
This can be done by subscribing to the *register_sampler* method of the SamplerFactory. The command is as follows:
`SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)`
Once the Sampler Factory reflects the new register, the custom sampler can be used for resampling reset parameter. For demonstration, lets say our sampler was implemented as below, and we register the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory.
```python
class CustomSampler(Sampler):
def __init__(self, argA, argB, argC):
self.possible_vals = [argA, argB, argC]
def sample_all(self):
return np.random.choice(self.possible_vals)
```
Now we need to specify this sampler in the sampler file. Lets say we wish to use this sampler for the reset parameter *mass*; the sampler file would specify the same for mass as the following (any order of the subarguments is valid).
```yaml
mass:
sampler-type: "custom-sampler"
argB: 1
argA: 2
argC: 3
```
With the sampler file setup, we can proceed to train our agent as explained in the next section.
### Training with Generalization Learning
We first begin with setting up the sampler file. After the sampler file is defined and configured, we proceed by launching `mlagents-learn` and specify our configured sampler file with the `--sampler` flag. To demonstrate, if we wanted to train a 3D ball agent with generalization using the `config/3dball_generalize.yaml` sampling setup, we can run
```sh
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_generalize.yaml --run-id=3D-Ball-generalization --train
```
We can observe progress and metrics via Tensorboard.

/docs/Training-BehavioralCloning.md → /docs/Training-Behavioral-Cloning.md

/docs/Profiling.md → /docs/Profiling-Python.md

正在加载...
取消
保存