|
|
|
|
|
|
#### Observing Training |
|
|
|
|
|
|
|
Regardless of which training methods, configurations or hyperparameters you |
|
|
|
provide, the training process will always generate three artifacts: |
|
|
|
provide, the training process will always generate three artifacts, all found |
|
|
|
in the `results/<run-identifier>` folder: |
|
|
|
1. Summaries (under the `summaries/` folder): these are training metrics that |
|
|
|
1. Summaries: these are training metrics that |
|
|
|
1. Models (under the `models/` folder): these contain the model checkpoints that |
|
|
|
1. Models: these contain the model checkpoints that |
|
|
|
1. Timers file (also under the `summaries/` folder): this contains aggregated |
|
|
|
1. Timers file (under `results/<run-identifier>/run_logs`): this contains aggregated |
|
|
|
metrics on your training process, including time spent on specific code |
|
|
|
blocks. See [Profiling in Python](Profiling-Python.md) for more information |
|
|
|
on the timers generated. |
|
|
|
|
|
|
This section offers a detailed guide into how to manage the different training |
|
|
|
set-ups withing the toolkit. |
|
|
|
|
|
|
|
More specifically, this section offers a detailed guide on four command-line |
|
|
|
More specifically, this section offers a detailed guide on the command-line |
|
|
|
Behavior in the scene |
|
|
|
- `--curriculum`: defines the set-up for Curriculum Learning |
|
|
|
- `--sampler`: defines the set-up for Environment Parameter Randomization |
|
|
|
Behavior in the scene, and the set-ups for Curriculum Learning and |
|
|
|
Environment Parameter Randomization |
|
|
|
- `--num-envs`: number of concurrent Unity instances to use during training |
|
|
|
|
|
|
|
Reminder that a detailed description of all command-line options can be found by |
|
|
|
|
|
|
process when the default parameters don't seem to be giving the level of |
|
|
|
performance you would like. We provide sample configuration files for our |
|
|
|
example environments in the [config/](../config/) directory. The |
|
|
|
`config/trainer_config.yaml` was used to train the 3D Balance Ball in the |
|
|
|
`config/ppo/3DBall.yaml` was used to train the 3D Balance Ball in the |
|
|
|
[Getting Started](Getting-Started.md) guide. That configuration file uses the |
|
|
|
PPO trainer, but we also have configuration files for SAC and GAIL. |
|
|
|
|
|
|
|
|
|
|
add typically has its own training configurations or additional configuration |
|
|
|
files. For instance: |
|
|
|
add typically has its own training configurations. For instance: |
|
|
|
|
|
|
|
- Use PPO or SAC? |
|
|
|
- Use Recurrent Neural Networks for adding memory to your agents? |
|
|
|
|
|
|
demonstrations.) |
|
|
|
- Use self-play? (Assuming your environment includes multiple agents.) |
|
|
|
|
|
|
|
The answers to the above questions will dictate the configuration files and the |
|
|
|
parameters within them. The rest of this section breaks down the different |
|
|
|
configuration files and explains the possible settings for each. |
|
|
|
|
|
|
|
The trainer config file, `<trainer-config-file>`, determines the features you will |
|
|
|
use during training, and the answers to the above questions will dictate its contents. |
|
|
|
The rest of this guide breaks down the different sub-sections of the trainer config file |
|
|
|
and explains the possible settings for each. |
|
|
|
### Trainer Config File |
|
|
|
### Behavior Configurations |
|
|
|
We begin with the trainer config file, `<trainer-config-file>`, which includes a |
|
|
|
set of configurations for each Behavior in your scene. Some of the |
|
|
|
The primary section of the trainer config file is a |
|
|
|
set of configurations for each Behavior in your scene. These are defined under |
|
|
|
the sub-section `behaviors` in your trainer config file. Some of the |
|
|
|
curriculum and environment parameter randomization settings are not part of this |
|
|
|
file, but their settings live in different files that we'll cover in subsequent |
|
|
|
sections. |
|
|
|
curriculum and environment parameter randomization settings are not part of the `behaviors` |
|
|
|
configuration, but their settings live in different sections that we'll cover subsequently. |
|
|
|
BehaviorPPO: |
|
|
|
trainer: ppo |
|
|
|
behaviors: |
|
|
|
BehaviorPPO: |
|
|
|
trainer: ppo |
|
|
|
# Trainer configs common to PPO/SAC (excluding reward signals) |
|
|
|
batch_size: 1024 |
|
|
|
buffer_size: 10240 |
|
|
|
hidden_units: 128 |
|
|
|
learning_rate: 3.0e-4 |
|
|
|
learning_rate_schedule: linear |
|
|
|
max_steps: 5.0e5 |
|
|
|
normalize: false |
|
|
|
num_layers: 2 |
|
|
|
time_horizon: 64 |
|
|
|
vis_encoder_type: simple |
|
|
|
# Trainer configs common to PPO/SAC (excluding reward signals) |
|
|
|
batch_size: 1024 |
|
|
|
buffer_size: 10240 |
|
|
|
hidden_units: 128 |
|
|
|
learning_rate: 3.0e-4 |
|
|
|
learning_rate_schedule: linear |
|
|
|
max_steps: 5.0e5 |
|
|
|
normalize: false |
|
|
|
num_layers: 2 |
|
|
|
time_horizon: 64 |
|
|
|
vis_encoder_type: simple |
|
|
|
# PPO-specific configs |
|
|
|
beta: 5.0e-3 |
|
|
|
epsilon: 0.2 |
|
|
|
lambd: 0.95 |
|
|
|
num_epoch: 3 |
|
|
|
threaded: true |
|
|
|
# PPO-specific configs |
|
|
|
beta: 5.0e-3 |
|
|
|
epsilon: 0.2 |
|
|
|
lambd: 0.95 |
|
|
|
num_epoch: 3 |
|
|
|
threaded: true |
|
|
|
# memory |
|
|
|
use_recurrent: true |
|
|
|
sequence_length: 64 |
|
|
|
memory_size: 256 |
|
|
|
# memory |
|
|
|
use_recurrent: true |
|
|
|
sequence_length: 64 |
|
|
|
memory_size: 256 |
|
|
|
# behavior cloning |
|
|
|
behavioral_cloning: |
|
|
|
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo |
|
|
|
strength: 0.5 |
|
|
|
steps: 150000 |
|
|
|
batch_size: 512 |
|
|
|
num_epoch: 3 |
|
|
|
samples_per_update: 0 |
|
|
|
init_path: |
|
|
|
# behavior cloning |
|
|
|
behavioral_cloning: |
|
|
|
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo |
|
|
|
strength: 0.5 |
|
|
|
steps: 150000 |
|
|
|
batch_size: 512 |
|
|
|
num_epoch: 3 |
|
|
|
samples_per_update: 0 |
|
|
|
init_path: |
|
|
|
reward_signals: |
|
|
|
# environment reward |
|
|
|
extrinsic: |
|
|
|
strength: 1.0 |
|
|
|
gamma: 0.99 |
|
|
|
reward_signals: |
|
|
|
# environment reward |
|
|
|
extrinsic: |
|
|
|
strength: 1.0 |
|
|
|
gamma: 0.99 |
|
|
|
# curiosity module |
|
|
|
curiosity: |
|
|
|
strength: 0.02 |
|
|
|
gamma: 0.99 |
|
|
|
encoding_size: 256 |
|
|
|
learning_rate: 3e-4 |
|
|
|
# curiosity module |
|
|
|
curiosity: |
|
|
|
strength: 0.02 |
|
|
|
gamma: 0.99 |
|
|
|
encoding_size: 256 |
|
|
|
learning_rate: 3e-4 |
|
|
|
# GAIL |
|
|
|
gail: |
|
|
|
strength: 0.01 |
|
|
|
gamma: 0.99 |
|
|
|
encoding_size: 128 |
|
|
|
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo |
|
|
|
learning_rate: 3e-4 |
|
|
|
use_actions: false |
|
|
|
use_vail: false |
|
|
|
# GAIL |
|
|
|
gail: |
|
|
|
strength: 0.01 |
|
|
|
gamma: 0.99 |
|
|
|
encoding_size: 128 |
|
|
|
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo |
|
|
|
learning_rate: 3e-4 |
|
|
|
use_actions: false |
|
|
|
use_vail: false |
|
|
|
# self-play |
|
|
|
self_play: |
|
|
|
window: 10 |
|
|
|
play_against_latest_model_ratio: 0.5 |
|
|
|
save_steps: 50000 |
|
|
|
swap_steps: 50000 |
|
|
|
team_change: 100000 |
|
|
|
# self-play |
|
|
|
self_play: |
|
|
|
window: 10 |
|
|
|
play_against_latest_model_ratio: 0.5 |
|
|
|
save_steps: 50000 |
|
|
|
swap_steps: 50000 |
|
|
|
team_change: 100000 |
|
|
|
``` |
|
|
|
|
|
|
|
Here is an equivalent file if we use an SAC trainer instead. Notice that the |
|
|
|
|
|
|
```yaml |
|
|
|
BehaviorSAC: |
|
|
|
trainer: sac |
|
|
|
|
|
|
|
# Trainer configs common to PPO/SAC (excluding reward signals) |
|
|
|
# same as PPO config |
|
|
|
behaviors: |
|
|
|
BehaviorSAC: |
|
|
|
trainer: sac |
|
|
|
# SAC-specific configs (replaces the "PPO-specific configs" section above) |
|
|
|
buffer_init_steps: 0 |
|
|
|
tau: 0.005 |
|
|
|
steps_per_update: 1 |
|
|
|
train_interval: 1 |
|
|
|
init_entcoef: 1.0 |
|
|
|
save_replay_buffer: false |
|
|
|
# Trainer configs common to PPO/SAC (excluding reward signals) |
|
|
|
# same as PPO config |
|
|
|
# memory |
|
|
|
# same as PPO config |
|
|
|
# SAC-specific configs (replaces the "PPO-specific configs" section above) |
|
|
|
buffer_init_steps: 0 |
|
|
|
tau: 0.005 |
|
|
|
steps_per_update: 1 |
|
|
|
train_interval: 1 |
|
|
|
init_entcoef: 1.0 |
|
|
|
save_replay_buffer: false |
|
|
|
# pre-training using behavior cloning |
|
|
|
behavioral_cloning: |
|
|
|
# memory |
|
|
|
reward_signals: |
|
|
|
reward_signal_num_update: 1 # only applies to SAC |
|
|
|
|
|
|
|
# environment reward |
|
|
|
extrinsic: |
|
|
|
# pre-training using behavior cloning |
|
|
|
behavioral_cloning: |
|
|
|
# curiosity module |
|
|
|
curiosity: |
|
|
|
# same as PPO config |
|
|
|
reward_signals: |
|
|
|
reward_signal_num_update: 1 # only applies to SAC |
|
|
|
# GAIL |
|
|
|
gail: |
|
|
|
# same as PPO config |
|
|
|
# environment reward |
|
|
|
extrinsic: |
|
|
|
# same as PPO config |
|
|
|
# self-play |
|
|
|
self_play: |
|
|
|
# same as PPO config |
|
|
|
# curiosity module |
|
|
|
curiosity: |
|
|
|
# same as PPO config |
|
|
|
|
|
|
|
# GAIL |
|
|
|
gail: |
|
|
|
# same as PPO config |
|
|
|
|
|
|
|
# self-play |
|
|
|
self_play: |
|
|
|
# same as PPO config |
|
|
|
``` |
|
|
|
|
|
|
|
We now break apart the components of the configuration file and describe what |
|
|
|
|
|
|
|
|
|
|
### Curriculum Learning |
|
|
|
|
|
|
|
To enable curriculum learning, you need to provide the `--curriculum` CLI option |
|
|
|
and point to a YAML file that defines the curriculum. Here is one example file: |
|
|
|
To enable curriculum learning, you need to add a sub-section to the corresponding |
|
|
|
`behaivors` entry in the trainer config YAML file that defines the curriculum for that |
|
|
|
behavior. Here is one example: |
|
|
|
BehaviorY: |
|
|
|
measure: progress |
|
|
|
thresholds: [0.1, 0.3, 0.5] |
|
|
|
min_lesson_length: 100 |
|
|
|
signal_smoothing: true |
|
|
|
parameters: |
|
|
|
wall_height: [1.5, 2.0, 2.5, 4.0] |
|
|
|
behaviors: |
|
|
|
BehaviorY: |
|
|
|
# < Same as above > |
|
|
|
|
|
|
|
# Add this section |
|
|
|
curriculum: |
|
|
|
measure: progress |
|
|
|
thresholds: [0.1, 0.3, 0.5] |
|
|
|
min_lesson_length: 100 |
|
|
|
signal_smoothing: true |
|
|
|
parameters: |
|
|
|
wall_height: [1.5, 2.0, 2.5, 4.0] |
|
|
|
``` |
|
|
|
|
|
|
|
Each group of Agents under the same `Behavior Name` in an environment can have a |
|
|
|
|
|
|
In order to define the curricula, the first step is to decide which parameters |
|
|
|
of the environment will vary. In the case of the Wall Jump environment, the |
|
|
|
height of the wall is what varies. Rather than adjusting it by hand, we will |
|
|
|
create a YAML file which describes the structure of the curricula. Within it, we |
|
|
|
create a configuration which describes the structure of the curricula. Within it, we |
|
|
|
can specify which points in the training process our wall height will change, |
|
|
|
either based on the percentage of training steps which have taken place, or what |
|
|
|
the average reward the agent has received in the recent past is. Below is an |
|
|
|
|
|
|
BigWallJump: |
|
|
|
measure: progress |
|
|
|
thresholds: [0.1, 0.3, 0.5] |
|
|
|
min_lesson_length: 100 |
|
|
|
signal_smoothing: true |
|
|
|
parameters: |
|
|
|
big_wall_min_height: [0.0, 4.0, 6.0, 8.0] |
|
|
|
big_wall_max_height: [4.0, 7.0, 8.0, 8.0] |
|
|
|
SmallWallJump: |
|
|
|
measure: progress |
|
|
|
thresholds: [0.1, 0.3, 0.5] |
|
|
|
min_lesson_length: 100 |
|
|
|
signal_smoothing: true |
|
|
|
parameters: |
|
|
|
small_wall_height: [1.5, 2.0, 2.5, 4.0] |
|
|
|
behaviors: |
|
|
|
BigWallJump: |
|
|
|
# < Trainer parameters for BigWallJump > |
|
|
|
# Curriculum configuration |
|
|
|
curriculum: |
|
|
|
measure: progress |
|
|
|
thresholds: [0.1, 0.3, 0.5] |
|
|
|
min_lesson_length: 100 |
|
|
|
signal_smoothing: true |
|
|
|
parameters: |
|
|
|
big_wall_min_height: [0.0, 4.0, 6.0, 8.0] |
|
|
|
big_wall_max_height: [4.0, 7.0, 8.0, 8.0] |
|
|
|
|
|
|
|
SmallWallJump: |
|
|
|
# < Trainer parameters for BigWallJump > |
|
|
|
# Curriculum configuration |
|
|
|
curriculum: |
|
|
|
measure: progress |
|
|
|
thresholds: [0.1, 0.3, 0.5] |
|
|
|
min_lesson_length: 100 |
|
|
|
signal_smoothing: true |
|
|
|
parameters: |
|
|
|
small_wall_height: [1.5, 2.0, 2.5, 4.0] |
|
|
|
``` |
|
|
|
|
|
|
|
The curriculum for each Behavior has the following parameters: |
|
|
|
|
|
|
train agents in the Wall Jump environment with curriculum learning, we can run: |
|
|
|
|
|
|
|
```sh |
|
|
|
mlagents-learn config/trainer_config.yaml --curriculum=config/curricula/wall_jump.yaml --run-id=wall-jump-curriculum |
|
|
|
mlagents-learn config/ppo/WallJump_curriculum.yaml --run-id=wall-jump-curriculum |
|
|
|
``` |
|
|
|
|
|
|
|
We can then keep track of the current lessons and progresses via TensorBoard. |
|
|
|
|
|
|
|
|
|
|
### Environment Parameter Randomization |
|
|
|
|
|
|
|
To enable parameter randomization, you need to provide the `--sampler` CLI |
|
|
|
option and point to a YAML file that defines the curriculum. Here is one example |
|
|
|
file: |
|
|
|
To enable parameter randomization, you need to add a `parameter-randomization` sub-section |
|
|
|
to your trainer config YAML file. Here is one example: |
|
|
|
resampling-interval: 5000 |
|
|
|
behaviors: |
|
|
|
# < Same as above> |
|
|
|
mass: |
|
|
|
sampler-type: "uniform" |
|
|
|
min_value: 0.5 |
|
|
|
max_value: 10 |
|
|
|
parameter_randomization: |
|
|
|
resampling-interval: 5000 |
|
|
|
gravity: |
|
|
|
sampler-type: "multirange_uniform" |
|
|
|
intervals: [[7, 10], [15, 20]] |
|
|
|
mass: |
|
|
|
sampler-type: "uniform" |
|
|
|
min_value: 0.5 |
|
|
|
max_value: 10 |
|
|
|
scale: |
|
|
|
sampler-type: "uniform" |
|
|
|
min_value: 0.75 |
|
|
|
max_value: 3 |
|
|
|
gravity: |
|
|
|
sampler-type: "multirange_uniform" |
|
|
|
intervals: [[7, 10], [15, 20]] |
|
|
|
|
|
|
|
scale: |
|
|
|
sampler-type: "uniform" |
|
|
|
min_value: 0.75 |
|
|
|
max_value: 3 |
|
|
|
``` |
|
|
|
|
|
|
|
Note that `mass`, `gravity` and `scale` are the names of the environment |
|
|
|
|
|
|
`interval_2_max`], ...] |
|
|
|
- **sub-arguments** - `intervals` |
|
|
|
|
|
|
|
The implementation of the samplers can be found at |
|
|
|
`ml-agents-envs/mlagents_envs/sampler_class.py`. |
|
|
|
The implementation of the samplers can be found in the |
|
|
|
[sampler_class.py file](../ml-agents/mlagents/trainers/sampler_class.py). |
|
|
|
|
|
|
|
#### Defining a New Sampler Type |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### Training with Environment Parameter Randomization |
|
|
|
|
|
|
|
After the sampler YAML file is defined, we proceed by launching `mlagents-learn` |
|
|
|
and specify our configured sampler file with the `--sampler` flag. For example, |
|
|
|
After the sampler configuration is defined, we proceed by launching `mlagents-learn` |
|
|
|
and specify trainer configuration with `parameter-randomization` defined. For example, |
|
|
|
`Environment Parameters` with `config/3dball_randomize.yaml` sampling setup, we |
|
|
|
would run |
|
|
|
`Environment Parameters` with sampling setup, we would run |
|
|
|
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_randomize.yaml |
|
|
|
--run-id=3D-Ball-randomize |
|
|
|
mlagents-learn config/ppo/3DBall_randomize.yaml --run-id=3D-Ball-randomize |
|
|
|
``` |
|
|
|
|
|
|
|
We can observe progress and metrics via Tensorboard. |
|
|
|
|
|
|
|
|
|
|
- **Buffer Size** - If you are having trouble getting an agent to train, even |
|
|
|
with multiple concurrent Unity instances, you could increase `buffer_size` in |
|
|
|
the `config/trainer_config.yaml` file. A common practice is to multiply |
|
|
|
the trainer config file. A common practice is to multiply |
|
|
|
`buffer_size` by `num-envs`. |
|
|
|
- **Resource Constraints** - Invoking concurrent Unity instances is constrained |
|
|
|
by the resources on the machine. Please use discretion when setting |
|
|
|