Improvements to Training-ML-Agents (#3776)
Improvements to Training-ML-Agents (#3776)
* Improvements to Training-ML-Agents - Removed duplicate documentation - Moved CLI descriptions to learn.py - Reorganized "Training with mlagents-learn" into 5 sub-sections * fixed formatting errors and incorporated minor feedback * minor improvement * Minor formatting. * fixed run-id references * Keeping link to use Inference consistent with master Will update the UIE page in a separate PR. * Squashed commit of the following: commit 9600d0fbe6684eca69fb5bab84ab0f6754fc8b0f Author: Marwan Mattar <marwan@unity3d.com> Date: Tue Apr 14 17:45:33 2020 -0700 Various doc improvements (#3775) * Various doc improvements For Using-Virtual-Environment.md: - Made a note regarding updating setuptools and pip. - Changed lists from "-" to "*" For Using-Tensorboard.md: - Changed the ordered list to use "1." For Training-on-Microsoft-Azure-Custom-Instance.md: - Deleted .../develop/gym-wrapper
5 年前
共有 2 个文件被更改,包括 208 次插入 和 255 次删除
# Training ML-Agents |
The ML-Agents toolkit conducts training using an external Python training |
process. During training, this external process communicates with the Academy |
to generate a block of agent experiences. These |
experiences become the training set for a neural network used to optimize the |
agent's policy (which is essentially a mathematical function mapping |
observations to actions). In reinforcement learning, the neural network |
optimizes the policy by maximizing the expected rewards. In imitation learning, |
the neural network optimizes the policy to achieve the smallest difference |
between the actions chosen by the agent trainee and the actions chosen by the |
expert in the same situation. |
The output of the training process is a model file containing the optimized |
policy. This model file is a TensorFlow data graph containing the mathematical |
operations and the optimized weights selected during the training process. You |
can set the generated model file in the Behaviors Parameters under your |
Agent in your Unity project to decide the best course of action for an agent. |
Use the command `mlagents-learn` to train your agents. This command is installed |
with the `mlagents` package and its implementation can be found at |
`ml-agents/mlagents/trainers/learn.py`. The [configuration file](#training-config-file), |
like `config/trainer_config.yaml` specifies the hyperparameters used during training. |
You can edit this file with a text editor to add a specific configuration for |
each Behavior. |
For a broad overview of reinforcement learning, imitation learning and all the |
training scenarios, methods and options within the ML-Agents Toolkit, see |
[ML-Agents Toolkit Overview](ML-Agents-Overview.md). |
For a broader overview of reinforcement learning, imitation learning and the |
ML-Agents training process, see [ML-Agents Toolkit |
Overview](ML-Agents-Overview.md). |
Once your learning environment has been created and is ready for training, the |
next step is to initiate a training run. Training in the ML-Agents Toolkit is |
powered by a dedicated Python package, `mlagents`. This package exposes a |
command `mlagents-learn` that is the single entry point for all training |
workflows (e.g. reinforcement leaning, imitation learning, curriculum learning). |
Its implementation can be found at |
[ml-agents/mlagents/trainers/learn.py](../ml-agents/mlagents/trainers/learn.py). |
Use the `mlagents-learn` command to train agents. `mlagents-learn` supports |
training with |
[reinforcement learning](Background-Machine-Learning.md#reinforcement-learning), |
[curriculum learning](Training-Curriculum-Learning.md), |
and [behavioral cloning imitation learning](Training-Imitation-Learning.md). |
### Starting Training |
Run `mlagents-learn` from the command line to launch the training process. Use |
the command line patterns and the `config/trainer_config.yaml` file to control |
training options. |
`mlagents-learn` is the main training utility provided by the ML-Agents Toolkit. |
It accepts a number of CLI options in addition to a YAML configuration file that |
contains all the configurations and hyperparameters to be used during training. |
The set of configurations and hyperparameters to include in this file depend on |
the agents in your environment and the specific training method you wish to |
utilize. Keep in mind that the hyperparameter values can have a big impact on |
the training performance (i.e. your agent's ability to learn a policy that |
solves the task). In this page, we will review all the hyperparameters for all |
training methods and provide guidelines and advice on their values. |
The basic command for training is: |
To view a description of all the CLI options accepted by `mlagents-learn`, use |
the `--help`: |
mlagents-learn <trainer-config-file> --env=<env_name> --run-id=<run-identifier> |
mlagents-learn --help |
where |
* `<trainer-config-file>` is the file path of the trainer configuration yaml. |
* `<env_name>`__(Optional)__ is the name (including path) of your Unity |
executable containing the agents to be trained. If `<env_name>` is not passed, |
the training will happen in the Editor. Press the :arrow_forward: button in |
Unity when the message _"Start training by pressing the Play button in the |
Unity Editor"_ is displayed on the screen. |
* `<run-identifier>` is an optional identifier you can use to identify the |
results of individual training runs. |
For example, suppose you have a project in Unity named "CatsOnBicycles" which |
contains agents ready to train. To perform the training: |
1. [Build the project](Learning-Environment-Executable.md), making sure that you |
only include the training scene. |
2. Open a terminal or console window. |
3. Navigate to the directory where you installed the ML-Agents Toolkit. |
4. Run the following to launch the training process using the path to the Unity |
environment you built in step 1: |
The basic command for training is: |
mlagents-learn config/trainer_config.yaml --env=../../projects/Cats/CatsOnBicycles.app --run-id=cob_1 |
mlagents-learn <trainer-config-file> --env=<env_name> --run-id=<run-identifier> |
During a training session, the training program prints out and saves updates at |
regular intervals (specified by the `summary_freq` option). The saved statistics |
are grouped by the `run-id` value so you should assign a unique id to each |
training run if you plan to view the statistics. You can view these statistics |
using TensorBoard during or after training by running the following command: |
where |
```sh |
tensorboard --logdir=summaries --port 6006 |
``` |
- `<trainer-config-file>` is the file path of the trainer configuration yaml. |
This contains all the hyperparameter values. We offer a detailed guide on the |
structure of this file and the meaning of the hyperameters (and advice on how |
to set them) in the dedicated [Training Config File](#training-config-file) |
section below. |
- `<env_name>`**(Optional)** is the name (including path) of your |
[Unity executable](Learning-Environment-Executable.md) containing the agents |
to be trained. If `<env_name>` is not passed, the training will happen in the |
Editor. Press the :arrow_forward: button in Unity when the message _"Start |
training by pressing the Play button in the Unity Editor"_ is displayed on |
the screen. |
- `<run-identifier>` is a unique name you can use to identify the results of |
your training runs. |
And then opening the URL: [localhost:6006](http://localhost:6006). |
See the |
[Getting Started Guide](Getting-Started.md#training-a-new-model-with-reinforcement-learning) |
for a sample execution of the `mlagents-learn` command. |
**Note:** The default port TensorBoard uses is 6006. If there is an existing session |
running on port 6006 a new session can be launched on an open port using the --port |
option. |
#### Observing Training |
When training is finished, you can find the saved model in the `models` folder |
under the assigned run-id — in the cats example, the path to the model would be |
`models/cob_1/CatsOnBicycles_cob_1.nn`. |
Regardless of which training methods, configurations or hyperparameters you |
provide, the training process will always generate three artifacts: |
While this example used the default training hyperparameters, you can edit the |
[trainer_config.yaml file](#training-config-file) with a text editor to set |
different values. |
1. Summaries (under the `summaries/` folder): these are training metrics that |
are updated throughout the training process. They are helpful to monitor your |
training performance and may help inform how to update your hyperparameter |
values. See [Using TensorBoard](Using-Tensorboard.md) for more details on how |
to visualize the training metrics. |
1. Models (under the `models/` folder): these contain the model checkpoints that |
are updated throughout training and the final model file (`.nn`). This final |
model file is generated once either when training completes or is |
interrupted. |
1. Timers file (also under the `summaries/` folder): this contains aggregated |
metrics on your training process, including time spent on specific code |
blocks. See [Profiling in Python](Profiling-Python.md) for more information |
on the timers generated. |
To interrupt training and save the current progress, hit Ctrl+C once and wait for the |
model to be saved out. |
These artifacts (except the `.nn` file) are updated throughout the training |
process and finalized when training completes or is interrupted. |
### Loading an Existing Model |
#### Stopping and Resuming Training |
If you've quit training early using Ctrl+C, you can resume the training run by running |
`mlagents-learn` again, specifying the same `<run-identifier>` and appending the `--resume` flag |
to the command. |
To interrupt training and save the current progress, hit `Ctrl+C` once and wait |
for the model(s) to be saved out. |
You can also use this mode to run inference of an already-trained model in Python. |
Append both the `--resume` and `--inference` to do this. Note that if you want to run |
inference in Unity, you should use the |
[Unity Inference Engine](Getting-started.md#running-a-pre-trained-model). |
To resume a previously interrupted or completed training run, use the `--resume` |
flag and make sure to specify the previously used run ID. |
If you've already trained a model using the specified `<run-identifier>` and `--resume` is not |
specified, you will not be able to continue with training. Use `--force` to force ML-Agents to |
overwrite the existing data. |
If you would like to re-run a previously interrupted or completed training run |
and re-use the same run ID (in this case, overwriting the previously generated |
artifacts), then use the `--force` flag. |
Alternatively, you might want to start a new training run but _initialize_ it using an already-trained |
model. You may want to do this, for instance, if your environment changed and you want |
a new model, but the old behavior is still better than random. You can do this by specifying `--initialize-from=<run-identifier>`, where `<run-identifier>` is the old run ID. |
#### Loading an Existing Model |
### Command Line Training Options |
You can also use this mode to run inference of an already-trained model in |
Python by using both the `--resume` and `--inference` flags. Note that if you |
want to run inference in Unity, you should use the |
[Unity Inference Engine](Getting-Started.md#running-a-pre-trained-model). |
In addition to passing the path of the Unity executable containing your training |
environment, you can set the following command line options when invoking |
`mlagents-learn`: |
Alternatively, you might want to start a new training run but _initialize_ it |
using an already-trained model. You may want to do this, for instance, if your |
environment changed and you want a new model, but the old behavior is still |
better than random. You can do this by specifying |
`--initialize-from=<run-identifier>`, where `<run-identifier>` is the old run |
ID. |
* `--env=<env>`: Specify an executable environment to train. |
* `--curriculum=<file>`: Specify a curriculum JSON file for defining the |
lessons for curriculum training. See [Curriculum |
Training](Training-Curriculum-Learning.md) for more information. |
* `--sampler=<file>`: Specify a sampler YAML file for defining the |
sampler for parameter randomization. See [Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md) for more information. |
* `--keep-checkpoints=<n>`: Specify the maximum number of model checkpoints to |
keep. Checkpoints are saved after the number of steps specified by the |
`save-freq` option. Once the maximum number of checkpoints has been reached, |
the oldest checkpoint is deleted when saving a new checkpoint. Defaults to 5. |
* `--lesson=<n>`: Specify which lesson to start with when performing curriculum |
training. Defaults to 0. |
* `--num-envs=<n>`: Specifies the number of concurrent Unity environment instances to |
collect experiences from when training. Defaults to 1. |
* `--run-id=<run-identifier>`: Specifies an identifier for each training run. This |
identifier is used to name the subdirectories in which the trained model and |
summary statistics are saved as well as the saved model itself. The default id |
is "ppo". If you use TensorBoard to view the training statistics, always set a |
unique run-id for each training run. (The statistics for all runs with the |
same id are combined as if they were produced by a the same session.) |
* `--save-freq=<n>`: Specifies how often (in steps) to save the model during |
training. Defaults to 50000. |
* `--seed=<n>`: Specifies a number to use as a seed for the random number |
generator used by the training code. |
* `--env-args=<string>`: Specify arguments for the executable environment. Be aware that |
the standalone build will also process these as |
[Unity Command Line Arguments](https://docs.unity3d.com/Manual/CommandLineArguments.html). |
You should choose different argument names if you want to create environment-specific arguments. |
All arguments after this flag will be passed to the executable. For example, setting |
`mlagents-learn config/trainer_config.yaml --env-args --num-orcs 42` would result in |
` --num-orcs 42` passed to the executable. |
* `--base-port`: Specifies the starting port. Each concurrent Unity environment instance |
will get assigned a port sequentially, starting from the `base-port`. Each instance |
will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs |
given to each instance from 0 to `num_envs - 1`. Default is 5005. __Note:__ When |
training using the Editor rather than an executable, the base port will be ignored. |
* `--inference`: Specifies whether to only run in inference mode. Omit to train the model. |
To load an existing model, specify a run-id and combine with `--resume`. |
* `--resume`: If set, the training code loads an already trained model to |
initialize the neural network before training. The learning code looks for the |
model in `models/<run-id>/` (which is also where it saves models at the end of |
training). This option only works when the models exist, and have the same behavior names |
as the current agents in your scene. |
* `--force`: Attempting to train a model with a run-id that has been used before will |
throw an error. Use `--force` to force-overwrite this run-id's summary and model data. |
* `--initialize-from=<run-identifier>`: Specify an old run-id here to initialize your model from |
a previously trained model. Note that the previously saved models _must_ have the same behavior |
parameters as your current environment. |
* `--no-graphics`: Specify this option to run the Unity executable in |
`-batchmode` and doesn't initialize the graphics driver. Use this only if your |
training doesn't involve visual observations (reading from Pixels). See |
[here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more |
details. |
* `--debug`: Specify this option to enable debug-level logging for some parts of the code. |
* `--cpu`: Forces training using CPU only. |
* Engine Configuration : |
* `--width` : The width of the executable window of the environment(s) in pixels |
(ignored for editor training) (Default 84) |
* `--height` : The height of the executable window of the environment(s) in pixels |
(ignored for editor training). (Default 84) |
* `--quality-level` : The quality level of the environment(s). Equivalent to |
calling `QualitySettings.SetQualityLevel` in Unity. (Default 5) |
* `--time-scale` : The time scale of the Unity environment(s). Equivalent to setting |
`Time.timeScale` in Unity. (Default 20.0, maximum 100.0) |
* `--target-frame-rate` : The target frame rate of the Unity environment(s). |
Equivalent to setting `Application.targetFrameRate` in Unity. (Default: -1) |
## Training Config File |
### Training Config File |
The Unity ML-Agents Toolkit provides a wide range of training scenarios, methods |
and options. As such, specific training runs may require different training |
configurations and may generate different artifacts and TensorBoard statistics. |
This section offers a detailed guide into how to manage the different training |
set-ups withing the toolkit. |
The training config files `config/trainer_config.yaml`, `config/sac_trainer_config.yaml`, |
`config/gail_config.yaml` and `config/offline_bc_config.yaml` specifies the training method, |
the hyperparameters, and a few additional values to use when training with Proximal Policy |
Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial Imitation Learning) |
with PPO/SAC, and Behavioral Cloning(BC)/Imitation with PPO/SAC. These files are divided |
into sections. The **default** section defines the default values for all the available |
training with PPO, SAC, GAIL (with PPO), and BC. These files are divided into sections. |
The **default** section defines the default values for all the available settings. You can |
also add new sections to override these defaults to train specific Behaviors. Name each of these |
override sections after the appropriate `Behavior Name`. Sections for the |
The training config files `config/trainer_config.yaml`, |
`config/sac_trainer_config.yaml`, `config/gail_config.yaml` and |
`config/offline_bc_config.yaml` specifies the training method, the |
hyperparameters, and a few additional values to use when training with Proximal |
Policy Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial |
Imitation Learning) with PPO/SAC, and Behavioral Cloning(BC)/Imitation with |
PPO/SAC. These files are divided into sections. The **default** section defines |
the default values for all the available training with PPO, SAC, GAIL (with |
PPO), and BC. These files are divided into sections. The **default** section |
defines the default values for all the available settings. You can also add new |
sections to override these defaults to train specific Behaviors. Name each of |
these override sections after the appropriate `Behavior Name`. Sections for the |
| **Setting** | **Description** | **Applies To Trainer\*** | |
| :------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- | |
| batch_size | The number of experiences in each iteration of gradient descent. | PPO, SAC | |
| batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model. | | |
| beta | The strength of entropy regularization. | PPO | |
| buffer_size | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer. | PPO, SAC | |
| buffer_init_steps | The number of experiences to collect into the buffer before updating the policy model. | SAC | |
| epsilon | Influences how rapidly the policy can evolve during training. | PPO | |
| hidden_units | The number of units in the hidden layers of the neural network. | PPO, SAC | |
| init_entcoef | How much the agent should explore in the beginning of training. | SAC | |
| lambd | The regularization parameter. | PPO | |
| learning_rate | The initial learning rate for gradient descent. | PPO, SAC | |
| learning_rate_schedule | Determines how learning rate changes over time. | PPO, SAC | |
| max_steps | The maximum number of simulation steps to run during a training session. | PPO, SAC | |
| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC | |
| normalize | Whether to automatically normalize observations. | PPO, SAC | |
| num_epoch | The number of passes to make through the experience buffer when performing gradient descent optimization. | PPO | |
| num_layers | The number of hidden layers in the neural network. | PPO, SAC | |
| behavioral_cloning | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations). | PPO, SAC | |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO, SAC | |
| save_replay_buffer | Saves the replay buffer when exiting training, and loads it on resume. | SAC | |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC | |
| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, SAC | |
| tau | How aggressively to update the target network used for bootstrapping value estimation in SAC. | SAC | |
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC | |
| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC | |
| train_interval | How often to update the agent. | SAC | |
| num_update | Number of mini-batches to update the agent with during each update. | SAC | |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC | |
| init_path | Initialize trainer from a previously saved model. | PPO, SAC | |
\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral |
Cloning (Imitation), GAIL = Generative Adversarial Imitation Learning |
\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation), GAIL = Generative Adversarial Imitaiton Learning |
| **Setting** | **Description** | **Applies To Trainer\*** | |
| :--------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- | |
| batch_size | The number of experiences in each iteration of gradient descent. | PPO, SAC | |
| batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model. | | |
| beta | The strength of entropy regularization. | PPO | |
| buffer_size | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer. | PPO, SAC | |
| buffer_init_steps | The number of experiences to collect into the buffer before updating the policy model. | SAC | |
| epsilon | Influences how rapidly the policy can evolve during training. | PPO | |
| hidden_units | The number of units in the hidden layers of the neural network. | PPO, SAC | |
| init_entcoef | How much the agent should explore in the beginning of training. | SAC | |
| lambd | The regularization parameter. | PPO | |
| learning_rate | The initial learning rate for gradient descent. | PPO, SAC | |
| learning_rate_schedule | Determines how learning rate changes over time. | PPO, SAC | |
| max_steps | The maximum number of simulation steps to run during a training session. | PPO, SAC | |
| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC | |
| normalize | Whether to automatically normalize observations. | PPO, SAC | |
| num_epoch | The number of passes to make through the experience buffer when performing gradient descent optimization. | PPO | |
| num_layers | The number of hidden layers in the neural network. | PPO, SAC | |
| behavioral_cloning | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations). | PPO, SAC | |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO, SAC | |
| save_replay_buffer | Saves the replay buffer when exiting training, and loads it on resume. | SAC | |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC | |
| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, SAC | |
| tau | How aggressively to update the target network used for bootstrapping value estimation in SAC. | SAC | |
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC | |
| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC | |
| train_interval | How often to update the agent. | SAC | |
| num_update | Number of mini-batches to update the agent with during each update. | SAC | |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC | |
| init_path | Initialize trainer from a previously saved model. | PPO, SAC | |
* [Training with PPO](Training-PPO.md) |
* [Training with SAC](Training-SAC.md) |
* [Using Recurrent Neural Networks](Feature-Memory.md) |
* [Training with Curriculum Learning](Training-Curriculum-Learning.md) |
* [Training with Imitation Learning](Training-Imitation-Learning.md) |
* [Training with Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md) |
- [Training with PPO](Training-PPO.md) |
- [Training with SAC](Training-SAC.md) |
- [Training with Self-Play](Training-Self-Play.md) |
- [Using Recurrent Neural Networks](Feature-Memory.md) |
- [Training with Curriculum Learning](Training-Curriculum-Learning.md) |
- [Training with Imitation Learning](Training-Imitation-Learning.md) |
- [Training with Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md) |
[example environments](Learning-Environment-Examples.md) |
to the corresponding sections of the `config/trainer_config.yaml` file for each |
example to see how the hyperparameters and other configuration variables have |
been changed from the defaults. |
### Debugging and Profiling |
If you enable the `--debug` flag in the command line, the trainer metrics are logged to a CSV file |
stored in the `summaries` directory. The metrics stored are: |
* brain name |
* time to update policy |
* time since start of training |
* time for last experience collection |
* number of experiences used for training |
* mean return |
This option is not available currently for Behavioral Cloning. |
Additionally, we have included basic [Profiling in Python](Profiling-Python.md) as part of the toolkit. |
This information is also saved in the `summaries` directory. |
[example environments](Learning-Environment-Examples.md) to the corresponding |
sections of the `config/trainer_config.yaml` file for each example to see how |
the hyperparameters and other configuration variables have been changed from the |
defaults. |
Reference in new issue