Unity 机器学习代理工具包 (ML-Agents) 是一个开源项目,它使游戏和模拟能够作为训练智能代理的环境。
您最多选择25个主题 主题必须以中文或者字母或数字开头,可以包含连字符 (-),并且长度不得超过35个字符
 
 
 
 
 

15 KiB

Training ML-Agents

For a broad overview of reinforcement learning, imitation learning and all the training scenarios, methods and options within the ML-Agents Toolkit, see ML-Agents Toolkit Overview.

Once your learning environment has been created and is ready for training, the next step is to initiate a training run. Training in the ML-Agents Toolkit is powered by a dedicated Python package, mlagents. This package exposes a command mlagents-learn that is the single entry point for all training workflows (e.g. reinforcement leaning, imitation learning, curriculum learning). Its implementation can be found at ml-agents/mlagents/trainers/learn.py.

Training with mlagents-learn

Starting Training

mlagents-learn is the main training utility provided by the ML-Agents Toolkit. It accepts a number of CLI options in addition to a YAML configuration file that contains all the configurations and hyperparameters to be used during training. The set of configurations and hyperparameters to include in this file depend on the agents in your environment and the specific training method you wish to utilize. Keep in mind that the hyperparameter values can have a big impact on the training performance (i.e. your agent's ability to learn a policy that solves the task). In this page, we will review all the hyperparameters for all training methods and provide guidelines and advice on their values.

To view a description of all the CLI options accepted by mlagents-learn, use the --help:

mlagents-learn --help

The basic command for training is:

mlagents-learn <trainer-config-file> --env=<env_name> --run-id=<run-identifier>

where

  • <trainer-config-file> is the file path of the trainer configuration yaml. This contains all the hyperparameter values. We offer a detailed guide on the structure of this file and the meaning of the hyperameters (and advice on how to set them) in the dedicated Training Config File section below.
  • <env_name>(Optional) is the name (including path) of your Unity executable containing the agents to be trained. If <env_name> is not passed, the training will happen in the Editor. Press the ▶️ button in Unity when the message "Start training by pressing the Play button in the Unity Editor" is displayed on the screen.
  • <run-identifier> is a unique name you can use to identify the results of your training runs.

See the Getting Started Guide for a sample execution of the mlagents-learn command.

Observing Training

Regardless of which training methods, configurations or hyperparameters you provide, the training process will always generate three artifacts:

  1. Summaries (under the results/<run-identifier>/<behavior-name> folder): these are training metrics that are updated throughout the training process. They are helpful to monitor your training performance and may help inform how to update your hyperparameter values. See Using TensorBoard for more details on how to visualize the training metrics.
  2. Models (under the results/<run-identifier>/ folder): these contain the model checkpoints that are updated throughout training and the final model file (.nn). This final model file is generated once either when training completes or is interrupted.
  3. Timers file (also under the results/<run-identifier> folder): this contains aggregated metrics on your training process, including time spent on specific code blocks. See Profiling in Python for more information on the timers generated.

These artifacts (except the .nn file) are updated throughout the training process and finalized when training completes or is interrupted.

Stopping and Resuming Training

To interrupt training and save the current progress, hit Ctrl+C once and wait for the model(s) to be saved out.

To resume a previously interrupted or completed training run, use the --resume flag and make sure to specify the previously used run ID.

If you would like to re-run a previously interrupted or completed training run and re-use the same run ID (in this case, overwriting the previously generated artifacts), then use the --force flag.

Loading an Existing Model

You can also use this mode to run inference of an already-trained model in Python by using both the --resume and --inference flags. Note that if you want to run inference in Unity, you should use the Unity Inference Engine.

Alternatively, you might want to start a new training run but initialize it using an already-trained model. You may want to do this, for instance, if your environment changed and you want a new model, but the old behavior is still better than random. You can do this by specifying --initialize-from=<run-identifier>, where <run-identifier> is the old run ID.

Training Config File

The Unity ML-Agents Toolkit provides a wide range of training scenarios, methods and options. As such, specific training runs may require different training configurations and may generate different artifacts and TensorBoard statistics. This section offers a detailed guide into how to manage the different training set-ups withing the toolkit.

For each training run, create a YAML file that contains the the training method and the hyperparameters for each of the Behaviors found in your environment. Example files for Policy Optimization (PPO) and Soft Actor-Critic (SAC) are provided in config/ppo/ and config/sac/, respectively. Examples for imitation learning through GAIL (Generative Adversarial Imitation Learning) and Behavioral Cloning (BC) can be found in config/imitiation/.

Each file is divided into sections. The behaviors section defines the hyperparameters for each Behavior found in your environment. A section should be created for each Behavior Name. The available parameters for PPO and SAC are listed below. Alternatively, if there are many different Behaviors that all use similar hyperparameters, you can create a default behavior name that specifies all hyperparameters that are not specified in the Behavior-specific sections. To use Curriculum Learning for a particular Behavior, add a section under that Behavior Name called curriculum. See the Curriculum Learning page for more information.

To use Parameter Randomization, add a parameter_randomization section in the configuration file. See the Parameter Randomization docs for more information.

*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation), GAIL = Generative Adversarial Imitation Learning

Setting Description Applies To Trainer*
batch_size The number of experiences in each iteration of gradient descent. PPO, SAC
batches_per_epoch In imitation learning, the number of batches of training examples to collect before training the model.
beta The strength of entropy regularization. PPO
buffer_size The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer. PPO, SAC
buffer_init_steps The number of experiences to collect into the buffer before updating the policy model. SAC
epsilon Influences how rapidly the policy can evolve during training. PPO
hidden_units The number of units in the hidden layers of the neural network. PPO, SAC
init_entcoef How much the agent should explore in the beginning of training. SAC
lambd The regularization parameter. PPO
learning_rate The initial learning rate for gradient descent. PPO, SAC
learning_rate_schedule Determines how learning rate changes over time. PPO, SAC
max_steps The maximum number of simulation steps to run during a training session. PPO, SAC
memory_size The size of the memory an agent must keep. Used for training with a recurrent neural network. See Using Recurrent Neural Networks. PPO, SAC
normalize Whether to automatically normalize observations. PPO, SAC
num_epoch The number of passes to make through the experience buffer when performing gradient descent optimization. PPO
num_layers The number of hidden layers in the neural network. PPO, SAC
behavioral_cloning Use demonstrations to bootstrap the policy neural network. See Pretraining Using Demonstrations. PPO, SAC
reward_signals The reward signals used to train the policy. Enable Curiosity and GAIL here. See Reward Signals for configuration options. PPO, SAC
save_replay_buffer Saves the replay buffer when exiting training, and loads it on resume. SAC
sequence_length Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See Using Recurrent Neural Networks. PPO, SAC
summary_freq How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. PPO, SAC
tau How aggressively to update the target network used for bootstrapping value estimation in SAC. SAC
time_horizon How many steps of experience to collect per-agent before adding it to the experience buffer. PPO, SAC
trainer The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". PPO, SAC
steps_per_update Ratio of agent steps per mini-batch update. SAC
use_recurrent Train using a recurrent neural network. See Using Recurrent Neural Networks. PPO, SAC
init_path Initialize trainer from a previously saved model. PPO, SAC
threaded Run the trainer in a parallel thread from the environment steps. (Default: true) PPO, SAC

For specific advice on setting hyperparameters based on the type of training you are conducting, see:

You can also compare the example environments to the corresponding files in the config/ppo/ file for each example to see how the hyperparameters and other configuration variables have been changed from environment to environment.