* `--train`: Specifies whether to train model or only run in inference mode.
When training, **always** use the `--train` option.
* `--num-envs=<n>`: Specifies the number of concurrent Unity environment instances to collect
experiences from when training. Defaults to 1.
* `--base-port`: Specifies the starting port. Each concurrent Unity environment instance will get assigned a port sequentially, starting from the `base-port`. Each instance will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs given to each instance from 0 to `num_envs - 1`. Default is 5005.
* `--docker-target-name=<dt>`: The Docker Volume on which to store curriculum,
executable and model files. See [Using Docker](Using-Docker.md).
* `--load`: If set, the training code loads an already trained model to
initialize the neural network before training. The learning code looks for the
model in `models/<run-id>/` (which is also where it saves models at the end of
training). When not set (the default), the neural network weights are randomly
initialized and an existing model is not loaded.
* `--no-graphics`: Specify this option to run the Unity executable in
`-batchmode` and doesn't initialize the graphics driver. Use this only if your
training doesn't involve visual observations (reading from Pixels). See
The training config files `config/trainer_config.yaml`, `config/sac_trainer_config.yaml`,
`config/gail_config.yaml`, `config/online_bc_config.yaml` and `config/offline_bc_config.yaml`
specifies the training method, the hyperparameters, and a few additional values to use when
training with PPO, SAC, GAIL (with PPO), and online and offline BC. These files are divided into sections.
The **default** section defines the default values for all the available
training with Proximal Policy Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial
Imitation Learning) with PPO, and online and offline Behavioral Cloning(BC)/Imitation. These files are
divided into sections. The **default** section defines the default values for all the available
settings. You can also add new sections to override these defaults to train
specific Brains. Name each of these override sections after the GameObject
containing the Brain component that should use these settings. (This GameObject
| **Setting** | **Description** | **Applies To Trainer\*** |
| batch_size | The number of experiences in each iteration of gradient descent. | PPO, SAC, BC |
| batch_size | The number of experiences in each iteration of gradient descent. | PPO, SAC, BC |
| buffer_size | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer. | PPO, SAC |
| buffer_init_steps | The number of experiences to collect into the buffer before updating the policy model. | SAC |
| buffer_size | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer. | PPO, SAC |
| buffer_init_steps | The number of experiences to collect into the buffer before updating the policy model. | SAC |
| hidden_units | The number of units in the hidden layers of the neural network. | PPO, SAC, BC |
| init_entcoef | How much the agent should explore in the beginning of training. | SAC |
| hidden_units | The number of units in the hidden layers of the neural network. | PPO, SAC, BC |
| init_entcoef | How much the agent should explore in the beginning of training. | SAC |
| learning_rate | The initial learning rate for gradient descent. | PPO, SAC, BC |
| max_steps | The maximum number of simulation steps to run during a training session. | PPO, SAC, BC |
| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC |
| learning_rate | The initial learning rate for gradient descent. | PPO, SAC, BC |
| max_steps | The maximum number of simulation steps to run during a training session. | PPO, SAC, BC |
| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC |
| num_layers | The number of hidden layers in the neural network. | PPO, SAC, BC |
| pretraining | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations). | PPO, SAC |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO, SAC, BC |
| save_replay_buffer | Saves the replay buffer when exiting training, and loads it on resume. | SAC |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC |
| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, SAC, BC |
| tau | How aggressively to update the target network used for bootstrapping value estimation in SAC. | SAC |
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC, (online)BC |
| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC, BC |
| train_interval | How often to update the agent. | SAC |
| num_update | Number of mini-batches to update the agent with during each update. | SAC |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC |
| num_layers | The number of hidden layers in the neural network. | PPO, SAC, BC |
| pretraining | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations). | PPO, SAC |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO, SAC, BC |
| save_replay_buffer | Saves the replay buffer when exiting training, and loads it on resume. | SAC |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC |
| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, SAC, BC |
| tau | How aggressively to update the target network used for bootstrapping value estimation in SAC. | SAC |
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC, (online)BC |
| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC, BC |
| train_interval | How often to update the agent. | SAC |
| num_update | Number of mini-batches to update the agent with during each update. | SAC |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC |