One of the first decisions you need to make regarding your training run is which
trainer to use: PPO or SAC. There are some training configurations that are
trainer to use: PPO, SAC, or POCA. There are some training configurations that are
| `trainer_type` | (default = `ppo`) The type of trainer to use: `ppo` or `sac` |
| `trainer_type` | (default = `ppo`) The type of trainer to use: `ppo`, `sac`, or `poca`. |
| `summary_freq` | (default = `50000`) Number of experiences that needs to be collected before generating and displaying training statistics. This determines the granularity of the graphs in Tensorboard. |
| `time_horizon` | (default = `64`) How many steps of experience to collect per-agent before adding it to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions. <br><br> Typical range: `32` - `2048` |
| `max_steps` | (default = `500000`) Total number of steps (i.e., observation collected and action taken) that must be taken in the environment (or across all environments if using multiple in parallel) before ending the training process. If you have multiple agents with the same behavior name within your environment, all steps taken by those agents will contribute to the same `max_steps` count. <br><br>Typical range: `5e5` - `1e7` |
| `hyperparameters -> tau` | (default = `0.005`) How aggressively to update the target network used for bootstrapping value estimation in SAC. Corresponds to the magnitude of the target Q update during the SAC model update. In SAC, there are two neural networks: the target and the policy. The target network is used to bootstrap the policy's estimate of the future rewards at a given state, and is fixed while the policy is being updated. This target is then slowly updated according to tau. Typically, this value should be left at 0.005. For simple problems, increasing tau to 0.01 might reduce the time it takes to learn, at the cost of stability. <br><br>Typical range: `0.005` - `0.01` |
| `hyperparameters -> steps_per_update` | (default = `1`) Average ratio of agent steps (actions) taken to updates made of the agent's policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps. Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will improve sample efficiency (reduce the number of steps required to train) but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example environments) `steps_per_update` equal to the number of agents in the scene is a good balance. For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed. We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will usually result in a slowdown unless the environment is very slow. <br><br>Typical range: `1` - `20` |
| `hyperparameters -> reward_signal_num_update` | (default = `steps_per_update`) Number of steps per mini batch sampled and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. However, to imitate the training procedure in certain imitation learning papers (e.g. [Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)), we may want to update the reward signal (GAIL) M times for every update of the policy. We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update` under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to `steps_per_update`. |
### MA-POCA-specific Configurations
MA-POCA uses the same configurations as PPO, and there are no additional POCA-specific parameters.
**NOTE**: Reward signals other than Extrinsic Rewards have not been extensively tested with MA-POCA,
though they can still be added and used for training on a your-mileage-may-vary basis.