浏览代码
Release mm GitHub docs (#3864)
Release mm GitHub docs (#3864)
* Improvements to Key Components section of ML-Agents Overview - Moved some documentation from Learning-Environment-Design. - Added the trainers vs LL-API separation. - Made a note about gym-unity. - Some update to the Agent/Behavior sections - Updated diagrams to reflect new side channels. Made Behavior type a consistent color. * Reorganizing the overview file and creating new (empty) sections This change defines the new structure for the overview doc. Subsequent commits will fill in the sections and rewrite existing sections. * Reorganizing the main Training ML-Agents page Re-organizes into feature-specific sections that somewhat mirror the previous commit of reorganizing the overview doc. Subsequent commits will populate these empty sections. * Adding Deep RL - Update ML-Agents-Overview with description of DeepRL training algorithms - Decribe the common and trainer-specific hyperparams in Training-ML-Agents. - Removed .../release_1_branch
GitHub
5 年前
当前提交
0dff739b
共有 30 个文件被更改,包括 1798 次插入 和 2241 次删除
-
123README.md
-
6com.unity.ml-agents/Runtime/Agent.cs
-
2com.unity.ml-agents/Runtime/Demonstrations/DemonstrationRecorder.cs
-
10docs/Getting-Started.md
-
6docs/Glossary.md
-
19docs/Learning-Environment-Create-New.md
-
97docs/Learning-Environment-Design-Agents.md
-
135docs/Learning-Environment-Design.md
-
2docs/Learning-Environment-Executable.md
-
693docs/ML-Agents-Overview.md
-
55docs/Migrating.md
-
2docs/Python-API.md
-
13docs/Readme.md
-
473docs/Training-ML-Agents.md
-
2docs/Using-Docker.md
-
65docs/Using-Tensorboard.md
-
123docs/images/learning_environment_basic.png
-
251docs/images/learning_environment_example.png
-
167docs/images/learning_environment_full.png
-
216docs/Training-Configuration-File.md
-
48docs/Feature-Memory.md
-
50docs/Feature-Monitor.md
-
25docs/Training-Using-Concurrent-Unity-Instances.md
-
104docs/Training-Imitation-Learning.md
-
205docs/Reward-Signals.md
-
159docs/Training-Self-Play.md
-
171docs/Training-Environment-Parameter-Randomization.md
-
350docs/Training-PPO.md
-
356docs/Training-SAC.md
-
111docs/Training-Curriculum-Learning.md
|
|||
# Training Configuration File |
|||
|
|||
**Table of Contents** |
|||
|
|||
- [Common Trainer Configurations](#common-trainer-configurations) |
|||
- [Trainer-specific Configurations](#trainer-specific-configurations) |
|||
- [PPO-specific Configurations](#ppo-specific-configurations) |
|||
- [SAC-specific Configurations](#sac-specific-configurations) |
|||
- [Reward Signals](#reward-signals) |
|||
- [Extrinsic Rewards](#extrinsic-rewards) |
|||
- [Curiosity Intrinsic Reward](#curiosity-intrinsic-reward) |
|||
- [GAIL Intrinsic Reward](#gail-intrinsic-reward) |
|||
- [SAC-specific Reward Signal](#sac-specific-reward-signal) |
|||
- [Behavioral Cloning](#behavioral-cloning) |
|||
- [Memory-enhanced Agents using Recurrent Neural Networks](#memory-enhanced-agents-using-recurrent-neural-networks) |
|||
- [Self-Play](#self-play) |
|||
- [Note on Reward Signals](#note-on-reward-signals) |
|||
- [Note on Swap Steps](#note-on-swap-steps) |
|||
|
|||
## Common Trainer Configurations |
|||
|
|||
One of the first decisions you need to make regarding your training run is which |
|||
trainer to use: PPO or SAC. There are some training configurations that are |
|||
common to both trainers (which we review now) and others that depend on the |
|||
choice of the trainer (which we review on subsequent sections). |
|||
|
|||
| **Setting** | **Description** | |
|||
| :----------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
|||
| `trainer` | The type of training to perform: `ppo` or `sac` | |
|||
| `init_path` | Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. | |
|||
| `summary_freq` | Number of experiences that needs to be collected before generating and displaying training statistics. This determines the granularity of the graphs in Tensorboard. | |
|||
| `batch_size` | Number of experiences in each iteration of gradient descent. **This should always be a fraction of the `buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value should be smaller (in order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`. | |
|||
| `buffer_size` | Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be a multiple of `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates. In SAC, the max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences. <br><br>Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000` | |
|||
| `hidden_units` | Number of units in the hidden layers of the neural network. Correspond to how many units are in each fully connected layer of the neural network. For simple problems where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where the action is a very complex interaction between the observation variables, this should be larger. <br><br> Typical range: `32` - `512` | |
|||
| `learning_rate` | Initial learning rate for gradient descent. Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase. <br><br>Typical range: `1e-5` - `1e-3` | |
|||
| `learning_rate_schedule` | Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally. <br><br>`linear` (default) decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run. | |
|||
| `max_steps` | Total number of experience points that must be collected from the simulation before ending the training process. <br><br>Typical range: `5e5` - `1e7` | |
|||
| `normalize` | Whether normalization is applied to the vector observation inputs. This normalization is based on the running average and variance of the vector observation. Normalization can be helpful in cases with complex continuous control problems, but may be harmful with simpler discrete control problems. | |
|||
| `num_layers` | The number of hidden layers in the neural network. Corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation. For simple problems, fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems. <br><br> Typical range: `1` - `3` | |
|||
| `time_horizon` | How many steps of experience to collect per-agent before adding it to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions. <br><br> Typical range: `32` - `2048` | |
|||
| `vis_encoder_type` | Encoder type for encoding visual observations. <br><br> `simple` (default) uses a simple encoder which consists of two convolutional layers, `nature_cnn` uses the CNN implementation proposed by [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of three convolutional layers, and `resnet` uses the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. | |
|||
|
|||
## Trainer-specific Configurations |
|||
|
|||
Depending on your choice of a trainer, there are additional trainer-specific |
|||
configurations. We present them below in two separate tables, but keep in mind |
|||
that you only need to include the configurations for the trainer selected (i.e. |
|||
the `trainer` setting above). |
|||
|
|||
### PPO-specific Configurations |
|||
|
|||
| **Setting** | **Description** | |
|||
| :---------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
|||
| `beta` | Strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase beta. If entropy drops too slowly, decrease `beta`. <br><br>Typical range: `1e-4` - `1e-2` | |
|||
| `epsilon` | Influences how rapidly the policy can evolve during training. Corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process. <br><br>Typical range: `0.1` - `0.3` | |
|||
| `lambd` | Regularization parameter (lambda) used when calculating the Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process. <br><br>Typical range: `0.9` - `0.95` | |
|||
| `num_epoch` | Number of passes to make through the experience buffer when performing gradient descent optimization.The larger the batch_size, the larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning. <br><br>Typical range: `3` - `10` | |
|||
| `threaded` | (Optional, default = `true`) By default, PPO model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a 10-20% training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. | |
|||
|
|||
### SAC-specific Configurations |
|||
|
|||
| **Setting** | **Description** | |
|||
| :------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | |
|||
| `buffer_init_steps` | Number of experiences to collect into the buffer before updating the policy model. As the untrained policy is fairly random, pre-filling the buffer with random actions is useful for exploration. Typically, at least several episodes of experiences should be pre-filled. <br><br>Typical range: `1000` - `10000` | |
|||
| `init_entcoef` | How much the agent should explore in the beginning of training. Corresponds to the initial entropy coefficient set at the beginning of training. In SAC, the agent is incentivized to make its actions entropic to facilitate better exploration. The entropy coefficient weighs the true reward with a bonus entropy reward. The entropy coefficient is [automatically adjusted](https://arxiv.org/abs/1812.05905) to a preset target entropy, so the `init_entcoef` only corresponds to the starting value of the entropy bonus. Increase init_entcoef to explore more in the beginning, decrease to converge to a solution faster. <br><br>Typical range: (Continuous): `0.5` - `1.0`; (Discrete): `0.05` - `0.5` | |
|||
| `save_replay_buffer` | (Optional, default = `false`) Whether to save and load the experience replay buffer as well as the model when quitting and re-starting training. This may help resumes go more smoothly, as the experiences collected won't be wiped. Note that replay buffers can be very large, and will take up a considerable amount of disk space. For that reason, we disable this feature by default. | |
|||
| `tau` | How aggressively to update the target network used for bootstrapping value estimation in SAC. Corresponds to the magnitude of the target Q update during the SAC model update. In SAC, there are two neural networks: the target and the policy. The target network is used to bootstrap the policy's estimate of the future rewards at a given state, and is fixed while the policy is being updated. This target is then slowly updated according to tau. Typically, this value should be left at 0.005. For simple problems, increasing tau to 0.01 might reduce the time it takes to learn, at the cost of stability. <br><br>Typical range: `0.005` - `0.01` | |
|||
| `steps_per_update` | Average ratio of agent steps (actions) taken to updates made of the agent's policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps. Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will improve sample efficiency (reduce the number of steps required to train) but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example environments) `steps_per_update` equal to the number of agents in the scene is a good balance. For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed. We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will usually result in a slowdown unless the environment is very slow. <br><br>Typical range: `1` - `20` | |
|||
| `train_interval` | Number of steps taken between each agent training event. Typically, we can train after every step, but if your environment's steps are very small and very frequent, there may not be any new interesting information between steps, and `train_interval` can be increased. <br><br>Typical range: `1` - `5` | |
|||
|
|||
## Reward Signals |
|||
|
|||
The `reward_signals` section enables the specification of settings for both |
|||
extrinsic (i.e. environment-based) and intrinsic reward signals (e.g. curiosity |
|||
and GAIL). Each reward signal should define at least two parameters, `strength` |
|||
and `gamma`, in addition to any class-specific hyperparameters. Note that to |
|||
remove a reward signal, you should delete its entry entirely from |
|||
`reward_signals`. At least one reward signal should be left defined at all |
|||
times. Provide the following configurations to design the reward signal for your |
|||
training run. |
|||
|
|||
### Extrinsic Rewards |
|||
|
|||
Enable these settings to ensure that your training run incorporates your |
|||
environment-based reward signal: |
|||
|
|||
| **Setting** | **Description** | |
|||
| :--------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
|||
| `extrinsic > strength` | Factor by which to multiply the reward given by the environment. Typical ranges will vary depending on the reward signal. <br><br>Typical range: `1.00` | |
|||
| `extrinsic > gamma` | Discount factor for future rewards coming from the environment. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large. In cases when rewards are more immediate, it can be smaller. Must be strictly smaller than 1. <br><br>Typical range: `0.8` - `0.995` | |
|||
|
|||
### Curiosity Intrinsic Reward |
|||
|
|||
To enable curiosity, provide these settings: |
|||
|
|||
| **Setting** | **Description** | |
|||
| :-------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | |
|||
| `curiosity > strength` | Magnitude of the curiosity reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. <br><br>Typical range: `0.001` - `0.1` | |
|||
| `curiosity > gamma` | Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.995` | |
|||
| `curiosity > encoding_size` | (Optional, default = `64`) Size of the encoding used by the intrinsic curiosity model. This value should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations. <br><br>Typical range: `64` - `256` | |
|||
| `curiosity > learning_rate` | (Optional, default = `3e-4`) Learning rate used to update the intrinsic curiosity module. This should typically be decreased if training is unstable, and the curiosity loss is unstable. <br><br>Typical range: `1e-5` - `1e-3` | |
|||
|
|||
### GAIL Intrinsic Reward |
|||
|
|||
To enable GAIL (assuming you have recorded demonstrations), provide these |
|||
settings: |
|||
|
|||
| **Setting** | **Description** | |
|||
| :--------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | |
|||
| `gail > strength` | Factor by which to multiply the raw reward. Note that when using GAIL with an Extrinsic Signal, this value should be set lower if your demonstrations are suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. <br><br>Typical range: `0.01` - `1.0` | |
|||
| `gail > gamma` | Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.9` | |
|||
| `gail > demo_path` | The path to your .demo file or directory of .demo files. | |
|||
| `gail > encoding_size` | (Optional, default = `64`) Size of the hidden layer used by the discriminator. This value should be small enough to encourage the discriminator to compress the original observation, but also not too small to prevent it from learning to differentiate between demonstrated and actual behavior. Dramatically increasing this size will also negatively affect training times. <br><br>Typical range: `64` - `256` | |
|||
| `gail > learning_rate` | (Optional, default = `3e-4`) Learning rate used to update the discriminator. This should typically be decreased if training is unstable, and the GAIL loss is unstable. <br><br>Typical range: `1e-5` - `1e-3` | |
|||
| `gail > use_actions` | (Optional, default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. | |
|||
| `gail > use_vail` | (Optional, default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand. | |
|||
|
|||
### SAC-specific Reward Signal |
|||
|
|||
All of the reward signals configurations described above apply to both PPO and |
|||
SAC. There is one configuration for reward signals that only applies to SAC. |
|||
|
|||
| **Setting** | **Description** | |
|||
| :------------------------------------------ | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
|||
| `reward_signals > reward_signal_num_update` | (Optional, default = `steps_per_update`) Number of steps per mini batch sampled and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. However, to imitate the training procedure in certain imitation learning papers (e.g. [Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)), we may want to update the reward signal (GAIL) M times for every update of the policy. We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update` under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to `steps_per_update`. | |
|||
|
|||
## Behavioral Cloning |
|||
|
|||
To enable Behavioral Cloning as a pre-training option (assuming you have |
|||
recorded demonstrations), provide the following configurations under the |
|||
`behavior_cloning` section: |
|||
|
|||
| **Setting** | **Description** | |
|||
| :------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
|||
| `demo_path` | The path to your .demo file or directory of .demo files. | |
|||
| `strength` | Learning rate of the imitation relative to the learning rate of PPO, and roughly corresponds to how strongly we allow BC to influence the policy. <br><br>Typical range: `0.1` - `0.5` | |
|||
| `steps` | During BC, it is often desirable to stop using demonstrations after the agent has "seen" rewards, and allow it to optimize past the available demonstrations and/or generalize outside of the provided demonstrations. steps corresponds to the training steps over which BC is active. The learning rate of BC will anneal over the steps. Set the steps to 0 for constant imitation over the entire training run. | |
|||
| `batch_size` | Number of demonstration experiences used for one iteration of a gradient descent update. If not specified, it will default to the `batch_size`. <br><br>Typical range: (Continuous): `512` - `5120`; (Discrete): `32` - `512` | |
|||
| `num_epoch` | Number of passes through the experience buffer during gradient descent. If not specified, it will default to the number of epochs set for PPO. <br><br>Typical range: `3` - `10` | |
|||
| `samples_per_update` | (Optional, default = `0`) Maximum number of samples to use during each imitation update. You may want to lower this if your demonstration dataset is very large to avoid overfitting the policy on demonstrations. Set to 0 to train over all of the demonstrations at each update step. <br><br>Typical range: `buffer_size` | |
|||
| `init_path` | Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. | |
|||
|
|||
## Memory-enhanced Agents using Recurrent Neural Networks |
|||
|
|||
You can enable your agents to use memory, by setting `use_recurrent` to `true` |
|||
and setting `memory_size` and `sequence_length`: |
|||
|
|||
| **Setting** | **Description** | |
|||
| :---------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
|||
| `use_recurrent` | Whether to enable this option or not. | |
|||
| `memory_size` | Size of the memory an agent must keep. In order to use a LSTM, training requires a sequence of experiences instead of single experiences. Corresponds to the size of the array of floating point numbers used to store the hidden state of the recurrent neural network of the policy. This value must be a multiple of 2, and should scale with the amount of information you expect the agent will need to remember in order to successfully complete the task. <br><br>Typical range: `32` - `256` | |
|||
| `sequence_length` | Defines how long the sequences of experiences must be while training. Note that if this number is too small, the agent will not be able to remember things over longer periods of time. If this number is too large, the neural network will take longer to train. <br><br>Typical range: `4` - `128` | |
|||
|
|||
A few considerations when deciding to use memory: |
|||
|
|||
- LSTM does not work well with continuous vector action space. Please use |
|||
discrete vector action space for better results. |
|||
- Since the memories must be sent back and forth between Python and Unity, using |
|||
too large `memory_size` will slow down training. |
|||
- Adding a recurrent layer increases the complexity of the neural network, it is |
|||
recommended to decrease `num_layers` when using recurrent. |
|||
- It is required that `memory_size` be divisible by 4. |
|||
|
|||
## Self-Play |
|||
|
|||
Training with self-play adds additional confounding factors to the usual issues |
|||
faced by reinforcement learning. In general, the tradeoff is between the skill |
|||
level and generality of the final policy and the stability of learning. Training |
|||
against a set of slowly or unchanging adversaries with low diversity results in |
|||
a more stable learning process than training against a set of quickly changing |
|||
adversaries with high diversity. With this context, this guide discusses the |
|||
exposed self-play hyperparameters and intuitions for tuning them. |
|||
|
|||
If your environment contains multiple agents that are divided into teams, you |
|||
can leverage our self-play training option by providing these configurations for |
|||
each Behavior: |
|||
|
|||
| **Setting** | **Description** | |
|||
| :-------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | |
|||
| `save_steps` | Number of _trainer steps_ between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. <br><br>A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent. <br><br> Typical range: `10000` - `100000` | |
|||
| `team_change` | Number of _trainer_steps_ between switching the learning team. This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents per team switch. <br><br>A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies and so the agent may fail against the next batch of opponents. <br><br> The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we recommend setting this value as a function of the `save_steps` parameter discussed previously. <br><br> Typical range: 4x-10x where x=`save_steps` | |
|||
| `swap_steps` | Number of _ghost steps_ (not trainer steps) between swapping the opponents policy with a different snapshot. A 'ghost step' refers to a step taken by an agent _that is following a fixed policy and not learning_. The reason for this distinction is that in asymmetric games, we may have teams with an unequal number of agents e.g. a 2v1 scenario like our Strikers Vs Goalie example environment. The team with two agents collects twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents` agents during `team-change` total steps is: `(num_agents / num_opponent_agents) * (team_change / x)` <br><br> Typical range: `10000` - `100000` | |
|||
| `play_against_latest_model_ratio` | Probability an agent will play against the latest opponent policy. With probability 1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its opponent from a past iteration. <br><br> A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy. <br><br> Typical range: `0.0` - `1.0` | |
|||
| `window` | Size of the sliding window of past snapshots from which the agent's opponents are sampled. For example, a `window` size of 5 will save the last 5 snapshots taken. Each time a new snapshot is taken, the oldest is discarded. A larger value of `window` means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. Like in the `save_steps` hyperparameter, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. <br><br> Typical range: `5` - `30` | |
|||
|
|||
### Note on Reward Signals |
|||
|
|||
We make the assumption that the final reward in a trajectory corresponds to the |
|||
outcome of an episode. A final reward of +1 indicates winning, -1 indicates |
|||
losing and 0 indicates a draw. The ELO calculation (discussed below) depends on |
|||
this final reward being either +1, 0, -1. |
|||
|
|||
The reward signal should still be used as described in the documentation for the |
|||
other trainers. However, we encourage users to be a bit more conservative when |
|||
shaping reward functions due to the instability and non-stationarity of learning |
|||
in adversarial games. Specifically, we encourage users to begin with the |
|||
simplest possible reward function (+1 winning, -1 losing) and to allow for more |
|||
iterations of training to compensate for the sparsity of reward. |
|||
|
|||
### Note on Swap Steps |
|||
|
|||
As an example, in a 2v1 scenario, if we want the swap to occur x=4 times during |
|||
team-change=200000 steps, the swap_steps for the team of one agent is: |
|||
|
|||
swap_steps = (1 / 2) \* (200000 / 4) = 25000 The swap_steps for the team of two |
|||
agents is: |
|||
|
|||
swap_steps = (2 / 1) \* (200000 / 4) = 100000 Note, with equal team sizes, the |
|||
first term is equal to 1 and swap_steps can be calculated by just dividing the |
|||
total steps by the desired number of swaps. |
|||
|
|||
A larger value of swap_steps means that an agent will play against the same |
|||
fixed opponent for a longer number of training iterations. This results in a |
|||
more stable training scenario, but leaves the agent open to the risk of |
|||
overfitting it's behavior for this particular opponent. Thus, when a new |
|||
opponent is swapped, the agent may lose more often than expected. |
|
|||
# Memory-enhanced agents using Recurrent Neural Networks |
|||
|
|||
## What are memories used for? |
|||
|
|||
Have you ever entered a room to get something and immediately forgot what you |
|||
were looking for? Don't let that happen to your agents. |
|||
|
|||
It is now possible to give memories to your agents. When training, the agents |
|||
will be able to store a vector of floats to be used next time they need to make |
|||
a decision. |
|||
|
|||
![Inspector](images/ml-agents-LSTM.png) |
|||
|
|||
Deciding what the agents should remember in order to solve a task is not easy to |
|||
do by hand, but our training algorithms can learn to keep track of what is |
|||
important to remember with |
|||
[LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory). |
|||
|
|||
## How to use |
|||
|
|||
When configuring the trainer parameters in the `config/trainer_config.yaml` |
|||
file, add the following parameters to the Behavior you want to use. |
|||
|
|||
```json |
|||
use_recurrent: true |
|||
sequence_length: 64 |
|||
memory_size: 256 |
|||
``` |
|||
|
|||
* `use_recurrent` is a flag that notifies the trainer that you want to use a |
|||
Recurrent Neural Network. |
|||
* `sequence_length` defines how long the sequences of experiences must be while |
|||
training. In order to use a LSTM, training requires a sequence of experiences |
|||
instead of single experiences. |
|||
* `memory_size` corresponds to the size of the memory the agent must keep. Note |
|||
that if this number is too small, the agent will not be able to remember a lot |
|||
of things. If this number is too large, the neural network will take longer to |
|||
train. |
|||
|
|||
## Limitations |
|||
|
|||
* LSTM does not work well with continuous vector action space. Please use |
|||
discrete vector action space for better results. |
|||
* Since the memories must be sent back and forth between Python and Unity, using |
|||
too large `memory_size` will slow down training. |
|||
* Adding a recurrent layer increases the complexity of the neural network, it is |
|||
recommended to decrease `num_layers` when using recurrent. |
|||
* It is required that `memory_size` be divisible by 4. |
|
|||
# Using the Monitor |
|||
|
|||
![Monitor](images/monitor.png) |
|||
|
|||
The monitor allows visualizing information related to the agents or training |
|||
process within a Unity scene. |
|||
|
|||
You can track many different things both related and unrelated to the agents |
|||
themselves. By default, the Monitor is only active in the *inference* phase, so |
|||
not during training. To change this behavior, you can activate or deactivate it |
|||
by calling `SetActive(boolean)`. For example to also show the monitor during |
|||
training, you can call it in the `Awake()` method of your `MonoBehaviour`: |
|||
|
|||
```csharp |
|||
using Unity.MLAgents; |
|||
|
|||
public class MyBehaviour : MonoBehaviour { |
|||
public void Awake() |
|||
{ |
|||
Monitor.SetActive(true); |
|||
} |
|||
} |
|||
``` |
|||
|
|||
To add values to monitor, call the `Log` function anywhere in your code: |
|||
|
|||
```csharp |
|||
Monitor.Log(key, value, target) |
|||
``` |
|||
|
|||
* `key` is the name of the information you want to display. |
|||
* `value` is the information you want to display. *`value`* can have different |
|||
types: |
|||
* `string` - The Monitor will display the string next to the key. It can be |
|||
useful for displaying error messages. |
|||
* `float` - The Monitor will display a slider. Note that the values must be |
|||
between -1 and 1. If the value is positive, the slider will be green, if the |
|||
value is negative, the slider will be red. |
|||
* `float[]` - The Monitor Log call can take an additional argument called |
|||
`displayType` that can be either `INDEPENDENT` (default) or `PROPORTIONAL`: |
|||
* `INDEPENDENT` is used to display multiple independent floats as a |
|||
histogram. The histogram will be a sequence of vertical sliders. |
|||
* `PROPORTION` is used to see the proportions between numbers. For each |
|||
float in values, a rectangle of width of value divided by the sum of all |
|||
values will be show. It is best for visualizing values that sum to 1. |
|||
* `target` is the transform to which you want to attach information. If the |
|||
transform is `null` the information will be attached to the global monitor. |
|||
* **NB:** When adding a target transform that is not the global monitor, make |
|||
sure you have your main camera object tagged as `MainCamera` via the |
|||
inspector. This is needed to properly display the text onto the screen. |
|
|||
# Training Using Concurrent Unity Instances |
|||
|
|||
As part of release v0.8, we enabled developers to run concurrent, parallel instances of the Unity executable during training. For certain scenarios, this should speed up the training. |
|||
|
|||
## How to Run Concurrent Unity Instances During Training |
|||
|
|||
Please refer to the general instructions on [Training ML-Agents](Training-ML-Agents.md). In order to run concurrent Unity instances during training, set the number of environment instances using the command line option `--num-envs=<n>` when you invoke `mlagents-learn`. Optionally, you can also set the `--base-port`, which is the starting port used for the concurrent Unity instances. |
|||
|
|||
## Considerations |
|||
|
|||
### Buffer Size |
|||
|
|||
If you are having trouble getting an agent to train, even with multiple concurrent Unity instances, you could increase `buffer_size` in the `config/trainer_config.yaml` file. A common practice is to multiply `buffer_size` by `num-envs`. |
|||
|
|||
### Resource Constraints |
|||
|
|||
Invoking concurrent Unity instances is constrained by the resources on the machine. Please use discretion when setting `--num-envs=<n>`. |
|||
|
|||
### Using num-runs and num-envs |
|||
|
|||
If you set `--num-runs=<n>` greater than 1 and are also invoking concurrent Unity instances using `--num-envs=<n>`, then the number of concurrent Unity instances is equal to `num-runs` times `num-envs`. |
|||
|
|||
### Result Variation Using Concurrent Unity Instances |
|||
|
|||
If you keep all the hyperparameters the same, but change `--num-envs=<n>`, the results and model would likely change. |
|
|||
# Training with Imitation Learning |
|||
|
|||
It is often more intuitive to simply demonstrate the behavior we want an agent |
|||
to perform, rather than attempting to have it learn via trial-and-error methods. |
|||
Consider our |
|||
[running example](ML-Agents-Overview.md#running-example-training-npc-behaviors) |
|||
of training a medic NPC. Instead of indirectly training a medic with the help |
|||
of a reward function, we can give the medic real world examples of observations |
|||
from the game and actions from a game controller to guide the medic's behavior. |
|||
Imitation Learning uses pairs of observations and actions from |
|||
a demonstration to learn a policy. |
|||
|
|||
Imitation learning can also be used to help reinforcement learning. Especially in |
|||
environments with sparse (i.e., infrequent or rare) rewards, the agent may never see |
|||
the reward and thus not learn from it. Curiosity (which is available in the toolkit) |
|||
helps the agent explore, but in some cases |
|||
it is easier to show the agent how to achieve the reward. In these cases, |
|||
imitation learning combined with reinforcement learning can dramatically |
|||
reduce the time the agent takes to solve the environment. |
|||
For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids), |
|||
using 6 episodes of demonstrations can reduce training steps by more than 4 times. |
|||
See Behavioral Cloning + GAIL + Curiosity + RL below. |
|||
|
|||
<p align="center"> |
|||
<img src="images/mlagents-ImitationAndRL.png" |
|||
alt="Using Demonstrations with Reinforcement Learning" |
|||
width="700" border="0" /> |
|||
</p> |
|||
|
|||
The ML-Agents Toolkit provides two features that enable your agent to learn from demonstrations. |
|||
In most scenarios, you can combine these two features. |
|||
|
|||
* GAIL (Generative Adversarial Imitation Learning) uses an adversarial approach to |
|||
reward your Agent for behaving similar to a set of demonstrations. To use GAIL, you can add the |
|||
[GAIL reward signal](Reward-Signals.md#gail-reward-signal). GAIL can be |
|||
used with or without environment rewards, and works well when there are a limited |
|||
number of demonstrations. |
|||
* Behavioral Cloning (BC) trains the Agent's neural network to exactly mimic the actions |
|||
shown in a set of demonstrations. |
|||
The BC feature can be enabled on the [PPO](Training-PPO.md#optional-behavioral-cloning-using-demonstrations) |
|||
or [SAC](Training-SAC.md#optional-behavioral-cloning-using-demonstrations) trainer. As BC cannot generalize |
|||
past the examples shown in the demonstrations, BC tends to work best when there exists demonstrations |
|||
for nearly all of the states that the agent can experience, or in conjunction with GAIL and/or an extrinsic reward. |
|||
|
|||
### What to Use |
|||
|
|||
If you want to help your agents learn (especially with environments that have sparse rewards) |
|||
using pre-recorded demonstrations, you can generally enable both GAIL and Behavioral Cloning |
|||
at low strengths in addition to having an extrinsic reward. |
|||
An example of this is provided for the Pyramids example environment under |
|||
`PyramidsLearning` in `config/gail_config.yaml`. |
|||
|
|||
If you want to train purely from demonstrations, GAIL and BC _without_ an |
|||
extrinsic reward signal is the preferred approach. An example of this is provided for the Crawler |
|||
example environment under `CrawlerStaticLearning` in `config/gail_config.yaml`. |
|||
|
|||
## Recording Demonstrations |
|||
|
|||
Demonstrations of agent behavior can be recorded from the Unity Editor, |
|||
and saved as assets. These demonstrations contain information on the |
|||
observations, actions, and rewards for a given agent during the recording session. |
|||
They can be managed in the Editor, as well as used for training with BC and GAIL. |
|||
|
|||
In order to record demonstrations from an agent, add the `Demonstration Recorder` |
|||
component to a GameObject in the scene which contains an `Agent` component. |
|||
Once added, it is possible to name the demonstration that will be recorded |
|||
from the agent. |
|||
|
|||
<p align="center"> |
|||
<img src="images/demo_component.png" |
|||
alt="Demonstration Recorder" |
|||
width="375" border="10" /> |
|||
</p> |
|||
|
|||
When `Record` is checked, a demonstration will be created whenever the scene |
|||
is played from the Editor. Depending on the complexity of the task, anywhere |
|||
from a few minutes or a few hours of demonstration data may be necessary to |
|||
be useful for imitation learning. When you have recorded enough data, end |
|||
the Editor play session. A `.demo` file will be created in the |
|||
`Assets/Demonstrations` folder (by default). This file contains the demonstrations. |
|||
Clicking on the file will provide metadata about the demonstration in the |
|||
inspector. |
|||
|
|||
<p align="center"> |
|||
<img src="images/demo_inspector.png" |
|||
alt="Demonstration Inspector" |
|||
width="375" border="10" /> |
|||
</p> |
|||
|
|||
You can then specify the path to this file as the `demo_path` in your `trainer_config.yaml` file |
|||
when using BC or GAIL. For instance, for BC: |
|||
|
|||
``` |
|||
behavioral_cloning: |
|||
demo_path: <path_to_your_demo_file> |
|||
... |
|||
``` |
|||
And for GAIL: |
|||
``` |
|||
reward_signals: |
|||
gail: |
|||
demo_path: <path_to_your_demo_file> |
|||
... |
|||
``` |
|
|||
# Reward Signals |
|||
|
|||
In reinforcement learning, the end goal for the Agent is to discover a behavior (a Policy) |
|||
that maximizes a reward. Typically, a reward is defined by your environment, and corresponds |
|||
to reaching some goal. These are what we refer to as "extrinsic" rewards, as they are defined |
|||
external of the learning algorithm. |
|||
|
|||
Rewards, however, can be defined outside of the environment as well, to encourage the agent to |
|||
behave in certain ways, or to aid the learning of the true extrinsic reward. We refer to these |
|||
rewards as "intrinsic" reward signals. The total reward that the agent will learn to maximize can |
|||
be a mix of extrinsic and intrinsic reward signals. |
|||
|
|||
ML-Agents allows reward signals to be defined in a modular way, and we provide three reward |
|||
signals that can the mixed and matched to help shape your agent's behavior. The `extrinsic` Reward |
|||
Signal represents the rewards defined in your environment, and is enabled by default. |
|||
The `curiosity` reward signal helps your agent explore when extrinsic rewards are sparse. |
|||
|
|||
## Enabling Reward Signals |
|||
|
|||
Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An |
|||
example is provided in `config/trainer_config.yaml` and `config/gail_config.yaml`. To enable a reward signal, add it to the |
|||
`reward_signals:` section under the behavior name. For instance, to enable the extrinsic signal |
|||
in addition to a small curiosity reward and a GAIL reward signal, you would define your `reward_signals` as follows: |
|||
|
|||
```yaml |
|||
reward_signals: |
|||
extrinsic: |
|||
strength: 1.0 |
|||
gamma: 0.99 |
|||
curiosity: |
|||
strength: 0.02 |
|||
gamma: 0.99 |
|||
encoding_size: 256 |
|||
gail: |
|||
strength: 0.01 |
|||
gamma: 0.99 |
|||
encoding_size: 128 |
|||
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo |
|||
``` |
|||
|
|||
Each reward signal should define at least two parameters, `strength` and `gamma`, in addition |
|||
to any class-specific hyperparameters. Note that to remove a reward signal, you should delete |
|||
its entry entirely from `reward_signals`. At least one reward signal should be left defined |
|||
at all times. |
|||
|
|||
## Reward Signal Types |
|||
As part of the toolkit, we provide three reward signal types as part of hyperparameters - Extrinsic, Curiosity, and GAIL. |
|||
|
|||
### Extrinsic Reward Signal |
|||
|
|||
The `extrinsic` reward signal is simply the reward given by the |
|||
[environment](Learning-Environment-Design.md). Remove it to force the agent |
|||
to ignore the environment reward. |
|||
|
|||
#### Strength |
|||
|
|||
`strength` is the factor by which to multiply the raw |
|||
reward. Typical ranges will vary depending on the reward signal. |
|||
|
|||
Typical Range: `1.0` |
|||
|
|||
#### Gamma |
|||
|
|||
`gamma` corresponds to the discount factor for future rewards. This can be |
|||
thought of as how far into the future the agent should care about possible |
|||
rewards. In situations when the agent should be acting in the present in order |
|||
to prepare for rewards in the distant future, this value should be large. In |
|||
cases when rewards are more immediate, it can be smaller. |
|||
|
|||
Typical Range: `0.8` - `0.995` |
|||
|
|||
### Curiosity Reward Signal |
|||
|
|||
The `curiosity` Reward Signal enables the Intrinsic Curiosity Module. This is an implementation |
|||
of the approach described in "Curiosity-driven Exploration by Self-supervised Prediction" |
|||
by Pathak, et al. It trains two networks: |
|||
* an inverse model, which takes the current and next observation of the agent, encodes them, and |
|||
uses the encoding to predict the action that was taken between the observations |
|||
* a forward model, which takes the encoded current observation and action, and predicts the |
|||
next encoded observation. |
|||
|
|||
The loss of the forward model (the difference between the predicted and actual encoded observations) is used as the intrinsic reward, so the more surprised the model is, the larger the reward will be. |
|||
|
|||
For more information, see |
|||
* https://arxiv.org/abs/1705.05363 |
|||
* https://pathak22.github.io/noreward-rl/ |
|||
* https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/ |
|||
|
|||
#### Strength |
|||
|
|||
In this case, `strength` corresponds to the magnitude of the curiosity reward generated |
|||
by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough |
|||
to not be overwhelmed by extrinsic reward signals in the environment. |
|||
Likewise it should not be too large to overwhelm the extrinsic reward signal. |
|||
|
|||
Typical Range: `0.001` - `0.1` |
|||
|
|||
#### Gamma |
|||
|
|||
`gamma` corresponds to the discount factor for future rewards. |
|||
|
|||
Typical Range: `0.8` - `0.995` |
|||
|
|||
#### (Optional) Encoding Size |
|||
|
|||
`encoding_size` corresponds to the size of the encoding used by the intrinsic curiosity model. |
|||
This value should be small enough to encourage the ICM to compress the original |
|||
observation, but also not too small to prevent it from learning to differentiate between |
|||
demonstrated and actual behavior. |
|||
|
|||
Default Value: `64` |
|||
|
|||
Typical Range: `64` - `256` |
|||
|
|||
#### (Optional) Learning Rate |
|||
|
|||
`learning_rate` is the learning rate used to update the intrinsic curiosity module. |
|||
This should typically be decreased if training is unstable, and the curiosity loss is unstable. |
|||
|
|||
Default Value: `3e-4` |
|||
|
|||
Typical Range: `1e-5` - `1e-3` |
|||
|
|||
### GAIL Reward Signal |
|||
|
|||
GAIL, or [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476), is an |
|||
imitation learning algorithm that uses an adversarial approach, in a similar vein to GANs |
|||
(Generative Adversarial Networks). In this framework, a second neural network, the |
|||
discriminator, is taught to distinguish whether an observation/action is from a demonstration or |
|||
produced by the agent. This discriminator can the examine a new observation/action and provide it a |
|||
reward based on how close it believes this new observation/action is to the provided demonstrations. |
|||
|
|||
At each training step, the agent tries to learn how to maximize this reward. Then, the |
|||
discriminator is trained to better distinguish between demonstrations and agent state/actions. |
|||
In this way, while the agent gets better and better at mimicing the demonstrations, the |
|||
discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it. |
|||
|
|||
This approach learns a _policy_ that produces states and actions similar to the demonstrations, |
|||
requiring fewer demonstrations than direct cloning of the actions. In addition to learning purely |
|||
from demonstrations, the GAIL reward signal can be mixed with an extrinsic reward signal to guide |
|||
the learning process. |
|||
|
|||
Using GAIL requires recorded demonstrations from your Unity environment. See the |
|||
[imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations. |
|||
|
|||
#### Strength |
|||
|
|||
`strength` is the factor by which to multiply the raw reward. Note that when using GAIL |
|||
with an Extrinsic Signal, this value should be set lower if your demonstrations are |
|||
suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic |
|||
rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. |
|||
|
|||
Typical Range: `0.01` - `1.0` |
|||
|
|||
#### Gamma |
|||
|
|||
`gamma` corresponds to the discount factor for future rewards. |
|||
|
|||
Typical Range: `0.8` - `0.9` |
|||
|
|||
#### Demo Path |
|||
|
|||
`demo_path` is the path to your `.demo` file or directory of `.demo` files. See the [imitation learning guide](Training-Imitation-Learning.md). |
|||
|
|||
#### (Optional) Encoding Size |
|||
|
|||
`encoding_size` corresponds to the size of the hidden layer used by the discriminator. |
|||
This value should be small enough to encourage the discriminator to compress the original |
|||
observation, but also not too small to prevent it from learning to differentiate between |
|||
demonstrated and actual behavior. Dramatically increasing this size will also negatively affect |
|||
training times. |
|||
|
|||
Default Value: `64` |
|||
|
|||
Typical Range: `64` - `256` |
|||
|
|||
#### (Optional) Learning Rate |
|||
|
|||
`learning_rate` is the learning rate used to update the discriminator. |
|||
This should typically be decreased if training is unstable, and the GAIL loss is unstable. |
|||
|
|||
Default Value: `3e-4` |
|||
|
|||
Typical Range: `1e-5` - `1e-3` |
|||
|
|||
#### (Optional) Use Actions |
|||
|
|||
`use_actions` determines whether the discriminator should discriminate based on both |
|||
observations and actions, or just observations. Set to `True` if you want the agent to |
|||
mimic the actions from the demonstrations, and `False` if you'd rather have the agent |
|||
visit the same states as in the demonstrations but with possibly different actions. |
|||
Setting to `False` is more likely to be stable, especially with imperfect demonstrations, |
|||
but may learn slower. |
|||
|
|||
Default Value: `false` |
|||
|
|||
#### (Optional) Variational Discriminator Bottleneck |
|||
|
|||
`use_vail` enables a [variational bottleneck](https://arxiv.org/abs/1810.00821) within the |
|||
GAIL discriminator. This forces the discriminator to learn a more general representation |
|||
and reduces its tendency to be "too good" at discriminating, making learning more stable. |
|||
However, it does increase training time. Enable this if you notice your imitation learning is |
|||
unstable, or unable to learn the task at hand. |
|||
|
|||
Default Value: `false` |
|
|||
# Training with Self-Play |
|||
|
|||
ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with |
|||
[Self-Play](https://openai.com/blog/competitive-self-play/). |
|||
A symmetric game is one in which opposing agents are equal in form, function and objective. Examples of symmetric games |
|||
are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and |
|||
action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games, |
|||
this is not the case. An example of an asymmetric game is our Strikers Vs Goalie example environment. Agents in these |
|||
types of games do not always have the same observation or action spaces and so sharing policy networks is not |
|||
necessarily ideal. |
|||
|
|||
With self-play, an agent learns in adversarial games by competing against fixed, past versions of its opponent |
|||
(which could be itself as in symmetric games) to provide a more stable, stationary learning environment. This is compared |
|||
to competing against the current, best opponent in every episode, which is constantly changing (because it's learning). |
|||
|
|||
Self-play can be used with our implementations of both [Proximal Policy Optimization (PPO)](Training-PPO.md) and [Soft Actor-Critc (SAC)](Training-SAC.md). |
|||
However, from the perspective of an individual agent, these scenarios appear to have non-stationary dynamics because the opponent is often changing. |
|||
This can cause significant issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For further reading on |
|||
this issue in particular, see the paper [Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1702.08887.pdf). |
|||
For more general information on training with ML-Agents, see [Training ML-Agents](Training-ML-Agents.md). |
|||
For more algorithm specific instruction, please see the documentation for [PPO](Training-PPO.md) or [SAC](Training-SAC.md). |
|||
|
|||
Self-play is triggered by including the self-play hyperparameter hierarchy in the trainer configuration file. Detailed description of the self-play hyperparameters are contained below. Furthermore, to distinguish opposing agents, set the team ID to different integer values in the behavior parameters script on the agent prefab. |
|||
|
|||
![Team ID](images/team_id.png) |
|||
|
|||
***Team ID must be 0 or an integer greater than 0.*** |
|||
|
|||
In symmetric games, since all agents (even on opposing teams) will share the same policy, they should have the same 'Behavior Name' in their |
|||
Behavior Parameters Script. In asymmetric games, they should have a different Behavior Name in their Behavior Parameters script. |
|||
Note, in asymmetric games, the agents must have both different Behavior Names *and* different team IDs! Then, specify the trainer configuration |
|||
for each Behavior Name in your scene as you would normally, and remember to include the self-play hyperparameter hierarchy! |
|||
|
|||
For examples of how to use this feature, you can see the trainer configurations and agent prefabs for our Tennis, Soccer, and |
|||
Strikers Vs Goalie environments. |
|||
Tennis and Soccer provide examples of symmetric games and Strikers Vs Goalie provides an example of an asymmetric game. |
|||
|
|||
|
|||
## Best Practices Training with Self-Play |
|||
|
|||
Training with self-play adds additional confounding factors to the usual |
|||
issues faced by reinforcement learning. In general, the tradeoff is between |
|||
the skill level and generality of the final policy and the stability of learning. |
|||
Training against a set of slowly or unchanging adversaries with low diversity |
|||
results in a more stable learning process than training against a set of quickly |
|||
changing adversaries with high diversity. With this context, this guide discusses |
|||
the exposed self-play hyperparameters and intuitions for tuning them. |
|||
|
|||
|
|||
## Hyperparameters |
|||
|
|||
### Reward Signals |
|||
|
|||
We make the assumption that the final reward in a trajectory corresponds to the outcome of an episode. |
|||
A final reward greater than 0 indicates winning, less than 0 indicates losing and 0 indicates a draw. |
|||
The final reward determines the result of an episode (win, loss, or draw) in the ELO calculation. |
|||
|
|||
The reward signal should still be used as described in the documentation for the other trainers and [reward signals.](Reward-Signals.md) However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games. Specifically, we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing) and to allow for more iterations of training to compensate for the sparsity of reward. |
|||
|
|||
In problems that are too challenging to be solved by sparse rewards, it may be necessary to provide intermediate rewards to encourage useful instrumental behaviors. |
|||
For example, it may be difficult for a soccer agent to learn that kicking a ball into the net receives a reward because this sequence has a low probability |
|||
of occurring randomly. However, it will have a higher probability of occurring if the agent learns generally that kicking the ball has utility. So, we may be able |
|||
to speed up training by giving the agent intermediate reward for kicking the ball. However, we must be careful that the agent doesn't learn to undermine |
|||
its original objective of scoring goals e.g. if it scores a goal, the episode ends and it can no longer receive reward for kicking the ball. The behavior |
|||
that receives the most reward may be to keep the ball out of the net and to kick it indefinitely! To address this, we suggest |
|||
using a curriculum that allows the agents to learn the necessary intermediate behavior (i.e. colliding with a ball) and then |
|||
decays this reward signal to allow training on just the rewards of winning and losing. Please see our documentation on |
|||
how to use curriculum learning [here](./Training-Curriculum-Learning.md) and our SoccerTwos example environment. |
|||
|
|||
### Save Steps |
|||
|
|||
The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. |
|||
|
|||
A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent. |
|||
|
|||
Recommended Range : 10000-100000 |
|||
|
|||
### Team Change |
|||
|
|||
The `team_change` parameter corresponds to the number of *trainer_steps* between switching the learning team. |
|||
This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team |
|||
becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar |
|||
performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents |
|||
per team switch. |
|||
|
|||
A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents |
|||
the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies |
|||
and so the agent may fail against the next batch of opponents. |
|||
|
|||
The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we |
|||
recommend setting this value as a function of the `save_steps` parameter discussed previously. |
|||
|
|||
Recommended Range : 4x-10x where x=`save_steps` |
|||
|
|||
|
|||
### Swap Steps |
|||
|
|||
The `swap_steps` parameter corresponds to the number of *ghost steps* (not trainer steps) between swapping the opponents policy with a different snapshot. |
|||
A 'ghost step' refers to a step taken by an agent *that is following a fixed policy and not learning*. The reason for this distinction is that in asymmetric games, |
|||
we may have teams with an unequal number of agents e.g. a 2v1 scenario like our Strikers Vs Goalie example environment. The team with two agents collects |
|||
twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number |
|||
of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if |
|||
a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents` |
|||
agents during `team-change` total steps is: |
|||
|
|||
``` |
|||
swap_steps = (num_agents / num_opponent_agents) * (team_change / x) |
|||
``` |
|||
|
|||
As an example, in a 2v1 scenario, if we want the swap to occur `x=4` times during `team-change=200000` steps, |
|||
the `swap_steps` for the team of one agent is: |
|||
|
|||
``` |
|||
swap_steps = (1 / 2) * (200000 / 4) = 25000 |
|||
``` |
|||
The `swap_steps` for the team of two agents is: |
|||
``` |
|||
swap_steps = (2 / 1) * (200000 / 4) = 100000 |
|||
``` |
|||
Note, with equal team sizes, the first term is equal to 1 and `swap_steps` can be calculated by just dividing the total steps by the desired number of swaps. |
|||
|
|||
A larger value of `swap_steps` means that an agent will play against the same fixed opponent for a longer number of training iterations. This results in a more stable training scenario, but leaves the agent open to the risk of overfitting it's behavior for this particular opponent. Thus, when a new opponent is swapped, the agent may lose more often than expected. |
|||
|
|||
Recommended Range : 10000-100000 |
|||
|
|||
### Play against latest model ratio |
|||
|
|||
The `play_against_latest_model_ratio` parameter corresponds to the probability |
|||
an agent will play against the latest opponent policy. With probability |
|||
1 - `play_against_latest_model_ratio`, the agent will play against a snapshot of its |
|||
opponent from a past iteration. |
|||
|
|||
A larger value of `play_against_latest_model_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration. This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy. |
|||
|
|||
Range : 0.0 - 1.0 |
|||
|
|||
### Window |
|||
|
|||
The `window` parameter corresponds to the size of the sliding window of past snapshots from which the agent's opponents are sampled. For example, a `window` size of 5 will save the last 5 snapshots taken. Each time a new snapshot is taken, the oldest is discarded. |
|||
|
|||
A larger value of `window` means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. Like in the `save_steps` hyperparameter, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. |
|||
|
|||
Recommended Range : 5 - 30 |
|||
|
|||
## Training Statistics |
|||
|
|||
To view training statistics, use TensorBoard. For information on launching and |
|||
using TensorBoard, see |
|||
[here](./Getting-Started.md#observing-training-progress). |
|||
|
|||
### ELO |
|||
In adversarial games, the cumulative environment reward may not be a meaningful metric by which to track learning progress. This is because cumulative reward is entirely dependent on the skill of the opponent. An agent at a particular skill level will get more or less reward against a worse or better agent, respectively. |
|||
|
|||
We provide an implementation of the ELO rating system, a method for calculating the relative skill level between two players from a given population in a zero-sum game. For more information on ELO, please see [the ELO wiki](https://en.wikipedia.org/wiki/Elo_rating_system). |
|||
In a proper training run, the ELO of the agent should steadily increase. The absolute value of the ELO is less important than the change in ELO over training iterations. |
|||
|
|||
Note, this implementation will support any number of teams but ELO is only applicable to games with two teams. It is ongoing work to implement |
|||
a reliable metric for measuring progress in scenarios with three or more teams. These scenarios can still train, though as of now, reward and qualitative observations |
|||
are the only metric by which we can judge performance. |
|
|||
# Training With Environment Parameter Randomization |
|||
|
|||
One of the challenges of training and testing agents on the same |
|||
environment is that the agents tend to overfit. The result is that the |
|||
agents are unable to generalize to any tweaks or variations in the environment. |
|||
This is analogous to a model being trained and tested on an identical dataset |
|||
in supervised learning. This becomes problematic in cases where environments |
|||
are instantiated with varying objects or properties. |
|||
|
|||
To help agents robust and better generalizable to changes in the environment, the agent |
|||
can be trained over multiple variations of a given environment. We refer to this approach as **Environment Parameter Randomization**. For those familiar with Reinforcement Learning research, this approach is based on the concept of Domain Randomization (you can read more about it [here](https://arxiv.org/abs/1703.06907)). By using parameter randomization |
|||
during training, the agent can be better suited to adapt (with higher performance) |
|||
to future unseen variations of the environment. |
|||
|
|||
_Example of variations of the 3D Ball environment._ |
|||
|
|||
Ball scale of 0.5 | Ball scale of 4 |
|||
:-------------------------:|:-------------------------: |
|||
![](images/3dball_small.png) | ![](images/3dball_big.png) |
|||
|
|||
|
|||
To enable variations in the environments, we implemented `Environment Parameters`. |
|||
`Environment Parameters` are values in the `FloatPropertiesChannel` that can be read when setting |
|||
up the environment. We |
|||
also included different sampling methods and the ability to create new kinds of |
|||
sampling methods for each `Environment Parameter`. In the 3D ball environment example displayed |
|||
in the figure above, the environment parameters are `gravity`, `ball_mass` and `ball_scale`. |
|||
|
|||
|
|||
## How to Enable Environment Parameter Randomization |
|||
|
|||
We first need to provide a way to modify the environment by supplying a set of `Environment Parameters` |
|||
and vary them over time. This provision can be done either deterministically or randomly. |
|||
|
|||
This is done by assigning each `Environment Parameter` a `sampler-type`(such as a uniform sampler), |
|||
which determines how to sample an `Environment |
|||
Parameter`. If a `sampler-type` isn't provided for a |
|||
`Environment Parameter`, the parameter maintains the default value throughout the |
|||
training procedure, remaining unchanged. The samplers for all the `Environment Parameters` |
|||
are handled by a **Sampler Manager**, which also handles the generation of new |
|||
values for the environment parameters when needed. |
|||
|
|||
To setup the Sampler Manager, we create a YAML file that specifies how we wish to |
|||
generate new samples for each `Environment Parameters`. In this file, we specify the samplers and the |
|||
`resampling-interval` (the number of simulation steps after which environment parameters are |
|||
resampled). Below is an example of a sampler file for the 3D ball environment. |
|||
|
|||
```yaml |
|||
resampling-interval: 5000 |
|||
|
|||
mass: |
|||
sampler-type: "uniform" |
|||
min_value: 0.5 |
|||
max_value: 10 |
|||
|
|||
gravity: |
|||
sampler-type: "multirange_uniform" |
|||
intervals: [[7, 10], [15, 20]] |
|||
|
|||
scale: |
|||
sampler-type: "uniform" |
|||
min_value: 0.75 |
|||
max_value: 3 |
|||
|
|||
``` |
|||
|
|||
Below is the explanation of the fields in the above example. |
|||
|
|||
* `resampling-interval` - Specifies the number of steps for the agent to |
|||
train under a particular environment configuration before resetting the |
|||
environment with a new sample of `Environment Parameters`. |
|||
|
|||
* `Environment Parameter` - Name of the `Environment Parameter` like `mass`, `gravity` and `scale`. This should match the name |
|||
specified in the `FloatPropertiesChannel` of the environment being trained. If a parameter specified in the file doesn't exist in the |
|||
environment, then this parameter will be ignored. Within each `Environment Parameter` |
|||
|
|||
* `sampler-type` - Specify the sampler type to use for the `Environment Parameter`. |
|||
This is a string that should exist in the `Sampler Factory` (explained |
|||
below). |
|||
|
|||
* `sampler-type-sub-arguments` - Specify the sub-arguments depending on the `sampler-type`. |
|||
In the example above, this would correspond to the `intervals` |
|||
under the `sampler-type` `"multirange_uniform"` for the `Environment Parameter` called `gravity`. |
|||
The key name should match the name of the corresponding argument in the sampler definition. |
|||
(See below) |
|||
|
|||
The Sampler Manager allocates a sampler type for each `Environment Parameter` by using the *Sampler Factory*, |
|||
which maintains a dictionary mapping of string keys to sampler objects. The available sampler types |
|||
to be used for each `Environment Parameter` is available in the Sampler Factory. |
|||
|
|||
### Included Sampler Types |
|||
|
|||
Below is a list of included `sampler-type` as part of the toolkit. |
|||
|
|||
* `uniform` - Uniform sampler |
|||
* Uniformly samples a single float value between defined endpoints. |
|||
The sub-arguments for this sampler to specify the interval |
|||
endpoints are as below. The sampling is done in the range of |
|||
[`min_value`, `max_value`). |
|||
|
|||
* **sub-arguments** - `min_value`, `max_value` |
|||
|
|||
* `gaussian` - Gaussian sampler |
|||
* Samples a single float value from the distribution characterized by |
|||
the mean and standard deviation. The sub-arguments to specify the |
|||
gaussian distribution to use are as below. |
|||
|
|||
* **sub-arguments** - `mean`, `st_dev` |
|||
|
|||
* `multirange_uniform` - Multirange uniform sampler |
|||
* Uniformly samples a single float value between the specified intervals. |
|||
Samples by first performing a weight pick of an interval from the list |
|||
of intervals (weighted based on interval width) and samples uniformly |
|||
from the selected interval (half-closed interval, same as the uniform |
|||
sampler). This sampler can take an arbitrary number of intervals in a |
|||
list in the following format: |
|||
[[`interval_1_min`, `interval_1_max`], [`interval_2_min`, `interval_2_max`], ...] |
|||
|
|||
* **sub-arguments** - `intervals` |
|||
|
|||
The implementation of the samplers can be found at `ml-agents-envs/mlagents_envs/sampler_class.py`. |
|||
|
|||
### Defining a New Sampler Type |
|||
|
|||
If you want to define your own sampler type, you must first inherit the *Sampler* |
|||
base class (included in the `sampler_class` file) and preserve the interface. |
|||
Once the class for the required method is specified, it must be registered in the Sampler Factory. |
|||
|
|||
This can be done by subscribing to the *register_sampler* method of the SamplerFactory. The command |
|||
is as follows: |
|||
|
|||
`SamplerFactory.register_sampler(*custom_sampler_string_key*, *custom_sampler_object*)` |
|||
|
|||
Once the Sampler Factory reflects the new register, the new sampler type can be used for sample any |
|||
`Environment Parameter`. For example, lets say a new sampler type was implemented as below and we register |
|||
the `CustomSampler` class with the string `custom-sampler` in the Sampler Factory. |
|||
|
|||
```python |
|||
class CustomSampler(Sampler): |
|||
|
|||
def __init__(self, argA, argB, argC): |
|||
self.possible_vals = [argA, argB, argC] |
|||
|
|||
def sample_all(self): |
|||
return np.random.choice(self.possible_vals) |
|||
``` |
|||
|
|||
Now we need to specify the new sampler type in the sampler YAML file. For example, we use this new |
|||
sampler type for the `Environment Parameter` *mass*. |
|||
|
|||
```yaml |
|||
mass: |
|||
sampler-type: "custom-sampler" |
|||
argB: 1 |
|||
argA: 2 |
|||
argC: 3 |
|||
``` |
|||
|
|||
### Training with Environment Parameter Randomization |
|||
|
|||
After the sampler YAML file is defined, we proceed by launching `mlagents-learn` and specify |
|||
our configured sampler file with the `--sampler` flag. For example, if we wanted to train the |
|||
3D ball agent with parameter randomization using `Environment Parameters` with `config/3dball_randomize.yaml` |
|||
sampling setup, we would run |
|||
|
|||
```sh |
|||
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_randomize.yaml |
|||
--run-id=3D-Ball-randomize |
|||
``` |
|||
|
|||
We can observe progress and metrics via Tensorboard. |
|
|||
# Training with Proximal Policy Optimization |
|||
|
|||
ML-Agents provides an implementation of a reinforcement learning algorithm called |
|||
[Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/). |
|||
PPO uses a neural network to approximate the ideal function that maps an agent's |
|||
observations to the best action an agent can take in a given state. The |
|||
ML-Agents PPO algorithm is implemented in TensorFlow and runs in a separate |
|||
Python process (communicating with the running Unity application over a socket). |
|||
|
|||
ML-Agents also provides an implementation of |
|||
[Soft Actor-Critic (SAC)](https://bair.berkeley.edu/blog/2018/12/14/sac/). SAC tends |
|||
to be more _sample-efficient_, i.e. require fewer environment steps, |
|||
than PPO, but may spend more time performing model updates. This can produce a large |
|||
speedup on heavy or slow environments. Check out how to train with |
|||
SAC [here](Training-SAC.md). |
|||
|
|||
To train an agent, you will need to provide the agent one or more reward signals which |
|||
the agent should attempt to maximize. See [Reward Signals](Reward-Signals.md) |
|||
for the available reward signals and the corresponding hyperparameters. |
|||
|
|||
See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the |
|||
training program, `learn.py`. |
|||
|
|||
If you are using the recurrent neural network (RNN) to utilize memory, see |
|||
[Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training |
|||
details. |
|||
|
|||
If you are using curriculum training to pace the difficulty of the learning task |
|||
presented to an agent, see [Training with Curriculum |
|||
Learning](Training-Curriculum-Learning.md). |
|||
|
|||
For information about imitation learning from demonstrations, see |
|||
[Training with Imitation Learning](Training-Imitation-Learning.md). |
|||
|
|||
## Best Practices Training with PPO |
|||
|
|||
Successfully training a Reinforcement Learning model often involves tuning the |
|||
training hyperparameters. This guide contains some best practices for tuning the |
|||
training process when the default parameters don't seem to be giving the level |
|||
of performance you would like. |
|||
|
|||
## Hyperparameters |
|||
|
|||
### Reward Signals |
|||
|
|||
In reinforcement learning, the goal is to learn a Policy that maximizes reward. |
|||
At a base level, the reward is given by the environment. However, we could imagine |
|||
rewarding the agent for various different behaviors. For instance, we could reward |
|||
the agent for exploring new states, rather than just when an explicit reward is given. |
|||
Furthermore, we could mix reward signals to help the learning process. |
|||
|
|||
Using `reward_signals` allows you to define [reward signals.](Reward-Signals.md) |
|||
The ML-Agents Toolkit provides three reward signals by default, the Extrinsic (environment) |
|||
reward signal, the Curiosity reward signal, which can be used to encourage exploration in |
|||
sparse extrinsic reward environments, and the GAIL reward signal. Please see [Reward Signals](Reward-Signals.md) |
|||
for additional details. |
|||
|
|||
### Lambda |
|||
|
|||
`lambd` corresponds to the `lambda` parameter used when calculating the |
|||
Generalized Advantage Estimate ([GAE](https://arxiv.org/abs/1506.02438)). This |
|||
can be thought of as how much the agent relies on its current value estimate |
|||
when calculating an updated value estimate. Low values correspond to relying |
|||
more on the current value estimate (which can be high bias), and high values |
|||
correspond to relying more on the actual rewards received in the environment |
|||
(which can be high variance). The parameter provides a trade-off between the |
|||
two, and the right value can lead to a more stable training process. |
|||
|
|||
Typical Range: `0.9` - `0.95` |
|||
|
|||
### Buffer Size |
|||
|
|||
`buffer_size` corresponds to how many experiences (agent observations, actions |
|||
and rewards obtained) should be collected before we do any learning or updating |
|||
of the model. **This should be a multiple of `batch_size`**. Typically a larger |
|||
`buffer_size` corresponds to more stable training updates. |
|||
|
|||
Typical Range: `2048` - `409600` |
|||
|
|||
### Batch Size |
|||
|
|||
`batch_size` is the number of experiences used for one iteration of a gradient |
|||
descent update. **This should always be a fraction of the `buffer_size`**. If |
|||
you are using a continuous action space, this value should be large (in the |
|||
order of 1000s). If you are using a discrete action space, this value should be |
|||
smaller (in order of 10s). |
|||
|
|||
Typical Range (Continuous): `512` - `5120` |
|||
|
|||
Typical Range (Discrete): `32` - `512` |
|||
|
|||
### Number of Epochs |
|||
|
|||
`num_epoch` is the number of passes through the experience buffer during |
|||
gradient descent. The larger the `batch_size`, the larger it is acceptable to |
|||
make this. Decreasing this will ensure more stable updates, at the cost of |
|||
slower learning. |
|||
|
|||
Typical Range: `3` - `10` |
|||
|
|||
### Learning Rate |
|||
|
|||
`learning_rate` corresponds to the strength of each gradient descent update |
|||
step. This should typically be decreased if training is unstable, and the reward |
|||
does not consistently increase. |
|||
|
|||
Typical Range: `1e-5` - `1e-3` |
|||
|
|||
### (Optional) Learning Rate Schedule |
|||
|
|||
`learning_rate_schedule` corresponds to how the learning rate is changed over time. |
|||
For PPO, we recommend decaying learning rate until `max_steps` so learning converges |
|||
more stably. However, for some cases (e.g. training for an unknown amount of time) |
|||
this feature can be disabled. |
|||
|
|||
Options: |
|||
* `linear` (default): Decay `learning_rate` linearly, reaching 0 at `max_steps`. |
|||
* `constant`: Keep learning rate constant for the entire training run. |
|||
|
|||
Options: `linear`, `constant` |
|||
|
|||
### Time Horizon |
|||
|
|||
`time_horizon` corresponds to how many steps of experience to collect per-agent |
|||
before adding it to the experience buffer. When this limit is reached before the |
|||
end of an episode, a value estimate is used to predict the overall expected |
|||
reward from the agent's current state. As such, this parameter trades off |
|||
between a less biased, but higher variance estimate (long time horizon) and more |
|||
biased, but less varied estimate (short time horizon). In cases where there are |
|||
frequent rewards within an episode, or episodes are prohibitively large, a |
|||
smaller number can be more ideal. This number should be large enough to capture |
|||
all the important behavior within a sequence of an agent's actions. |
|||
|
|||
Typical Range: `32` - `2048` |
|||
|
|||
### Max Steps |
|||
|
|||
`max_steps` corresponds to how many steps of the simulation (multiplied by |
|||
frame-skip) are run during the training process. This value should be increased |
|||
for more complex problems. |
|||
|
|||
Typical Range: `5e5` - `1e7` |
|||
|
|||
### Beta |
|||
|
|||
`beta` corresponds to the strength of the entropy regularization, which makes |
|||
the policy "more random." This ensures that agents properly explore the action |
|||
space during training. Increasing this will ensure more random actions are |
|||
taken. This should be adjusted such that the entropy (measurable from |
|||
TensorBoard) slowly decreases alongside increases in reward. If entropy drops |
|||
too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`. |
|||
|
|||
Typical Range: `1e-4` - `1e-2` |
|||
|
|||
### Epsilon |
|||
|
|||
`epsilon` corresponds to the acceptable threshold of divergence between the old |
|||
and new policies during gradient descent updating. Setting this value small will |
|||
result in more stable updates, but will also slow the training process. |
|||
|
|||
Typical Range: `0.1` - `0.3` |
|||
|
|||
### Normalize |
|||
|
|||
`normalize` corresponds to whether normalization is applied to the vector |
|||
observation inputs. This normalization is based on the running average and |
|||
variance of the vector observation. Normalization can be helpful in cases with |
|||
complex continuous control problems, but may be harmful with simpler discrete |
|||
control problems. |
|||
|
|||
### Number of Layers |
|||
|
|||
`num_layers` corresponds to how many hidden layers are present after the |
|||
observation input, or after the CNN encoding of the visual observation. For |
|||
simple problems, fewer layers are likely to train faster and more efficiently. |
|||
More layers may be necessary for more complex control problems. |
|||
|
|||
Typical range: `1` - `3` |
|||
|
|||
### Hidden Units |
|||
|
|||
`hidden_units` correspond to how many units are in each fully connected layer of |
|||
the neural network. For simple problems where the correct action is a |
|||
straightforward combination of the observation inputs, this should be small. For |
|||
problems where the action is a very complex interaction between the observation |
|||
variables, this should be larger. |
|||
|
|||
Typical Range: `32` - `512` |
|||
|
|||
### (Optional) Visual Encoder Type |
|||
|
|||
`vis_encode_type` corresponds to the encoder type for encoding visual observations. |
|||
Valid options include: |
|||
* `simple` (default): a simple encoder which consists of two convolutional layers |
|||
* `nature_cnn`: [CNN implementation proposed by Mnih et al.](https://www.nature.com/articles/nature14236), |
|||
consisting of three convolutional layers |
|||
* `resnet`: [IMPALA Resnet implementation](https://arxiv.org/abs/1802.01561), |
|||
consisting of three stacked layers, each with two residual blocks, making a |
|||
much larger network than the other two. |
|||
|
|||
Options: `simple`, `nature_cnn`, `resnet` |
|||
|
|||
## (Optional) Recurrent Neural Network Hyperparameters |
|||
|
|||
The below hyperparameters are only used when `use_recurrent` is set to true. |
|||
|
|||
### Sequence Length |
|||
|
|||
`sequence_length` corresponds to the length of the sequences of experience |
|||
passed through the network during training. This should be long enough to |
|||
capture whatever information your agent might need to remember over time. For |
|||
example, if your agent needs to remember the velocity of objects, then this can |
|||
be a small value. If your agent needs to remember a piece of information given |
|||
only once at the beginning of an episode, then this should be a larger value. |
|||
|
|||
Typical Range: `4` - `128` |
|||
|
|||
### Memory Size |
|||
|
|||
`memory_size` corresponds to the size of the array of floating point numbers |
|||
used to store the hidden state of the recurrent neural network of the policy. This value must |
|||
be a multiple of 2, and should scale with the amount of information you expect |
|||
the agent will need to remember in order to successfully complete the task. |
|||
|
|||
Typical Range: `32` - `256` |
|||
|
|||
## (Optional) Behavioral Cloning Using Demonstrations |
|||
|
|||
In some cases, you might want to bootstrap the agent's policy using behavior recorded |
|||
from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds |
|||
training operations that mimic a demonstration rather than attempting to maximize reward. |
|||
|
|||
To use BC, add a `behavioral_cloning` section to the trainer_config. For instance: |
|||
|
|||
``` |
|||
behavioral_cloning: |
|||
demo_path: ./Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo |
|||
strength: 0.5 |
|||
steps: 10000 |
|||
``` |
|||
|
|||
Below are the available hyperparameters for BC. |
|||
|
|||
### Strength |
|||
|
|||
`strength` corresponds to the learning rate of the imitation relative to the learning |
|||
rate of PPO, and roughly corresponds to how strongly we allow BC |
|||
to influence the policy. |
|||
|
|||
Typical Range: `0.1` - `0.5` |
|||
|
|||
### Demo Path |
|||
|
|||
`demo_path` is the path to your `.demo` file or directory of `.demo` files. |
|||
See the [imitation learning guide](Training-Imitation-Learning.md) for more on `.demo` files. |
|||
|
|||
### Steps |
|||
|
|||
During BC, it is often desirable to stop using demonstrations after the agent has |
|||
"seen" rewards, and allow it to optimize past the available demonstrations and/or generalize |
|||
outside of the provided demonstrations. `steps` corresponds to the training steps over which |
|||
BC is active. The learning rate of BC will anneal over the steps. Set |
|||
the steps to 0 for constant imitation over the entire training run. |
|||
|
|||
### (Optional) Batch Size |
|||
|
|||
`batch_size` is the number of demonstration experiences used for one iteration of a gradient |
|||
descent update. If not specified, it will default to the `batch_size` defined for PPO. |
|||
|
|||
Typical Range (Continuous): `512` - `5120` |
|||
|
|||
Typical Range (Discrete): `32` - `512` |
|||
|
|||
### (Optional) Number of Epochs |
|||
|
|||
`num_epoch` is the number of passes through the experience buffer during |
|||
gradient descent. If not specified, it will default to the number of epochs set for PPO. |
|||
|
|||
Typical Range: `3` - `10` |
|||
|
|||
### (Optional) Samples Per Update |
|||
|
|||
`samples_per_update` is the maximum number of samples |
|||
to use during each imitation update. You may want to lower this if your demonstration |
|||
dataset is very large to avoid overfitting the policy on demonstrations. Set to 0 |
|||
to train over all of the demonstrations at each update step. |
|||
|
|||
Default Value: `0` (all) |
|||
|
|||
Typical Range: Approximately equal to PPO's `buffer_size` |
|||
|
|||
### (Optional) Advanced: Initialize Model Path |
|||
|
|||
`init_path` can be specified to initialize your model from a previous run before starting. |
|||
Note that the prior run should have used the same trainer configurations as the current run, |
|||
and have been saved with the same version of ML-Agents. You should provide the full path |
|||
to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. |
|||
|
|||
This option is provided in case you want to initialize different behaviors from different runs; |
|||
in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize |
|||
all models from the same run. |
|||
|
|||
### (Optional) Advanced: Disable Threading |
|||
|
|||
By default, PPO model updates can happen while the environment is being stepped. This violates the |
|||
[on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) |
|||
assumption of PPO slightly in exchange for a 10-20% training speedup. To maintain the |
|||
strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. |
|||
|
|||
Default Value: `true` |
|||
|
|||
## Training Statistics |
|||
|
|||
To view training statistics, use TensorBoard. For information on launching and |
|||
using TensorBoard, see |
|||
[here](./Getting-Started.md#observing-training-progress). |
|||
|
|||
### Cumulative Reward |
|||
|
|||
The general trend in reward should consistently increase over time. Small ups |
|||
and downs are to be expected. Depending on the complexity of the task, a |
|||
significant increase in reward may not present itself until millions of steps |
|||
into the training process. |
|||
|
|||
### Entropy |
|||
|
|||
This corresponds to how random the decisions are. This should |
|||
consistently decrease during training. If it decreases too soon or not at all, |
|||
`beta` should be adjusted (when using discrete action space). |
|||
|
|||
### Learning Rate |
|||
|
|||
This will decrease over time on a linear schedule by default, unless `learning_rate_schedule` |
|||
is set to `constant`. |
|||
|
|||
### Policy Loss |
|||
|
|||
These values will oscillate during training. Generally they should be less than |
|||
1.0. |
|||
|
|||
### Value Estimate |
|||
|
|||
These values should increase as the cumulative reward increases. They correspond |
|||
to how much future reward the agent predicts itself receiving at any given |
|||
point. |
|||
|
|||
### Value Loss |
|||
|
|||
These values will increase as the reward increases, and then should decrease |
|||
once reward becomes stable. |
|
|||
# Training with Soft-Actor Critic |
|||
|
|||
In addition to [Proximal Policy Optimization (PPO)](Training-PPO.md), ML-Agents also provides |
|||
[Soft Actor-Critic](http://bair.berkeley.edu/blog/2018/12/14/sac/) to perform |
|||
reinforcement learning. |
|||
|
|||
In contrast with PPO, SAC is _off-policy_, which means it can learn from experiences collected |
|||
at any time during the past. As experiences are collected, they are placed in an |
|||
experience replay buffer and randomly drawn during training. This makes SAC |
|||
significantly more sample-efficient, often requiring 5-10 times less samples to learn |
|||
the same task as PPO. However, SAC tends to require more model updates. SAC is a |
|||
good choice for heavier or slower environments (about 0.1 seconds per step or more). |
|||
|
|||
SAC is also a "maximum entropy" algorithm, and enables exploration in an intrinsic way. |
|||
Read more about maximum entropy RL [here](https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/). |
|||
|
|||
To train an agent, you will need to provide the agent one or more reward signals which |
|||
the agent should attempt to maximize. See [Reward Signals](Reward-Signals.md) |
|||
for the available reward signals and the corresponding hyperparameters. |
|||
|
|||
## Best Practices when training with SAC |
|||
|
|||
Successfully training a reinforcement learning model often involves tuning |
|||
hyperparameters. This guide contains some best practices for training |
|||
when the default parameters don't seem to be giving the level of performance |
|||
you would like. |
|||
|
|||
## Hyperparameters |
|||
|
|||
### Reward Signals |
|||
|
|||
In reinforcement learning, the goal is to learn a Policy that maximizes reward. |
|||
In the most basic case, the reward is given by the environment. However, we could imagine |
|||
rewarding the agent for various different behaviors. For instance, we could reward |
|||
the agent for exploring new states, rather than explicitly defined reward signals. |
|||
Furthermore, we could mix reward signals to help the learning process. |
|||
|
|||
`reward_signals` provides a section to define [reward signals.](Reward-Signals.md) |
|||
ML-Agents provides two reward signals by default, the Extrinsic (environment) reward, and the |
|||
Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward |
|||
environments. |
|||
|
|||
#### Steps Per Update for Reward Signal (Optional) |
|||
|
|||
`reward_signal_steps_per_update` for the reward signals corresponds to the number of steps per mini batch sampled |
|||
and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. |
|||
However, to imitate the training procedure in certain imitation learning papers (e.g. |
|||
[Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)), |
|||
we may want to update the reward signal (GAIL) M times for every update of the policy. |
|||
We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update` |
|||
under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to |
|||
`steps_per_update`. |
|||
|
|||
Typical Range: `steps_per_update` |
|||
|
|||
### Buffer Size |
|||
|
|||
`buffer_size` corresponds the maximum number of experiences (agent observations, actions |
|||
and rewards obtained) that can be stored in the experience replay buffer. This value should be |
|||
large, on the order of thousands of times longer than your episodes, so that SAC |
|||
can learn from old as well as new experiences. It should also be much larger than |
|||
`batch_size`. |
|||
|
|||
Typical Range: `50000` - `1000000` |
|||
|
|||
### Buffer Init Steps |
|||
|
|||
`buffer_init_steps` is the number of experiences to prefill the buffer with before attempting training. |
|||
As the untrained policy is fairly random, prefilling the buffer with random actions is |
|||
useful for exploration. Typically, at least several episodes of experiences should be |
|||
prefilled. |
|||
|
|||
Typical Range: `1000` - `10000` |
|||
|
|||
### Batch Size |
|||
|
|||
`batch_size` is the number of experiences used for one iteration of a gradient |
|||
descent update. If |
|||
you are using a continuous action space, this value should be large (in the |
|||
order of 1000s). If you are using a discrete action space, this value should be |
|||
smaller (in order of 10s). |
|||
|
|||
Typical Range (Continuous): `128` - `1024` |
|||
|
|||
Typical Range (Discrete): `32` - `512` |
|||
|
|||
### Initial Entropy Coefficient |
|||
|
|||
`init_entcoef` refers to the initial entropy coefficient set at the beginning of training. In |
|||
SAC, the agent is incentivized to make its actions entropic to facilitate better exploration. |
|||
The entropy coefficient weighs the true reward with a bonus entropy reward. The entropy |
|||
coefficient is [automatically adjusted](https://arxiv.org/abs/1812.05905) to a preset target |
|||
entropy, so the `init_entcoef` only corresponds to the starting value of the entropy bonus. |
|||
Increase `init_entcoef` to explore more in the beginning, decrease to converge to a solution faster. |
|||
|
|||
Typical Range (Continuous): `0.5` - `1.0` |
|||
|
|||
Typical Range (Discrete): `0.05` - `0.5` |
|||
|
|||
### Train Interval |
|||
|
|||
`train_interval` is the number of steps taken between each agent training event. Typically, |
|||
we can train after every step, but if your environment's steps are very small and very frequent, |
|||
there may not be any new interesting information between steps, and `train_interval` can be increased. |
|||
|
|||
Typical Range: `1` - `5` |
|||
|
|||
### Steps Per Update |
|||
|
|||
`steps_per_update` corresponds to the average ratio of agent steps (actions) taken to updates made of the agent's |
|||
policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience |
|||
replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after |
|||
exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps. |
|||
|
|||
Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will |
|||
improve sample efficiency (reduce the number of steps required to train) |
|||
but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example |
|||
environments) `steps_per_update` equal to the number of agents in the scene is a good balance. |
|||
For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed. |
|||
We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will |
|||
usually result in a slowdown unless the environment is very slow. |
|||
|
|||
Typical Range: `1` - `20` |
|||
|
|||
### Tau |
|||
|
|||
`tau` corresponds to the magnitude of the target Q update during the SAC model update. |
|||
In SAC, there are two neural networks: the target and the policy. The target network is |
|||
used to bootstrap the policy's estimate of the future rewards at a given state, and is fixed |
|||
while the policy is being updated. This target is then slowly updated according to `tau`. |
|||
Typically, this value should be left at `0.005`. For simple problems, increasing |
|||
`tau` to `0.01` might reduce the time it takes to learn, at the cost of stability. |
|||
|
|||
Typical Range: `0.005` - `0.01` |
|||
|
|||
### Learning Rate |
|||
|
|||
`learning_rate` corresponds to the strength of each gradient descent update |
|||
step. This should typically be decreased if training is unstable, and the reward |
|||
does not consistently increase. |
|||
|
|||
Typical Range: `1e-5` - `1e-3` |
|||
|
|||
### (Optional) Learning Rate Schedule |
|||
|
|||
`learning_rate_schedule` corresponds to how the learning rate is changed over time. |
|||
For SAC, we recommend holding learning rate constant so that the agent can continue to |
|||
learn until its Q function converges naturally. |
|||
|
|||
Options: |
|||
* `linear`: Decay `learning_rate` linearly, reaching 0 at `max_steps`. |
|||
* `constant` (default): Keep learning rate constant for the entire training run. |
|||
|
|||
Options: `linear`, `constant` |
|||
|
|||
### Time Horizon |
|||
|
|||
`time_horizon` corresponds to how many steps of experience to collect per-agent |
|||
before adding it to the experience buffer. This parameter is a lot less critical |
|||
to SAC than PPO, and can typically be set to approximately your episode length. |
|||
|
|||
Typical Range: `32` - `2048` |
|||
|
|||
### Max Steps |
|||
|
|||
`max_steps` corresponds to how many steps of the simulation (multiplied by |
|||
frame-skip) are run during the training process. This value should be increased |
|||
for more complex problems. |
|||
|
|||
Typical Range: `5e5` - `1e7` |
|||
|
|||
### Normalize |
|||
|
|||
`normalize` corresponds to whether normalization is applied to the vector |
|||
observation inputs. This normalization is based on the running average and |
|||
variance of the vector observation. Normalization can be helpful in cases with |
|||
complex continuous control problems, but may be harmful with simpler discrete |
|||
control problems. |
|||
|
|||
### Number of Layers |
|||
|
|||
`num_layers` corresponds to how many hidden layers are present after the |
|||
observation input, or after the CNN encoding of the visual observation. For |
|||
simple problems, fewer layers are likely to train faster and more efficiently. |
|||
More layers may be necessary for more complex control problems. |
|||
|
|||
Typical range: `1` - `3` |
|||
|
|||
### Hidden Units |
|||
|
|||
`hidden_units` correspond to how many units are in each fully connected layer of |
|||
the neural network. For simple problems where the correct action is a |
|||
straightforward combination of the observation inputs, this should be small. For |
|||
problems where the action is a very complex interaction between the observation |
|||
variables, this should be larger. |
|||
|
|||
Typical Range: `32` - `512` |
|||
|
|||
### (Optional) Visual Encoder Type |
|||
|
|||
`vis_encode_type` corresponds to the encoder type for encoding visual observations. |
|||
Valid options include: |
|||
* `simple` (default): a simple encoder which consists of two convolutional layers |
|||
* `nature_cnn`: [CNN implementation proposed by Mnih et al.](https://www.nature.com/articles/nature14236), |
|||
consisting of three convolutional layers |
|||
* `resnet`: [IMPALA Resnet implementation](https://arxiv.org/abs/1802.01561), |
|||
consisting of three stacked layers, each with two residual blocks, making a |
|||
much larger network than the other two. |
|||
|
|||
Options: `simple`, `nature_cnn`, `resnet` |
|||
|
|||
## (Optional) Recurrent Neural Network Hyperparameters |
|||
|
|||
The below hyperparameters are only used when `use_recurrent` is set to true. |
|||
|
|||
### Sequence Length |
|||
|
|||
`sequence_length` corresponds to the length of the sequences of experience |
|||
passed through the network during training. This should be long enough to |
|||
capture whatever information your agent might need to remember over time. For |
|||
example, if your agent needs to remember the velocity of objects, then this can |
|||
be a small value. If your agent needs to remember a piece of information given |
|||
only once at the beginning of an episode, then this should be a larger value. |
|||
|
|||
Typical Range: `4` - `128` |
|||
|
|||
### Memory Size |
|||
|
|||
`memory_size` corresponds to the size of the array of floating point numbers |
|||
used to store the hidden state of the recurrent neural network in the policy. |
|||
This value must be a multiple of 2, and should scale with the amount of information you expect |
|||
the agent will need to remember in order to successfully complete the task. |
|||
|
|||
Typical Range: `32` - `256` |
|||
|
|||
### (Optional) Save Replay Buffer |
|||
|
|||
`save_replay_buffer` enables you to save and load the experience replay buffer as well as |
|||
the model when quitting and re-starting training. This may help resumes go more smoothly, |
|||
as the experiences collected won't be wiped. Note that replay buffers can be very large, and |
|||
will take up a considerable amount of disk space. For that reason, we disable this feature by |
|||
default. |
|||
|
|||
Default: `False` |
|||
|
|||
## (Optional) Behavioral Cloning Using Demonstrations |
|||
|
|||
In some cases, you might want to bootstrap the agent's policy using behavior recorded |
|||
from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds |
|||
training operations that mimic a demonstration rather than attempting to maximize reward. |
|||
|
|||
To use BC, add a `behavioral_cloning` section to the trainer_config. For instance: |
|||
|
|||
``` |
|||
behavioral_cloning: |
|||
demo_path: ./Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo |
|||
strength: 0.5 |
|||
steps: 10000 |
|||
``` |
|||
|
|||
Below are the available hyperparameters for BC. |
|||
|
|||
### Strength |
|||
|
|||
`strength` corresponds to the learning rate of the imitation relative to the learning |
|||
rate of SAC, and roughly corresponds to how strongly we allow BC |
|||
to influence the policy. |
|||
|
|||
Typical Range: `0.1` - `0.5` |
|||
|
|||
### Demo Path |
|||
|
|||
`demo_path` is the path to your `.demo` file or directory of `.demo` files. |
|||
See the [imitation learning guide](Training-Imitation-Learning.md) for more on `.demo` files. |
|||
|
|||
### Steps |
|||
|
|||
During BC, it is often desirable to stop using demonstrations after the agent has |
|||
"seen" rewards, and allow it to optimize past the available demonstrations and/or generalize |
|||
outside of the provided demonstrations. `steps` corresponds to the training steps over which |
|||
BC is active. The learning rate of BC will anneal over the steps. Set |
|||
the steps to 0 for constant imitation over the entire training run. |
|||
|
|||
### (Optional) Batch Size |
|||
|
|||
`batch_size` is the number of demonstration experiences used for one iteration of a gradient |
|||
descent update. If not specified, it will default to the `batch_size` defined for SAC. |
|||
|
|||
Typical Range (Continuous): `512` - `5120` |
|||
|
|||
Typical Range (Discrete): `32` - `512` |
|||
|
|||
### (Optional) Advanced: Initialize Model Path |
|||
|
|||
`init_path` can be specified to initialize your model from a previous run before starting. |
|||
Note that the prior run should have used the same trainer configurations as the current run, |
|||
and have been saved with the same version of ML-Agents. You should provide the full path |
|||
to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. |
|||
|
|||
This option is provided in case you want to initialize different behaviors from different runs; |
|||
in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize |
|||
all models from the same run. |
|||
|
|||
## Training Statistics |
|||
|
|||
To view training statistics, use TensorBoard. For information on launching and |
|||
using TensorBoard, see |
|||
[here](./Getting-Started.md#observing-training-progress). |
|||
|
|||
### Cumulative Reward |
|||
|
|||
The general trend in reward should consistently increase over time. Small ups |
|||
and downs are to be expected. Depending on the complexity of the task, a |
|||
significant increase in reward may not present itself until millions of steps |
|||
into the training process. |
|||
|
|||
### Entropy Coefficient |
|||
|
|||
SAC is a "maximum entropy" reinforcement learning algorithm, and agents trained using |
|||
SAC are incentivized to behave randomly while also solving the problem. The entropy |
|||
coefficient balances the incentive to behave randomly vs. maximizing the reward. |
|||
This value is adjusted automatically so that the agent retains some amount of randomness during |
|||
training. It should steadily decrease in the beginning of training, and reach some small |
|||
value where it will level off. If it decreases too soon or takes too |
|||
long to decrease, `init_entcoef` should be adjusted. |
|||
|
|||
### Entropy |
|||
|
|||
This corresponds to how random the decisions are. This should |
|||
initially increase during training, reach a peak, and should decline along |
|||
with the Entropy Coefficient. This is because in the beginning, the agent is |
|||
incentivized to be more random for exploration due to a high entropy coefficient. |
|||
If it decreases too soon or takes too long to decrease, `init_entcoef` should be adjusted. |
|||
|
|||
### Learning Rate |
|||
|
|||
This will stay a constant value by default, unless `learning_rate_schedule` |
|||
is set to `linear`. |
|||
|
|||
### Policy Loss |
|||
|
|||
These values may increase as the agent explores, but should decrease long-term |
|||
as the agent learns how to solve the task. |
|||
|
|||
### Value Estimate |
|||
|
|||
These values should increase as the cumulative reward increases. They correspond |
|||
to how much future reward the agent predicts itself receiving at any given |
|||
point. They may also increase at the beginning as the agent is rewarded for |
|||
being random (see: Entropy and Entropy Coefficient), but should decline as |
|||
Entropy Coefficient decreases. |
|||
|
|||
### Value Loss |
|||
|
|||
These values will increase as the reward increases, and then should decrease |
|||
once reward becomes stable. |
|
|||
# Training with Curriculum Learning |
|||
|
|||
Curriculum learning is a feature of ML-Agents which allows for the properties of environments to be changed during the training process to aid in learning. |
|||
|
|||
## An Instructional Example |
|||
|
|||
*[**Note**: The example provided below is for instructional purposes, and was based on an early version of the [Wall Jump example environment](Learning-Environment-Examples.md). |
|||
As such, it is not possible to directly replicate the results here using that environment.]* |
|||
|
|||
Imagine a task in which an agent needs to scale a wall to arrive at a goal. The |
|||
starting point when training an agent to accomplish this task will be a random |
|||
policy. That starting policy will have the agent running in circles, and will |
|||
likely never, or very rarely scale the wall properly to the achieve the reward. |
|||
If we start with a simpler task, such as moving toward an unobstructed goal, |
|||
then the agent can easily learn to accomplish the task. From there, we can |
|||
slowly add to the difficulty of the task by increasing the size of the wall |
|||
until the agent can complete the initially near-impossible task of scaling the |
|||
wall. |
|||
|
|||
![Wall](images/curriculum.png) |
|||
|
|||
_Demonstration of a hypothetical curriculum training scenario in which a progressively taller |
|||
wall obstructs the path to the goal._ |
|||
|
|||
## How-To |
|||
|
|||
Each group of Agents under the same `Behavior Name` in an environment can have |
|||
a corresponding curriculum. These curricula are held in what we call a "metacurriculum". |
|||
A metacurriculum allows different groups of Agents to follow different curricula within |
|||
the same environment. |
|||
|
|||
### Specifying Curricula |
|||
|
|||
In order to define the curricula, the first step is to decide which parameters of |
|||
the environment will vary. In the case of the Wall Jump environment, |
|||
the height of the wall is what varies. We define this as a `Environment Parameters` |
|||
that can be accessed in `Academy.Instance.EnvironmentParameters`, and by doing |
|||
so it becomes adjustable via the Python API. |
|||
Rather than adjusting it by hand, we will create a YAML file which |
|||
describes the structure of the curricula. Within it, we can specify which |
|||
points in the training process our wall height will change, either based on the |
|||
percentage of training steps which have taken place, or what the average reward |
|||
the agent has received in the recent past is. Below is an example config for the |
|||
curricula for the Wall Jump environment. |
|||
|
|||
```yaml |
|||
BigWallJump: |
|||
measure: progress |
|||
thresholds: [0.1, 0.3, 0.5] |
|||
min_lesson_length: 100 |
|||
signal_smoothing: true |
|||
parameters: |
|||
big_wall_min_height: [0.0, 4.0, 6.0, 8.0] |
|||
big_wall_max_height: [4.0, 7.0, 8.0, 8.0] |
|||
SmallWallJump: |
|||
measure: progress |
|||
thresholds: [0.1, 0.3, 0.5] |
|||
min_lesson_length: 100 |
|||
signal_smoothing: true |
|||
parameters: |
|||
small_wall_height: [1.5, 2.0, 2.5, 4.0] |
|||
``` |
|||
|
|||
At the top level of the config is the behavior name. Note that this must be the |
|||
same as the Behavior Name in the [Agent's Behavior Parameters](Learning-Environment-Design-Agents.md#agent-properties). |
|||
The curriculum for each |
|||
behavior has the following parameters: |
|||
* `measure` - What to measure learning progress, and advancement in lessons by. |
|||
* `reward` - Uses a measure received reward. |
|||
* `progress` - Uses ratio of steps/max_steps. |
|||
* `thresholds` (float array) - Points in value of `measure` where lesson should |
|||
be increased. |
|||
* `min_lesson_length` (int) - The minimum number of episodes that should be |
|||
completed before the lesson can change. If `measure` is set to `reward`, the |
|||
average cumulative reward of the last `min_lesson_length` episodes will be |
|||
used to determine if the lesson should change. Must be nonnegative. |
|||
|
|||
__Important__: the average reward that is compared to the thresholds is |
|||
different than the mean reward that is logged to the console. For example, |
|||
if `min_lesson_length` is `100`, the lesson will increment after the average |
|||
cumulative reward of the last `100` episodes exceeds the current threshold. |
|||
The mean reward logged to the console is dictated by the `summary_freq` |
|||
parameter in the |
|||
[trainer configuration file](Training-ML-Agents.md#training-config-file). |
|||
* `signal_smoothing` (true/false) - Whether to weight the current progress |
|||
measure by previous values. |
|||
* If `true`, weighting will be 0.75 (new) 0.25 (old). |
|||
* `parameters` (dictionary of key:string, value:float array) - Corresponds to |
|||
Environment parameters to control. Length of each array should be one |
|||
greater than number of thresholds. |
|||
|
|||
Once our curriculum is defined, we have to use the environment parameters we defined |
|||
and modify the environment from the Agent's `OnEpisodeBegin()` function. See |
|||
[WallJumpAgent.cs](../Project/Assets/ML-Agents/Examples/WallJump/Scripts/WallJumpAgent.cs) |
|||
for an example. |
|||
|
|||
|
|||
### Training with a Curriculum |
|||
|
|||
Once we have specified our metacurriculum and curricula, we can launch |
|||
`mlagents-learn` using the `–curriculum` flag to point to the config file |
|||
for our curricula and PPO will train using Curriculum Learning. For example, |
|||
to train agents in the Wall Jump environment with curriculum learning, you can run: |
|||
|
|||
```sh |
|||
mlagents-learn config/trainer_config.yaml --curriculum=config/curricula/wall_jump.yaml --run-id=wall-jump-curriculum |
|||
``` |
|||
|
|||
You can then keep track of the current lessons and progresses via TensorBoard. |
|||
|
|||
__Note__: If you are resuming a training session that uses curriculum, please pass the number of the last-reached lesson using the `--lesson` flag when running `mlagents-learn`. |
撰写
预览
正在加载...
取消
保存
Reference in new issue