Your opinion matters a great deal to us. Only by hearing your thoughts on the Unity ML-Agents Toolkit can we continue to improve and grow. Please take a few minutes to let us know about it.
Your opinion matters a great deal to us. Only by hearing your thoughts on the
Unity ML-Agents Toolkit can we continue to improve and grow. Please take a few
minutes to let us know about it.
[Fill out the survey](https://goo.gl/forms/qFMYSYr5TlINvG6f1)
[Fill out the survey](https://goo.gl/forms/qFMYSYr5TlINvG6f1)
When contributing to the project, please make sure that your Pull Request (PR)
contains the following:
* Detailed description of the changes performed
* Corresponding changes to documentation, unit tests and sample environments (if
- Detailed description of the changes performed
- Corresponding changes to documentation, unit tests and sample environments (if
* Summary of the tests performed to validate your changes
* Issue numbers that the PR resolves (if any)
- Summary of the tests performed to validate your changes
- Issue numbers that the PR resolves (if any)
examples, as long as they are small, simple, demonstrate a unique feature of
the platform, and provide a unique non-trivial challenge to modern
machine learning algorithms. Feel free to submit these environments with a
PR explaining the nature of the environment and task.
examples, as long as they are small, simple, demonstrate a unique feature of the
platform, and provide a unique non-trivial challenge to modern machine learning
algorithms. Feel free to submit these environments with a PR explaining the
nature of the environment and task.
Several static checks are run on the codebase using the [pre-commit framework](https://pre-commit.com/) during CI. To execute the same checks locally, install `pre-commit` and run `pre-commit run --all-files`. Some hooks (for example, `black`) will output the corrected version of the code; others (like `mypy`) may require more effort to fix.
Several static checks are run on the codebase using the
[pre-commit framework](https://pre-commit.com/) during CI. To execute the same
checks locally, install `pre-commit` and run `pre-commit run --all-files`. Some
hooks (for example, `black`) will output the corrected version of the code;
others (like `mypy`) may require more effort to fix.
All python code should be formatted with [`black`](https://github.com/ambv/black). Style and formatting for C# may be enforced later.
All python code should be formatted with
[`black`](https://github.com/ambv/black). Style and formatting for C# may be
enforced later.
We use [`mypy`](http://mypy-lang.org/) to perform static type checking on python code. Currently not all code is annotated but we will increase coverage over time. If you are adding or refactoring code, please
We use [`mypy`](http://mypy-lang.org/) to perform static type checking on python
code. Currently not all code is annotated but we will increase coverage over
time. If you are adding or refactoring code, please
2. Make sure that code calling or called by the modified code also has type annotations.
2. Make sure that code calling or called by the modified code also has type
annotations.
The [type hint cheat sheet](https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html) provides a good introduction to adding type hints.
The
[type hint cheat sheet](https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html)
provides a good introduction to adding type hints.
When you open a pull request, you will be asked to acknolwedge our Contributor License Agreement. We allow both individual contributions and contributions made on behalf of companies. We use an open source tool called CLA assistant. If you have any questions on our CLA, please [submit an issue](https://github.com/Unity-Technologies/ml-agents/issues) or email us at ml-agents@unity3d.com.
When you open a pull request, you will be asked to acknolwedge our Contributor
License Agreement. We allow both individual contributions and contributions made
on behalf of companies. We use an open source tool called CLA assistant. If you
have any questions on our CLA, please
[submit an issue](https://github.com/Unity-Technologies/ml-agents/issues) or
Sets the action for a specific Agent in an agent group. `agent_group` is the
name of the group the Agent belongs to and `agent_id` is the integer
identifier of the Agent. Action is a 1D array of type `dtype=np.int32` and
size equal to the number of discrete actions in multi-discrete action type and
of type `dtype=np.float32` and size equal to the number of actions in
continuous action type.
__Note:__ If no action is provided for an agent group between two calls to `env.step()` then
the default action will be all zeros (in either discrete or continuous action space)
**Note:** If no action is provided for an agent group between two calls to
`env.step()` then the default action will be all zeros (in either discrete or
continuous action space)
`DecisionSteps` (with `s`) contains information about a whole batch of Agents while
`DecisionStep` (no `s`) only contains information about a single Agent.
`DecisionSteps` (with `s`) contains information about a whole batch of Agents
while `DecisionStep` (no `s`) only contains information about a single Agent.
- `obs` is a list of numpy arrays observations collected by the group of
agent. The first dimension of the array corresponds to the batch size of
the group (number of agents requesting a decision since the last call to
`env.step()`).
- `reward` is a float vector of length batch size. Corresponds to the
rewards collected by each agent since the last simulation step.
- `agent_id` is an int vector of length batch size containing unique
identifier for the corresponding Agent. This is used to track Agents
across simulation steps.
- `action_mask` is an optional list of two dimensional array of booleans.
Only available in multi-discrete action space type.
Each array corresponds to an action branch. The first dimension of each
array is the batch size and the second contains a mask for each action of
the branch. If true, the action is not available for the agent during
this simulation step.
- `obs` is a list of numpy arrays observations collected by the group of agent.
The first dimension of the array corresponds to the batch size of the group
(number of agents requesting a decision since the last call to `env.step()`).
- `reward` is a float vector of length batch size. Corresponds to the rewards
collected by each agent since the last simulation step.
- `agent_id` is an int vector of length batch size containing unique identifier
for the corresponding Agent. This is used to track Agents across simulation
steps.
- `action_mask` is an optional list of two dimensional array of booleans. Only
available in multi-discrete action space type. Each array corresponds to an
action branch. The first dimension of each array is the batch size and the
second contains a mask for each action of the branch. If true, the action is
not available for the agent during this simulation step.
- `len(DecisionSteps)` Returns the number of agents requesting a decision since
the last call to `env.step()`.
- `DecisionSteps[agent_id]` Returns a `DecisionStep`
for the Agent with the `agent_id` unique identifier.
- `len(DecisionSteps)` Returns the number of agents requesting a decision since
the last call to `env.step()`.
- `DecisionSteps[agent_id]` Returns a `DecisionStep` for the Agent with the
`agent_id` unique identifier.
- `obs` is a list of numpy arrays observations collected by the agent.
(Each array has one less dimension than the arrays in `DecisionSteps`)
- `reward` is a float. Corresponds to the rewards collected by the agent
since the last simulation step.
- `agent_id` is an int and an unique identifier for the corresponding Agent.
- `action_mask` is an optional list of one dimensional array of booleans.
Only available in multi-discrete action space type.
Each array corresponds to an action branch. Each array contains a mask
for each action of the branch. If true, the action is not available for
the agent during this simulation step.
- `obs` is a list of numpy arrays observations collected by the agent. (Each
array has one less dimension than the arrays in `DecisionSteps`)
- `reward` is a float. Corresponds to the rewards collected by the agent since
the last simulation step.
- `agent_id` is an int and an unique identifier for the corresponding Agent.
- `action_mask` is an optional list of one dimensional array of booleans. Only
available in multi-discrete action space type. Each array corresponds to an
action branch. Each array contains a mask for each action of the branch. If
true, the action is not available for the agent during this simulation step.
Similarly to `DecisionSteps` and `DecisionStep`,
`TerminalSteps` (with `s`) contains information about a whole batch of Agents while
`TerminalStep` (no `s`) only contains information about a single Agent.
Similarly to `DecisionSteps` and `DecisionStep`, `TerminalSteps` (with `s`)
contains information about a whole batch of Agents while `TerminalStep` (no `s`)
only contains information about a single Agent.
- `obs` is a list of numpy arrays observations collected by the group of
agent. The first dimension of the array corresponds to the batch size of
the group (number of agents requesting a decision since the last call to
`env.step()`).
- `reward` is a float vector of length batch size. Corresponds to the
rewards collected by each agent since the last simulation step.
- `agent_id` is an int vector of length batch size containing unique
identifier for the corresponding Agent. This is used to track Agents
across simulation steps.
- `max_step` is an array of booleans of length batch size. Is true if the
associated Agent reached its maximum number of steps during the last
simulation step.
- `obs` is a list of numpy arrays observations collected by the group of agent.
The first dimension of the array corresponds to the batch size of the group
(number of agents requesting a decision since the last call to `env.step()`).
- `reward` is a float vector of length batch size. Corresponds to the rewards
collected by each agent since the last simulation step.
- `agent_id` is an int vector of length batch size containing unique identifier
for the corresponding Agent. This is used to track Agents across simulation
steps.
- `max_step` is an array of booleans of length batch size. Is true if the
associated Agent reached its maximum number of steps during the last
simulation step.
- `len(TerminalSteps)` Returns the number of agents requesting a decision since
the last call to `env.step()`.
- `TerminalSteps[agent_id]` Returns a `TerminalStep`
for the Agent with the `agent_id` unique identifier.
- `len(TerminalSteps)` Returns the number of agents requesting a decision since
the last call to `env.step()`.
- `TerminalSteps[agent_id]` Returns a `TerminalStep` for the Agent with the
`agent_id` unique identifier.
- `obs` is a list of numpy arrays observations collected by the agent.
(Each array has one less dimension than the arrays in `TerminalSteps`)
- `reward` is a float. Corresponds to the rewards collected by the agent
since the last simulation step.
- `agent_id` is an int and an unique identifier for the corresponding Agent.
- `max_step` is a bool. Is true if the Agent reached its maximum number of
steps during the last simulation step.
- `obs` is a list of numpy arrays observations collected by the agent. (Each
array has one less dimension than the arrays in `TerminalSteps`)
- `reward` is a float. Corresponds to the rewards collected by the agent since
the last simulation step.
- `agent_id` is an int and an unique identifier for the corresponding Agent.
- `max_step` is a bool. Is true if the Agent reached its maximum number of steps
during the last simulation step.
An Agent behavior can either have discrete or continuous actions. To check which type
it is, use `spec.is_action_discrete()` or `spec.is_action_continuous()` to see
which one it is. If discrete, the action tensors are expected to be `np.int32`. If
continuous, the actions are expected to be `np.float32`.
An Agent behavior can either have discrete or continuous actions. To check which
type it is, use `spec.is_action_discrete()` or `spec.is_action_continuous()` to
see which one it is. If discrete, the action tensors are expected to be
`np.int32`. If continuous, the actions are expected to be `np.float32`.
- `observation_shapes` is a List of Tuples of int : Each Tuple corresponds
to an observation's dimensions (without the number of agents dimension).
The shape tuples have the same ordering as the ordering of the
DecisionSteps, DecisionStep, TerminalSteps and TerminalStep.
- `action_type` is the type of data of the action. it can be discrete or
continuous. If discrete, the action tensors are expected to be `np.int32`. If
continuous, the actions are expected to be `np.float32`.
- `action_size` is an `int` corresponding to the expected dimension of the action
array.
- In continuous action space it is the number of floats that constitute the action.
- In discrete action space (same as multi-discrete) it corresponds to the
number of branches (the number of independent actions)
- `discrete_action_branches` is a Tuple of int only for discrete action space. Each int
corresponds to the number of different options for each branch of the action.
For example : In a game direction input (no movement, left, right) and jump input
(no jump, jump) there will be two branches (direction and jump), the first one with 3
options and the second with 2 options. (`action_size = 2` and
`discrete_action_branches = (3,2,)`)
- `observation_shapes` is a List of Tuples of int : Each Tuple corresponds to an
observation's dimensions (without the number of agents dimension). The shape
tuples have the same ordering as the ordering of the DecisionSteps,
DecisionStep, TerminalSteps and TerminalStep.
- `action_type` is the type of data of the action. it can be discrete or
continuous. If discrete, the action tensors are expected to be `np.int32`. If
continuous, the actions are expected to be `np.float32`.
- `action_size` is an `int` corresponding to the expected dimension of the
action array.
- In continuous action space it is the number of floats that constitute the
action.
- In discrete action space (same as multi-discrete) it corresponds to the
number of branches (the number of independent actions)
- `discrete_action_branches` is a Tuple of int only for discrete action space.
Each int corresponds to the number of different options for each branch of the
action. For example : In a game direction input (no movement, left, right) and
jump input (no jump, jump) there will be two branches (direction and jump),
the first one with 3 options and the second with 2 options. (`action_size = 2`
and `discrete_action_branches = (3,2,)`)
In addition to the means of communicating between Unity and python described above,
we also provide methods for sharing agent-agnostic information. These
additional methods are referred to as side channels. ML-Agents includes two ready-made
side channels, described below. It is also possible to create custom side channels to
communicate any additional data between a Unity environment and Python. Instructions for
creating custom side channels can be found [here](Custom-SideChannels.md).
In addition to the means of communicating between Unity and python described
above, we also provide methods for sharing agent-agnostic information. These
additional methods are referred to as side channels. ML-Agents includes two
ready-made side channels, described below. It is also possible to create custom
side channels to communicate any additional data between a Unity environment and
Python. Instructions for creating custom side channels can be found
[here](Custom-SideChannels.md).
Side channels exist as separate classes which are instantiated, and then passed as list to the `side_channels` argument of the constructor of the `UnityEnvironment` class.
Side channels exist as separate classes which are instantiated, and then passed
as list to the `side_channels` argument of the constructor of the
`UnityEnvironment` class.
```python
channel = MyChannel()
__Note__ : A side channel will only send/receive messages when `env.step` or `env.reset()` is
called.
**Note** : A side channel will only send/receive messages when `env.step` or
`env.reset()` is called.
The `EngineConfiguration` side channel allows you to modify the time-scale, resolution, and graphics quality of the environment. This can be useful for adjusting the environment to perform better during training, or be more interpretable during inference.
The `EngineConfiguration` side channel allows you to modify the time-scale,
resolution, and graphics quality of the environment. This can be useful for
adjusting the environment to perform better during training, or be more
interpretable during inference.
* `set_configuration_parameters` which takes the following arguments:
* `width`: Defines the width of the display. (Must be set alongside height)
* `height`: Defines the height of the display. (Must be set alongside width)
* `quality_level`: Defines the quality level of the simulation.
* `time_scale`: Defines the multiplier for the deltatime in the simulation. If set to a higher value, time will pass faster in the simulation but the physics may perform unpredictably.
* `target_frame_rate`: Instructs simulation to try to render at a specified frame rate.
* `capture_frame_rate` Instructs the simulation to consider time between updates to always be constant, regardless of the actual frame rate.
* `set_configuration` with argument config which is an `EngineConfig`
NamedTuple object.
- `set_configuration_parameters` which takes the following arguments:
- `width`: Defines the width of the display. (Must be set alongside height)
- `height`: Defines the height of the display. (Must be set alongside width)
- `quality_level`: Defines the quality level of the simulation.
- `time_scale`: Defines the multiplier for the deltatime in the simulation. If
set to a higher value, time will pass faster in the simulation but the
physics may perform unpredictably.
- `target_frame_rate`: Instructs simulation to try to render at a specified
frame rate.
- `capture_frame_rate` Instructs the simulation to consider time between
updates to always be constant, regardless of the actual frame rate.
- `set_configuration` with argument config which is an `EngineConfig` NamedTuple
object.
For example, the following code would adjust the time-scale of the simulation to be 2x realtime.
For example, the following code would adjust the time-scale of the simulation to
be 2x realtime.
```python
from mlagents_envs.environment import UnityEnvironment
```
#### EnvironmentParameters
The `EnvironmentParameters` will allow you to get and set pre-defined numerical values in the environment. This can be useful for adjusting environment-specific settings, or for reading non-agent related information from the environment. You can call `get_property` and `set_property` on the side channel to read and write properties.
The `EnvironmentParameters` will allow you to get and set pre-defined numerical
values in the environment. This can be useful for adjusting environment-specific
settings, or for reading non-agent related information from the environment. You
can call `get_property` and `set_property` on the side channel to read and write
properties.
* `set_float_parameter` Sets a float parameter in the Unity Environment.
* key: The string identifier of the property.
* value: The float value of the property.
- `set_float_parameter` Sets a float parameter in the Unity Environment.
- key: The string identifier of the property.
- value: The float value of the property.
```python
from mlagents_envs.environment import UnityEnvironment
...
```
Once a property has been modified in Python, you can access it in C# after the next call to `step` as follows:
Once a property has been modified in Python, you can access it in C# after the
next call to `step` as follows:
```csharp
var envParameters = Academy.Instance.EnvironmentParameters;
#### Custom side channels
For information on how to make custom side channels for sending additional data types, see the documentation [here](Custom-SideChannels.md).
For information on how to make custom side channels for sending additional data
types, see the documentation [here](Custom-SideChannels.md).
| `trainer` | The type of trainer to use: `ppo` or `sac` |
| `trainer` | The type of trainer to use: `ppo` or `sac`|
| `batch_size` | Number of experiences in each iteration of gradient descent. **This should always be multiple times smaller than `buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value should be smaller (in order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`. |
| `buffer_size` | Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be multiple times larger than `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates. In SAC, the max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences. <br><br>Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000`|
| `batch_size` | Number of experiences in each iteration of gradient descent. **This should always be multiple times smaller than `buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value should be smaller (in order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`. |
| `buffer_size` | Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be multiple times larger than `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates. In SAC, the max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences. <br><br>Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000` |
| `learning_rate_schedule` | (Optional, default = `linear` for PPO and `constant` for SAC) Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally. <br><br>`linear` decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run. |
| `learning_rate_schedule` | (Optional, default = `linear` for PPO and `constant` for SAC) Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally. <br><br>`linear` decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run. |
| `vis_encoder_type` | (Optional, default = `simple`) Encoder type for encoding visual observations. <br><br>`simple` (default) uses a simple encoder which consists of two convolutional layers, `nature_cnn` uses the CNN implementation proposed by [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of three convolutional layers, and `resnet` uses the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. |
| `init_path` | (Optional, default = None) Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. |
| `threaded` | (Optional, default = `true`) By default, model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. There is usually no reason to turn `threaded` off for SAC. |
| `vis_encoder_type`| (Optional, default = `simple`) Encoder type for encoding visual observations. <br><br>`simple` (default) uses a simple encoder which consists of two convolutional layers, `nature_cnn` uses the CNN implementation proposed by [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of three convolutional layers, and `resnet` uses the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. |
| `init_path`| (Optional, default = None) Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. |
| `threaded`| (Optional, default = `true`) By default, model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. There is usually no reason to turn `threaded` off for SAC. |
## Trainer-specific Configurations
Enable these settings to ensure that your training run incorporates your
| `extrinsic -> strength` | Factor by which to multiply the reward given by the environment. Typical ranges will vary depending on the reward signal. <br><br>Typical range: `1.00` |
| `extrinsic -> gamma` | Discount factor for future rewards coming from the environment. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large. In cases when rewards are more immediate, it can be smaller. Must be strictly smaller than 1. <br><br>Typical range: `0.8` - `0.995` |
| `curiosity -> strength` | Magnitude of the curiosity reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. <br><br>Typical range: `0.001` - `0.1` |
| `curiosity -> encoding_size` | (Optional, default = `64`) Size of the encoding used by the intrinsic curiosity model. This value should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations. <br><br>Typical range: `64` - `256` |
To enable GAIL (assuming you have recorded demonstrations), provide these
| `gail -> strength` | Factor by which to multiply the raw reward. Note that when using GAIL with an Extrinsic Signal, this value should be set lower if your demonstrations are suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. <br><br>Typical range: `0.01` - `1.0` |
| `reward_signals -> reward_signal_num_update` | (Optional, default = `steps_per_update`) Number of steps per mini batch sampled and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. However, to imitate the training procedure in certain imitation learning papers (e.g. [Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)), we may want to update the reward signal (GAIL) M times for every update of the policy. We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update` under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to `steps_per_update`. |