浏览代码

Release 1 mm formatting (#3904)

* Formatting lines.

* Fix changelogs
/release_1_branch
GitHub 4 年前
当前提交
0e4bd1c6
共有 8 个文件被更改,包括 324 次插入257 次删除
  1. 6
      SURVEY.md
  2. 75
      com.unity.ml-agents/CHANGELOG.md
  3. 76
      com.unity.ml-agents/CONTRIBUTING.md
  4. 4
      docs/API-Reference.md
  5. 12
      docs/Migrating.md
  6. 372
      docs/Python-API.md
  7. 30
      docs/Training-Configuration-File.md
  8. 6
      docs/Using-Docker.md

6
SURVEY.md


# Unity ML-Agents Toolkit Survey
Your opinion matters a great deal to us. Only by hearing your thoughts on the Unity ML-Agents Toolkit can we continue to improve and grow. Please take a few minutes to let us know about it.
Your opinion matters a great deal to us. Only by hearing your thoughts on the
Unity ML-Agents Toolkit can we continue to improve and grow. Please take a few
minutes to let us know about it.
[Fill out the survey](https://goo.gl/forms/qFMYSYr5TlINvG6f1)
[Fill out the survey](https://goo.gl/forms/qFMYSYr5TlINvG6f1)

75
com.unity.ml-agents/CHANGELOG.md


## [1.0.0-preview] - 2020-05-06
### Major Changes
- The `MLAgents` C# namespace was renamed to `Unity.MLAgents`, and other nested
namespaces were similarly renamed. (#3843)
- The offset logic was removed from DecisionRequester. (#3716)

`AgentAction` and `AgentReset` have been removed. (#3770)
- The SideChannel API has changed (#3833, #3660) :
- Introduced the `SideChannelManager` to register, unregister and access side
channels.
channels.
See the [Migration Guide](../docs/Migrating.md) for more details on upgrading.
See the [Migration Guide](../docs/Migrating.md) for more details on
upgrading.
which is used when trying to read more data than the message contains.
which is used when trying to read more data than the message contains.
(and other python StatsWriters). To do this from your code, use
`Academy.Instance.StatsRecorder.Add(key, value)`. (#3660)
- `CameraSensorComponent.m_Grayscale` and `RenderTextureSensorComponent.m_Grayscale`
were changed from `public` to `private`. These can still be accessed via their
corresponding properties. (#3808)
(and other python StatsWriters). To do this from your code, use
`Academy.Instance.StatsRecorder.Add(key, value)`. (#3660)
- `CameraSensorComponent.m_Grayscale` and
`RenderTextureSensorComponent.m_Grayscale` were changed from `public` to
`private`. These can still be accessed via their corresponding properties.
(#3808)
- `WriteAdapter` was renamed to `ObservationWriter`. If you have a custom `ISensor` implementation,
you will need to change the signature of its `Write()` method. (#3834)
- Updated to Barracuda 0.7.0-preivew which has breaking namespace and assembly name changes. (#3875)
- `WriteAdapter` was renamed to `ObservationWriter`. If you have a custom
`ISensor` implementation, you will need to change the signature of its
`Write()` method. (#3834)
- Updated to Barracuda 0.7.0-preivew which has breaking namespace and assembly
name changes. (#3875)
- The `--load` and `--train` command-line flags have been deprecated. Training
now happens by default, and use `--resume` to resume training instead. (#3705)
- The Jupyter notebooks have been removed from the repository. (#3704)

- The GhostTrainer has been extended to support asymmetric games and the
asymmetric example environment Strikers Vs. Goalie has been added. (#3653)
- The `UnityEnv` class from the `gym-unity` package was renamed
`UnityToGymWrapper` and no longer creates the `UnityEnvironment`.
Instead, the `UnityEnvironment` must be passed as input to the
constructor of `UnityToGymWrapper` (#3812)
`UnityToGymWrapper` and no longer creates the `UnityEnvironment`. Instead, the
`UnityEnvironment` must be passed as input to the constructor of
`UnityToGymWrapper` (#3812)
random number generator in ModelRunner, and is incremented for each ModelRunner. (#3823)
- Added `Agent.GetObservations(), which returns a read-only view of the observations
added in `CollectObservations()`. (#3825)
- `UnityRLCapabilities` was added to help inform users when RL features are mismatched between C# and Python packages. (#3831)
random number generator in ModelRunner, and is incremented for each
ModelRunner. (#3823)
- Added `Agent.GetObservations()`, which returns a read-only view of the
observations added in `CollectObservations()`. (#3825)
- `UnityRLCapabilities` was added to help inform users when RL features are
mismatched between C# and Python packages. (#3831)
- Renamed 'Generalization' feature to 'Environment Parameter Randomization'. (#3646)
- Renamed 'Generalization' feature to 'Environment Parameter Randomization'.
(#3646)
- Timer files now contain a dictionary of metadata, including things like the
package version numbers. (#3758)
- The way that UnityEnvironment decides the port was changed. If no port is

- Running `mlagents-learn` with the same `--run-id` twice will no longer
overwrite the existing files. (#3705)
- Model updates can now happen asynchronously with environment steps for better performance. (#3690)
- `num_updates` and `train_interval` for SAC were replaced with `steps_per_update`. (#3690)
- The maximum compatible version of tensorflow was changed to allow tensorflow 2.1 and 2.2. This
will allow use with python 3.8 using tensorflow 2.2.0rc3. (#3830)
- `mlagents-learn` will no longer set the width and height of the executable window to 84x84
when no width nor height arguments are given. (#3867)
- Model updates can now happen asynchronously with environment steps for better
performance. (#3690)
- `num_updates` and `train_interval` for SAC were replaced with
`steps_per_update`. (#3690)
- The maximum compatible version of tensorflow was changed to allow tensorflow
2.1 and 2.2. This will allow use with python 3.8 using tensorflow 2.2.0rc3.
(#3830)
- `mlagents-learn` will no longer set the width and height of the executable
window to 84x84 when no width nor height arguments are given. (#3867)
- Self-Play team changes will now trigger a full environment reset. This prevents trajectories
in progress during a team change from getting into the buffer. (#3870)
- Self-Play team changes will now trigger a full environment reset. This
prevents trajectories in progress during a team change from getting into the
buffer. (#3870)
## [0.15.1-preview] - 2020-03-30

76
com.unity.ml-agents/CONTRIBUTING.md


## Communication
First, please read through our [code of conduct](https://github.com/Unity-Technologies/ml-agents/blob/master/CODE_OF_CONDUCT.md), as we
expect all our contributors to follow it.
First, please read through our
[code of conduct](https://github.com/Unity-Technologies/ml-agents/blob/master/CODE_OF_CONDUCT.md),
as we expect all our contributors to follow it.
[Issues page](https://github.com/Unity-Technologies/ml-agents/issues)
and briefly outlining the changes you plan to make. This will enable us to
provide some context that may be helpful for you. This could range from advice
and feedback on how to optimally perform your changes or reasons for not doing
it.
[Issues page](https://github.com/Unity-Technologies/ml-agents/issues) and
briefly outlining the changes you plan to make. This will enable us to provide
some context that may be helpful for you. This could range from advice and
feedback on how to optimally perform your changes or reasons for not doing it.
Lastly, if you're looking for input on what to contribute, feel free to
reach out to us directly at ml-agents@unity3d.com and/or browse the GitHub
issues with the `contributions welcome` label.
Lastly, if you're looking for input on what to contribute, feel free to reach
out to us directly at ml-agents@unity3d.com and/or browse the GitHub issues with
the `contributions welcome` label.
The master branch corresponds to the most recent version of the project.
Note that this may be newer that the [latest release](https://github.com/Unity-Technologies/ml-agents/releases/tag/latest_release).
When contributing to the project, please make sure that your Pull Request (PR) contains the following:
The master branch corresponds to the most recent version of the project. Note
that this may be newer that the
[latest release](https://github.com/Unity-Technologies/ml-agents/releases/tag/latest_release).
When contributing to the project, please make sure that your Pull Request (PR)
contains the following:
* Detailed description of the changes performed
* Corresponding changes to documentation, unit tests and sample environments (if
- Detailed description of the changes performed
- Corresponding changes to documentation, unit tests and sample environments (if
* Summary of the tests performed to validate your changes
* Issue numbers that the PR resolves (if any)
- Summary of the tests performed to validate your changes
- Issue numbers that the PR resolves (if any)
examples, as long as they are small, simple, demonstrate a unique feature of
the platform, and provide a unique non-trivial challenge to modern
machine learning algorithms. Feel free to submit these environments with a
PR explaining the nature of the environment and task.
examples, as long as they are small, simple, demonstrate a unique feature of the
platform, and provide a unique non-trivial challenge to modern machine learning
algorithms. Feel free to submit these environments with a PR explaining the
nature of the environment and task.
Several static checks are run on the codebase using the [pre-commit framework](https://pre-commit.com/) during CI. To execute the same checks locally, install `pre-commit` and run `pre-commit run --all-files`. Some hooks (for example, `black`) will output the corrected version of the code; others (like `mypy`) may require more effort to fix.
Several static checks are run on the codebase using the
[pre-commit framework](https://pre-commit.com/) during CI. To execute the same
checks locally, install `pre-commit` and run `pre-commit run --all-files`. Some
hooks (for example, `black`) will output the corrected version of the code;
others (like `mypy`) may require more effort to fix.
All python code should be formatted with [`black`](https://github.com/ambv/black). Style and formatting for C# may be enforced later.
All python code should be formatted with
[`black`](https://github.com/ambv/black). Style and formatting for C# may be
enforced later.
We use [`mypy`](http://mypy-lang.org/) to perform static type checking on python code. Currently not all code is annotated but we will increase coverage over time. If you are adding or refactoring code, please
We use [`mypy`](http://mypy-lang.org/) to perform static type checking on python
code. Currently not all code is annotated but we will increase coverage over
time. If you are adding or refactoring code, please
2. Make sure that code calling or called by the modified code also has type annotations.
2. Make sure that code calling or called by the modified code also has type
annotations.
The [type hint cheat sheet](https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html) provides a good introduction to adding type hints.
The
[type hint cheat sheet](https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html)
provides a good introduction to adding type hints.
When you open a pull request, you will be asked to acknolwedge our Contributor License Agreement. We allow both individual contributions and contributions made on behalf of companies. We use an open source tool called CLA assistant. If you have any questions on our CLA, please [submit an issue](https://github.com/Unity-Technologies/ml-agents/issues) or email us at ml-agents@unity3d.com.
When you open a pull request, you will be asked to acknolwedge our Contributor
License Agreement. We allow both individual contributions and contributions made
on behalf of companies. We use an open source tool called CLA assistant. If you
have any questions on our CLA, please
[submit an issue](https://github.com/Unity-Technologies/ml-agents/issues) or
email us at ml-agents@unity3d.com.

4
docs/API-Reference.md


Our developer-facing C# classes have been documented to be compatible with
Doxygen for auto-generating HTML documentation.
To generate the API reference, download Doxygen
and run the following command within the `docs/` directory:
To generate the API reference, download Doxygen and run the following command
within the `docs/` directory:
```sh
doxygen dox-ml-agents.conf

12
docs/Migrating.md


source of error where users would return arrays of the wrong size.
- The SideChannel API has changed (#3833, #3660) :
- Introduced the `SideChannelManager` to register, unregister and access side
channels.
- `EnvironmentParameters` replaces the default `FloatProperties`.
You can access the `EnvironmentParameters` with
`Academy.Instance.EnvironmentParameters` on C#. If you were previously creating
a `UnityEnvironment` in python and passing it a `FloatPropertiesChannel`,
create an `EnvironmentParametersChannel` instead.
channels.
- `EnvironmentParameters` replaces the default `FloatProperties`. You can
access the `EnvironmentParameters` with
`Academy.Instance.EnvironmentParameters` on C#. If you were previously
creating a `UnityEnvironment` in python and passing it a
`FloatPropertiesChannel`, create an `EnvironmentParametersChannel` instead.
- `SideChannel.OnMessageReceived` is now a protected method (was public)
- SideChannel IncomingMessages methods now take an optional default argument,
which is used when trying to read more data than the message contains.

372
docs/Python-API.md


imitation learning. This document describes how to use the `mlagents_envs` API.
For information on using `mlagents-learn`, see [here](Training-ML-Agents.md).
The Python Low Level API can be used to interact directly with your Unity learning environment.
As such, it can serve as the basis for developing and evaluating new learning algorithms.
The Python Low Level API can be used to interact directly with your Unity
learning environment. As such, it can serve as the basis for developing and
evaluating new learning algorithms.
The ML-Agents Toolkit Low Level API is a Python API for controlling the simulation
loop of an environment or game built with Unity. This API is used by the
training algorithms inside the ML-Agent Toolkit, but you can also write your own
Python programs using this API.
The ML-Agents Toolkit Low Level API is a Python API for controlling the
simulation loop of an environment or game built with Unity. This API is used by
the training algorithms inside the ML-Agent Toolkit, but you can also write your
own Python programs using this API.
The key objects in the Python API include:

- **BehaviorSpec** — describes the shape of the observation data inside
DecisionSteps and TerminalSteps as well as the expected action shapes.
These classes are all defined in the
[base_env](../ml-agents-envs/mlagents_envs/base_env.py) script.
These classes are all defined in the [base_env](../ml-agents-envs/mlagents_envs/base_env.py)
script.
An Agent "Behavior" is a group of Agents identified by a `BehaviorName` that share the same
observations and action types (described in their `BehaviorSpec`). You can think about Agent
Behavior as a group of agents that will share the same policy. All Agents with the same
behavior have the same goal and reward signals.
An Agent "Behavior" is a group of Agents identified by a `BehaviorName` that
share the same observations and action types (described in their
`BehaviorSpec`). You can think about Agent Behavior as a group of agents that
will share the same policy. All Agents with the same behavior have the same goal
and reward signals.
To communicate with an Agent in a Unity environment from a Python program, the
Agent in the simulation must have `Behavior Parameters` set to communicate. You

## Loading a Unity Environment
Python-side communication happens through `UnityEnvironment` which is located in
[`environment.py`](../ml-agents-envs/mlagents_envs/environment.py). To load
a Unity environment from a built binary file, put the file in the same directory
as `envs`. For example, if the filename of your Unity environment is `3DBall`, in python, run:
[`environment.py`](../ml-agents-envs/mlagents_envs/environment.py). To load a
Unity environment from a built binary file, put the file in the same directory
as `envs`. For example, if the filename of your Unity environment is `3DBall`,
in python, run:
```python
from mlagents_envs.environment import UnityEnvironment

- `worker_id` indicates which port to use for communication with the
environment. For use in parallel training regimes such as A3C.
- `seed` indicates the seed to use when generating random numbers during the
training process. In environments which are deterministic,
setting the seed enables reproducible experimentation by ensuring that the
environment and trainers utilize the same random seed.
training process. In environments which are deterministic, setting the seed
enables reproducible experimentation by ensuring that the environment and
trainers utilize the same random seed.
or properties. More on them in the [Modifying the environment from Python](Python-API.md#modifying-the-environment-from-python) section.
or properties. More on them in the
[Modifying the environment from Python](Python-API.md#modifying-the-environment-from-python)
section.
`file_name=None`, then press the **Play** button in the Editor when the
message _"Start training by pressing the Play button in the Unity Editor"_ is
displayed on the screen
`file_name=None`, then press the **Play** button in the Editor when the message
_"Start training by pressing the Play button in the Unity Editor"_ is displayed
on the screen
### Interacting with a Unity Environment

- **Reset : `env.reset()`** Sends a signal to reset the environment. Returns None.
- **Step : `env.step()`** Sends a signal to step the environment. Returns None.
Note that a "step" for Python does not correspond to either Unity `Update` nor
`FixedUpdate`. When `step()` or `reset()` is called, the Unity simulation will
move forward until an Agent in the simulation needs a input from Python to act.
- **Close : `env.close()`** Sends a shutdown signal to the environment and terminates
the communication.
- **Get Behavior Names : `env.get_behavior_names()`** Returns a list of `BehaviorName`.
Note that the number of groups can change over time in the simulation if new
Agent behaviors are created in the simulation.
- **Get Behavior Spec : `env.get_behavior_spec(behavior_name: str)`** Returns
the `BehaviorSpec` corresponding to the behavior_name given as input. A
`BehaviorSpec` contains information such as the observation shapes, the action
type (multi-discrete or continuous) and the action shape. Note that the `BehaviorSpec`
for a specific group is fixed throughout the simulation.
- **Get Steps : `env.get_steps(behavior_name: str)`**
Returns a tuple `DecisionSteps, TerminalSteps` corresponding to the behavior_name
given as input.
The `DecisionSteps` contains information about the state of the agents
**that need an action this step** and have the behavior behavior_name.
The `TerminalSteps` contains information about the state of the agents
**whose episode ended** and have the behavior behavior_name.
Both `DecisionSteps` and `TerminalSteps` contain information such as
the observations, the rewards and the agent identifiers.
`DecisionSteps` also contains action masks for the next action while `TerminalSteps`
contains the reason for termination (did the Agent reach its maximum step and was
interrupted). The data is in `np.array` of which the first dimension is always the
number of agents note that the number of agents is not guaranteed to remain constant
during the simulation and it is not unusual to have either `DecisionSteps` or `TerminalSteps`
contain no Agents at all.
- **Set Actions :`env.set_actions(behavior_name: str, action: np.array)`**
Sets the actions for a whole agent group. `action` is a 2D `np.array` of `dtype=np.int32`
in the discrete action case and `dtype=np.float32` in the continuous action case.
The first dimension of `action` is the number of agents that requested a decision
since the last call to `env.step()`. The second dimension is the number of discrete actions
in multi-discrete action type and the number of actions in continuous action type.
- **Set Action for Agent : `env.set_action_for_agent(agent_group: str, agent_id: int, action: np.array)`**
Sets the action for a specific Agent in an agent group. `agent_group` is the name of the
group the Agent belongs to and `agent_id` is the integer identifier of the Agent. Action
is a 1D array of type `dtype=np.int32` and size equal to the number of discrete actions
in multi-discrete action type and of type `dtype=np.float32` and size equal to the number
of actions in continuous action type.
- **Reset : `env.reset()`** Sends a signal to reset the environment. Returns
None.
- **Step : `env.step()`** Sends a signal to step the environment. Returns None.
Note that a "step" for Python does not correspond to either Unity `Update` nor
`FixedUpdate`. When `step()` or `reset()` is called, the Unity simulation will
move forward until an Agent in the simulation needs a input from Python to
act.
- **Close : `env.close()`** Sends a shutdown signal to the environment and
terminates the communication.
- **Get Behavior Names : `env.get_behavior_names()`** Returns a list of
`BehaviorName`. Note that the number of groups can change over time in the
simulation if new Agent behaviors are created in the simulation.
- **Get Behavior Spec : `env.get_behavior_spec(behavior_name: str)`** Returns
the `BehaviorSpec` corresponding to the behavior_name given as input. A
`BehaviorSpec` contains information such as the observation shapes, the action
type (multi-discrete or continuous) and the action shape. Note that the
`BehaviorSpec` for a specific group is fixed throughout the simulation.
- **Get Steps : `env.get_steps(behavior_name: str)`** Returns a tuple
`DecisionSteps, TerminalSteps` corresponding to the behavior_name given as
input. The `DecisionSteps` contains information about the state of the agents
**that need an action this step** and have the behavior behavior_name. The
`TerminalSteps` contains information about the state of the agents **whose
episode ended** and have the behavior behavior_name. Both `DecisionSteps` and
`TerminalSteps` contain information such as the observations, the rewards and
the agent identifiers. `DecisionSteps` also contains action masks for the next
action while `TerminalSteps` contains the reason for termination (did the
Agent reach its maximum step and was interrupted). The data is in `np.array`
of which the first dimension is always the number of agents note that the
number of agents is not guaranteed to remain constant during the simulation
and it is not unusual to have either `DecisionSteps` or `TerminalSteps`
contain no Agents at all.
- **Set Actions :`env.set_actions(behavior_name: str, action: np.array)`** Sets
the actions for a whole agent group. `action` is a 2D `np.array` of
`dtype=np.int32` in the discrete action case and `dtype=np.float32` in the
continuous action case. The first dimension of `action` is the number of
agents that requested a decision since the last call to `env.step()`. The
second dimension is the number of discrete actions in multi-discrete action
type and the number of actions in continuous action type.
- **Set Action for Agent :
`env.set_action_for_agent(agent_group: str, agent_id: int, action: np.array)`**
Sets the action for a specific Agent in an agent group. `agent_group` is the
name of the group the Agent belongs to and `agent_id` is the integer
identifier of the Agent. Action is a 1D array of type `dtype=np.int32` and
size equal to the number of discrete actions in multi-discrete action type and
of type `dtype=np.float32` and size equal to the number of actions in
continuous action type.
__Note:__ If no action is provided for an agent group between two calls to `env.step()` then
the default action will be all zeros (in either discrete or continuous action space)
**Note:** If no action is provided for an agent group between two calls to
`env.step()` then the default action will be all zeros (in either discrete or
continuous action space)
`DecisionSteps` (with `s`) contains information about a whole batch of Agents while
`DecisionStep` (no `s`) only contains information about a single Agent.
`DecisionSteps` (with `s`) contains information about a whole batch of Agents
while `DecisionStep` (no `s`) only contains information about a single Agent.
- `obs` is a list of numpy arrays observations collected by the group of
agent. The first dimension of the array corresponds to the batch size of
the group (number of agents requesting a decision since the last call to
`env.step()`).
- `reward` is a float vector of length batch size. Corresponds to the
rewards collected by each agent since the last simulation step.
- `agent_id` is an int vector of length batch size containing unique
identifier for the corresponding Agent. This is used to track Agents
across simulation steps.
- `action_mask` is an optional list of two dimensional array of booleans.
Only available in multi-discrete action space type.
Each array corresponds to an action branch. The first dimension of each
array is the batch size and the second contains a mask for each action of
the branch. If true, the action is not available for the agent during
this simulation step.
- `obs` is a list of numpy arrays observations collected by the group of agent.
The first dimension of the array corresponds to the batch size of the group
(number of agents requesting a decision since the last call to `env.step()`).
- `reward` is a float vector of length batch size. Corresponds to the rewards
collected by each agent since the last simulation step.
- `agent_id` is an int vector of length batch size containing unique identifier
for the corresponding Agent. This is used to track Agents across simulation
steps.
- `action_mask` is an optional list of two dimensional array of booleans. Only
available in multi-discrete action space type. Each array corresponds to an
action branch. The first dimension of each array is the batch size and the
second contains a mask for each action of the branch. If true, the action is
not available for the agent during this simulation step.
- `len(DecisionSteps)` Returns the number of agents requesting a decision since
the last call to `env.step()`.
- `DecisionSteps[agent_id]` Returns a `DecisionStep`
for the Agent with the `agent_id` unique identifier.
- `len(DecisionSteps)` Returns the number of agents requesting a decision since
the last call to `env.step()`.
- `DecisionSteps[agent_id]` Returns a `DecisionStep` for the Agent with the
`agent_id` unique identifier.
- `obs` is a list of numpy arrays observations collected by the agent.
(Each array has one less dimension than the arrays in `DecisionSteps`)
- `reward` is a float. Corresponds to the rewards collected by the agent
since the last simulation step.
- `agent_id` is an int and an unique identifier for the corresponding Agent.
- `action_mask` is an optional list of one dimensional array of booleans.
Only available in multi-discrete action space type.
Each array corresponds to an action branch. Each array contains a mask
for each action of the branch. If true, the action is not available for
the agent during this simulation step.
- `obs` is a list of numpy arrays observations collected by the agent. (Each
array has one less dimension than the arrays in `DecisionSteps`)
- `reward` is a float. Corresponds to the rewards collected by the agent since
the last simulation step.
- `agent_id` is an int and an unique identifier for the corresponding Agent.
- `action_mask` is an optional list of one dimensional array of booleans. Only
available in multi-discrete action space type. Each array corresponds to an
action branch. Each array contains a mask for each action of the branch. If
true, the action is not available for the agent during this simulation step.
Similarly to `DecisionSteps` and `DecisionStep`,
`TerminalSteps` (with `s`) contains information about a whole batch of Agents while
`TerminalStep` (no `s`) only contains information about a single Agent.
Similarly to `DecisionSteps` and `DecisionStep`, `TerminalSteps` (with `s`)
contains information about a whole batch of Agents while `TerminalStep` (no `s`)
only contains information about a single Agent.
- `obs` is a list of numpy arrays observations collected by the group of
agent. The first dimension of the array corresponds to the batch size of
the group (number of agents requesting a decision since the last call to
`env.step()`).
- `reward` is a float vector of length batch size. Corresponds to the
rewards collected by each agent since the last simulation step.
- `agent_id` is an int vector of length batch size containing unique
identifier for the corresponding Agent. This is used to track Agents
across simulation steps.
- `max_step` is an array of booleans of length batch size. Is true if the
associated Agent reached its maximum number of steps during the last
simulation step.
- `obs` is a list of numpy arrays observations collected by the group of agent.
The first dimension of the array corresponds to the batch size of the group
(number of agents requesting a decision since the last call to `env.step()`).
- `reward` is a float vector of length batch size. Corresponds to the rewards
collected by each agent since the last simulation step.
- `agent_id` is an int vector of length batch size containing unique identifier
for the corresponding Agent. This is used to track Agents across simulation
steps.
- `max_step` is an array of booleans of length batch size. Is true if the
associated Agent reached its maximum number of steps during the last
simulation step.
- `len(TerminalSteps)` Returns the number of agents requesting a decision since
the last call to `env.step()`.
- `TerminalSteps[agent_id]` Returns a `TerminalStep`
for the Agent with the `agent_id` unique identifier.
- `len(TerminalSteps)` Returns the number of agents requesting a decision since
the last call to `env.step()`.
- `TerminalSteps[agent_id]` Returns a `TerminalStep` for the Agent with the
`agent_id` unique identifier.
- `obs` is a list of numpy arrays observations collected by the agent.
(Each array has one less dimension than the arrays in `TerminalSteps`)
- `reward` is a float. Corresponds to the rewards collected by the agent
since the last simulation step.
- `agent_id` is an int and an unique identifier for the corresponding Agent.
- `max_step` is a bool. Is true if the Agent reached its maximum number of
steps during the last simulation step.
- `obs` is a list of numpy arrays observations collected by the agent. (Each
array has one less dimension than the arrays in `TerminalSteps`)
- `reward` is a float. Corresponds to the rewards collected by the agent since
the last simulation step.
- `agent_id` is an int and an unique identifier for the corresponding Agent.
- `max_step` is a bool. Is true if the Agent reached its maximum number of steps
during the last simulation step.
An Agent behavior can either have discrete or continuous actions. To check which type
it is, use `spec.is_action_discrete()` or `spec.is_action_continuous()` to see
which one it is. If discrete, the action tensors are expected to be `np.int32`. If
continuous, the actions are expected to be `np.float32`.
An Agent behavior can either have discrete or continuous actions. To check which
type it is, use `spec.is_action_discrete()` or `spec.is_action_continuous()` to
see which one it is. If discrete, the action tensors are expected to be
`np.int32`. If continuous, the actions are expected to be `np.float32`.
- `observation_shapes` is a List of Tuples of int : Each Tuple corresponds
to an observation's dimensions (without the number of agents dimension).
The shape tuples have the same ordering as the ordering of the
DecisionSteps, DecisionStep, TerminalSteps and TerminalStep.
- `action_type` is the type of data of the action. it can be discrete or
continuous. If discrete, the action tensors are expected to be `np.int32`. If
continuous, the actions are expected to be `np.float32`.
- `action_size` is an `int` corresponding to the expected dimension of the action
array.
- In continuous action space it is the number of floats that constitute the action.
- In discrete action space (same as multi-discrete) it corresponds to the
number of branches (the number of independent actions)
- `discrete_action_branches` is a Tuple of int only for discrete action space. Each int
corresponds to the number of different options for each branch of the action.
For example : In a game direction input (no movement, left, right) and jump input
(no jump, jump) there will be two branches (direction and jump), the first one with 3
options and the second with 2 options. (`action_size = 2` and
`discrete_action_branches = (3,2,)`)
- `observation_shapes` is a List of Tuples of int : Each Tuple corresponds to an
observation's dimensions (without the number of agents dimension). The shape
tuples have the same ordering as the ordering of the DecisionSteps,
DecisionStep, TerminalSteps and TerminalStep.
- `action_type` is the type of data of the action. it can be discrete or
continuous. If discrete, the action tensors are expected to be `np.int32`. If
continuous, the actions are expected to be `np.float32`.
- `action_size` is an `int` corresponding to the expected dimension of the
action array.
- In continuous action space it is the number of floats that constitute the
action.
- In discrete action space (same as multi-discrete) it corresponds to the
number of branches (the number of independent actions)
- `discrete_action_branches` is a Tuple of int only for discrete action space.
Each int corresponds to the number of different options for each branch of the
action. For example : In a game direction input (no movement, left, right) and
jump input (no jump, jump) there will be two branches (direction and jump),
the first one with 3 options and the second with 2 options. (`action_size = 2`
and `discrete_action_branches = (3,2,)`)
In addition to the means of communicating between Unity and python described above,
we also provide methods for sharing agent-agnostic information. These
additional methods are referred to as side channels. ML-Agents includes two ready-made
side channels, described below. It is also possible to create custom side channels to
communicate any additional data between a Unity environment and Python. Instructions for
creating custom side channels can be found [here](Custom-SideChannels.md).
In addition to the means of communicating between Unity and python described
above, we also provide methods for sharing agent-agnostic information. These
additional methods are referred to as side channels. ML-Agents includes two
ready-made side channels, described below. It is also possible to create custom
side channels to communicate any additional data between a Unity environment and
Python. Instructions for creating custom side channels can be found
[here](Custom-SideChannels.md).
Side channels exist as separate classes which are instantiated, and then passed as list to the `side_channels` argument of the constructor of the `UnityEnvironment` class.
Side channels exist as separate classes which are instantiated, and then passed
as list to the `side_channels` argument of the constructor of the
`UnityEnvironment` class.
```python
channel = MyChannel()

__Note__ : A side channel will only send/receive messages when `env.step` or `env.reset()` is
called.
**Note** : A side channel will only send/receive messages when `env.step` or
`env.reset()` is called.
The `EngineConfiguration` side channel allows you to modify the time-scale, resolution, and graphics quality of the environment. This can be useful for adjusting the environment to perform better during training, or be more interpretable during inference.
The `EngineConfiguration` side channel allows you to modify the time-scale,
resolution, and graphics quality of the environment. This can be useful for
adjusting the environment to perform better during training, or be more
interpretable during inference.
* `set_configuration_parameters` which takes the following arguments:
* `width`: Defines the width of the display. (Must be set alongside height)
* `height`: Defines the height of the display. (Must be set alongside width)
* `quality_level`: Defines the quality level of the simulation.
* `time_scale`: Defines the multiplier for the deltatime in the simulation. If set to a higher value, time will pass faster in the simulation but the physics may perform unpredictably.
* `target_frame_rate`: Instructs simulation to try to render at a specified frame rate.
* `capture_frame_rate` Instructs the simulation to consider time between updates to always be constant, regardless of the actual frame rate.
* `set_configuration` with argument config which is an `EngineConfig`
NamedTuple object.
- `set_configuration_parameters` which takes the following arguments:
- `width`: Defines the width of the display. (Must be set alongside height)
- `height`: Defines the height of the display. (Must be set alongside width)
- `quality_level`: Defines the quality level of the simulation.
- `time_scale`: Defines the multiplier for the deltatime in the simulation. If
set to a higher value, time will pass faster in the simulation but the
physics may perform unpredictably.
- `target_frame_rate`: Instructs simulation to try to render at a specified
frame rate.
- `capture_frame_rate` Instructs the simulation to consider time between
updates to always be constant, regardless of the actual frame rate.
- `set_configuration` with argument config which is an `EngineConfig` NamedTuple
object.
For example, the following code would adjust the time-scale of the simulation to be 2x realtime.
For example, the following code would adjust the time-scale of the simulation to
be 2x realtime.
```python
from mlagents_envs.environment import UnityEnvironment

```
#### EnvironmentParameters
The `EnvironmentParameters` will allow you to get and set pre-defined numerical values in the environment. This can be useful for adjusting environment-specific settings, or for reading non-agent related information from the environment. You can call `get_property` and `set_property` on the side channel to read and write properties.
The `EnvironmentParameters` will allow you to get and set pre-defined numerical
values in the environment. This can be useful for adjusting environment-specific
settings, or for reading non-agent related information from the environment. You
can call `get_property` and `set_property` on the side channel to read and write
properties.
* `set_float_parameter` Sets a float parameter in the Unity Environment.
* key: The string identifier of the property.
* value: The float value of the property.
- `set_float_parameter` Sets a float parameter in the Unity Environment.
- key: The string identifier of the property.
- value: The float value of the property.
```python
from mlagents_envs.environment import UnityEnvironment

...
```
Once a property has been modified in Python, you can access it in C# after the next call to `step` as follows:
Once a property has been modified in Python, you can access it in C# after the
next call to `step` as follows:
```csharp
var envParameters = Academy.Instance.EnvironmentParameters;

#### Custom side channels
For information on how to make custom side channels for sending additional data types, see the documentation [here](Custom-SideChannels.md).
For information on how to make custom side channels for sending additional data
types, see the documentation [here](Custom-SideChannels.md).

30
docs/Training-Configuration-File.md


| **Setting** | **Description** |
| :----------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `trainer` | The type of trainer to use: `ppo` or `sac` |
| `trainer` | The type of trainer to use: `ppo` or `sac` |
| `batch_size` | Number of experiences in each iteration of gradient descent. **This should always be multiple times smaller than `buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value should be smaller (in order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`. |
| `buffer_size` | Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be multiple times larger than `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates. In SAC, the max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences. <br><br>Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000` |
| `batch_size` | Number of experiences in each iteration of gradient descent. **This should always be multiple times smaller than `buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value should be smaller (in order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`. |
| `buffer_size` | Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be multiple times larger than `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates. In SAC, the max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences. <br><br>Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000` |
| `learning_rate_schedule` | (Optional, default = `linear` for PPO and `constant` for SAC) Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally. <br><br>`linear` decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run. |
| `learning_rate_schedule` | (Optional, default = `linear` for PPO and `constant` for SAC) Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally. <br><br>`linear` decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run. |
| `vis_encoder_type` | (Optional, default = `simple`) Encoder type for encoding visual observations. <br><br> `simple` (default) uses a simple encoder which consists of two convolutional layers, `nature_cnn` uses the CNN implementation proposed by [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of three convolutional layers, and `resnet` uses the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. |
| `init_path` | (Optional, default = None) Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. |
| `threaded` | (Optional, default = `true`) By default, model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. There is usually no reason to turn `threaded` off for SAC. |
| `vis_encoder_type` | (Optional, default = `simple`) Encoder type for encoding visual observations. <br><br> `simple` (default) uses a simple encoder which consists of two convolutional layers, `nature_cnn` uses the CNN implementation proposed by [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of three convolutional layers, and `resnet` uses the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. |
| `init_path` | (Optional, default = None) Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. |
| `threaded` | (Optional, default = `true`) By default, model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. There is usually no reason to turn `threaded` off for SAC. |
## Trainer-specific Configurations

Enable these settings to ensure that your training run incorporates your
environment-based reward signal:
| **Setting** | **Description** |
| :--------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Setting** | **Description** |
| :---------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `extrinsic -> strength` | Factor by which to multiply the reward given by the environment. Typical ranges will vary depending on the reward signal. <br><br>Typical range: `1.00` |
| `extrinsic -> gamma` | Discount factor for future rewards coming from the environment. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large. In cases when rewards are more immediate, it can be smaller. Must be strictly smaller than 1. <br><br>Typical range: `0.8` - `0.995` |

| **Setting** | **Description** |
| :-------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Setting** | **Description** |
| :--------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `curiosity -> strength` | Magnitude of the curiosity reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. <br><br>Typical range: `0.001` - `0.1` |
| `curiosity -> gamma` | Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.995` |
| `curiosity -> encoding_size` | (Optional, default = `64`) Size of the encoding used by the intrinsic curiosity model. This value should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations. <br><br>Typical range: `64` - `256` |

To enable GAIL (assuming you have recorded demonstrations), provide these
settings:
| **Setting** | **Description** |
| :--------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Setting** | **Description** |
| :---------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `gail -> strength` | Factor by which to multiply the raw reward. Note that when using GAIL with an Extrinsic Signal, this value should be set lower if your demonstrations are suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. <br><br>Typical range: `0.01` - `1.0` |
| `gail -> gamma` | Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.9` |
| `gail -> demo_path` | The path to your .demo file or directory of .demo files. |

All of the reward signals configurations described above apply to both PPO and
SAC. There is one configuration for all reward signals that only applies to SAC.
| **Setting** | **Description** |
| :------------------------------------------ | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Setting** | **Description** |
| :------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `reward_signals -> reward_signal_num_update` | (Optional, default = `steps_per_update`) Number of steps per mini batch sampled and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. However, to imitate the training procedure in certain imitation learning papers (e.g. [Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)), we may want to update the reward signal (GAIL) M times for every update of the policy. We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update` under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to `steps_per_update`. |
## Behavioral Cloning

6
docs/Using-Docker.md


- `<image-name>` references the image name used when building the container.
- `<environment-name>` **(Optional)**: If you are training with a linux
executable, this is the name of the executable. If you are training in the
Editor, do not pass a `<environment-name>` argument and press the
**Play** button in Unity when the message _"Start training by pressing
the Play button in the Unity Editor"_ is displayed on the screen.
Editor, do not pass a `<environment-name>` argument and press the **Play**
button in Unity when the message _"Start training by pressing the Play button
in the Unity Editor"_ is displayed on the screen.
- `source`: Reference to the path in your host OS where you will store the Unity
executable.
- `target`: Tells Docker to mount the `source` path as a disk with this name.

正在加载...
取消
保存