浏览代码

fix trailing whitespace in markdown (#2786)

/develop-gpu-test
GitHub 5 年前
当前提交
d009511a
共有 48 个文件被更改,包括 483 次插入480 次删除
  1. 3
      .pre-commit-config.yaml
  2. 4
      SURVEY.md
  3. 2
      UnitySDK/Assets/ML-Agents/Plugins/Barracuda.Core/Barracuda.md
  4. 2
      UnitySDK/Assets/ML-Agents/Plugins/Barracuda.Core/LICENSE.md
  5. 10
      UnitySDK/Assets/ML-Agents/Plugins/Barracuda.Core/ReleaseNotes.md
  6. 4
      UnitySDK/README.md
  7. 34
      docs/Basic-Guide.md
  8. 16
      docs/Creating-Custom-Protobuf-Messages.md
  9. 2
      docs/FAQ.md
  10. 8
      docs/Getting-Started-with-Balance-Ball.md
  11. 2
      docs/Glossary.md
  12. 16
      docs/Installation.md
  13. 2
      docs/Learning-Environment-Best-Practices.md
  14. 112
      docs/Learning-Environment-Create-New.md
  15. 80
      docs/Learning-Environment-Design-Agents.md
  16. 16
      docs/Learning-Environment-Design.md
  17. 8
      docs/Learning-Environment-Examples.md
  18. 10
      docs/ML-Agents-Overview.md
  19. 2
      docs/Migrating.md
  20. 2
      docs/Readme.md
  21. 34
      docs/Training-Behavioral-Cloning.md
  22. 4
      docs/Training-Curriculum-Learning.md
  23. 70
      docs/Training-Generalized-Reinforcement-Learning-Agents.md
  24. 20
      docs/Training-ML-Agents.md
  25. 2
      docs/Training-Using-Concurrent-Unity-Instances.md
  26. 2
      docs/Training-on-Amazon-Web-Service.md
  27. 2
      docs/Training-on-Microsoft-Azure-Custom-Instance.md
  28. 2
      docs/Training-on-Microsoft-Azure.md
  29. 22
      docs/Unity-Inference-Engine.md
  30. 8
      docs/Using-Tensorboard.md
  31. 30
      docs/Using-Virtual-Environment.md
  32. 8
      docs/localized/KR/README.md
  33. 40
      docs/localized/KR/docs/Installation-Windows.md
  34. 12
      docs/localized/KR/docs/Installation.md
  35. 34
      docs/localized/KR/docs/Training-Imitation-Learning.md
  36. 48
      docs/localized/KR/docs/Training-PPO.md
  37. 10
      docs/localized/KR/docs/Using-Docker.md
  38. 4
      docs/localized/zh-CN/README.md
  39. 66
      docs/localized/zh-CN/docs/Getting-Started-with-Balance-Ball.md
  40. 8
      docs/localized/zh-CN/docs/Installation.md
  41. 24
      docs/localized/zh-CN/docs/Learning-Environment-Create-New.md
  42. 8
      docs/localized/zh-CN/docs/Learning-Environment-Design.md
  43. 2
      docs/localized/zh-CN/docs/Learning-Environment-Examples.md
  44. 60
      docs/localized/zh-CN/docs/ML-Agents-Overview.md
  45. 4
      docs/localized/zh-CN/docs/Readme.md
  46. 92
      gym-unity/README.md
  47. 6
      ml-agents-envs/README.md
  48. 6
      protobuf-definitions/README.md

3
.pre-commit-config.yaml


.*_pb2_grpc.py
)$
additional_dependencies: [flake8-comprehensions]
- id: trailing-whitespace
name: trailing-whitespace-markdown
types: [markdown]
- repo: https://github.com/pre-commit/pygrep-hooks
rev: v1.4.1 # Use the ref you want to point at

4
SURVEY.md


# Unity ML-Agents Toolkit Survey
Your opinion matters a great deal to us. Only by hearing your thoughts on the Unity ML-Agents Toolkit can we continue to improve and grow. Please take a few minutes to let us know about it.
Your opinion matters a great deal to us. Only by hearing your thoughts on the Unity ML-Agents Toolkit can we continue to improve and grow. Please take a few minutes to let us know about it.
[Fill out the survey](https://goo.gl/forms/qFMYSYr5TlINvG6f1)
[Fill out the survey](https://goo.gl/forms/qFMYSYr5TlINvG6f1)

2
UnitySDK/Assets/ML-Agents/Plugins/Barracuda.Core/Barracuda.md


Tanh
```
P.S. some of these operations are under limited support and not all configurations are properly supported
P.S. some of these operations are under limited support and not all configurations are properly supported
P.P.S. Python 3.5 or 3.6 is recommended

2
UnitySDK/Assets/ML-Agents/Plugins/Barracuda.Core/LICENSE.md


Barracuda cross-platform Neural Net engine copyright © 2018 Unity Technologies ApS
Licensed under the Unity Companion License for Unity-dependent projects--see [Unity Companion License](http://www.unity3d.com/legal/licenses/Unity_Companion_License).
Licensed under the Unity Companion License for Unity-dependent projects--see [Unity Companion License](http://www.unity3d.com/legal/licenses/Unity_Companion_License).
Unless expressly provided otherwise, the Software under this license is made available strictly on an “AS IS” BASIS WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. Please review the license for details on these and other terms and conditions.

10
UnitySDK/Assets/ML-Agents/Plugins/Barracuda.Core/ReleaseNotes.md


- TF importer: made detection of actual output node from LSTM/GRU pattern more bullet proof by skipping Const nodes.
- TF importer: improved InstanceNormalization handling.
- TF importer: fixed SquareDifference pattern.
- TF importer: fixed Conv2DBackpropInput (transpose convolution) import.
- TF importer: fixed Conv2DBackpropInput (transpose convolution) import.
- Fixed Conv2D performance regression on some GPUs.
- Fixed TextureAsTensorData.Download() to work properly with InterpretDepthAs.Channels.
- Fixed bug when identity/nop layers would reuse input as an output and later causing premature release of that tensor as part of intermediate data cleanup.

## 0.2.0
- Version bumped to 0.2.0 as it brings breaking API changes, for details look below.
- Version bumped to 0.2.0 as it brings breaking API changes, for details look below.
- Significantly reduced temporary memory allocations by introducing internal allocator support. Now memory is re-used between layer execution as much as possible.
- Improved small workload performance on CSharp backend
- Added parallel implementation for multiple activation functions on CSharp backend

- Added `Summary()` method to `Worker`. Currently returns allocator information.
- Tabs to spaces! Aiming at higher salary (https://stackoverflow.blog/2017/06/15/developers-use-spaces-make-money-use-tabs/).
- Renamed worker type enum members: `CSharp` -> `CSharpRef`, `CSharpFast` -> `CSharp`, `Compute` -> `ComputeRef`, `ComputeFast` -> `Compute`.
- Implemented new optimized `ComputePrecompiled` worker. This worker caches Compute kernels and state beforehand to reduce CPU overhead.
- Implemented new optimized `ComputePrecompiled` worker. This worker caches Compute kernels and state beforehand to reduce CPU overhead.
- Added `ExecuteAsync()` to `IWorker` interface, it returns `IEnumerator`, which enables you to control how many layers to schedule per frame (one iteration == one layer).
- Added `Log` op support on Compute workers.
- Optimized activation functions and ScaleBias by accessing tensor as continuous array. Gained ~2.0ms on 4 batch MobileNet (MBP2016).

- Fixed compilation issues on Xbox One.
- TexConv2D support was temporary disabled.
- Barracuda logging now can be configured via static fields of ``Barracuda.D`` class, it allows both disable specific logging levels or just disable stack trace collection (helps with performance when profiling).
- Compute Concat implementation now will fall back to C# implementation instead of throwing exception when unsupported configuration is encountered.
- Fixed several ``ComputeBuffer`` release issues.
- Compute Concat implementation now will fall back to C# implementation instead of throwing exception when unsupported configuration is encountered.
- Fixed several ``ComputeBuffer`` release issues.
- Added constructor for ``Tensor`` that allows to pass in data array.
- Improved Flatten handling in TensorFlow models.
- Added helper func ``ModelLoader.LoadFromStreamingAssets``.

4
UnitySDK/README.md


# Unity ML-Agents SDK
Contains the ML-Agents Unity Project, including
both the core plugin (in `Scripts`), as well as a set
Contains the ML-Agents Unity Project, including
both the core plugin (in `Scripts`), as well as a set
of example environments (in `Examples`).

34
docs/Basic-Guide.md


## Setting up the ML-Agents Toolkit within Unity
In order to use the ML-Agents toolkit within Unity, you first need to change a few
Unity settings.
Unity settings.
1. Launch Unity
2. On the Projects dialog, choose the **Open** option at the top of the window.

## Running a Pre-trained Model
We include pre-trained models for our agents (`.nn` files) and we use the
[Unity Inference Engine](Unity-Inference-Engine.md) to run these models
inside Unity. In this section, we will use the pre-trained model for the
We include pre-trained models for our agents (`.nn` files) and we use the
[Unity Inference Engine](Unity-Inference-Engine.md) to run these models
inside Unity. In this section, we will use the pre-trained model for the
2. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Prefabs` folder.
2. In the **Project** window, go to the `Assets/ML-Agents/Examples/3DBall/Prefabs` folder.
3. In the **Project** window, drag the **3DBallLearning** Model located in
3. In the **Project** window, drag the **3DBallLearning** Model located in
4. You should notice that each `Agent` under each `3DBall` in the **Hierarchy** windows now contains **3DBallLearning** as `Model`. __Note__ : You can modify multiple game objects in a scene by selecting them all at
once using the search bar in the Scene Hierarchy.
8. Select the **InferenceDevice** to use for this model (CPU or GPU) on the Agent.
4. You should notice that each `Agent` under each `3DBall` in the **Hierarchy** windows now contains **3DBallLearning** as `Model`. __Note__ : You can modify multiple game objects in a scene by selecting them all at
once using the search bar in the Scene Hierarchy.
8. Select the **InferenceDevice** to use for this model (CPU or GPU) on the Agent.
_Note: CPU is faster for the majority of ML-Agents toolkit generated models_
9. Click the **Play** button and you will see the platforms balance the balls
using the pre-trained model.

### Setting up the environment for training
In order to setup the Agents for Training, you will need to edit the
In order to setup the Agents for Training, you will need to edit the
same `Behavior Parameters`. You can make sure all your agents have the same
same `Behavior Parameters`. You can make sure all your agents have the same
The `Behavior Name` corresponds to the name of the model that will be
The `Behavior Name` corresponds to the name of the model that will be
generated by the training process and is used to select the hyperparameters
from the training configuration file.

16
docs/Creating-Custom-Protobuf-Messages.md


# Creating Custom Protobuf Messages
Unity and Python communicate by sending protobuf messages to and from each other. You can create custom protobuf messages if you want to exchange structured data beyond what is included by default.
Unity and Python communicate by sending protobuf messages to and from each other. You can create custom protobuf messages if you want to exchange structured data beyond what is included by default.
## Implementing a Custom Message

By default, the Python API sends actions to Unity in the form of a floating point list and an optional string-valued text action for each agent.
You can define a custom action type, to either replace or augment the default, by adding fields to the `CustomAction` message, which you can do by editing the file `protobuf-definitions/proto/mlagents/envs/communicator_objects/custom_action.proto`.
You can define a custom action type, to either replace or augment the default, by adding fields to the `CustomAction` message, which you can do by editing the file `protobuf-definitions/proto/mlagents/envs/communicator_objects/custom_action.proto`.
Instances of custom actions are set via the `custom_action` parameter of the `env.step`. An agent receives a custom action by defining a method with the signature:

Below is an example of creating a custom action that instructs an agent to choose a cardinal direction to walk in and how far to walk.
Below is an example of creating a custom action that instructs an agent to choose a cardinal direction to walk in and how far to walk.
The `custom_action.proto` file looks like:

EAST=2;
WEST=3;
}
float walkAmount = 1;
float walkAmount = 1;
Direction direction = 2;
}
```

### Custom Reset Parameters
By default, you can configure an environment `env` in the Python API by specifying a `config` parameter that is a dictionary mapping strings to floats.
By default, you can configure an environment `env` in the Python API by specifying a `config` parameter that is a dictionary mapping strings to floats.
You can also configure the environment reset using a custom protobuf message. To do this, add fields to the `CustomResetParameters` protobuf message in `custom_reset_parameters.proto`, analogously to `CustomAction` above. Then pass an instance of the message to `env.reset` via the `custom_reset_parameters` keyword parameter.

### Custom Observations
By default, Unity returns observations to Python in the form of a floating-point vector.
By default, Unity returns observations to Python in the form of a floating-point vector.
You can define a custom observation message to supplement that. To do so, add fields to the `CustomObservation` protobuf message in `custom_observation.proto`.
You can define a custom observation message to supplement that. To do so, add fields to the `CustomObservation` protobuf message in `custom_observation.proto`.
Then in your agent, create an instance of a custom observation via `new CommunicatorObjects.CustomObservation`. Then in `CollectObservations`, call `SetCustomObservation` with the custom observation instance as the parameter.

var obs = new CustomObservation();
obs.CustomField = 1.0;
SetCustomObservation(obs);
}
}
}
```

2
docs/FAQ.md


There may be a number of possible causes:
* _Cause_: There may be no agent in the scene
* _Cause_: There may be no agent in the scene
* _Cause_: On OSX, the firewall may be preventing communication with the
environment. _Solution_: Add the built environment binary to the list of
exceptions on the firewall by following

8
docs/Getting-Started-with-Balance-Ball.md


"Agent" GameObjects. The base Agent object has a few properties that affect its
behavior:
* **Behavior Parameters** — Every Agent must have a Behavior. The Behavior
* **Behavior Parameters** — Every Agent must have a Behavior. The Behavior
determines how an Agent makes decisions. More on Behavior Parameters in
the next section.
* **Visual Observations** — Defines any Camera objects used by the Agent to

training generalizes to more than a specific starting position and agent cube
attitude.
* agent.CollectObservations() — Called every simulation step. Responsible for
collecting the Agent's observations of the environment. Since the Behavior
collecting the Agent's observations of the environment. Since the Behavior
Parameters of the Agent are set with vector observation
space with a state size of 8, the `CollectObservations()` must call
`AddVectorObs` such that vector size adds up to 8.

negative reward for dropping the ball. An Agent is also marked as done when it
drops the ball so that it will reset with a new ball for the next simulation
step.
* agent.Heuristic() - When the `Use Heuristic` checkbox is checked in the Behavior
* agent.Heuristic() - When the `Use Heuristic` checkbox is checked in the Behavior
keyboard inputs into actions.
keyboard inputs into actions.
#### Behavior Parameters : Vector Observation Space

2
docs/Glossary.md


logic should not be placed here.
* **External Coordinator** - ML-Agents class responsible for communication with
outside processes (in this case, the Python API).
* **Trainer** - Python class which is responsible for training a given
* **Trainer** - Python class which is responsible for training a given
group of Agents.

16
docs/Installation.md


## Environment Setup
We now support a single mechanism for installing ML-Agents on Mac/Windows/Linux using Virtual
Environments. For more information on Virtual Environments and installation instructions,
Environments. For more information on Virtual Environments and installation instructions,
follow this [guide](Using-Virtual-Environment.md).
### Clone the ML-Agents Toolkit Repository

It also contains many [example environments](Learning-Environment-Examples.md)
to help you get started.
The `ml-agents` subdirectory contains a Python package which provides deep reinforcement
The `ml-agents` subdirectory contains a Python package which provides deep reinforcement
the `ml-agents` package depends on.
the `ml-agents` package depends on.
In order to use ML-Agents toolkit, you need Python 3.6.1 or higher.
In order to use ML-Agents toolkit, you need Python 3.6.1 or higher.
[Download](https://www.python.org/downloads/) and install the latest version of Python if you do not already have it.
If your Python environment doesn't include `pip3`, see these

pip3 install mlagents
```
Note that this will install `ml-agents` from PyPi, _not_ from the cloned repo.
Note that this will install `ml-agents` from PyPi, _not_ from the cloned repo.
parameters you can use with `mlagents-learn`.
parameters you can use with `mlagents-learn`.
By installing the `mlagents` package, the dependencies listed in the [setup.py file](../ml-agents/setup.py) are also installed.
Some of the primary dependencies include:

### Installing for Development
If you intend to make modifications to `ml-agents` or `ml-agents-envs`, you should install
If you intend to make modifications to `ml-agents` or `ml-agents-envs`, you should install
the packages from the cloned repo rather than from PyPi. To do this, you will need to install
`ml-agents` and `ml-agents-envs` separately. From the repo's root directory, run:

Running pip with the `-e` flag will let you make changes to the Python files directly and have those
reflected when you run `mlagents-learn`. It is important to install these packages in this order as the
`mlagents` package depends on `mlagents_envs`, and installing it in the other
`mlagents` package depends on `mlagents_envs`, and installing it in the other
order will download `mlagents_envs` from PyPi.
## Next Steps

2
docs/Learning-Environment-Best-Practices.md


lessons which progressively increase in difficulty are presented to the agent
([learn more here](Training-Curriculum-Learning.md)).
* When possible, it is often helpful to ensure that you can complete the task by
using a heuristic to control the agent. To do so, check the `Use Heuristic`
using a heuristic to control the agent. To do so, check the `Use Heuristic`
checkbox on the Agent and implement the `Heuristic()` method on the Agent.
* It is often helpful to make many copies of the agent, and give them the same
`Behavior Name`. In this way the learning process can get more feedback

112
docs/Learning-Environment-Create-New.md


importing the ML-Agents assets into it:
1. Launch the Unity Editor and create a new project named "RollerBall".
2. Make sure that the Scripting Runtime Version for the project is set to use
**.NET 4.x Equivalent** (This is an experimental option in Unity 2017,
2. Make sure that the Scripting Runtime Version for the project is set to use
**.NET 4.x Equivalent** (This is an experimental option in Unity 2017,
4. Drag the `ML-Agents` folder from `UnitySDK/Assets` to the Unity
4. Drag the `ML-Agents` folder from `UnitySDK/Assets` to the Unity
Editor Project window.
Your Unity **Project** window should contain the following assets:

1. In the Unity Project window, double-click the `RollerAcademy` script to open
it in your code editor. (By default new scripts are placed directly in the
**Assets** folder.)
2. In the code editor, add the statement, `using MLAgents;`.
2. In the code editor, add the statement, `using MLAgents;`.
3. Change the base class from `MonoBehaviour` to `Academy`.
4. Delete the `Start()` and `Update()` methods that were added by default.

The default settings for the Academy properties are also fine for this
environment, so we don't need to change anything for the RollerAcademy component
in the Inspector window.
in the Inspector window.
![The Academy properties](images/mlagents-NewTutAcademy.png)

1. In the Unity Project window, double-click the `RollerAgent` script to open it
in your code editor.
2. In the editor, add the `using MLAgents;` statement and then change the base
2. In the editor, add the `using MLAgents;` statement and then change the base
class from `MonoBehaviour` to `Agent`.
3. Delete the `Update()` method, but we will use the `Start()` function, so
leave it alone for now.

this reference, add a public field of type `Transform` to the RollerAgent class.
Public fields of a component in Unity get displayed in the Inspector window,
allowing you to choose which GameObject to use as the target in the Unity
Editor.
Editor.
To reset the Agent's velocity (and later to apply force to move the
agent) we need a reference to the Rigidbody component. A

In our case, the information our Agent collects includes:
* Position of the target.
* Position of the target.
* Position of the Agent itself.
* Position of the Agent itself.
```csharp
AddVectorObs(this.transform.position);

### Rewards
Reinforcement learning requires rewards. Assign rewards in the `AgentAction()`
function. The learning algorithm uses the rewards assigned to the Agent during
function. The learning algorithm uses the rewards assigned to the Agent during
assigned task. In this case, the Agent is given a reward of 1.0 for reaching the
assigned task. In this case, the Agent is given a reward of 1.0 for reaching the
reward of 1.0 and marks the agent as finished by calling the `Done()` method
reward of 1.0 and marks the agent as finished by calling the `Done()` method
on the Agent.
```csharp

## Testing the Environment
It is always a good idea to test your environment manually before embarking on
an extended training run. To do so, you will need to implement the `Heuristic()`
method on the RollerAgent class. This will allow you control the Agent using
direct keyboard control.
an extended training run. To do so, you will need to implement the `Heuristic()`
method on the RollerAgent class. This will allow you control the Agent using
direct keyboard control.
The `Heuristic()` method will look like this :

## Training the Environment
The process is
the same as described in [Training ML-Agents](Training-ML-Agents.md). Note that the
the same as described in [Training ML-Agents](Training-ML-Agents.md). Note that the
pass to the `mlagents-learn` program. Using the default settings specified
pass to the `mlagents-learn` program. Using the default settings specified
RollerAgent takes about 300,000 steps to train. However, you can change the
RollerAgent takes about 300,000 steps to train. However, you can change the
Since this example creates a very simple training environment with only a few inputs
and outputs, using small batch and buffer sizes speeds up the training considerably.
However, if you add more complexity to the environment or change the reward or
observation functions, you might also find that training performs better with different
Since this example creates a very simple training environment with only a few inputs
and outputs, using small batch and buffer sizes speeds up the training considerably.
However, if you add more complexity to the environment or change the reward or
observation functions, you might also find that training performs better with different
**Note:** In addition to setting these hyperparameter values, the Agent
**Note:** In addition to setting these hyperparameter values, the Agent
in this simple environment, speeds up training.
in this simple environment, speeds up training.
To train in the editor, run the following Python command from a Terminal or Console
To train in the editor, run the following Python command from a Terminal or Console
(where `config.yaml` is a copy of `trainer_config.yaml` that you have edited
(where `config.yaml` is a copy of `trainer_config.yaml` that you have edited
**Note:** If you get a `command not found` error when running this command, make sure
that you have followed the *Install Python and mlagents Package* section of the
**Note:** If you get a `command not found` error when running this command, make sure
that you have followed the *Install Python and mlagents Package* section of the
To monitor the statistics of Agent performance during training, use
[TensorBoard](Using-Tensorboard.md).
To monitor the statistics of Agent performance during training, use
[TensorBoard](Using-Tensorboard.md).
In particular, the *cumulative_reward* and *value_estimate* statistics show how
well the Agent is achieving the task. In this example, the maximum reward an
In particular, the *cumulative_reward* and *value_estimate* statistics show how
well the Agent is achieving the task. In this example, the maximum reward an
**Note:** If you use TensorBoard, always increment or change the `run-id`
you pass to the `mlagents-learn` command for each training run. If you use
the same id value, the statistics for multiple runs are combined and become
**Note:** If you use TensorBoard, always increment or change the `run-id`
you pass to the `mlagents-learn` command for each training run. If you use
the same id value, the statistics for multiple runs are combined and become
In many of the [example environments](Learning-Environment-Examples.md), many copies of
In many of the [example environments](Learning-Environment-Examples.md), many copies of
parallelize your RollerBall environment.
parallelize your RollerBall environment.
1. Right-click on your Project Hierarchy and create a new empty GameObject.
Name it TrainingArea.
2. Reset the TrainingArea’s Transform so that it is at (0,0,0) with Rotation (0,0,0)
and Scale (1,1,1).
3. Drag the Floor, Target, and RollerAgent GameObjects in the Hierarchy into the
TrainingArea GameObject.
4. Drag the TrainingArea GameObject, along with its attached GameObjects, into your
1. Right-click on your Project Hierarchy and create a new empty GameObject.
Name it TrainingArea.
2. Reset the TrainingArea’s Transform so that it is at (0,0,0) with Rotation (0,0,0)
and Scale (1,1,1).
3. Drag the Floor, Target, and RollerAgent GameObjects in the Hierarchy into the
TrainingArea GameObject.
4. Drag the TrainingArea GameObject, along with its attached GameObjects, into your
5. You can now instantiate copies of the TrainingArea prefab. Drag them into your scene,
positioning them so that they do not overlap.
5. You can now instantiate copies of the TrainingArea prefab. Drag them into your scene,
positioning them so that they do not overlap.
### Editing the Scripts
### Editing the Scripts
You will notice that in the previous section, we wrote our scripts assuming that our
TrainingArea was at (0,0,0), performing checks such as `this.transform.position.y < 0`
to determine whether our agent has fallen off the platform. We will need to change
this if we are to use multiple TrainingAreas throughout the scene.
You will notice that in the previous section, we wrote our scripts assuming that our
TrainingArea was at (0,0,0), performing checks such as `this.transform.position.y < 0`
to determine whether our agent has fallen off the platform. We will need to change
this if we are to use multiple TrainingAreas throughout the scene.
A quick way to adapt our current code is to use
localPosition rather than position, so that our position reference is in reference
to the prefab TrainingArea's location, and not global coordinates.
A quick way to adapt our current code is to use
localPosition rather than position, so that our position reference is in reference
to the prefab TrainingArea's location, and not global coordinates.
This is only one way to achieve this objective. Refer to the
This is only one way to achieve this objective. Refer to the
[example environments](Learning-Environment-Examples.md) for other ways we can achieve relative positioning.
## Review: Scene Layout

There are two kinds of game objects you need to include in your scene in order
to use Unity ML-Agents: an Academy and one or more Agents.
to use Unity ML-Agents: an Academy and one or more Agents.
Keep in mind:

80
docs/Learning-Environment-Design-Agents.md


The Policy class abstracts out the decision making logic from the Agent itself so
that you can use the same Policy in multiple Agents. How a Policy makes its
decisions depends on the kind of Policy it is. You can change the Policy of an
Agent by changing its `Behavior Parameters`. If you check `Use Heuristic`, the
Agent will use its `Heuristic()` method to make decisions which can allow you to
decisions depends on the kind of Policy it is. You can change the Policy of an
Agent by changing its `Behavior Parameters`. If you check `Use Heuristic`, the
Agent will use its `Heuristic()` method to make decisions which can allow you to
## Decisions
The observation-decision-action-reward cycle repeats after a configurable number

agent in a robotic simulator that must provide fine-control of joint torques
should make its decisions every step of the simulation. On the other hand, an
agent that only needs to make decisions when certain game or simulation events
occur, should use on-demand decision making.
occur, should use on-demand decision making.
To control the frequency of step-based decision making, set the **Decision
Frequency** value for the Agent object in the Unity Inspector window. Agents

When you turn on **On Demand Decisions** for an Agent, your agent code must call
the `Agent.RequestDecision()` function. This function call starts one iteration
of the observation-decision-action-reward cycle. The Agent's
`CollectObservations()` method is called, the Policy makes a decision and
`CollectObservations()` method is called, the Policy makes a decision and
returns it by calling the
`AgentAction()` method. The Policy waits for the Agent to request the next
decision before starting another iteration.

When you use vector observations for an Agent, implement the
`Agent.CollectObservations()` method to create the feature vector. When you use
**Visual Observations**, you only need to identify which Unity Camera objects
or RenderTextures will provide images and the base Agent class handles the rest.
You do not need to implement the `CollectObservations()` method when your Agent
**Visual Observations**, you only need to identify which Unity Camera objects
or RenderTextures will provide images and the base Agent class handles the rest.
You do not need to implement the `CollectObservations()` method when your Agent
uses visual observations (unless it also uses vector observations).
### Vector Observation Space: Feature Vectors

### Multiple Visual Observations
Visual observations use rendered textures directly or from one or more
cameras in a scene. The Policy vectorizes the textures into a 3D Tensor which
can be fed into a convolutional neural network (CNN). For more information on
CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). You
Visual observations use rendered textures directly or from one or more
cameras in a scene. The Policy vectorizes the textures into a 3D Tensor which
can be fed into a convolutional neural network (CNN). For more information on
CNNs, see [this guide](http://cs231n.github.io/convolutional-networks/). You
Agents using visual observations can capture state of arbitrary complexity and
are useful when the state is difficult to describe numerically. However, they
are also typically less efficient and slower to train, and sometimes don't
Agents using visual observations can capture state of arbitrary complexity and
are useful when the state is difficult to describe numerically. However, they
are also typically less efficient and slower to train, and sometimes don't
Visual observations can be derived from Cameras or RenderTextures within your scene.
To add a visual observation to an Agent, either click on the `Add Camera` or
`Add RenderTexture` button in the Agent inspector. Then drag the camera or
render texture you want to add to the `Camera` or `RenderTexture` field.
You can have more than one camera or render texture and even use a combination
Visual observations can be derived from Cameras or RenderTextures within your scene.
To add a visual observation to an Agent, either click on the `Add Camera` or
`Add RenderTexture` button in the Agent inspector. Then drag the camera or
render texture you want to add to the `Camera` or `RenderTexture` field.
You can have more than one camera or render texture and even use a combination
of both attached to an Agent.
![Agent Camera](images/visual-observation.png)

specify the number of Resolutions the Agent is using for its visual observations.
For each visual observation, set the width and height of the image (in pixels)
and whether or not the observation is color or grayscale (when `Black And White`
is checked).
is checked).
three **Visual Observations** have to be added to the **Behavior Parameters**.
During runtime, if a combination of `Cameras` and `RenderTextures` is used, all
three **Visual Observations** have to be added to the **Behavior Parameters**.
During runtime, if a combination of `Cameras` and `RenderTextures` is used, all
order they appear in the editor.
order they appear in the editor.
RenderTexture observations will throw an `Exception` if the width/height doesn't
RenderTexture observations will throw an `Exception` if the width/height doesn't
When using `RenderTexture` visual observations, a handy feature for debugging is
adding a `Canvas`, then adding a `Raw Image` with it's texture set to the Agent's
`RenderTexture`. This will render the agent observation on the game screen.
When using `RenderTexture` visual observations, a handy feature for debugging is
adding a `Canvas`, then adding a `Raw Image` with it's texture set to the Agent's
`RenderTexture`. This will render the agent observation on the game screen.
The [GridWorld environment](Learning-Environment-Examples.md#gridworld)
is an example on how to use a RenderTexture for both debugging and observation. Note
that in this example, a Camera is rendered to a RenderTexture, which is then used for
observations and debugging. To update the RenderTexture, the Camera must be asked to
render every time a decision is requested within the game code. When using Cameras
The [GridWorld environment](Learning-Environment-Examples.md#gridworld)
is an example on how to use a RenderTexture for both debugging and observation. Note
that in this example, a Camera is rendered to a RenderTexture, which is then used for
observations and debugging. To update the RenderTexture, the Camera must be asked to
render every time a decision is requested within the game code. When using Cameras
as observations directly, this is done automatically by the Agent.
![Agent RenderTexture Debug](images/gridworld.png)

is an array of indices. The number of indices in the array is determined by the
number of branches defined in the `Branches Size` property. Each branch
corresponds to an action table, you can specify the size of each table by
modifying the `Branches` property.
modifying the `Branches` property.
Neither the Policy nor the training algorithm know anything about what the action
values themselves mean. The training algorithm simply tries different values for

with values ranging from zero to one.
Note that when you are programming actions for an agent, it is often helpful to
test your action logic using the `Heuristic()` method of the Agent,
test your action logic using the `Heuristic()` method of the Agent,
which lets you map keyboard
commands to actions.

Perhaps the best advice is to start simple and only add complexity as needed. In
general, you should reward results rather than actions you think will lead to
the desired results. To help develop your rewards, you can use the Monitor class
to display the cumulative reward received by an Agent. You can even use the
to display the cumulative reward received by an Agent. You can even use the
Agent's Heuristic to control the Agent while watching how it accumulates rewards.
Allocate rewards to an Agent by calling the `AddReward()` method in the

platform.
Note that all of these environments make use of the `Done()` method, which manually
terminates an episode when a termination condition is reached. This can be
terminates an episode when a termination condition is reached. This can be
called independently of the `Max Step` property.
## Agent Properties

* `Branches` (Discrete) - An array of integers, defines multiple concurrent
discrete actions. The values in the `Branches` array correspond to the
number of possible discrete values for each action branch.
* `Model` - The neural network model used for inference (obtained after
* `Model` - The neural network model used for inference (obtained after
* `Visual Observations` - A list of `Cameras` or `RenderTextures` which will
* `Visual Observations` - A list of `Cameras` or `RenderTextures` which will
be used to generate observations.
* `Max Step` - The per-agent maximum number of steps. Once this number is
reached, the Agent will be reset if `Reset On Done` is checked.

16
docs/Learning-Environment-Design.md


To train and use the ML-Agents toolkit in a Unity scene, the scene must contain
a single Academy subclass and as many Agent subclasses
as you need.
as you need.
Agent instances should be attached to the GameObject representing that Agent.
### Academy

The Agent class represents an actor in the scene that collects observations and
carries out actions. The Agent class is typically attached to the GameObject in
the scene that otherwise represents the actor — for example, to a player object
in a football game or a car object in a vehicle simulation. Every Agent must
in a football game or a car object in a vehicle simulation. Every Agent must
have appropriate `Behavior Parameters`.
To create an Agent, extend the Agent class and implement the essential

You must also determine how an Agent finishes its task or times out. You can
manually set an Agent to done in your `AgentAction()` function when the Agent
has finished (or irrevocably failed) its task by calling the `Done()` function.
You can also set the Agent's `Max Steps` property to a positive value and the
Agent will consider itself done after it has taken that many steps. If you
set an Agent's `ResetOnDone` property to true, then the Agent can attempt its
task several times in one episode. (Use the `Agent.AgentReset()` function to
has finished (or irrevocably failed) its task by calling the `Done()` function.
You can also set the Agent's `Max Steps` property to a positive value and the
Agent will consider itself done after it has taken that many steps. If you
set an Agent's `ResetOnDone` property to true, then the Agent can attempt its
task several times in one episode. (Use the `Agent.AgentReset()` function to
prepare the Agent to start again.)
See [Agents](Learning-Environment-Design-Agents.md) for detailed information

properties that can be set differently for a training scene versus a regular
scene. The Academy's **Configuration** properties control rendering and time
scale. You can set the **Training Configuration** to minimize the time Unity
spends rendering graphics in order to speed up training.
spends rendering graphics in order to speed up training.
When you create a training environment in Unity, you must set up the scene so
that it can be controlled by the external training process. Considerations
include:

8
docs/Learning-Environment-Examples.md


* Default: 1
* Recommended Minimum: 0.2
* Recommended Maximum: 5
* gravity: Magnitude of gravity
* gravity: Magnitude of gravity
* Default: 9.81
* Recommended Minimum: 4
* Recommended Maximum: 105

* Reset Parameters: Three
* angle: Angle of the racket from the vertical (Y) axis.
* Default: 55
* Recommended Minimum: 35
* Recommended Minimum: 35
* Recommended Maximum: 65
* gravity: Magnitude of gravity
* Default: 9.81

* Set-up: A platforming environment where the agent can jump over a wall.
* Goal: The agent must use the block to scale the wall and reach the goal.
* Agents: The environment contains one agent linked to two different
Models. The Policy the agent is linked to changes depending on the
* Agents: The environment contains one agent linked to two different
Models. The Policy the agent is linked to changes depending on the
height of the wall. The change of Policy is done in the WallJumpAgent class.
* Agent Reward Function:
* -0.0005 for every step.

10
docs/ML-Agents-Overview.md


an Agent, and each Agent has a Policy. The Policy receives observations
and rewards from the Agent and returns actions. The Academy ensures that all the
Agents are in sync in addition to controlling environment-wide
settings.
settings.
## Training Modes

learn the best policy for each medic. Once training concludes, the learned
policy for each medic can be exported. Given that all our implementations are
based on TensorFlow, the learned policy is just a TensorFlow model file. Then
during the inference phase, we use the
during the inference phase, we use the
TensorFlow model generated from the training phase. Now during the inference
phase, the medics still continue to generate their observations, but instead of
being sent to the Python API, they will be fed into their (internal, embedded)

In the previous mode, the Agents were used for training to generate
a TensorFlow model that the Agents can later use. However,
any user of the ML-Agents toolkit can leverage their own algorithms for
training. In this case, the behaviors of all the Agents in the scene
training. In this case, the behaviors of all the Agents in the scene
will be controlled within Python.
You can even turn your environment into a [gym.](../gym-unity/README.md)

to perform, rather than attempting to have it learn via trial-and-error methods.
For example, instead of training the medic by setting up its reward function,
this mode allows providing real examples from a game controller on how the medic
should behave. More specifically, in this mode, the Agent must use its heuristic
should behave. More specifically, in this mode, the Agent must use its heuristic
to generate action, and all the actions performed with the controller (in addition
to the agent observations) will be recorded. The
imitation learning algorithm will then use these pairs of observations and

signals with same or different `Behavior Parameters`. In this
scenario, agents must compete with one another to either win a competition, or
obtain some limited set of resources. All team sports fall into this scenario.
- Ecosystem. Multiple interacting agents with independent reward signals with
- Ecosystem. Multiple interacting agents with independent reward signals with
same or different `Behavior Parameters`. This scenario can be thought
of as creating a small world in which animals with different goals all
interact, such as a savanna in which there might be zebras, elephants and

2
docs/Migrating.md


### Important Changes
* The definition of the gRPC service has changed.
* The online BC training feature has been removed.
* The online BC training feature has been removed.
* The BroadcastHub has been deprecated. If there is a training Python process, all LearningBrains in the scene will automatically be trained. If there is no Python process, inference will be used.
* The Brain ScriptableObjects have been deprecated. The Brain Parameters are now on the Agent and are referred to as Behavior Parameters. Make sure the Behavior Parameters is attached to the Agent GameObject.
* Several changes were made to the setup for visual observations (i.e. using Cameras or RenderTextures):

2
docs/Readme.md


* [Training Generalized Reinforcement Learning Agents](Training-Generalized-Reinforcement-Learning-Agents.md)
### Cloud Training (Deprecated)
Here are the cloud training set-up guides for Azure and AWS. We no longer use them ourselves and
Here are the cloud training set-up guides for Azure and AWS. We no longer use them ourselves and
so they may not be work correctly. We've decided to keep them up just in case they are helpful to
you.

34
docs/Training-Behavioral-Cloning.md


# Training with Behavioral Cloning
There are a variety of possible imitation learning algorithms which can
be used, the simplest one of them is Behavioral Cloning. It works by collecting
demonstrations from a teacher, and then simply uses them to directly learn a
policy, in the same way the supervised learning for image classification
There are a variety of possible imitation learning algorithms which can
be used, the simplest one of them is Behavioral Cloning. It works by collecting
demonstrations from a teacher, and then simply uses them to directly learn a
policy, in the same way the supervised learning for image classification
With offline behavioral cloning, we can use demonstrations (`.demo` files)
With offline behavioral cloning, we can use demonstrations (`.demo` files)
1. Choose an agent you would like to learn to imitate some set of demonstrations.
2. Record a set of demonstration using the `Demonstration Recorder` (see [here](Training-Imitation-Learning.md)).
For illustrative purposes we will refer to this file as `AgentRecording.demo`.
3. Build the scene(make sure the Agent is not using its heuristic).
4. Open the `config/offline_bc_config.yaml` file.
5. Modify the `demo_path` parameter in the file to reference the path to the
demonstration file recorded in step 2. In our case this is:
1. Choose an agent you would like to learn to imitate some set of demonstrations.
2. Record a set of demonstration using the `Demonstration Recorder` (see [here](Training-Imitation-Learning.md)).
For illustrative purposes we will refer to this file as `AgentRecording.demo`.
3. Build the scene(make sure the Agent is not using its heuristic).
4. Open the `config/offline_bc_config.yaml` file.
5. Modify the `demo_path` parameter in the file to reference the path to the
demonstration file recorded in step 2. In our case this is:
6. Launch `mlagent-learn`, providing `./config/offline_bc_config.yaml`
as the config parameter, and include the `--run-id` and `--train` as usual.
Provide your environment as the `--env` parameter if it has been compiled
6. Launch `mlagent-learn`, providing `./config/offline_bc_config.yaml`
as the config parameter, and include the `--run-id` and `--train` as usual.
Provide your environment as the `--env` parameter if it has been compiled
This will use the demonstration file to train a neural network driven agent
to directly imitate the actions provided in the demonstration. The environment
This will use the demonstration file to train a neural network driven agent
to directly imitate the actions provided in the demonstration. The environment
will launch and be used for evaluating the agent's performance during training.

4
docs/Training-Curriculum-Learning.md


## How-To
Each group of Agents under the same `Behavior Name` in an environment can have
Each group of Agents under the same `Behavior Name` in an environment can have
a corresponding curriculum. These
curriculums are held in what we call a metacurriculum. A metacurriculum allows
different groups of Agents to follow different curriculums within the same environment.

We will save this file into our metacurriculum folder with the name of its
corresponding `Behavior Name`. For example, in the Wall Jump environment, there are two
different `Behaviors Name` set via script in `WallJumpAgent.cs`
different `Behaviors Name` set via script in `WallJumpAgent.cs`
---BigWallBrainLearning and SmallWallBrainLearning. If we want to define a curriculum for
the BigWallBrainLearning, we will save `BigWallBrainLearning.json` into
`config/curricula/wall-jump/`.

70
docs/Training-Generalized-Reinforcement-Learning-Agents.md


agents are unable to generalize to any tweaks or variations in the environment.
This is analogous to a model being trained and tested on an identical dataset
in supervised learning. This becomes problematic in cases where environments
are randomly instantiated with varying objects or properties.
are randomly instantiated with varying objects or properties.
To make agents robust and generalizable to different environments, the agent
should be trained over multiple variations of the environment. Using this approach

## How to Enable Generalization Using Reset Parameters
We first need to provide a way to modify the environment by supplying a set of `Reset Parameters`
and vary them over time. This provision can be done either deterministically or randomly.
and vary them over time. This provision can be done either deterministically or randomly.
This is done by assigning each `Reset Parameter` a `sampler-type`(such as a uniform sampler),
This is done by assigning each `Reset Parameter` a `sampler-type`(such as a uniform sampler),
`Reset Parameter`, the parameter maintains the default value throughout the
training procedure, remaining unchanged. The samplers for all the `Reset Parameters`
are handled by a **Sampler Manager**, which also handles the generation of new
values for the reset parameters when needed.
`Reset Parameter`, the parameter maintains the default value throughout the
training procedure, remaining unchanged. The samplers for all the `Reset Parameters`
are handled by a **Sampler Manager**, which also handles the generation of new
values for the reset parameters when needed.
To setup the Sampler Manager, we create a YAML file that specifies how we wish to
generate new samples for each `Reset Parameters`. In this file, we specify the samplers and the
`resampling-interval` (the number of simulation steps after which reset parameters are
To setup the Sampler Manager, we create a YAML file that specifies how we wish to
generate new samples for each `Reset Parameters`. In this file, we specify the samplers and the
`resampling-interval` (the number of simulation steps after which reset parameters are
resampled). Below is an example of a sampler file for the 3D ball environment.
```yaml

Below is the explanation of the fields in the above example.
* `resampling-interval` - Specifies the number of steps for the agent to
train under a particular environment configuration before resetting the
* `resampling-interval` - Specifies the number of steps for the agent to
train under a particular environment configuration before resetting the
* `Reset Parameter` - Name of the `Reset Parameter` like `mass`, `gravity` and `scale`. This should match the name
specified in the academy of the intended environment for which the agent is
being trained. If a parameter specified in the file doesn't exist in the
* `Reset Parameter` - Name of the `Reset Parameter` like `mass`, `gravity` and `scale`. This should match the name
specified in the academy of the intended environment for which the agent is
being trained. If a parameter specified in the file doesn't exist in the
* `sampler-type` - Specify the sampler type to use for the `Reset Parameter`.
This is a string that should exist in the `Sampler Factory` (explained
* `sampler-type` - Specify the sampler type to use for the `Reset Parameter`.
This is a string that should exist in the `Sampler Factory` (explained
* `sampler-type-sub-arguments` - Specify the sub-arguments depending on the `sampler-type`.
In the example above, this would correspond to the `intervals`
under the `sampler-type` `"multirange_uniform"` for the `Reset Parameter` called gravity`.
The key name should match the name of the corresponding argument in the sampler definition.
* `sampler-type-sub-arguments` - Specify the sub-arguments depending on the `sampler-type`.
In the example above, this would correspond to the `intervals`
under the `sampler-type` `"multirange_uniform"` for the `Reset Parameter` called gravity`.
The key name should match the name of the corresponding argument in the sampler definition.
(See below)
The Sampler Manager allocates a sampler type for each `Reset Parameter` by using the *Sampler Factory*,

Below is a list of included `sampler-type` as part of the toolkit.
* `uniform` - Uniform sampler
* Uniformly samples a single float value between defined endpoints.
The sub-arguments for this sampler to specify the interval
endpoints are as below. The sampling is done in the range of
* Uniformly samples a single float value between defined endpoints.
The sub-arguments for this sampler to specify the interval
endpoints are as below. The sampling is done in the range of
* `gaussian` - Gaussian sampler
* `gaussian` - Gaussian sampler
the mean and standard deviation. The sub-arguments to specify the
the mean and standard deviation. The sub-arguments to specify the
* Uniformly samples a single float value between the specified intervals.
Samples by first performing a weight pick of an interval from the list
of intervals (weighted based on interval width) and samples uniformly
from the selected interval (half-closed interval, same as the uniform
sampler). This sampler can take an arbitrary number of intervals in a
list in the following format:
* Uniformly samples a single float value between the specified intervals.
Samples by first performing a weight pick of an interval from the list
of intervals (weighted based on interval width) and samples uniformly
from the selected interval (half-closed interval, same as the uniform
sampler). This sampler can take an arbitrary number of intervals in a
list in the following format:
* **sub-arguments** - `intervals`
The implementation of the samplers can be found at `ml-agents-envs/mlagents/envs/sampler_class.py`.

If you want to define your own sampler type, you must first inherit the *Sampler*
base class (included in the `sampler_class` file) and preserve the interface.
Once the class for the required method is specified, it must be registered in the Sampler Factory.
Once the class for the required method is specified, it must be registered in the Sampler Factory.
This can be done by subscribing to the *register_sampler* method of the SamplerFactory. The command
is as follows:

sampling setup, we would run
```sh
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_generalize.yaml
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_generalize.yaml
--run-id=3D-Ball-generalization --train
```

20
docs/Training-ML-Agents.md


And then opening the URL: [localhost:6006](http://localhost:6006).
**Note:** The default port TensorBoard uses is 6006. If there is an existing session
running on port 6006 a new session can be launched on an open port using the --port
running on port 6006 a new session can be launched on an open port using the --port
option.
When training is finished, you can find the saved model in the `models` folder

Default is set to 1. Set to higher values when benchmarking performance and
multiple training sessions is desired. Training sessions are independent, and
do not improve learning performance.
* `--num-envs=<n>`: Specifies the number of concurrent Unity environment instances to
* `--num-envs=<n>`: Specifies the number of concurrent Unity environment instances to
collect experiences from when training. Defaults to 1.
* `--run-id=<path>`: Specifies an identifier for each training run. This
identifier is used to name the subdirectories in which the trained model and

All arguments after this flag will be passed to the executable. For example, setting
`mlagents-learn config/trainer_config.yaml --env-args --num-orcs 42` would result in
` --num-orcs 42` passed to the executable.
* `--base-port`: Specifies the starting port. Each concurrent Unity environment instance
will get assigned a port sequentially, starting from the `base-port`. Each instance
will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs
* `--base-port`: Specifies the starting port. Each concurrent Unity environment instance
will get assigned a port sequentially, starting from the `base-port`. Each instance
will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs
given to each instance from 0 to `num_envs - 1`. Default is 5005. __Note:__ When
training using the Editor rather than an executable, the base port will be ignored.
* `--slow`: Specify this option to run the Unity environment at normal, game

The training config files `config/trainer_config.yaml`, `config/sac_trainer_config.yaml`,
`config/gail_config.yaml` and `config/offline_bc_config.yaml` specifies the training method,
the hyperparameters, and a few additional values to use when training with Proximal Policy
Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial Imitation Learning)
with PPO, and online and offline Behavioral Cloning(BC)/Imitation. These files are divided
the hyperparameters, and a few additional values to use when training with Proximal Policy
Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial Imitation Learning)
with PPO, and online and offline Behavioral Cloning(BC)/Imitation. These files are divided
The **default** section defines the default values for all the available settings. You can
The **default** section defines the default values for all the available settings. You can
override sections after the appropriate `Behavior Name`. Sections for the
override sections after the appropriate `Behavior Name`. Sections for the
example environments are included in the provided config file.
| **Setting** | **Description** | **Applies To Trainer\*** |

2
docs/Training-Using-Concurrent-Unity-Instances.md


# Training Using Concurrent Unity Instances
As part of release v0.8, we enabled developers to run concurrent, parallel instances of the Unity executable during training. For certain scenarios, this should speed up the training.
As part of release v0.8, we enabled developers to run concurrent, parallel instances of the Unity executable during training. For certain scenarios, this should speed up the training.
## How to Run Concurrent Unity Instances During Training

2
docs/Training-on-Amazon-Web-Service.md


# Training on Amazon Web Service
Note: We no longer use this guide ourselves and so it may not work correctly. We've
Note: We no longer use this guide ourselves and so it may not work correctly. We've
decided to keep it up just in case it is helpful to you.
This page contains instructions for setting up an EC2 instance on Amazon Web

2
docs/Training-on-Microsoft-Azure-Custom-Instance.md


6. Navigate to [http://developer.nvidia.com](http://developer.nvidia.com) and
create an account and verify it.
7. Download (to your own computer) cuDNN from [this url](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v6/prod/8.0_20170307/Ubuntu16_04_x64/libcudnn6_6.0.20-1+cuda8.0_amd64-deb).
7. Download (to your own computer) cuDNN from [this url](https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v6/prod/8.0_20170307/Ubuntu16_04_x64/libcudnn6_6.0.20-1+cuda8.0_amd64-deb).
8. Copy the deb package to your VM:

2
docs/Training-on-Microsoft-Azure.md


# Training on Microsoft Azure (works with ML-Agents toolkit v0.3)
Note: We no longer use this guide ourselves and so it may not work correctly. We've
Note: We no longer use this guide ourselves and so it may not work correctly. We've
decided to keep it up just in case it is helpful to you.
This page contains instructions for setting up training on Microsoft Azure

22
docs/Unity-Inference-Engine.md


The ML-Agents toolkit allows you to use pre-trained neural network models
inside your Unity games. This support is possible thanks to the Unity Inference
Engine. The Unity Inference Engine is using
[compute shaders](https://docs.unity3d.com/Manual/class-ComputeShader.html)
to run the neural network within Unity.
Engine. The Unity Inference Engine is using
[compute shaders](https://docs.unity3d.com/Manual/class-ComputeShader.html)
to run the neural network within Unity.
Scripting Backends : The Unity Inference Engine is generally faster with
Scripting Backends : The Unity Inference Engine is generally faster with
In the Editor, It is not possible to use the Unity Inference Engine with
GPU device selected when Editor Graphics Emulation is set to __OpenGL(ES)
3.0 or 2.0 emulation__. Also there might be non-fatal build time errors
when target platform includes Graphics API that does not support
In the Editor, It is not possible to use the Unity Inference Engine with
GPU device selected when Editor Graphics Emulation is set to __OpenGL(ES)
3.0 or 2.0 emulation__. Also there might be non-fatal build time errors
when target platform includes Graphics API that does not support
__Unity Compute Shaders__.
The Unity Inference Engine supposedly works on any Unity supported platform
but we only tested for the following platforms :

## Using the Unity Inference Engine
When using a model, drag the `.nn` file into the **Model** field
in the Inspector of the Agent.
When using a model, drag the `.nn` file into the **Model** field
in the Inspector of the Agent.
You should use the GPU only if you use the
You should use the GPU only if you use the
ResNet visual encoder or have a large number of agents with visual observations.

8
docs/Using-Tensorboard.md


4. Open a browser window and navigate to [localhost:6006](http://localhost:6006).
**Note:** The default port TensorBoard uses is 6006. If there is an existing session
running on port 6006 a new session can be launched on an open port using the --port
running on port 6006 a new session can be launched on an open port using the --port
option.
**Note:** If you don't assign a `run-id` identifier, `mlagents-learn` uses the

* `Environment/Cumulative Reward` - The mean cumulative episode reward over all agents. Should
increase during a successful training session.
* `Environment/Episode Length` - The mean length of each episode in the environment for all agents.
### Policy Statistics

* `Policy/Learning Rate` (PPO; BC) - How large a step the training algorithm takes as it searches
for the optimal policy. Should decrease over time.
* `Policy/Value Estimate` (PPO) - The mean value estimate for all states visited by the agent. Should increase during a successful training session.
* `Policy/Curiosity Reward` (PPO+Curiosity) - This corresponds to the mean cumulative intrinsic reward generated per-episode.

* `Losses/Inverse Loss` (PPO+Curiosity) - The mean magnitude of the forward model
loss function. Corresponds to how well the model is able to predict the action
taken between two observations.
* `Losses/Cloning Loss` (BC) - The mean magnitude of the behavioral cloning loss. Corresponds to how well the model imitates the demonstration data.

30
docs/Using-Virtual-Environment.md


# Using Virtual Environment
## What is a Virtual Environment?
A Virtual Environment is a self contained directory tree that contains a Python installation
for a particular version of Python, plus a number of additional packages. To learn more about
A Virtual Environment is a self contained directory tree that contains a Python installation
for a particular version of Python, plus a number of additional packages. To learn more about
A Virtual Environment keeps all dependencies for the Python project separate from dependencies
A Virtual Environment keeps all dependencies for the Python project separate from dependencies
1. It enables using and testing of different library versions by quickly
1. It enables using and testing of different library versions by quickly
different version.
different version.
Requirement - Python 3.6 must be installed on the machine you would like
to run ML-Agents on (either local laptop/desktop or remote server). Python 3.6 can be
installed from [here](https://www.python.org/downloads/).
Requirement - Python 3.6 must be installed on the machine you would like
to run ML-Agents on (either local laptop/desktop or remote server). Python 3.6 can be
installed from [here](https://www.python.org/downloads/).
## Installing Pip (Required)

1. Check pip version using `pip3 -V`
Note (for Ubuntu users): If the `ModuleNotFoundError: No module named 'distutils.util'` error is encountered, then
python3-distutils needs to be installed. Install python3-distutils using `sudo apt-get install python3-distutils`
python3-distutils needs to be installed. Install python3-distutils using `sudo apt-get install python3-distutils`
1. To create a new environment named `sample-env` execute `$ python3 -m venv ~/python-envs/sample-env`
1. To create a new environment named `sample-env` execute `$ python3 -m venv ~/python-envs/sample-env`
1. Verify pip version is the same as in the __Installing Pip__ section. In case it is not the latest, upgrade to
the latest pip version using `pip3 install --upgrade pip`
1. Verify pip version is the same as in the __Installing Pip__ section. In case it is not the latest, upgrade to
the latest pip version using `pip3 install --upgrade pip`
## Ubuntu Setup
## Ubuntu Setup
1. Install the python3-venv package using `$ sudo apt-get install python3-venv`
1. Follow the steps in the Mac OS X installation.

1. Create a folder where the virtual environments will reside `$ md python-envs`
1. To create a new environment named `sample-env` execute `$ python3 -m venv python-envs\sample-env`
1. To create a new environment named `sample-env` execute `$ python3 -m venv python-envs\sample-env`
1. Verify pip version is the same as in the __Installing Pip__ section. In case it is not the latest, upgrade to
1. Verify pip version is the same as in the __Installing Pip__ section. In case it is not the latest, upgrade to
the latest pip version using `pip3 install --upgrade pip`
1. Install ML-Agents package using `$ pip3 install mlagents`
1. To deactivate the environment execute `$ deactivate`

8
docs/localized/KR/README.md


[![license badge](https://img.shields.io/badge/license-Apache--2.0-green.svg)](LICENSE)
**Unity Machine Learning Agents Toolkit** (ML-Agents) 은 지능형 에이전트를 학습시키기 위한
환경을 제공하여 게임 또는 시뮬레이션을 만들 수 있게 해주는 오픈소스 유니티 플러그인 입니다. 사용하기 쉬운
환경을 제공하여 게임 또는 시뮬레이션을 만들 수 있게 해주는 오픈소스 유니티 플러그인 입니다. 사용하기 쉬운
NPC의 행동 제어(다중 에이전트, 적대적 에이전트 등), 게임 빌드 테스트 자동화, 그리고 출시 전 게임 설계 검증 등을 포함한 다양한 목적을 위해 사용될 수 있습니다.
NPC의 행동 제어(다중 에이전트, 적대적 에이전트 등), 게임 빌드 테스트 자동화, 그리고 출시 전 게임 설계 검증 등을 포함한 다양한 목적을 위해 사용될 수 있습니다.
ML-Agents toolkit은 유니티의 풍부한 환경에서 인공지능 에이전트 개발을 위한 중심 플랫폼을 제공함으로써 더욱 광범위한 연구와 게임 개발이 진행되도록 하며 이에 따라 게임 개발자들과 AI 연구원들 모두에게 도움을 줍니다.
## 특징

## 커뮤니티 그리고 피드백
ML-Agents toolkit은 오픈소스 프로젝트이며 컨트리뷰션을 환영합니다. 만약 컨트리뷰션을 원하시는 경우
ML-Agents toolkit은 오픈소스 프로젝트이며 컨트리뷰션을 환영합니다. 만약 컨트리뷰션을 원하시는 경우
발전하고 성장할 수 있습니다. 단 몇 분만 사용하여 [저희에게 알려주세요](https://github.com/Unity-Technologies/ml-agents/issues/1454).
발전하고 성장할 수 있습니다. 단 몇 분만 사용하여 [저희에게 알려주세요](https://github.com/Unity-Technologies/ml-agents/issues/1454).
다른 의견과 피드백은 ML-Agents 팀과 직접 연락부탁드립니다. (ml-agents@unity3d.com)

40
docs/localized/KR/docs/Installation-Windows.md


# Windows �����ڸ� ���� ML-Agents Toolkit ��ġ ����
ML-Agents toolkit�� Windows 10�� �����մϴ�. �ٸ� ������ Windows ���ε� ML-Agents toolkit��
������ �� ������ �������� �ʾҽ��ϴ�. ����, ML-Agents toolkit�� Windows VM(Bootcamp �Ǵ� ���� ó��
������ �� ������ �������� �ʾҽ��ϴ�. ����, ML-Agents toolkit�� Windows VM(Bootcamp �Ǵ� ���� ó��
�� ���̵��� ���� GPU ���� �н�(�����ڸ� ����)�� ���� ���� ������ �ٷ��ϴ�.
�� ���̵��� ���� GPU ���� �н�(�����ڸ� ����)�� ���� ���� ������ �ٷ��ϴ�.
����, ML-Agents toolkit�� ���� GPU ���� �н��� �ʿ����� ������ ���� ���� �Ǵ� Ư�� ���׿� �ʿ��� �� �ֽ��ϴ�.
## �ܰ� 1: Anaconda�� ���� Python ��ġ

Python 2�� ���̻� �������� �ʱ� ������ Python 3.5 �Ǵ� 3.6�� �ʿ��մϴ�. �� ���̵忡�� �츮��
Python 2�� ���̻� �������� �ʱ� ������ Python 3.5 �Ǵ� 3.6�� �ʿ��մϴ�. �� ���̵忡�� �츮��
Python 3.6 ������ Anaconda 5.1 ������ ������ ���Դϴ�.
([64-bit](https://repo.continuum.io/archive/Anaconda3-5.1.0-Windows-x86_64.exe)
�Ǵ� [32-bit](https://repo.continuum.io/archive/Anaconda3-5.1.0-Windows-x86.exe)

"conda is not recognized as internal or external command" ���� ������ ���� ���Դϴ�.
�̸� �ذ��ϱ� ���� ��Ȯ�� ȯ�� ���� ������ �ʿ��մϴ�.
Ž�� â���� `ȯ�� ����`�� Ÿ���� �Ͽ� (������ Ű�� �����ų� ���� �Ʒ� ������ ��ư�� ���� �� �� �ֽ��ϴ�).
Ž�� â���� `ȯ�� ����`�� Ÿ���� �Ͽ� (������ Ű�� �����ų� ���� �Ʒ� ������ ��ư�� ���� �� �� �ֽ��ϴ�).
__�ý��� ȯ�� ���� ����__ �ɼ��� �ҷ��ɴϴ�.
<p align="center">

## �ܰ� 3: �ʼ� ���̽� ��Ű�� ��ġ
ML-Agents toolkit�� ���� ���̽� ��Ű���� �������Դϴ�. `pip`�� �����Ͽ� �� ���̽� ���Ӽ����� ��ġ�Ͻʽÿ�.
ML-Agents toolkit�� ���� ���̽� ��Ű���� �������Դϴ�. `pip`�� �����Ͽ� �� ���̽� ���Ӽ����� ��ġ�Ͻʽÿ�.
ML-Agents Toolkit ������ �����Ұ� ���� ��ǻ�Ϳ� �����Ǿ����� �ʾҴٸ� �����Ͻʽÿ�. Git�� ([�ٿ��ε�](https://git-scm.com/download/win))�ϰ�
������Ų �� ���� ���ɾ Anaconda ������Ʈâ�� �Է��Ͽ� ������ �� �ֽ��ϴ�. _(���� �� ������Ʈ â�� �����ִٸ� `activate ml-agents`�� Ÿ�����Ͽ�

`ml-agents` ���� �����丮���� ����Ƽ ȯ���� �԰� �����ϴ� ���� ��ȭ�н� Ʈ���̳� ���̽� ��Ű���� ���ԵǾ� �ֽ��ϴ�.
`ml-agents-envs` ���� �����丮���� `ml-agents` ��Ű���� ���ӵǴ� ����Ƽ�� �������̽��� ���� ���̽� API�� ���ԵǾ� �ֽ��ϴ�.
`ml-agents-envs` ���� �����丮���� `ml-agents` ��Ű���� ���ӵǴ� ����Ƽ�� �������̽��� ���� ���̽� API�� ���ԵǾ� �ֽ��ϴ�.
`gym-unity` ���� �����丮���� OpenAI Gym�� �������̽��� ���� ��Ű���� ���ԵǾ� �ֽ��ϴ�.

`--no-cache-dir`�� pip���� ij���� ��Ȱ��ȭ �Ѵٴ� ���Դϴ�.
### ������ ���� ��ġ
### ������ ���� ��ġ
�̸� ����, `ml-agents` �� `ml-agents-envs` �� ���� ��ġ�ؾ� �մϴ�.
�������� ������ `C:\Downloads`�� ��ġ�� �ֽ��ϴ�. ������ �����ϰų� �ٿ��ε��� ��
�̸� ����, `ml-agents` �� `ml-agents-envs` �� ���� ��ġ�ؾ� �մϴ�.
�������� ������ `C:\Downloads`�� ��ġ�� �ֽ��ϴ�. ������ �����ϰų� �ٿ��ε��� ��
�������� ���� �����丮���� ������ �����Ͻʽÿ�:
```console

pip install -e .
```
`-e` �÷��׸� �����Ͽ� pip�� ���� �ϸ� ���̽� ������ ���� ������ �� �ְ� `mlagents-learn`�� ������ �� �ݿ��˴ϴ�.
`-e` �÷��׸� �����Ͽ� pip�� ���� �ϸ� ���̽� ������ ���� ������ �� �ְ� `mlagents-learn`�� ������ �� �ݿ��˴ϴ�.
�� �������� ��Ű���� ��ġ�ϴ� ���� �߿��մϴ�.
�� �������� ��Ű���� ��ġ�ϴ� ���� �߿��մϴ�.
## (�ɼ�) Step 4: ML-Agents Toolkit�� ������ GPU �н�
## (�ɼ�) Step 4: ML-Agents Toolkit�� ������ GPU �н�
ML-Agents toolkit�� ���� GPU�� �ʿ����� ������ �н� �߿� PPO �˰����� �ӵ��� ũ�� ������ ���մϴ�(������ ���Ŀ� GPU�� ������ �� �� �ֽ��ϴ�).
�� ���̵��� GPU�� ������ �н��� �ϰ� ���� ���� �����ڸ� ���� ���̵� �Դϴ�. ���� GPU�� CUDA�� ȣȯ�Ǵ��� Ȯ���ؾ� �մϴ�.

### Nvidia CUDA toolkit ��ġ
Nvidia ��ī�̺꿡�� CUDA ��Ŷ(toolkit) 9.0�� [�ٿ��ε�](https://developer.nvidia.com/cuda-toolkit-archive)�ϰ� ��ġ�Ͻʽÿ�.
ML-Agents toolkit�� ������Ű�� ���� CUDA ��Ŷ�� GPU ���� ���̺귯��,
ML-Agents toolkit�� ������Ű�� ���� CUDA ��Ŷ�� GPU ���� ���̺귯��,
������-����ȭ ����, C/C++(���־� ��Ʃ���� 2017) �����Ϸ�, ��Ÿ�� ���̺귯���� �����մϴ�.
�� ���̵忡���� [9.0.176](https://developer.nvidia.com/compute/cuda/9.0/Prod/network_installers/cuda_9.0.176_win10_network-exe))������ �����մϴ�.

### Nvidia cuDNN ���̺귯�� ��ġ
Nvidia���� cuDNN ���̺귯���� [�ٿ��ε�](https://developer.nvidia.com/cudnn)�ϰ� ��ġ�Ͻʽÿ�.
Nvidia���� cuDNN ���̺귯���� [�ٿ��ε�](https://developer.nvidia.com/cudnn)�ϰ� ��ġ�Ͻʽÿ�.
cuDNN�� ���� �Ű����� ���� �⺻�� �Ǵ� GPU ���� ���̺귯��. �ٿ��ε� ���� Nvidia Developer Program�� �����ؾ��� ���Դϴ�(����).
<p align="center">

�����ϰ� cuDNN [�ٿ��ε� ������](https://developer.nvidia.com/cudnn)�� ���ư��ʽÿ�.
ª�� �������翡 �����ؾ� �� ���� �ֽ��ϴ�. When you get to the list
cuDNN ������ ����Ʈ���� __�ܰ� 1���� ��ġ�� CUDA ��Ŷ�� �´� ������ �ٿ��ε��ϰ� �ִ��� Ȯ���Ͻʽÿ�.__ �� ���̵忡����,
cuDNN ������ ����Ʈ���� __�ܰ� 1���� ��ġ�� CUDA ��Ŷ�� �´� ������ �ٿ��ε��ϰ� �ִ��� Ȯ���Ͻʽÿ�.__ �� ���̵忡����,
cuDNN ������ �ٿ��ε� �� �Ŀ�, CUDA ��Ŷ �����丮�ȿ� ������ ����(���� ����)�ؾ� �մϴ�.
cuDNN ������ �ٿ��ε� �� �Ŀ�, CUDA ��Ŷ �����丮�ȿ� ������ ����(���� ����)�ؾ� �մϴ�.
cuDNN zip ���� �ȿ��� ������ ���� `bin`, `include`, �׸��� `lib`�� �ֽ��ϴ�.
<p align="center">

</p>
�� ������ ������ CUDA ��Ŷ �����丮�ȿ� �����Ͻʽÿ�.
�� ������ ������ CUDA ��Ŷ �����丮�ȿ� �����Ͻʽÿ�.
CUDA ��Ŷ �����丮�� `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0`�� ��ġ�� �ֽ��ϴ�.
<p align="center">

1���� ȯ�� ������ 2���� ���� ������ �߰��ؾ� �մϴ�.
ȯ�� ������ �����ϱ� ����, Ž�� â���� `ȯ�� ����`�� Ÿ���� �Ͽ� (������ Ű�� �����ų� ���� �Ʒ� ������ ��ư�� ���� �� �� �ֽ��ϴ�).
ȯ�� ������ �����ϱ� ����, Ž�� â���� `ȯ�� ����`�� Ÿ���� �Ͽ� (������ Ű�� �����ų� ���� �Ʒ� ������ ��ư�� ���� �� �� �ֽ��ϴ�).
__�ý��� ȯ�� ���� ����__ �ɼ��� �ҷ��ɴϴ�.
<p align="center">

12
docs/localized/KR/docs/Installation.md


</p>
## Windows 사용자
Windows에서 환경을 설정하기 위해, [세부 사항](Installation-Windows.md)에 설정 방법에 대해 작성하였습니다.
Windows에서 환경을 설정하기 위해, [세부 사항](Installation-Windows.md)에 설정 방법에 대해 작성하였습니다.
Mac과 Linux는 다음 가이드를 확인해주십시오.
## Mac 또는 Unix 사용자

pip3 install mlagents
```
이 명령어를 통해 PyPi로 부터(복제된 저장소가 아닌) `ml-agents`가 설치될 것입니다.
이 명령어를 통해 PyPi로 부터(복제된 저장소가 아닌) `ml-agents`가 설치될 것입니다.
명령어를 실행하면 유니티 로고와 `mlagents-learn`에서 사용할 수 있는 명령어 라인 매개변수들을 볼 수 있습니다.
명령어를 실행하면 유니티 로고와 `mlagents-learn`에서 사용할 수 있는 명령어 라인 매개변수들을 볼 수 있습니다.
- 만약 Anaconda를 사용하고 TensorFlow에 문제가 있다면, 다음
- 만약 Anaconda를 사용하고 TensorFlow에 문제가 있다면, 다음
[링크](https://www.tensorflow.org/install/pip)에서 Anaconda 환경에서 어떻게 TensorFlow를 설치하는지 확인하십시오.
### 개발을 위한 설치방법

`-e` 플래그를 사용하여 pip를 실행 하면 파이썬 파일을 직접 변경할 수 있고 `mlagents-learn`를 실행할 때 반영됩니다.
`mlagents` 패키지가 `mlagents_envs`에 의존적이고, 다른 순서로 설치하면 PyPi로 부터 `mlagents_envs`
설치할 수 있기 때문에 이 순서대로 패키지를 설치하는 것은 중요합니다.
설치할 수 있기 때문에 이 순서대로 패키지를 설치하는 것은 중요합니다.
## 도커 기반 설치

[기초 가이드](Basic-Guide.md) 페이지에는 유니티 내에서 ML-Agents toolkit의 설정 및 학습된 모델 실행,
[기초 가이드](Basic-Guide.md) 페이지에는 유니티 내에서 ML-Agents toolkit의 설정 및 학습된 모델 실행,
환경 구축, 학습 방법에 대한 여러 짧은 튜토리얼을 포함하고 있습니다.
## 도움말

34
docs/localized/KR/docs/Training-Imitation-Learning.md


유니티 에디터를 이용하여 에이전트의 플레이를 기록하고 에셋으로 저장하는 것이 가능합니다. 이런 플레이 데이터에는 기록을 진행하는 동안의 관측, 행동 그리고 보상 정보가 포함됩니다. 이것들은 데이터를 통해 관리가 가능하며 Behavioral Cloning과 같은 오프라인 학습에 사용될 수 있습니다. (아래 내용 참고)
에이전트의 플레이 데이터를 기록하기 위해서는 씬(Scene)에서 `Agent` 컴포넌트를 포함하고 있는 GameObject에 `Demonstration Recorder` 컴포넌트를 추가해주어야 합니다. 일단 추가되고나면 에이전트로부터 플레이 데이터를 기록할 수 있게 됩니다.
에이전트의 플레이 데이터를 기록하기 위해서는 씬(Scene)에서 `Agent` 컴포넌트를 포함하고 있는 GameObject에 `Demonstration Recorder` 컴포넌트를 추가해주어야 합니다. 일단 추가되고나면 에이전트로부터 플레이 데이터를 기록할 수 있게 됩니다.
<p align="center">
<img src="images/demo_component.png"

`Record`가 체크되는 경우 씬이 실행되면 데이터가 생성됩니다. 환경의 난이도에 따라 모방학습에 사용하기 위해 몇분에서 몇시간 정도 플레이 데이터를 수집해야합니다. 충분한 데이터가 기록되었으면 유니티 상에서 게임의 실행을 정지합니다. 그렇게 하면 `.demo` 파일이 `Assets/Demonstations` 폴더 내부에 생성됩니다. 이 파일에는 에이전트의 플레이 데이터가 저장되어 있습니다. 이 파일을 클릭하면 인스펙터 상에 데모 파일에 대한 정보를 아래와 같이 알려줍니다.
`Record`가 체크되는 경우 씬이 실행되면 데이터가 생성됩니다. 환경의 난이도에 따라 모방학습에 사용하기 위해 몇분에서 몇시간 정도 플레이 데이터를 수집해야합니다. 충분한 데이터가 기록되었으면 유니티 상에서 게임의 실행을 정지합니다. 그렇게 하면 `.demo` 파일이 `Assets/Demonstations` 폴더 내부에 생성됩니다. 이 파일에는 에이전트의 플레이 데이터가 저장되어 있습니다. 이 파일을 클릭하면 인스펙터 상에 데모 파일에 대한 정보를 아래와 같이 알려줍니다.
<p align="center">
<img src="images/demo_inspector.png"

## Behavioral Cloning을 통한 학습
모방학습을 위한 다양한 알고리즘이 존재하며 모방학습 알고리즘 중 가장 간단한 알고리즘이 Behavioral Cloning 입니다. 이 알고리즘은 마치 이미지 분류를 위한 지도학습 (Supervised Learning)이나 기타 고전적인 머신러닝 기법들처럼 전문가의 플레이로부터 수집된 데이터를 직접적으로 모방하도록 정책 (Policy)을 학습합니다.
모방학습을 위한 다양한 알고리즘이 존재하며 모방학습 알고리즘 중 가장 간단한 알고리즘이 Behavioral Cloning 입니다. 이 알고리즘은 마치 이미지 분류를 위한 지도학습 (Supervised Learning)이나 기타 고전적인 머신러닝 기법들처럼 전문가의 플레이로부터 수집된 데이터를 직접적으로 모방하도록 정책 (Policy)을 학습합니다.
오프라인 Behavioral Cloning에서 우리는 에이전트의 행동을 학습하기 위해 `Demonstration Recorder`를 통해 생성된 `demo` 파일을 데이터 셋으로 이용합니다.
오프라인 Behavioral Cloning에서 우리는 에이전트의 행동을 학습하기 위해 `Demonstration Recorder`를 통해 생성된 `demo` 파일을 데이터 셋으로 이용합니다.
2. `Demonstration Recorder`를 이용하여 전문가의 플레이를 기록합니다. (위의 내용 참고)
앞으로 설명을 위해 이 기록된 파일의 이름을 `AgentRecording.demo`라고 하겠습니다.
2. `Demonstration Recorder`를 이용하여 전문가의 플레이를 기록합니다. (위의 내용 참고)
앞으로 설명을 위해 이 기록된 파일의 이름을 `AgentRecording.demo`라고 하겠습니다.
4. `config/offline_bc_config.yaml` 파일을 열어줍니다.
4. `config/offline_bc_config.yaml` 파일을 열어줍니다.
위 방법은 데모 파일을 이용하여 에이전트가 직접적으로 전문가의 행동을 따라하도록 인공신경망을 학습하는 기법입니다. 환경은 학습이 진행되는 동안 에이전트의 성능을 평가하기 위해 실행되며 사용될 것입니다.
위 방법은 데모 파일을 이용하여 에이전트가 직접적으로 전문가의 행동을 따라하도록 인공신경망을 학습하는 기법입니다. 환경은 학습이 진행되는 동안 에이전트의 성능을 평가하기 위해 실행되며 사용될 것입니다.
### 온라인 학습

1. 먼저 두개의 브레인들을 생성합니다. 하나는 "선생님"이 될 것이고 하나는 "학생"이 될 것입니다. 이번 예시에서는 두개의 브레인 에셋의 이름을 각각 "Teacher"와 "Student"로 설정할 것입니다.
2. "Teacher" 브레인은 반드시 **플레이어 브레인 (Player Brain)**이어야 합니다.
1. 먼저 두개의 브레인들을 생성합니다. 하나는 "선생님"이 될 것이고 하나는 "학생"이 될 것입니다. 이번 예시에서는 두개의 브레인 에셋의 이름을 각각 "Teacher"와 "Student"로 설정할 것입니다.
2. "Teacher" 브레인은 반드시 **플레이어 브레인 (Player Brain)**이어야 합니다.
4. "Teacher" 브레인과 "Student" 브레인의 파라미터는 에이전트에서 설정한대로 동일하게 설정되어야 합니다.
4. "Teacher" 브레인과 "Student" 브레인의 파라미터는 에이전트에서 설정한대로 동일하게 설정되어야 합니다.
7. `config/online_bc_config.yaml` 파일에서, "Student" 브레인에 대한 항목을 추가해야합니다. `trainer` 파라미터를 `online_bc`로 설정하고 `brain_to_imitate` 파라미터를 선생님 에이전트의 브레인 이름인 "Teacher"로 설정합니다. 추가적으로 각 순간마다 얼마나 많은 학습을 진행할지 결정하는 `batches_per_epoch`를 설정합니다. 에이전트를 더 오랜 기간동안 학습하고 싶은 경우 `max_steps` 값을 증가시켜주세요.
7. `config/online_bc_config.yaml` 파일에서, "Student" 브레인에 대한 항목을 추가해야합니다. `trainer` 파라미터를 `online_bc`로 설정하고 `brain_to_imitate` 파라미터를 선생님 에이전트의 브레인 이름인 "Teacher"로 설정합니다. 추가적으로 각 순간마다 얼마나 많은 학습을 진행할지 결정하는 `batches_per_epoch`를 설정합니다. 에이전트를 더 오랜 기간동안 학습하고 싶은 경우 `max_steps` 값을 증가시켜주세요.
​--train —slow`를 통해 학습과정을 실행하고 화면에 _"Start training by pressing the Play button in the Unity Editor"_ 라는 메세지가 출력되면 유니티의 :arrow_forward: 버튼을 눌러주세요
9. 유니티 윈도우 상에서 선생님 브레인을 가진 에이전트를 제어하면서 원하는대로 플레이 데이터를 생성합니다.
10. 학생 브레인을 가진 에이전트(들)을 살펴보면 선생님 브레인을 가진 에이전트의 플레이와 유사하게 행동하기 시작합니다.
​--train —slow`를 통해 학습과정을 실행하고 화면에 _"Start training by pressing the Play button in the Unity Editor"_ 라는 메세지가 출력되면 유니티의 :arrow_forward: 버튼을 눌러주세요
9. 유니티 윈도우 상에서 선생님 브레인을 가진 에이전트를 제어하면서 원하는대로 플레이 데이터를 생성합니다.
10. 학생 브레인을 가진 에이전트(들)을 살펴보면 선생님 브레인을 가진 에이전트의 플레이와 유사하게 행동하기 시작합니다.
11. 학생 에이전트들이 원하는대로 행동하게 되면 커멘드 라인에서 `CTL+C`를 눌러서 학습을 중단하십시오.
12. 생성된 `*.nn` 파일을 Assets 폴더의 하위 폴더인 `TFModels` 폴더로 이동시키고 이 파일을 `러닝` 브레인에 사용하세요.

이것을 사용하면 다음과 같은 키보드 단축키를 사용할 수 있습니다:
1. 기록을 시작하거나 중단할 수 있습니다. 이것은 에이전트를 통해 게임을 플레이하되 에이전트가 학습은 되지 않도록 사용할 때 유용합니다. 이것에 대한 기본적인 실행은 키보드의 `R` 버튼을 누르면 됩니다.
2. 트레이닝 버퍼를 리셋합니다. 이 명령을 통해 에이전트가 최근의 경험에 대한 버퍼를 비우도록 설정합니다. 이것은 에이전트가 빠르게 새로운 행동을 배우게 하고싶을때 사용하면 유용합니다. 버퍼를 리셋하기 위한 기본 명령은 키보드의 `C` 버튼을 누르면 됩니다.
1. 기록을 시작하거나 중단할 수 있습니다. 이것은 에이전트를 통해 게임을 플레이하되 에이전트가 학습은 되지 않도록 사용할 때 유용합니다. 이것에 대한 기본적인 실행은 키보드의 `R` 버튼을 누르면 됩니다.
2. 트레이닝 버퍼를 리셋합니다. 이 명령을 통해 에이전트가 최근의 경험에 대한 버퍼를 비우도록 설정합니다. 이것은 에이전트가 빠르게 새로운 행동을 배우게 하고싶을때 사용하면 유용합니다. 버퍼를 리셋하기 위한 기본 명령은 키보드의 `C` 버튼을 누르면 됩니다.

48
docs/localized/KR/docs/Training-PPO.md


# Proximal Policy Optimization를 이용한 학습
ML-Agents는 [Proximal Policy Optimization (PPO)](https://blog.openai.com/openai-baselines-ppo/) 라는 강화학습 기법을 사용합니다.
PPO는 에이전트의 관측 (Observation)을 통해 에이전트가 주어진 상태에서 최선의 행동을 선택할 수 있도록 하는 이상적인 함수를 인공신경망을 이용하여 근사하는 기법입니다. ML-agents의 PPO 알고리즘은 텐서플로우로 구현되었으며 별도의 파이썬 프로세스 (소켓 통신을 통해 실행중인 유니티 프로그램과 통신)에서 실행됩니다.
PPO는 에이전트의 관측 (Observation)을 통해 에이전트가 주어진 상태에서 최선의 행동을 선택할 수 있도록 하는 이상적인 함수를 인공신경망을 이용하여 근사하는 기법입니다. ML-agents의 PPO 알고리즘은 텐서플로우로 구현되었으며 별도의 파이썬 프로세스 (소켓 통신을 통해 실행중인 유니티 프로그램과 통신)에서 실행됩니다.
에이전트를 학습하기 위해서 사용자는 에이전트가 최대화하도록 시도하는 보상 시그널을 하나 혹은 그 이상 설정해야합니다. 사용 가능한 보상 시그널들과 관련된 하이퍼파라미터에 대해서는 [보상 시그널](Reward-Signals.md) 문서를 참고해주십시오.
에이전트를 학습하기 위해서 사용자는 에이전트가 최대화하도록 시도하는 보상 시그널을 하나 혹은 그 이상 설정해야합니다. 사용 가능한 보상 시그널들과 관련된 하이퍼파라미터에 대해서는 [보상 시그널](Reward-Signals.md) 문서를 참고해주십시오.
`learn.py`를 이용하여 학습 프로그램을 실행하는 방법은 [ML-Agents 학습](Training-ML-Agents.md) 문서를 참고해주십시오.

만약 에이전트에게 제시된 문제의 난이도를 점차적으로 증가시키며 학습하는 커리큘럼 학습 (Curriculum Learning)을 사용하는 경우 [커리큘럼 학습을 통한 에이전트 학습](Training-Curriculum-Learning.md) 문서를 참고해주십니오.
만약 에이전트에게 제시된 문제의 난이도를 점차적으로 증가시키며 학습하는 커리큘럼 학습 (Curriculum Learning)을 사용하는 경우 [커리큘럼 학습을 통한 에이전트 학습](Training-Curriculum-Learning.md) 문서를 참고해주십니오.
모방 학습 (Imitation Learning)에 대한 정보를 얻고 싶으시다면 [모방 학습을 통한 에이전트 학습](Training-Imitation-Learning.md) 문서를 참고해주십시오.

강화학습 모델을 성공적으로 학습하기 위해서는 학습과 관련된 하이퍼파라미터 튜닝이 필요합니다. 이 가이드는 기본적인 파라미터들을 이용하여 학습했을 때 사용자가 원하는 성능을 만족하지 못한 경우 파라미터 튜닝을 수행하는 방법에 대해 설명합니다.
강화학습 모델을 성공적으로 학습하기 위해서는 학습과 관련된 하이퍼파라미터 튜닝이 필요합니다. 이 가이드는 기본적인 파라미터들을 이용하여 학습했을 때 사용자가 원하는 성능을 만족하지 못한 경우 파라미터 튜닝을 수행하는 방법에 대해 설명합니다.
강화학습에서 목표는 보상을 최대로 하는 정책 (Policy)을 학습하는 것입니다. 기본적으로 보상은 환경으로부터 주어집니다. 그러나 우리는 다양한 다른 행동을 통해 에이전트에게 보상을 주는 것을 생각해볼 수 있습니다. 예를 들어 에이전트가 새로운 상태를 탐험했을 때 에이전트에게 보상을 줄 수 있습니다. 이런 보상 시그널을 추가하여 학습 과정에 도움을 줄 수도 있습니다.
강화학습에서 목표는 보상을 최대로 하는 정책 (Policy)을 학습하는 것입니다. 기본적으로 보상은 환경으로부터 주어집니다. 그러나 우리는 다양한 다른 행동을 통해 에이전트에게 보상을 주는 것을 생각해볼 수 있습니다. 예를 들어 에이전트가 새로운 상태를 탐험했을 때 에이전트에게 보상을 줄 수 있습니다. 이런 보상 시그널을 추가하여 학습 과정에 도움을 줄 수도 있습니다.
`reward_signals`는 [보상 시그널](Reward-Signals.md)을 정의합니다. ML-Agents는 기본적으로 두개의 보상 시그널을 제공합니다. 하나는 외부 (환경) 보상이며 다른 하나는 호기심 (Curiosity) 보상입니다. 이 호기심 보상은 외부 보상이 희소성을 가지는 환경 (Sparse Extrinsic Reward Environment)에서 더 다양한 탐험을 수행할 수 있도록 도와줍니다.
`reward_signals`는 [보상 시그널](Reward-Signals.md)을 정의합니다. ML-Agents는 기본적으로 두개의 보상 시그널을 제공합니다. 하나는 외부 (환경) 보상이며 다른 하나는 호기심 (Curiosity) 보상입니다. 이 호기심 보상은 외부 보상이 희소성을 가지는 환경 (Sparse Extrinsic Reward Environment)에서 더 다양한 탐험을 수행할 수 있도록 도와줍니다.
`lambd``람다(lambda)` 파라미터를 의미하며 일반화된 이득 추정 (Generalized Advantage Estimate, [GAE]((https://arxiv.org/abs/1506.02438))) 계산에 사용됩니다. 이는 업데이트된 가치를 예측할 때 현재 예측된 가치에 얼마나 의존할지 결정하는 값입니다. 이 값이 낮으면 현재 예측된 가치에 더 의존하는 것을 의미하며 (높은 편향 (bias) 발생 가능), 값이 높으면 환경을 통해 얻은 실제 보상에 더 의존하는 것을 의미합니다 (높은 분산 (variance) 발생 가능). 즉 이 파라미터를 어떻게 선택하냐에 따라 두 특성간에 트레이드오프 (trade-off)가 존재합니다. 또한 이 파라미터를 적절하게 선택하면 더 안정적인 학습이 가능합니다.
`lambd``람다(lambda)` 파라미터를 의미하며 일반화된 이득 추정 (Generalized Advantage Estimate, [GAE]((https://arxiv.org/abs/1506.02438))) 계산에 사용됩니다. 이는 업데이트된 가치를 예측할 때 현재 예측된 가치에 얼마나 의존할지 결정하는 값입니다. 이 값이 낮으면 현재 예측된 가치에 더 의존하는 것을 의미하며 (높은 편향 (bias) 발생 가능), 값이 높으면 환경을 통해 얻은 실제 보상에 더 의존하는 것을 의미합니다 (높은 분산 (variance) 발생 가능). 즉 이 파라미터를 어떻게 선택하냐에 따라 두 특성간에 트레이드오프 (trade-off)가 존재합니다. 또한 이 파라미터를 적절하게 선택하면 더 안정적인 학습이 가능합니다.
`buffer_size` 는 모델 학습을 시작하기 전 얼마나 많은 경험들(관측, 행동, 보상 등)을 저장할지 결정합니다. **이 값은 `batch_size`의 배수로 설정되어야 합니다.** 일반적으로 큰 `buffer_size`는 더 안정적인 학습을 가능하게 합니다.
`buffer_size` 는 모델 학습을 시작하기 전 얼마나 많은 경험들(관측, 행동, 보상 등)을 저장할지 결정합니다. **이 값은 `batch_size`의 배수로 설정되어야 합니다.** 일반적으로 큰 `buffer_size`는 더 안정적인 학습을 가능하게 합니다.
`batch_size` 는 한번의 경사하강(Gradient Descent) 업데이트를 수행할 때 사용할 경험들의 수를 의미합니다. **이 값은 항상 `buffer_size`의 약수로 설정되어야 합니다.** 만약 연속적인 행동 공간 (Continuous Action Space) 환경을 사용하는 경우 이 값은 크게 설정되어야 합니다 (1000의 단위). 만약 이산적인 행동 공간 (Discrete Action Space) 환경을 사용하는 경우 이 값은 더 작게 설정되어야 합니다. (10의 단위).
`batch_size` 는 한번의 경사하강(Gradient Descent) 업데이트를 수행할 때 사용할 경험들의 수를 의미합니다. **이 값은 항상 `buffer_size`의 약수로 설정되어야 합니다.** 만약 연속적인 행동 공간 (Continuous Action Space) 환경을 사용하는 경우 이 값은 크게 설정되어야 합니다 (1000의 단위). 만약 이산적인 행동 공간 (Discrete Action Space) 환경을 사용하는 경우 이 값은 더 작게 설정되어야 합니다. (10의 단위).
일반적인 범위 (연속적인 행동): `512` - `5120`

`num_epoch` 는 경사 하강 (Gradient Descent) 학습 동안 경험 버퍼 (Experience Buffer) 데이터에 대해 학습을 몇번 수행할 지 결정합니다. `batch_size`가 클수록 이 값도 커져야합니다. 이 값을 줄이면 더 안정적인 업데이트가 보장되지만 학습 속도가 느려집니다.
`num_epoch` 는 경사 하강 (Gradient Descent) 학습 동안 경험 버퍼 (Experience Buffer) 데이터에 대해 학습을 몇번 수행할 지 결정합니다. `batch_size`가 클수록 이 값도 커져야합니다. 이 값을 줄이면 더 안정적인 업데이트가 보장되지만 학습 속도가 느려집니다.
`learning_rate` 는 경사 하강 (Gradient Descent) 학습의 정도를 결정합니다. 학습이 불안정하고 에이전트가 얻는 보상이 증가하지 않는 경우 일반적으로 학습률을 감소시킵니다.
`learning_rate` 는 경사 하강 (Gradient Descent) 학습의 정도를 결정합니다. 학습이 불안정하고 에이전트가 얻는 보상이 증가하지 않는 경우 일반적으로 학습률을 감소시킵니다.
`time_horizon` 은 경험 버퍼 (Experience Buffer)에 저장하기 전 에이전트당 수집할 경험의 스텝 수를 의미합니다. 에피소드가 끝나기 전에 이 한도에 도달하면 가치 평가를 통해 에이전트의 현재 상태로부터 기대되는 전체 보상을 예측합니다. 따라서 이 값의 설정에 따라 덜 편향되지만 분산이 커질수도 있고 (긴 time horizon), 더 편향 (bias)되지만 분산 (variance)이 작아질 수도 있습니다 (짧은 time horizon). 한 에피소드 동안 보상이 빈번하게 발생하는 경우나 에피소드가 엄청나게 긴 경우에는 time horizon 값은 작게 설정하는 것이 이상적입니다. 이 값은 에이전트가 취하는 일련의 행동 내에서 중요한 행동을 모두 포착할 수 있을 만큼 큰 값을 가져야 합니다.
`time_horizon` 은 경험 버퍼 (Experience Buffer)에 저장하기 전 에이전트당 수집할 경험의 스텝 수를 의미합니다. 에피소드가 끝나기 전에 이 한도에 도달하면 가치 평가를 통해 에이전트의 현재 상태로부터 기대되는 전체 보상을 예측합니다. 따라서 이 값의 설정에 따라 덜 편향되지만 분산이 커질수도 있고 (긴 time horizon), 더 편향 (bias)되지만 분산 (variance)이 작아질 수도 있습니다 (짧은 time horizon). 한 에피소드 동안 보상이 빈번하게 발생하는 경우나 에피소드가 엄청나게 긴 경우에는 time horizon 값은 작게 설정하는 것이 이상적입니다. 이 값은 에이전트가 취하는 일련의 행동 내에서 중요한 행동을 모두 포착할 수 있을 만큼 큰 값을 가져야 합니다.
`max_steps` 은 학습 과정 동안 얼마나 많은 시뮬레이션 스텝 (프레임 스킵을 곱한만큼) 을 실행할지 결정합니다. 이 값은 복잡한 문제일수록 크게 설정해야합니다.
`max_steps` 은 학습 과정 동안 얼마나 많은 시뮬레이션 스텝 (프레임 스킵을 곱한만큼) 을 실행할지 결정합니다. 이 값은 복잡한 문제일수록 크게 설정해야합니다.
`beta` 는 엔트로피 정규화 (Entropy Regularization)의 정도를 결정하며 이를 통해 정책을 더 랜덤하게 만들 수 있습니다. 이 값을 통해 에이전트는 학습 동안 액션 공간을 적절하게 탐험할 수 있습니다. 이 값을 증가시키면 에이전트가 더 많이 랜덤 행동을 취하게 됩니다. 엔트로피 (텐서보드를 통해 측정 가능)는 보상이 증가함에 따라 서서히 크기를 감소시켜야합니다. 만약 엔트로피가 너무 빠르게 떨어지면 `beta`를 증가시켜야합니다. 만약 엔트로피가 너무 느리게 떨어지면 `beta`를 감소시켜야 합니다.
`beta` 는 엔트로피 정규화 (Entropy Regularization)의 정도를 결정하며 이를 통해 정책을 더 랜덤하게 만들 수 있습니다. 이 값을 통해 에이전트는 학습 동안 액션 공간을 적절하게 탐험할 수 있습니다. 이 값을 증가시키면 에이전트가 더 많이 랜덤 행동을 취하게 됩니다. 엔트로피 (텐서보드를 통해 측정 가능)는 보상이 증가함에 따라 서서히 크기를 감소시켜야합니다. 만약 엔트로피가 너무 빠르게 떨어지면 `beta`를 증가시켜야합니다. 만약 엔트로피가 너무 느리게 떨어지면 `beta`를 감소시켜야 합니다.
`epsilon` 은 경사 하강 업데이트 동안 사용하는 이전 정책과 새로운 정책 사이의 비율을 일정 범위의 크기로 제한하는 값입니다. 이 값이 작게 설정되면 더 안정적인 학습이 가능하지만 학습이 느리게 진행될 것입니다.
`epsilon` 은 경사 하강 업데이트 동안 사용하는 이전 정책과 새로운 정책 사이의 비율을 일정 범위의 크기로 제한하는 값입니다. 이 값이 작게 설정되면 더 안정적인 학습이 가능하지만 학습이 느리게 진행될 것입니다.
일반적인 범위: `0.1` - `0.3`

### Number of Layers
`num_layers` 는 관측 입력 후 혹은 시각적 관측 (Visual Observation)의 CNN 인코딩 이후 몇개의 은닉층 (Hidden Layer)을 사용할지 결정합니다. 간단한 문제에서는 적은 수의 층을 사용하여 빠르고 효율적으로 학습해야합니다. 복잡한 제어 문제에서는 많은 층을 사용할 필요가 있습니다.
`num_layers` 는 관측 입력 후 혹은 시각적 관측 (Visual Observation)의 CNN 인코딩 이후 몇개의 은닉층 (Hidden Layer)을 사용할지 결정합니다. 간단한 문제에서는 적은 수의 층을 사용하여 빠르고 효율적으로 학습해야합니다. 복잡한 제어 문제에서는 많은 층을 사용할 필요가 있습니다.
`hidden_units` 은 인공신경망의 각 완전연결층 (Fully Connected Layer)에 몇개의 유닛을 사용할지 결정합니다. 최적의 행동이 관측 입력의 간단한 조합으로 결정되는 단순한 문제에 대해서는 이 값을 작게 설정합니다. 최적의 행동이 관측 입력의 복잡한 관계에 의해 결정되는 어려운 문제에 대해서는 이 값을 크게 설정합니다.
`hidden_units` 은 인공신경망의 각 완전연결층 (Fully Connected Layer)에 몇개의 유닛을 사용할지 결정합니다. 최적의 행동이 관측 입력의 간단한 조합으로 결정되는 단순한 문제에 대해서는 이 값을 작게 설정합니다. 최적의 행동이 관측 입력의 복잡한 관계에 의해 결정되는 어려운 문제에 대해서는 이 값을 크게 설정합니다.
아래의 하이퍼파라미터들은 `use_recurrent` 이 참(True)으로 결정된 경우에만 사용합니다.
아래의 하이퍼파라미터들은 `use_recurrent` 이 참(True)으로 결정된 경우에만 사용합니다.
`sequence_length` 는 학습 동안 네트워크를 통과하는 연속적인 경험들의 길이를 의미합니다. 에이전트가 긴 시간에 대해 기억해야하는 정보가 있다면 이 값을 충분히 길게 설정해야합니다. 예를 들어 에이전트가 물체의 속도를 기억해야하는 경우 이 값은 작게 설정해도 괜찮습니다. 만약 에이전트가 에피소드 초반에 한번 주어진 정보를 계속 기억해야한다면 이 값을 크게 설정해야 합니다.
`sequence_length` 는 학습 동안 네트워크를 통과하는 연속적인 경험들의 길이를 의미합니다. 에이전트가 긴 시간에 대해 기억해야하는 정보가 있다면 이 값을 충분히 길게 설정해야합니다. 예를 들어 에이전트가 물체의 속도를 기억해야하는 경우 이 값은 작게 설정해도 괜찮습니다. 만약 에이전트가 에피소드 초반에 한번 주어진 정보를 계속 기억해야한다면 이 값을 크게 설정해야 합니다.
`memory_size` 는 순환신경망의 은닉 상태(hidden state)를 저장하는데 사용되는 배열의 크기를 의미합니다. 이 값은 반드시 4의 배수로 설정되어야 하며 에이전트가 임무를 성공적으로 완수하기 위해서 기억해야하는 정보의 양에 따라 크기를 조절해야합니다.
`memory_size` 는 순환신경망의 은닉 상태(hidden state)를 저장하는데 사용되는 배열의 크기를 의미합니다. 이 값은 반드시 4의 배수로 설정되어야 하며 에이전트가 임무를 성공적으로 완수하기 위해서 기억해야하는 정보의 양에 따라 크기를 조절해야합니다.
일반적인 범위: `64` - `512`

### Cumulative Reward
보상은 일반적으로 지속적으로 증가하는 경향을 가져야합니다. 작은 기복이 발생할수는 있습니다. 문제의 복잡도에 따라 수백만 스텝의 학습이 진행되어도 보상이 증가하지 않을수도 있습니다.
보상은 일반적으로 지속적으로 증가하는 경향을 가져야합니다. 작은 기복이 발생할수는 있습니다. 문제의 복잡도에 따라 수백만 스텝의 학습이 진행되어도 보상이 증가하지 않을수도 있습니다.
### Entropy

이 값은 시간이 지남에 따라 선형적으로 감소합니다.
이 값은 시간이 지남에 따라 선형적으로 감소합니다.
이 값들은 학습이 진행되는 동안 진동합니다. 일반적으로 이 값들은 1보다 작아야합니다.
이 값들은 학습이 진행되는 동안 진동합니다. 일반적으로 이 값들은 1보다 작아야합니다.
이 값들은 누적 보상이 증가함에 따라 커져야합니다. 이 값들은 주어진 시점에서 에이전트가 스스로 받을 것이라 예측하는 미래의 보상이 얼마나 될것인지를 나타냅니다.
이 값들은 누적 보상이 증가함에 따라 커져야합니다. 이 값들은 주어진 시점에서 에이전트가 스스로 받을 것이라 예측하는 미래의 보상이 얼마나 될것인지를 나타냅니다.
### Value Loss

10
docs/localized/KR/docs/Using-Docker.md


- 도커가 설치되어 있지 않다면 [다운로드](https://www.docker.com/community-edition#/download)하고 설치 하십시오.
- 호스트 머신과 분리된 환경에서 도커를 실행하기 때문에, 호스트 머신안에 마운트된 디렉토리는 트레이너 환경 설정 파일,
- 호스트 머신과 분리된 환경에서 도커를 실행하기 때문에, 호스트 머신안에 마운트된 디렉토리는 트레이너 환경 설정 파일,
이를 위해, 편의상 비어있는 `unity-volume` 디렉토리를 저장소의 루트에 만들었으나, 다른 디렉토리의 사용은 자유롭게 할 수 있습니다.
이를 위해, 편의상 비어있는 `unity-volume` 디렉토리를 저장소의 루트에 만들었으나, 다른 디렉토리의 사용은 자유롭게 할 수 있습니다.
이 가이드의 나머지 부분에서는 `unity-volume` 디렉토리가 사용된다고 가정하고 진행됩니다.
## 사용법

_학습을 위해 에디터 사용을 원한다면 이 단계를 건너뛸 수 있습니다._
도커는 일반적으로 호스트 머신과 (리눅스) 커널을 공유하는 컨테이너를 실행하기 때문에,
도커는 일반적으로 호스트 머신과 (리눅스) 커널을 공유하는 컨테이너를 실행하기 때문에,
유니티 환경은 리눅스 플랫폼이 구축되어야 합니다. 유니티 환경을 빌드할 때, 빌드 세팅 창(Build Settings window)에서
다음 옵션을 선택해 주십시오:

이것은 선택사항이며 설정하지 않았을 경우 도커는 랜덤한 이름을 생성합니다. _도커 이미지를 실행할 때마다
고유한 이름을 가져야함에 유의하십시오._
- `<image-name>` 컨테이너를 빌드할 때 사용할 image name을 참조합니다.
- `<environment-name>` __(옵션)__: 리눅스 실행파일과 함께 학습을 할 경우, 인수 값이 실행파일의 이름이 된다.
- `<environment-name>` __(옵션)__: 리눅스 실행파일과 함께 학습을 할 경우, 인수 값이 실행파일의 이름이 된다.
에디터에서 학습을 할 경우, `<environment-name>` 인수를 전달하지 말고 유니티에서 _"Start training by pressing
the Play button in the Unity Editor"_ 메세지가 화면에 표시될 때 :arrow_forward: 버튼을 누르십시오.
- `source`: 유니티 실행파일을 저장할 호스트 운영체제의 경로를 참조합니다.

- `trainer-config-file`, `train`, `run-id`: ML-Agents 인자들은 `mlagents-learn`로 전달됩니다. 트레이너 설정 파일의 이름 `trainer-config-file`,
알고리즘을 학습하는 `train`, 그리고 각 실험에 고유한 식별자를 태깅하는데 사용되는 `run-id`.
알고리즘을 학습하는 `train`, 그리고 각 실험에 고유한 식별자를 태깅하는데 사용되는 `run-id`.
컨테이너가 파일에 접근할 수 있도록 trainer-config 파일을 `unity-volume` 안에 둘 것을 권장합니다.
`3DBall` 환경 실행파일을 학습하기 위해 다음 명령어가 사용됩니다:

4
docs/localized/zh-CN/README.md


**注意:** 本文档为v0.3版本文档的部分翻译版,目前并不会随着英文版文档更新而更新。若要查看更新更全的英文版文档,请查看[这里](https://github.com/Unity-Technologies/ml-agents)。
**Unity Machine Learning Agents** (ML-Agents) 是一款开源的 Unity 插件,
使得我们得以在游戏环境和模拟环境中训练智能 agent。您可以使用 reinforcement learning(强化学习)、imitation learning(模仿学习)、neuroevolution(神经进化)或其他机器学习方法, 通过简单易用的 Python API进行控制,对 Agent 进行训练。我们还提供最先进算法的实现方式(基于
使得我们得以在游戏环境和模拟环境中训练智能 agent。您可以使用 reinforcement learning(强化学习)、imitation learning(模仿学习)、neuroevolution(神经进化)或其他机器学习方法, 通过简单易用的 Python API进行控制,对 Agent 进行训练。我们还提供最先进算法的实现方式(基于
TensorFlow),让游戏开发者和业余爱好者能够轻松地
训练用于 2D、3D 和 VR/AR 游戏的智能 agent。
这些经过训练的 agent 可用于多种目的,

[行为准则](/CODE_OF_CONDUCT.md)。
您可以通过 Unity Connect 和 GitHub 与我们以及更广泛的社区进行交流:
* 加入我们的
* 加入我们的
[Unity 机器学习频道](https://connect.unity.com/messages/c/035fba4f88400000)
与使用 ML-Agents 的其他人以及对机器学习充满热情的 Unity 开发者
交流。我们使用该频道来展示关于 ML-Agents

66
docs/localized/zh-CN/docs/Getting-Started-with-Balance-Ball.md


**注意:**在 Unity 中,场景内所有元素的基础对象均为
_游戏对象_(GameObject)。游戏对象本质上是其他任何元素
(包括行为、图形、物理等)的容器。要查看组成游戏对象的组件,
请在 Scene 窗口中选择 GameObject,然后打开
请在 Scene 窗口中选择 GameObject,然后打开
在打开 3D Balance Ball 场景后,您可能会首先注意到它包含的
不是一个平台,而是多个平台。场景中的每个平台都是
独立的 agent,但它们全部共享同一个 Brain。3D Balance Ball 通过

当您在 Inspector 中查看该 Academy 组件时,可以看到若干
用于控制环境工作方式的属性。例如,Inspector中可以看到
**Training** 和 **Inference Configuration** 属性, 在其中我们可以设置之后生成的 Unity 可执行文件的
图形和 Time Scale 属性。Academy 在训练期间使用
**Training Configuration**,而在不训练时使用
图形和 Time Scale 属性。Academy 在训练期间使用
**Training Configuration**,而在不训练时使用
和高Time Scale,而为 **Inference Configuration** 设置高图形质量和
和高Time Scale,而为 **Inference Configuration** 设置高图形质量和
**注意:**如果您想在训练期间观测环境,则可以调整
**Inference Configuration** 设置来使用更大的窗口和更接近
**注意:**如果您想在训练期间观测环境,则可以调整
**Inference Configuration** 设置来使用更大的窗口和更接近
1:1 的时间刻度。当你要正式训练时一定要重新设置这些参数;
否则,训练可能需要很长时间。

* Academy.InitializeAcademy() — 启动环境时调用一次。
* Academy.AcademyStep() — 在
* Academy.AcademyStep() — 在
Agent.AgentAction() 之前(以及 agent 收集其观测结果之后)的每个模拟步骤调用。
* Academy.AcademyReset() — 在 Academy 开始或重新开始模拟
(包括第一次)时调用。

### Brain
场景中的 Ball3DBrain 游戏对象包含 Brain 组件,
是 Academy 对象的子级。(场景中的所有 Brain 对象都必须是
是 Academy 对象的子级。(场景中的所有 Brain 对象都必须是
Academy 的子级。)3D Balance Ball 环境中的所有 agent 使用
同一个 Brain 实例。
Brain 不存储关于 agent 的任何信息,

实现自己的 CoreBrain 来创建自有的类型。
在本教程中,进行训练时,需要将 **Brain Type** 设置为 **External**
当您将经过训练的模型嵌入到 Unity 应用程序中时,需要将
当您将经过训练的模型嵌入到 Unity 应用程序中时,需要将
**Brain Type** 更改为 **Internal**
**向量观测空间**

**Continuous** 和 **Discrete**。**Continuous** 向量观测空间
会收集浮点数向量中的观测结果。**Discrete**
会收集浮点数向量中的观测结果。**Discrete**
3D Balance Ball 示例中所用的 Brain 实例使用 **State Size** 为 8 的
3D Balance Ball 示例中所用的 Brain 实例使用 **State Size** 为 8 的
**Continuous** 向量观测空间。这意味着
包含 agent 观测结果的特征向量包含八个元素:
平台旋转的 `x``z` 分量以及球相对位置和

**向量运动空间**
Brain 以*动作*的形式向 agent 提供指令。与状态
一样,ML-Agents 将动作分为两种类型:**Continuous**
一样,ML-Agents 将动作分为两种类型:**Continuous**
例如,一个元素可能表示施加到 agent 某个
例如,一个元素可能表示施加到 agent 某个
`Rigidbody` 上的力或扭矩。**Discrete** 向量运动空间将其动作
定义为一个表。提供给 agent 的具体动作是这个表的
索引。

您可以尝试使用两种设置进行训练,观测是否有
差异。(使用离散运动空间时将 `Vector Action Space Size` 设置为 4,
而使用连续运动空间时将其设置为 2。)
### Agent
Agent 是在环境中进行观测并采取动作的参与者。

* **Brain** — 每个 Agent 必须有一个 Brain。Brain 决定了 agent 如何
决策。3D Balance Ball 场景中的所有 agent 共享同一个
决策。3D Balance Ball 场景中的所有 agent 共享同一个
Brain。
* **Visual Observations** — 定义 agent 用来观测其环境的
任何 Camera 对象。3D Balance Ball 不使用摄像机观测。

3D Balance Ball 将此项设置为 true,因此 agent 在达到
3D Balance Ball 将此项设置为 true,因此 agent 在达到
**Max Step** 计数后或在掉球后重新开始。
也许 agent 更有趣的方面在于 Agent 子类的

训练不局限于特定的开始位置和平台
姿态。
* Agent.CollectObservations() — 在每个模拟步骤调用。负责
收集 agent 对环境的观测结果。由于分配给
收集 agent 对环境的观测结果。由于分配给
因此 `CollectObservations()` 必须调用 8 次
因此 `CollectObservations()` 必须调用 8 次
`AddVectorObs`
* Agent.AgentAction() — 在每个模拟步骤调用。接收 Brain 选择的
动作。Ball3DAgent 示例可以处理连续和离散

## 构建环境
第一步是打开包含 3D Balance Ball 环境的
第一步是打开包含 3D Balance Ball 环境的
3. 使用随后打开的文件对话框,找到 ML-Agents 项目内的
3. 使用随后打开的文件对话框,找到 ML-Agents 项目内的
4. 在 `Project` 窗口中,找到文件夹
4. 在 `Project` 窗口中,找到文件夹
`Assets/ML-Agents/Examples/3DBall/`
5. 双击 `Scene` 文件以加载包含 Balance Ball 环境的
场景。

* 环境应用程序在后台运行
* 没有对话需要互动
* 正确的场景会自动加载
1. 打开 Player Settings(菜单:**Edit** > **Project Settings** > **Player**)。
2. 在 **Resolution and Presentation** 下方:
- 确保选中 **Run in Background**

## 使用 Reinforcement Learning(强化学习)来训练 Brain
有了一个包含模拟环境的 Unity 可执行文件后,现在我们
可以执行训练。为了首先确保您的环境和 Python
API 能正常工作,您可以使用 `python/Basics`
可以执行训练。为了首先确保您的环境和 Python
API 能正常工作,您可以使用 `python/Basics`
[Jupyter 笔记本](/docs/Background-Jupyter.md)。
此笔记本包含了 API 功能的简单演练。
`Basics` 中,务必将 `env_name` 设置为您先前构建的

为了训练 agent 对球进行正确平衡,我们将使用一种称为 Proximal Policy Optimization (PPO) 的
为了训练 agent 对球进行正确平衡,我们将使用一种称为 Proximal Policy Optimization (PPO) 的
有效且更通用的方法,因此我们选择它作为与 ML-Agents
有效且更通用的方法,因此我们选择它作为与 ML-Agents
为了训练 Balance Ball 环境中的 agent,我们将使用 Python
为了训练 Balance Ball 环境中的 agent,我们将使用 Python
使用 `run_id` 来识别实验并创建用于存储模型和摘要统计信息的文件夹。当使用
使用 `run_id` 来识别实验并创建用于存储模型和摘要统计信息的文件夹。当使用
TensorBoard 来观测训练统计信息时,将每次训练的此项设置为顺序值
将会很有用。也就是说,第一次训练时为“BalanceBall1”,
第二次训练时为“BalanceBall2”,依此类推。如果不这样做,每次训练的

总之,转到命令行,进入 `ml-agents` 目录并输入:
```
python3 python/learn.py <env_name> --run-id=<run-identifier> --train
python3 python/learn.py <env_name> --run-id=<run-identifier> --train
```
`--train` 标志告诉 ML-Agents 以训练模式运行。`env_name` 应该是刚才创建的 Unity 可执行文件的名字。

* Cumulative Reward - 所有 agent 的平均累积场景奖励。
在成功训练期间应该增大。
* Entropy - 模型决策的随机程度。在成功训练过程中
应该缓慢减小。如果减小得太快,应增大 `beta`
应该缓慢减小。如果减小得太快,应增大 `beta`
超参数。
* Episode Length - 所有 agent 在环境中每个场景的
平均长度。

1. 经过训练的模型存储在 `ml-agents` 文件夹中的 `models/<run-identifier>` 内。训练
完成后,该位置会有一个 `<env_name>.bytes` 文件,其中的 `<env_name>` 是训练期间使用的可执行文件的
名称。
2. 将 `<env_name>.bytes``python/models/ppo/` 移入
2. 将 `<env_name>.bytes``python/models/ppo/` 移入
6. 将 `<env_name>.bytes` 文件从 Editor 的 Project 窗口拖入
6. 将 `<env_name>.bytes` 文件从 Editor 的 Project 窗口拖入
`3DBallBrain` Inspector 窗口中的 `Graph Model` 占位区域。
7. 按 Editor 顶部的 Play 按钮。

8
docs/localized/zh-CN/docs/Installation.md


## 安装 **Unity 2017.1** 或更高版本
[下载](https://store.unity.com/download) 并安装 Unity。如果您想
使用我们的 Docker 设置(稍后介绍),请确保在安装 Unity 时选择
使用我们的 Docker 设置(稍后介绍),请确保在安装 Unity 时选择
<img src="images/unity_linux_build_support.png"
alt="Linux Build Support"
<img src="images/unity_linux_build_support.png"
alt="Linux Build Support"
width="500" border="10" />
</p>

git clone git@github.com:Unity-Technologies/ml-agents.git
此代码仓库中的 `unity-environment` 目录包含了要添加到项目中的
此代码仓库中的 `unity-environment` 目录包含了要添加到项目中的
Unity Assets。`python` 目录包含训练代码。
这两个目录都位于代码仓库的根目录。

24
docs/localized/zh-CN/docs/Learning-Environment-Create-New.md


using System.Collections.Generic;
using UnityEngine;
public class RollerAgent : Agent
public class RollerAgent : Agent
{
Rigidbody rBody;
void Start () {

public override void AgentReset()
{
if (this.transform.position.y < -1.0)
{
{
// agent 掉落
this.transform.position = Vector3.zero;
this.rBody.angularVelocity = Vector3.zero;

{
{
// 将目标移动到新的位置
Target.position = new Vector3(Random.value * 8 - 4,
0.5f,

{
// 计算相对位置
Vector3 relativePosition = Target.position - this.transform.position;
// Agent 速度
AddVectorObs(rBody.velocity.x/5);
AddVectorObs(rBody.velocity.z/5);

```
**AgentAction()**
利用上面列出的动作和奖励逻辑,`AgentAction()` 函数的最终版本如下所示:
```csharp

public override void AgentAction(float[] vectorAction, string textAction)
{
// 奖励
float distanceToTarget = Vector3.Distance(this.transform.position,
float distanceToTarget = Vector3.Distance(this.transform.position,
// 已到达目标
if (distanceToTarget < 1.42f)
{

// 进一步接近
if (distanceToTarget < previousDistance)
{

**Play** 运行场景,并用 WASD 键在平台上移动 agent。确保在 Unity Editor Console 窗口中没有显示任何错误,并且 agent 在到达目标或掉下平台时会重置。请注意,对于较复杂的调试,ML-Agents SDK 提供了一个方便的 Monitor 类,您可以使用该类轻松地在 Game 窗口中显示 agent 状态信息。
您可以执行一个额外的测试是,首先使用 `python/Basics`
您可以执行一个额外的测试是,首先使用 `python/Basics`
确保您的环境和 Python API 能正常工作。在 `Basics` 中,务必将
确保您的环境和 Python API 能正常工作。在 `Basics` 中,务必将
`env_name` 设置为您生成的此环境对应的可执行文件的
名称。

8
docs/localized/zh-CN/docs/Learning-Environment-Design.md


8. 当 Academy 达到其自身的 `Max Step` 计数时,它会通过调用您的 Academy 子类的 `AcademyReset()` 函数来再次开始下一场景。
要创建训练环境,请扩展 Academy 和 Agent 类以实现上述方法。`Agent.CollectObservations()` 和 `Agent.AgentAction()` 函数必须实现;而其他方法是可选的,即是否需要实现它们取决于您的具体情况。
**注意:**在这里用到的 Python API 也可用于其他目的。例如,借助于该 API,您可以将 Unity 用作您自己的机器学习算法的模拟引擎。请参阅 [Python API](/docs/Python-API.md) 以了解更多信息。
## 组织 Unity 场景

* `AcademyStep()` — 为下一模拟步骤准备环境。Academy 基类首先调用此函数,然后才调用当前步骤的任何 `AgentAction()` 方法。您可以使用此函数在 agent 采取动作之前更新场景中的其他对象。请注意,在 Academy 调用此方法之前,agent 已收集了自己的观测结果并选择了动作。
Academy 基类还定义了若干可以在 Unity Editor Inspector 中设置的重要属性。对于训练而言,这些属性中最重要的是 `Max Steps`,它决定了每个训练场景的持续时间。Academy 的步骤计数器达到此值后,它将调用 `AcademyReset()` 函数来开始下一轮模拟。
Brain 内部封装了决策过程。Brain 对象必须放在 Hierarchy 视图中的 Academy 的子级。我们必须为每个 Agent 分配一个 Brain,但可以在多个 Agent 之间共享同一个 Brain。
当我们使用 Brain 类的时候不需要使用其子类,而应该直接使用 Brain 这个类。Brain 的行为取决于 Brain 的类型。在训练期间,应将 agent 上连接的 Brain 的 Brain Type 设置为 **External**。要使用经过训练的模型,请将模型文件导入 Unity 项目,并将对应 Brain 的 Brain Type 更改为 **Internal**。请参阅 [Brain](/docs/Learning-Environment-Design-Brains.md) 以了解有关使用不同类型的 Brain 的详细信息。如果四种内置的类型不能满足您的需求,您可以扩展 CoreBrain 类以创建其它的 Brain 类型。

* `AgentAction()` — 执行由 agent 的 Brain 选择的动作,并为当前状态分配奖励。
这些函数的实现决定了分配给此 agent 的 Brain 的属性要如何设置。
您还必须确定 Agent 如何完成任务,以及当它超时后如何处理。agent 完成其任务(或彻底失败)后,您可以在 `AgentAction()` 函数中手动将 agent 设置为完成。您还可以将 agent 的 `Max Steps` 属性设置为正值,这样 agent 在执行了此数量的步骤后会认为自己已完成。Academy 达到自己的 `Max Steps` 计数后,会开始下一场景。如果将 agent 的 `ResetOnDone` 属性设置为 true,则 agent 可以在一个场景中多次尝试自己的任务。(在 `Agent.AgentReset()` 函数中可以设置 agent 的初始化逻辑,为下一次的任务做好准备。)
请参阅 [Agent](/docs/Learning-Environment-Design-Agents.md) 以详细了解如何编写一个你自己的 agent。

2
docs/localized/zh-CN/docs/Learning-Environment-Examples.md


# 学习环境示例
Unity ML-Agents 工具包中内置了一些搭建好的学习环境的示例,并且我们还在不断增加新的示例,这些示例演示了该平台的各种功能。示例环境位于
Unity ML-Agents 工具包中内置了一些搭建好的学习环境的示例,并且我们还在不断增加新的示例,这些示例演示了该平台的各种功能。示例环境位于
`unity-environment/Assets/ML-Agents/Examples` 中,并且我们在下文中进行了简单的介绍。
此外,我们的
[首届 ML-Agents 挑战赛](https://connect.unity.com/challenges/ml-agents-1)

60
docs/localized/zh-CN/docs/ML-Agents-Overview.md


**Unity Machine Learning Agents** (ML-Agents) 是一款开源的 Unity 插件,使我们得以在游戏和其它模拟环境中训练智能的 agent。您可以使用 reinforcement learning(强化学习)、
imitation learning(模仿学习)、neuroevolution(神经进化)或其他机器学习方法
通过简单易用的 Python API 对 Agent 进行训练。我们还提供最先进算法的实现方式(基于
通过简单易用的 Python API 对 Agent 进行训练。我们还提供最先进算法的实现方式(基于
TensorFlow),让游戏开发者和业余爱好者能够轻松地
训练用于 2D、3D 和 VR/AR 游戏的智能 agent。
这些经过训练的 agent 可用于多种目的,

根据您的背景(如研究人员、游戏开发人员、业余爱好者),
您现在可能在脑海中会有非常不同的问题。
为了让您更轻松地过渡到 ML-Agents,
我们提供了多个后台页面,其中包括有关
我们提供了多个后台页面,其中包括有关
[机器学习](/docs/Background-Machine-Learning.md)和
[机器学习](/docs/Background-Machine-Learning.md)和
[TensorFlow](/docs/Background-TensorFlow.md) 的概述和有用资源。如果您不熟悉 Unity 场景,不了解基本的机器学习概念,或者以前没有听说过 TensorFlow,**强烈**建议您浏览相关的背景知识页面。
此页面的其余部分深入介绍了 ML-Agents、包括其重要组件、

所有机器学习算法。请注意,
与学习环境不同,Python API 不是 Unity 的一部分,而是位于外部
并通过 External Communicator 与 Unity 进行通信。
* **External Communicator** - 它将 Unity 环境与 Python API
* **External Communicator** - 它将 Unity 环境与 Python API
<img src="images/learning_environment_basic.png"
alt="Simplified ML-Agents Scene Block Diagram"
<img src="images/learning_environment_basic.png"
alt="Simplified ML-Agents Scene Block Diagram"
width="700" border="10" />
</p>

Brain 定义了所有可能的观测和动作的空间,
而与之相连的 Agent(在本示例中是指军医)可以各自拥有
自己独特的观测和动作值。如果我们将游戏
扩展到包含坦克驾驶员 NPC,那么附加到这些角色的
扩展到包含坦克驾驶员 NPC,那么附加到这些角色的
<img src="images/learning_environment_example.png"
alt="Example ML-Agents Scene Block Diagram"
<img src="images/learning_environment_example.png"
alt="Example ML-Agents Scene Block Diagram"
我们尚未讨论 ML-Agents 如何训练行为以及 Python API 和
我们尚未讨论 ML-Agents 如何训练行为以及 Python API 和
External Communicator 的作用。在我们深入了解这些细节之前,
让我们总结一下先前的组件。每个游戏角色上附有一个 Agent,
而每个 Agent 都连接到一个 Brain。Brain 从 Agent 处接收观测结果和奖励并返回动作。Academy 除了能够控制环境参数之外,还可确保所有 Agent 和 Brain 都处于同步状态。那么,Brain 如何控制 Agent 的动作呢?

Brain 收集的观测结果和奖励通过 External Communicator
Brain 收集的观测结果和奖励通过 External Communicator
* **Internal** - 使用嵌入式
* **Internal** - 使用嵌入式
[TensorFlow](/docs/Background-TensorFlow.md) 模型进行决策。
嵌入式 TensorFlow 模型包含了学到的 policy,Brain 直接使用
此模型来确定每个 Agent 的动作。

具有写死逻辑行为的 Agent。也有助于把这种由写死逻辑指挥的 Agent 与
训练好的 Agent 进行比较。在我们的示例中,一旦我们
为军医训练了 Brain,我们便可以为一个军队的军医分配
经过训练的 Brain,而为另一个军队的军医分配具有写死逻辑行为的
经过训练的 Brain,而为另一个军队的军医分配具有写死逻辑行为的
Heuristic Brain。然后,我们可以评估哪个军医的效率更高。
根据目前所述,External Communicator 和 Python API 似乎

看到,这样可以实现其他的训练模式。
<p align="center">
<img src="images/learning_environment.png"
alt="ML-Agents Scene Block Diagram"
<img src="images/learning_environment.png"
alt="ML-Agents Scene Block Diagram"
border="10" />
</p>

因此所学的 policy 只是一个 TensorFlow 模型文件。然后在预测阶段,
我们将 Brain 类型切换为 Internal,并加入从训练阶段
生成的 TensorFlow 模型。现在,在预测阶段,军医
仍然继续生成他们的观测结果,但不再将结果发送到
仍然继续生成他们的观测结果,但不再将结果发送到
在训练期间,Python API 使用收到的观测结果来学习
TensorFlow 模型。然后在预测过程中该模型将嵌入到
在训练期间,Python API 使用收到的观测结果来学习
TensorFlow 模型。然后在预测过程中该模型将嵌入到
因为它仅限于 TensorFlow 模型并会利用第三方
因为它仅限于 TensorFlow 模型并会利用第三方
[TensorFlowSharp](https://github.com/migueldeicaza/TensorFlowSharp)
库。**

### 自定义训练和预测
先前的模式中使用 External Brain 类型进行训练,
从而生成 Internal Brain 类型可以理解和使用的 TensorFlow
从而生成 Internal Brain 类型可以理解和使用的 TensorFlow
模型。然而,ML-Agents 的任何用户都可以利用自己的算法
进行训练和预测。在这种情况下,训练阶段和预测阶段
的 Brain 类型都会设置为 External,并且场景中所有 Agent 的行为

较难的任务提供基础。
<p align="center">
<img src="images/math.png"
alt="Example Math Curriculum"
<img src="images/math.png"
alt="Example Math Curriculum"
width="700"
border="10" />
</p>

即,随着环境逐渐复杂化,policy 也会不断
改进。在我们的示例中,我们可以考虑当每个队只包含一个
玩家时,首先训练军医,然后再反复增加玩家人数
(即环境复杂度)。ML-Agents 支持在
(即环境复杂度)。ML-Agents 支持在
Academy 内设置自定义环境参数。因此,
可以根据训练进度动态调整与难度或复杂性相关的
环境要素(比如游戏对象)。

agent 必须学会记住过去才能做出
最好的决策。当 agent 只能部分观测环境时,
跟踪过去的观测结果可以帮助 agent 学习。我们在
教练中提供了一种_长短期记忆_
教练中提供了一种_长短期记忆_
([LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory))
的实现,使 agent 能够存储要在未来步骤中
使用的记忆。您可以在

第一人称视觉的导航 agent。您可以在
[此处](/docs/Learning-Environment-Design-Agents.md#multiple-visual-observations)了解更多关于向 agent 添加视觉观测的
信息。
* **Broadcasting** - 如前所述,默认情况下,External Brain 会将
其所有 Agent 的观测结果发送到 Python API。这对
训练或预测很有帮助。Broadcasting 是一种可以为

[此处](/docs/Learning-Environment-Design-Brains.md#using-the-broadcast-feature)了解更多关于使用 broadcasting 功能的
信息。
* **Docker 设置(测试功能)** - 为了便于在不直接安装
* **Docker 设置(测试功能)** - 为了便于在不直接安装
* **AWS 上的云训练** - 为了便于在 Amazon Web Services (AWS)
* **AWS 上的云训练** - 为了便于在 Amazon Web Services (AWS)
让您了解如何设置 EC2 实例以及公共的预配置 Amazon
让您了解如何设置 EC2 实例以及公共的预配置 Amazon
* **Microsoft Azure 上的云训练** - 为了便于在 Microsoft Azure
* **Microsoft Azure 上的云训练** - 为了便于在 Microsoft Azure
Azure machines, we provide a
Azure machines, we provide a
[guide](Training-on-Microsoft-Azure.md)
on how to set-up virtual machine instances in addition to a pre-configured data science image.

4
docs/localized/zh-CN/docs/Readme.md


* [训练环境设计要点](/docs/Learning-Environment-Best-Practices.md)
* [如何使用 Monitor 功能](/docs/Feature-Monitor.md)
* [如何使用 TensorFlowSharp 插件(测试功能)](/docs/Using-TensorFlow-Sharp-in-Unity.md)
## 进行训练
* [如何用 ML-Agents 进行训练](/docs/Training-ML-Agents.md)
* [Proximal Policy Optimization 训练要点](/docs/Training-PPO.md)

* [常见问题](/docs/FAQ.md)
* [ML-Agents 术语表](/docs/Glossary.md)
* [ML-Agents 尚未实现功能](/docs/Limitations.md)
## API 文档
* [API 参考](/docs/API-Reference.md)
* [如何使用 Python API](/docs/Python-API.md)

92
gym-unity/README.md


observations (False) as the default observation provided by the `reset` and
`step` functions. Defaults to `False`.
* `uint8_visual` refers to whether to output visual observations as `uint8` values
(0-255). Many common Gym environments (e.g. Atari) do this. By default they
* `uint8_visual` refers to whether to output visual observations as `uint8` values
(0-255). Many common Gym environments (e.g. Atari) do this. By default they
* `flatten_branched` will flatten a branched discrete action space into a Gym Discrete.
* `flatten_branched` will flatten a branched discrete action space into a Gym Discrete.
Otherwise, it will be converted into a MultiDiscrete. Defaults to `False`.
* `allow_multiple_visual_obs` will return a list of observation instead of only

## Limitations
* It is only possible to use an environment with a single Agent.
* By default, the first visual observation is provided as the `observation`, if
* By default, the first visual observation is provided as the `observation`, if
present. Otherwise, vector observations are provided. You can receive all visual
observations by using the `allow_multiple_visual_obs=True` option in the gym
parameters. If set to `True`, you will receive a list of `observation` instead

### Example - DQN Baseline
In order to train an agent to play the `GridWorld` environment using the
Baselines DQN algorithm, you first need to install the baselines package using
Baselines DQN algorithm, you first need to install the baselines package using
pip:
```

Next, create a file called `train_unity.py`. Then create an `/envs/` directory
and build the GridWorld environment to that directory. For more information on
building Unity environments, see
[here](../docs/Learning-Environment-Executable.md). Add the following code to
Next, create a file called `train_unity.py`. Then create an `/envs/` directory
and build the GridWorld environment to that directory. For more information on
building Unity environments, see
[here](../docs/Learning-Environment-Executable.md). Add the following code to
the `train_unity.py` file:
```python

### Other Algorithms
Other algorithms in the Baselines repository can be run using scripts similar to
the examples from the baselines package. In most cases, the primary changes needed
the examples from the baselines package. In most cases, the primary changes needed
to use a Unity environment are to import `UnityEnv`, and to replace the environment
creation code, typically `gym.make()`, with a call to `UnityEnv(env_path)`
passing the environment binary path.

environments, modification should be done to Mujoco scripts.
Some algorithms will make use of `make_env()` or `make_mujoco_env()`
functions. You can define a similar function for Unity environments. An example of
functions. You can define a similar function for Unity environments. An example of
such a method using the PPO2 baseline:
```python

## Run Google Dopamine Algorithms
Google provides a framework [Dopamine](https://github.com/google/dopamine), and
implementations of algorithms, e.g. DQN, Rainbow, and the C51 variant of Rainbow.
Using the Gym wrapper, we can run Unity environments using Dopamine.
implementations of algorithms, e.g. DQN, Rainbow, and the C51 variant of Rainbow.
Using the Gym wrapper, we can run Unity environments using Dopamine.
First, after installing the Gym wrapper, clone the Dopamine repository.
First, after installing the Gym wrapper, clone the Dopamine repository.
Then, follow the appropriate install instructions as specified on
[Dopamine's homepage](https://github.com/google/dopamine). Note that the Dopamine
guide specifies using a virtualenv. If you choose to do so, make sure your unity_env
Then, follow the appropriate install instructions as specified on
[Dopamine's homepage](https://github.com/google/dopamine). Note that the Dopamine
guide specifies using a virtualenv. If you choose to do so, make sure your unity_env
First, open `dopamine/atari/run_experiment.py`. Alternatively, copy the entire `atari`
First, open `dopamine/atari/run_experiment.py`. Alternatively, copy the entire `atari`
Within `run_experiment.py`, we will need to make changes to which environment is
instantiated, just as in the Baselines example. At the top of the file, insert
Within `run_experiment.py`, we will need to make changes to which environment is
instantiated, just as in the Baselines example. At the top of the file, insert
to import the Gym Wrapper. Navigate to the `create_atari_environment` method
to import the Gym Wrapper. Navigate to the `create_atari_environment` method
the method with the following code.
the method with the following code.
```python
game_version = 'v0' if sticky_actions else 'v4'

```
`./envs/GridWorld` is the path to your built Unity executable. For more information on
building Unity environments, see [here](../docs/Learning-Environment-Executable.md), and note
the Limitations section below.
`./envs/GridWorld` is the path to your built Unity executable. For more information on
building Unity environments, see [here](../docs/Learning-Environment-Executable.md), and note
the Limitations section below.
Note that we are not using the preprocessor from Dopamine,
as it uses many Atari-specific calls. Furthermore, frame-skipping can be done from within Unity,
rather than on the Python side.
Note that we are not using the preprocessor from Dopamine,
as it uses many Atari-specific calls. Furthermore, frame-skipping can be done from within Unity,
rather than on the Python side.
that use branched discrete action spaces (e.g.
[VisualBanana](../docs/Learning-Environment-Examples.md)), you can enable the
`flatten_branched` parameter in `UnityEnv`, which treats each combination of branched
that use branched discrete action spaces (e.g.
[VisualBanana](../docs/Learning-Environment-Examples.md)), you can enable the
`flatten_branched` parameter in `UnityEnv`, which treats each combination of branched
Dopamine's agents currently do not automatically adapt to the observation
dimensions or number of channels.
Dopamine's agents currently do not automatically adapt to the observation
dimensions or number of channels.
likely need to adjust them for ML-Agents environments. Here is a sample
`dopamine/agents/rainbow/configs/rainbow.gin` file that is known to work with
GridWorld.
likely need to adjust them for ML-Agents environments. Here is a sample
`dopamine/agents/rainbow/configs/rainbow.gin` file that is known to work with
GridWorld.
```python
import dopamine.agents.rainbow.rainbow_agent

```
This example assumed you copied `atari` to a separate folder named `unity`.
Replace `unity` in `import dopamine.unity.run_experiment` with the folder you
Replace `unity` in `import dopamine.unity.run_experiment` with the folder you
If you directly modified the existing files, then use `atari` here.
If you directly modified the existing files, then use `atari` here.
### Starting a Run

--gin_files='dopamine/agents/rainbow/configs/rainbow.gin'
```
Again, we assume that you've copied `atari` into a separate folder.
Again, we assume that you've copied `atari` into a separate folder.
Remember to replace `unity` with the directory you copied your files into. If you
edited the Atari files directly, this should be `atari`.

Dopamine as run on the GridWorld example environment. All Dopamine (DQN, Rainbow,
C51) runs were done with the same epsilon, epsilon decay, replay history, training steps,
Dopamine as run on the GridWorld example environment. All Dopamine (DQN, Rainbow,
C51) runs were done with the same epsilon, epsilon decay, replay history, training steps,
the training buffer, and no learning happens.
the training buffer, and no learning happens.
We provide results from our PPO implementation and the DQN from Baselines as reference.
Note that all runs used the same greyscale GridWorld as Dopamine. For PPO, `num_layers`
We provide results from our PPO implementation and the DQN from Baselines as reference.
Note that all runs used the same greyscale GridWorld as Dopamine. For PPO, `num_layers`
was set to 2, and all other hyperparameters are the default for GridWorld in `trainer_config.yaml`.
For Baselines DQN, the provided hyperparameters in the previous section are used. Note
that Baselines implements certain features (e.g. dueling-Q) that are not enabled

### Example: VisualBanana
As an example of using the `flatten_branched` option, we also used the Rainbow
algorithm to train on the VisualBanana environment, and provide the results below.
The same hyperparameters were used as in the GridWorld case, except that
algorithm to train on the VisualBanana environment, and provide the results below.
The same hyperparameters were used as in the GridWorld case, except that
`replay_history` and `epsilon_decay` were increased to 100000.
![Dopamine on VisualBanana](images/dopamine_visualbanana_plot.png)

6
ml-agents-envs/README.md


The `mlagents_envs` Python package is part of the
[ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents).
`mlagents_envs` provides a Python API that allows direct interaction with the Unity
game engine. It is used by the trainer implementation in `mlagents` as well as
the `gym-unity` package to perform reinforcement learning within Unity. `mlagents_envs` can be
used independently of `mlagents` for Python communication.
game engine. It is used by the trainer implementation in `mlagents` as well as
the `gym-unity` package to perform reinforcement learning within Unity. `mlagents_envs` can be
used independently of `mlagents` for Python communication.
The `mlagents_envs` Python package contains one sub package:

6
protobuf-definitions/README.md


`sudo apt-get install nuget`
Navigate to your installation of nuget and run the following:
Navigate to your installation of nuget and run the following:
`nuget install Grpc.Tools -Version 1.14.1 -OutputDirectory $MLAGENTS_ROOT\protobuf-definitions`

# if UNITY_EDITOR || UNITY_STANDALONE_WIN || UNITY_STANDALONE_OSX || UNITY_STANDALONE_LINUX
```
and the following line to the end
```csharp
#endif
```

mlagents-learn
```
The final line will test if everything was generated and installed correctly. If it worked, you should see the Unity logo.
The final line will test if everything was generated and installed correctly. If it worked, you should see the Unity logo.
正在加载...
取消
保存