浏览代码
Merge pull request #433 from Unity-Technologies/docs-training-brains-etc
Merge pull request #433 from Unity-Technologies/docs-training-brains-etc
Docs training brains etc/develop-generalizationTraining-TrainerController
GitHub
7 年前
当前提交
54357ee8
共有 17 个文件被更改,包括 443 次插入 和 144 次删除
-
11docs/Feature-Broadcasting.md
-
4docs/Feature-Memory.md
-
6docs/Getting-Started-with-Balance-Ball.md
-
27docs/Learning-Environment-Create-New.md
-
46docs/Learning-Environment-Design-Agents.md
-
45docs/Learning-Environment-Design-Brains.md
-
14docs/Learning-Environment-Design.md
-
6docs/Learning-Environment-Examples.md
-
18docs/Python-API.md
-
91docs/Training-ML-Agents.md
-
16docs/Training-PPO.md
-
122docs/Using-TensorFlow-Sharp-in-Unity.md
-
53docs/Using-Tensorboard.md
-
15docs/dox-ml-agents.conf
-
62docs/Learning-Environment-Design-External-Internal-Brains.md
-
22docs/Learning-Environment-Design-Heuristic-Brains.md
-
29docs/Learning-Environment-Design-Player-Brains.md
|
|||
# Using the Broadcast Feature |
|||
The Player, Heuristic and Internal brains have been updated to support broadcast. The broadcast feature allows you to collect data from your agents in python without controling them. |
|||
|
|||
The Player, Heuristic and Internal brains have been updated to support broadcast. The broadcast feature allows you to collect data from your agents using a Python program without controlling them. |
|||
|
|||
|
|||
When you launch your Unity Environment from python, you can see what the agents connected to non-external brains are doing. When calling `step` or `reset` on your environment, you retrieve a dictionary from brain names to `BrainInfo` objects. Each `BrainInfo` the non-external brains set to broadcast. |
|||
|
|||
When you launch your Unity Environment from a Python program, you can see what the agents connected to non-external brains are doing. When calling `step` or `reset` on your environment, you retrieve a dictionary mapping brain names to `BrainInfo` objects. The dictionary contains a `BrainInfo` object for each non-external brain set to broadcast as well as for any external brains. |
|||
|
|||
|
|||
|
|||
You can use the broadcast feature to collect data generated by Player, Heuristics or Internal brains game sessions. You can then use this data to train an agent in a supervised context. |
|
|||
# Training ML-Agents |
|||
|
|||
This document is still to be written. When finished it will provide an overview of the training process. The main algorithm implemented currently is PPO, but there are various flavors including multi-agent training, curriculum training and imitation learning to consider. |
|||
ML-Agents conducts training using an external Python training process. During training, this external process communicates with the Academy object in the Unity scene to generate a block of agent experiences. These experiences become the training set for a neural network used to optimize the agent's policy (which is essentially a mathematical function mapping observations to actions). In reinforcement learning, the neural network optimizes the policy by maximizing the expected rewards. In imitation learning, the neural network optimizes the policy to achieve the smallest difference between the actions chosen by the agent trainee and the actions chosen by the expert in the same situation. |
|||
|
|||
The output of the training process is a model file containing the optimized policy. This model file is a TensorFlow data graph containing the mathematical operations and the optimized weights selected during the training process. You can use the generated model file with the Internal Brain type in your Unity project to decide the best course of action for an agent. |
|||
|
|||
Use the Python program, `learn.py` to train your agents. This program can be found in the `python` directory of the ML-Agents SDK. The [configuration file](#training-config-file), `trainer_config.yaml` specifies the hyperparameters used during training. You can edit this file with a text editor to add a specific configuration for each brain. |
|||
|
|||
For a broader overview of reinforcement learning, imitation learning and the ML-Agents training process, see [ML-Agents Overview](ML-Agents-Overview.md). |
|||
|
|||
## Training with Learn.py |
|||
|
|||
Use the Python `Learn.py` program to train agents. `Learn.py` supports training with [reinforcement learning](Background-Machine-Learning.md#reinforcement-learning), [curriculum learning](Training-Curriculum-Learning.md), and [behavioral cloning imitation learning](Training-Imitation-Learning.md). |
|||
|
|||
Run `Learn.py` from the command line to launch the training process. Use the command line patterns and the `trainer_config.yaml` file to control training options. |
|||
|
|||
The basic command for training is: |
|||
|
|||
python learn.py <env_file_path> --run-id=<run-identifier> --train |
|||
|
|||
where `<env_file_path>` is the path to your Unity executable containing the agents to be trained and `<run-identifier>` is an optional identifier you can use to identify the results of individual training runs. |
|||
|
|||
For example, suppose you have a project in Unity named "CatsOnBicyclesCatsOnBicycles" which contains agents ready to train. To perform the training: |
|||
|
|||
1. Build the project, making sure that you only include the training scene. |
|||
2. Open a terminal or console window. |
|||
3. Navigate to the ml-agents `python` folder. |
|||
4. Run the following to launch the training process using the path to the Unity environment you built in step 1: |
|||
|
|||
python learn.py ../../projects/Cats/CatsOnBicycles.app --run-id=cob_1 --train |
|||
|
|||
During a training session, the training program prints out and saves updates at regular intervals (specified by the `summary_freq` option). The saved statistics are grouped by the `run-id` value so you should assign a unique id to each training run if you plan to view the statistics. You can view these statistics using TensorBoard during or after training by running the following command (from the ML-Agents python directory): |
|||
|
|||
tensorboard --logdir=summaries |
|||
|
|||
And then opening the URL: [localhost:6006](http://localhost:6006). |
|||
|
|||
When training is finished, you can find the saved model in the `python/models` folder under the assigned run-id — in the cats example, the path to the model would be `python/models/cob_1/CatsOnBicycles_cob_1.bytes`. |
|||
|
|||
While this example used the default training hyperparameters, you can edit the [training_config.yaml file](#training-config-file) with a text editor to set different values. |
|||
|
|||
### Command line training options |
|||
|
|||
In addition to passing the path of the Unity executable containing your training environment, you can set the following command line options when invoking `learn.py`: |
|||
|
|||
* `--curriculum=<file>` – Specify a curriculum json file for defining the lessons for curriculum training. See [Curriculum Training](Training-Curriculum-Learning.md) for more information. |
|||
* `--keep-checkpoints=<n>` – Specify the maximum number of model checkpoints to keep. Checkpoints are saved after the number of steps specified by the `save-freq` option. Once the maximum number of checkpoints has been reached, the oldest checkpoint is deleted when saving a new checkpoint. Defaults to 5. |
|||
* `--lesson=<n>` – Specify which lesson to start with when performing curriculum training. Defaults to 0. |
|||
* `--load` – If set, the training code loads an already trained model to initialize the neural network before training. The learning code looks for the model in `python/models/<run-id>/` (which is also where it saves models at the end of training). When not set (the default), the neural network weights are randomly initialized and an existing model is not loaded. |
|||
* `--run-id=<path>` – Specifies an identifier for each training run. This identifier is used to name the subdirectories in which the trained model and summary statistics are saved as well as the saved model itself. The default id is "ppo". If you use TensorBoard to view the training statistics, always set a unique run-id for each training run. (The statistics for all runs with the same id are combined as if they were produced by a the same session.) |
|||
* `--save-freq=<n>` Specifies how often (in steps) to save the model during training. Defaults to 50000. |
|||
* `--seed=<n>` – Specifies a number to use as a seed for the random number generator used by the training code. |
|||
* `--slow` – Specify this option to run the Unity environment at normal, game speed. The `--slow` mode uses the **Time Scale** and **Target Frame Rate** specified in the Academy's **Inference Configuration**. By default, training runs using the speeds specified in your Academy's **Training Configuration**. See [Academy Properties](Learning-Environment-Design-Academy.md#academy-properties). |
|||
* `--train` – Specifies whether to train model or only run in inference mode. When training, **always** use the `--train` option. |
|||
* `--worker-id=<n>` – When you are running more than one training environment at the same time, assign each a unique worker-id number. The worker-id is added to the communication port opened between the current instance of learn.py and the ExternalCommunicator object in the Unity environment. Defaults to 0. |
|||
* `--docker-target-name=<dt>` – The Docker Volume on which to store curriculum, executable and model files. See [Using Docker](Using-Docker.md). |
|||
|
|||
### Training config file |
|||
|
|||
The training config file, `trainer_config.yaml` specifies the training method, the hyperparameters, and a few additional values to use during training. The file is divided into sections. The **default** section defines the default values for all the available settings. You can also add new sections to override these defaults to train specific Brains. Name each of these override sections after the GameObject containing the Brain component that should use these settings. (This GameObject will be a child of the Academy in your scene.) Sections for the example environments are included in the provided config file. `Learn.py` finds the config file by name and looks for it in the same directory as itself. |
|||
|
|||
| ** Setting ** | **Description** | **Applies To Trainer**| |
|||
| :-- | :-- | :-- | |
|||
| batch_size | The number of experiences in each iteration of gradient descent.| PPO, BC | |
|||
| batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model.| BC | |
|||
| beta | The strength of entropy regularization.| PPO, BC | |
|||
| brain_to_imitate | For imitation learning, the name of the GameObject containing the Brain component to imitate. | BC | |
|||
| buffer_size | The number of experiences to collect before updating the policy model. | PPO, BC | |
|||
| epsilon | Influences how rapidly the policy can evolve during training.| PPO, BC | |
|||
| gamma | The reward discount rate for the Generalized Advantage Estimator (GAE). | PPO | |
|||
| hidden_units | The number of units in the hidden layers of the neural network. | PPO, BC | |
|||
| lambd | The regularization parameter. | PPO | |
|||
| learning_rate | The initial learning rate for gradient descent. | PPO, BC | |
|||
| max_steps | The maximum number of simulation steps to run during a training session. | PPO, BC | |
|||
| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks in ML-Agents](Feature-Memory.md). | PPO, BC | |
|||
| normalize | Whether to automatically normalize observations. | PPO, BC | |
|||
| num_epoch | The number of passes to make through the experience buffer when performing gradient descent optimization. | PPO, BC | |
|||
| num_layers | The number of hidden layers in the neural network. | PPO, BC | |
|||
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks in ML-Agents](Feature-Memory.md). | PPO, BC | |
|||
| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, BC | |
|||
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, BC | |
|||
| trainer | The type of training to perform: "ppo" or "imitation".| PPO, BC | |
|||
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks in ML-Agents](Feature-Memory.md).| PPO, BC | |
|||
|| PPO = Proximal Policy Optimization, BC = Behavioral Cloning (Imitation)) || |
|||
For specific advice on setting hyperparameters based on the type of training you are conducting, see: |
|||
|
|||
* [Training with PPO](Training-PPO.md) |
|||
* [Using Recurrent Neural Networks in ML-Agents](Feature-Memory.md) |
|||
* [Imitation Learning](Training-Imitation-Learning.md) |
|||
* [Training with Curriculum Learning](Training-Curriculum-Learning.md) |
|||
|
|||
You can also compare the [example environments](Learning-Environment-Examples.md) to the corresponding sections of the `trainer-config.yaml` file for each example to see how the hyperparameters and other configuration variables have been changed from the defaults. |
|
|||
# Using TensorBoard to Observe Training |
|||
|
|||
This document is still to be written. It will discuss using TensorBoard and interpreting the TensorBoard charts. |
|||
ML-Agents saves statistics during learning session that you can view with a TensorFlow utility named, [TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard). |
|||
|
|||
The `learn.py` program saves training statistics to a folder named `summaries`, organized by the `run-id` value you assign to a training session. |
|||
|
|||
In order to observe the training process, either during training or afterward, |
|||
start TensorBoard: |
|||
|
|||
1. Open a terminal or console window: |
|||
2. Navigate to the ml-agents/python folder. |
|||
3. From the command line run : |
|||
|
|||
tensorboard --logdir=summaries |
|||
|
|||
4. Open a browser window and navigate to [localhost:6006](http://localhost:6006). |
|||
|
|||
**Note:** If you don't assign a `run-id` identifier, `learn.py` uses the default string, "ppo". All the statistics will be saved to the same sub-folder and displayed as one session in TensorBoard. After a few runs, the displays can become difficult to interpret in this situation. You can delete the folders under the `summaries` directory to clear out old statistics. |
|||
|
|||
On the left side of the TensorBoard window, you can select which of the training runs you want to display. You can select multiple run-ids to compare statistics. The TensorBoard window also provides options for how to display and smooth graphs. |
|||
|
|||
When you run the training program, `learn.py`, you can use the `--save-freq` option to specify how frequently to save the statistics. |
|||
|
|||
## ML-Agents training statistics |
|||
|
|||
The ML-agents training program saves the following statistics: |
|||
|
|||
* Lesson - Plots the progress from lesson to lesson. Only interesting when performing |
|||
[curriculum training](Training-Curriculum-Learning.md). |
|||
|
|||
* Cumulative Reward - The mean cumulative episode reward over all agents. |
|||
Should increase during a successful training session. |
|||
|
|||
* Entropy - How random the decisions of the model are. Should slowly decrease |
|||
during a successful training process. If it decreases too quickly, the `beta` |
|||
hyperparameter should be increased. |
|||
|
|||
* Episode Length - The mean length of each episode in the environment for all |
|||
agents. |
|||
|
|||
* Learning Rate - How large a step the training algorithm takes as it searches |
|||
for the optimal policy. Should decrease over time. |
|||
|
|||
* Policy Loss - The mean loss of the policy function update. Correlates to how |
|||
much the policy (process for deciding actions) is changing. The magnitude of |
|||
this should decrease during a successful training session. |
|||
|
|||
* Value Estimate - The mean value estimate for all states visited by the agent. |
|||
Should increase during a successful training session. |
|||
|
|||
* Value Loss - The mean loss of the value function update. Correlates to how |
|||
well the model is able to predict the value of each state. This should decrease |
|||
during a successful training session. |
|||
|
|
|||
# External and Internal Brains |
|||
|
|||
The **External** and **Internal** types of Brains work in different phases of training. When training your agents, set their brain types to **External**; when using the trained models, set their brain types to **Internal**. |
|||
|
|||
## External Brain |
|||
|
|||
When [running an ML-Agents training algorithm](Training-ML-Agents.md), at least one Brain object in a scene must be set to **External**. This allows the training process to collect the observations of agents using that brain and give the agents their actions. |
|||
|
|||
In addition to using an External brain for training using the ML-Agents learning algorithms, you can use an External brain to control agents in a Unity environment using an external Python program. See [Python API](Python-API.md) for more information. |
|||
|
|||
Unlike the other types, the External Brain has no properties to set in the Unity Inspector window. |
|||
|
|||
## Internal Brain |
|||
|
|||
The Internal Brain type uses a [TensorFlow model](https://www.tensorflow.org/get_started/get_started_for_beginners#models_and_training) to make decisions. The Proximal Policy Optimization (PPO) and Behavioral Cloning algorithms included with the ML-Agents SDK produce trained TensorFlow models that you can use with the Internal Brain type. |
|||
|
|||
A __model__ is a mathematical relationship mapping an agent's observations to its actions. TensorFlow is a software library for performing numerical computation through data flow graphs. A TensorFlow model, then, defines the mathematical relationship between your agent's observations and its actions using a TensorFlow data flow graph. |
|||
|
|||
### Creating a graph model |
|||
|
|||
The training algorithms included in the ML-Agents SDK produce TensorFlow graph models as the end result of the training process. See [Training ML-Agents](Training-ML-Agents.md) for instructions on how to train a model. |
|||
|
|||
### Using a graph model |
|||
|
|||
To use a graph model: |
|||
|
|||
1. Select the Brain GameObject in the **Hierarchy** window of the Unity Editor. (The Brain GameObject must be a child of the Academy Gameobject and must have a Brain component.) |
|||
2. Set the **Brain Type** to **Internal**. |
|||
|
|||
**Note:** In order to see the **Internal** Brain Type option, you must [enable TensorFlowSharp](Using-TensorFlow-Sharp-in-Unity.md). |
|||
|
|||
3. Import the `environment_run-id.bytes` file produced by the PPO training program. (Where `environment_run-id` is the name of the model file, which is constructed from the name of your Unity environment executable and the run-id value you assigned when running the training process.) |
|||
|
|||
You can [import assets into Unity](https://docs.unity3d.com/Manual/ImportingAssets.html) in various ways. The easiest way is to simply drag the file into the **Project** window and drop it into an appropriate folder. |
|||
|
|||
4. Once the `environment.bytes` file is imported, drag it from the **Project** window to the **Graph Model** field of the Brain component. |
|||
|
|||
If you are using a model produced by the ML-Agents `learn.py` program, use the default values for the other Internal Brain parameters. |
|||
|
|||
### Internal Brain properties |
|||
|
|||
The default values of the TensorFlow graph parameters work with the model produced by the PPO and BC training code in the ML-Agents SDK. To use a default ML-Agents model, the only parameter that you need to set is the `Graph Model`, which must be set to the .bytes file containing the trained model itself. |
|||
|
|||
![Internal Brain Inspector](images/internal_brain.png) |
|||
|
|||
|
|||
* `Graph Model` : This must be the `bytes` file corresponding to the pretrained Tensorflow graph. (You must first drag this file into your Resources folder and then from the Resources folder into the inspector) |
|||
|
|||
Only change the following Internal Brain properties if you have created your own TensorFlow model and are not using an ML-Agents model: |
|||
|
|||
* `Graph Scope` : If you set a scope while training your TensorFlow model, all your placeholder name will have a prefix. You must specify that prefix here. |
|||
* `Batch Size Node Name` : If the batch size is one of the inputs of your graph, you must specify the name if the placeholder here. The brain will make the batch size equal to the number of agents connected to the brain automatically. |
|||
* `State Node Name` : If your graph uses the state as an input, you must specify the name of the placeholder here. |
|||
* `Recurrent Input Node Name` : If your graph uses a recurrent input / memory as input and outputs new recurrent input / memory, you must specify the name if the input placeholder here. |
|||
* `Recurrent Output Node Name` : If your graph uses a recurrent input / memory as input and outputs new recurrent input / memory, you must specify the name if the output placeholder here. |
|||
* `Observation Placeholder Name` : If your graph uses observations as input, you must specify it here. Note that the number of observations is equal to the length of `Camera Resolutions` in the brain parameters. |
|||
* `Action Node Name` : Specify the name of the placeholder corresponding to the actions of the brain in your graph. If the action space type is continuous, the output must be a one dimensional tensor of float of length `Action Space Size`, if the action space type is discrete, the output must be a one dimensional tensor of int of length 1. |
|||
* `Graph Placeholder` : If your graph takes additional inputs that are fixed (example: noise level) you can specify them here. Note that in your graph, these must correspond to one dimensional tensors of int or float of size 1. |
|||
* `Name` : Corresponds to the name of the placeholder. |
|||
* `Value Type` : Either Integer or Floating Point. |
|||
* `Min Value` and `Max Value` : Specify the range of the value here. The value will be sampled from the uniform distribution ranging from `Min Value` to `Max Value` inclusive. |
|||
|
|
|||
# Heuristic Brain |
|||
|
|||
The **Heuristic** brain type allows you to hand code an agent's decision making process. A Heuristic brain requires an implementation of the Decision interface to which it delegates the decision making process. |
|||
|
|||
When you set the **Brain Type** property of a Brain to **Heuristic**, you must add a component implementing the Decision interface to the same GameObject as the Brain. |
|||
|
|||
## Implementing the Decision interface |
|||
|
|||
When creating your Decision class, extend MonoBehaviour (so you can use the class as a Unity component) and extend the Decision interface. |
|||
|
|||
using UnityEngine; |
|||
|
|||
public class HeuristicLogic : MonoBehaviour, Decision |
|||
{ |
|||
// ... |
|||
} |
|||
|
|||
The Decision interface defines two methods, `Decide()` and `MakeMemory()`. |
|||
|
|||
The `Decide()` method receives an agents current state, consisting of the agent's observations, reward, memory and other aspects of the agent's state, and must return an array containing the action that the agent should take. The format of the returned action array depends on the **Vector Action Space Type**. When using a **Continuous** action space, the action array is just a float array with a length equal to the **Vector Action Space Size** setting. When using a **Discrete** action space, the array contains just a single value. In the discrete action space, the **Space Size** value defines the number of discrete values that your `Decide()` function can return, which don't need to be consecutive integers. |
|||
|
|||
The `MakeMemory()` function allows you to pass data forward to the next iteration of an agent's decision making process. The array you return from `MakeMemory()` is passed to the `Decide()` function in the next iteration. You can use the memory to allow the agent's decision process to take past actions and observations into account when making the current decision. If your heuristic logic does not require memory, just return an empty array. |
|
|||
# Player Brain |
|||
|
|||
The **Player** brain type allows you to control an agent using keyboard commands. You can use Player brains to control a "teacher" agent that trains other agents during [imitation learning](Training-Imitation-Learning.md). You can also use Player brains to test your agents and environment before changing their brain types to **External** and running the training process. |
|||
|
|||
## Player Brain properties |
|||
|
|||
The **Player** brain properties allow you to assign one or more keyboard keys to each action and a unique value to send when a key is pressed. |
|||
|
|||
![Player Brain Inspector](images/player_brain.png) |
|||
|
|||
Note the differences between the discrete and continuous action spaces. When a brain uses the discrete action space, you can send one integer value as the action per step. In contrast, when a brain uses the continuous action space you can send any number of floating point values (up to the **Vector Action Space Size** setting). |
|||
|
|||
| **Property** | | **Description** | |
|||
| :-- |:-- | :-- | |
|||
|**Continuous Player Actions**|| The mapping for the continuous vector action space. Shown when the action space is **Continuous**|. |
|||
|| **Size** | The number of key commands defined. You can assign more than one command to the same action index in order to send different values for that action. (If you press both keys at the same time, deterministic results are not guaranteed.)| |
|||
||**Element 0–N**| The mapping of keys to action values. | |
|||
|| **Key** | The key on the keyboard. | |
|||
|| **Index** | The element of the agent's action vector to set when this key is pressed. The index value cannot exceed the size of the Action Space (minus 1, since it is an array index).| |
|||
|| **Value** | The value to send to the agent as its action for the specified index when the mapped key is pressed. All other members of the action vector are set to 0. | |
|||
|**Discrete Player Actions**|| The mapping for the discrete vector action space. Shown when the action space is **Discrete**.| |
|||
|| **Default Action** | The value to send when no keys are pressed.| |
|||
|| **Size** | The number of key commands defined. | |
|||
||**Element 0–N**| The mapping of keys to action values. | |
|||
|| **Key** | The key on the keyboard. | |
|||
|| **Value** | The value to send to the agent as its action when the mapped key is pressed.| |
|||
|
|||
For more information about the Unity input system, see [Input](https://docs.unity3d.com/ScriptReference/Input.html). |
|||
|
撰写
预览
正在加载...
取消
保存
Reference in new issue