wip

7 年前 · e37b13e1
--- a/docs/Learning-Environment-Create-New.md
+++ b/docs/Learning-Environment-Create-New.md
    public float speed = 10;
    private float previousDistance = float.MaxValue;
    
-    public override void AgentAct(float[] action)
+    public override void AgentAction(float[] vectorAction, string textAction)
    {
        // Rewards
        float distanceToTarget = Vector3.Distance(this.transform.position, 

        // Actions, size = 2
        Vector3 controlSignal = Vector3.zero;
-        controlSignal.x = Mathf.Clamp(action[0], -1, 1);
-        controlSignal.z = Mathf.Clamp(action[1], -1, 1);
+        controlSignal.x = Mathf.Clamp(vectorAction[0], -1, 1);
+        controlSignal.z = Mathf.Clamp(vectorAction[1], -1, 1);
        rBody.AddForce(controlSignal * speed);
     }

--- a/docs/Learning-Environment-Design-Brains.md
+++ b/docs/Learning-Environment-Design-Brains.md

 The Brain encapsulates the decision making process. Brain objects must be children of the Academy in the Unity scene hierarchy. Every Agent must be assigned a Brain, but you can use the same Brain with more than one Agent. 

-Use the Brain class directly, rather than a subclass. Brain behavior is determined by the brain type. During training, set your agent's brain type to **External**. To use the trained model, import the model file into the Unity project and change the brain type to **Internal**. You can extend the CoreBrain class to create different brain types if the four built-in types don't do what you need.
+Use the Brain class directly, rather than a subclass. Brain behavior is determined by the **Brain Type**. ML-Agents defines four Brain Types:
+
+* [External](Learning-Environment-External-Brains.md) — The **External** and **Internal** types typically work together; set **External** when training your agents. You can also use the **External** brain to communicate with a Python script via the Python `UnityEnvironment` class included in the Python portion of the ML-Agents SDK.
+* [Internal](Learning-Environment-Internal-Brains.md) – Set **Internal**  to make use of a trained model.
+* [Heuristic](Learning-Environment-Heuristic-Brains.md) – Set **Heuristic** to hand-code the agent's logic by extending the Decision class.
+* [Player](Learning-Environment-Player-Brains.md) – Set **Player** to map keyboard keys to agent actions, which can be useful to test your agent code.
+
+Each of these types is an implementation of the CoreBrain interface. You can extend the CoreBrain class to create different brain types if the four built-in types don't do what you need.
+
+During training, set your agent's brain type to **External**. To use the trained model, import the model file into the Unity project and change the brain type to **Internal**. 
+
+The Brain Inspector window in the Unity Editor displays the properties assigned to a Brain component:

 ![Brain Inspector](images/brain.png)

 values (in _Discrete_ action space).
 		* `Action Descriptions` - A list of strings used to name the available actions for the Brain.
 * `Type of Brain` - Describes how the Brain will decide actions.
-    * `External` - Actions are decided using Python API.
+    * `External` - Actions are decided by an external process, such as the PPO training process.
-    * `Heuristic` - Actions are decided using custom `Decision` script, which should be attached to the Brain game object.
+    * `Heuristic` - Actions are decided using custom `Decision` script, which must be attached to the Brain game object.

 ### Internal Brain


 ### Player Brain

-![Player Brain Inspector](images/player_brain.png)
+The Brain property settings must match the Agent implementation. For example, if you specify that the Brain use the **Continuous State Space** and a **State Size** of 23, then the Agent must provide a state vector with 23 elements. See [Agents](Learning-Environment-Design-Agents.md) for more information about programming agents.
-If the action space is discrete, you must map input keys to their corresponding integer values. If the action space is continuous, you must map input keys to their corresponding indices and float values.

--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 # Training ML-Agents

-This document is still to be written. When finished it will provide an overview of the training process. The main algorithm implemented currently is PPO, but there are various flavors including multi-agent training, curriculum training and imitation learning to consider.
+ML-Agents conducts training using an external Python training process. During training, this external process communicates with the Academy object in the Unity scene to generate a block of agent experiences. These experiences become the training set for a neural network used to optimize the agent's policy (which is essentially a mathematical function mapping observations to actions). In reinforcement learning, the neural network optimizes the policy by maximizing the expected rewards. In imitation learning, the neural network optimizes the policy to achieve the smallest difference between the actions chosen by the agent trainee and the imitated actions. 
+
+The output of the training process is a model file containing the optimized policy. This model file is a TensorFlow data graph containing the mathematical operations and the optimized weights selected during the training process. You can use the generated model file with the Internal Brain type in your Unity project to decide the best course of action for an agent. 
+
+Use the Python program, `learn.py` to train your agents. This program can be found in the `python` directory of the ML-Agents SDK. The [configuration file](#training-config-file), `trainer_config.yaml` specifies the hyperparameters used during training. You can edit this file with a text editor to add a specific configuration for each brain.
+
+For an broader overview of reinforcement learning, imitation learning and the ML-Agents training process, see [ML-Agents Overview](ML-Agents-Overview.md).
+
+## Training with Learn.py
+
+Use the Python `Learn.py` program to train agents. `Learn.py` supports training with [reinforcement learning](Background-Machine-Learning.md#reinforcement-learning), [curriculum learning](Training-Curriculum-Learning.md), and [behavioural cloning imitation learning](link).
+
+Run `Learn.py` from the command line to launch the training process. Use the command line patterns and the `trainer_config.yaml` file to control training options.
+
+The basic command for training is:
+
+    python learn.py <env_file_path> --run-id=<run-identifier> --train
+
+where `<env_file_path>` is the path to your Unity executable containing the agents to be trained and `<run-identifier>` is an optional identifier you can use to identify the results of individual training runs.
+
+For example, suppose you have a project in Unity named "CatsOnBycycles" which contains agents ready to train. To perform the training:
+
+1. Build the project, making sure that you only include the training scene.
+2. Open a terminal or console window.
+3. Navigate to the ml-agents `python` folder.
+4. Run the following to launch the training process using the path to the Unity environment you built in step 1:
+
+        python learn.py ../../projects/Cats/CatsOnBicycles.app --run-id=cob_1 --train
+During a training session, the training program prints out and saves updates at regular intervals (specified by the `summary_freq` option). The saved statistics are grouped by the `run-id` value so you should assign a unique id to each training run if you plan to view the statistics. You can view these statistics using TensorBoard during or after training by running the following command (from the ML-Agents python directory):
+
+    tensorboard --logdir=summaries
+
+And then opening the URL: [localhost:6006](http://localhost:6006).
+ 
+When training is finished, you can find the saved model in the `python/models` folder under the assigned run-id — in the cats example, the path to the model would be `python/models/cob_1/CatsOnBicycles_cob_1.bytes`.
+
+While this example used the default training hyperparameters, you can edit the [training_config.yaml file](#training-config-file) with a text editor to set different values. 
+
+### Commandline training options
+
+In addition to passing the path of the Unity executable containing your training environment, you can set the following commandline options when invoking `learn.py`:
+
+* `--curriculum=<file>` – Specify a curriculum json file for defining the lessons for curriculum training. See [Curriculum Training](Training-Curriculum-Learning.md) for more information.
+* `--keep-checkpoints=<n>` – Specify the maximum number of model checkpoints to keep. Checkpoints are saved after the number of steps specified by the `save-freq` option. Once the maximum number of checkpoints has been reached, the oldest checkpoint is overwritten. Defaults to 5.
+* `--lesson=<n>` – Specify which lesson to start with when performing curriculum training. Defaults to 0.
+* `--load` – If set, the training code loads an already trained model to initialize the neural network before training. A trained model must exist. The learning code looks for the model in `python/models/<run-id>/` (which is also where it saves models at the end of training). When not set (the default), the neural network weigths are randomly initialized and an existing model is not loaded.
+* `--run-id=<path>` – Specifies an identifier for each training run. This identifier is used to name the subdirectories in which the trained model and summary statistics are saved. The default id is "ppo". If you use TensorBoard to view the training statistics, always set a unique run-id for each training run. (Otherwise, the statistics for all runs with the same id are all mashed together.)
+* `--save-freq=<n>` Specifies how often (in  steps) to save the model during training. Defaults to 50000.
+* `--seed=<n>` – Specifies a number to use as a seed for the random number generator used by the training code.
+* `--slow` – Specify this option to run the Unity environment at normal, game speed. The `--slow` mode uses the **Time Scale** and **Target Frame Rate** specified in the Academy's **Inference Configuration**. By default, training runs using the speeds specified in your Academy's **Training Configuration**. See [Academy Properties](Learning-Environment-Design-Academy.md#academy-properties).
+* `--train` – Specifies whether to train model or only run in inference mode. When training, **always** use the `--train` option.
+* `--worker-id=<n>` – When you are running more than one training environment at the same time, assign each a unique worker-id number. The worker-id is added to the communication port opened between the current instance of learn.py and the ExternalCommunicator object in the Unity environment. Defaults to 0.
+* `--docker-target-name=<dt>` – The Docker Volume on which to store curriculum, executable and model files. See [Using Docker](Using-Docker.md).
+
+### Training config file
+
+The training config file, `trainer_config.yaml` specifies the training method, the hyperparameters, and a few additional values to use during training. The file is divided into sections. The **default** section defines the default values for all the available settings. You can also add new sections to override these defaults to train specific Brains. Name each of these override sections after the GameObject containing the Brain component that should use these settings. (This GameObject will be a child of the Academy in your scene.) Sections for the example environments are included in the provided config file. `Learn.py` finds the config file by name and looks for it in the same directory as itself.
+
+| ** Setting ** | **Description** |
+| :--               | :--                     |
+| batch_size | The number of experiences in each iteration of gradient descent.|
+| batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model.|
+| beta | the strength of entropy regularization.|
+| brain_to_imitate | For imitation learning, the name of the GameObject containing the Brain component to imitate. |
+| buffer_size | The number of experiences to collect before  |
+| epsilon | Influences how rapidly the policy can evolve during training.|
+| gamma | The reward discount rate for the Generalized Advantage Estimater (GAE). |
+| hidden_units | The number of units in the hidden layers of the neural network. |
+| lambd | The regularization parameter. |
+| learning_rate | The initial learning rate for gradient descent. |
+| max_steps | The maximum number of simulation steps to run during a training session. |
+| memory_size |  |
+| normalize | Whether to automatically normalize observations. |
+| num_epoch | The number of passes to makethrough the experience buffer when performing gradient descent optimization. |
+| num_layers | The number of hidden layers in the neural network. |
+| sequence_length | |
+| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by Tensorboard. |
+| time_horizon | h|
+| trainer | c|
+| use_recurrent |c|
+
+## Observing Training Progress
+
+Once you start training using `learn.py` in the way described in the previous section, the `ml-agents` folder will 
+contain a `summaries` directory. In order to observe the training process 
+in more detail, you can use TensorBoard. From the command line run :
+
+`tensorboard --logdir=summaries`
+
+Then navigate to `localhost:6006`.
+
+From TensorBoard, you will see the summary statistics:
+
+* Lesson - only interesting when performing
+[curriculum training](Training-Curriculum-Learning.md). 
+This is not used in the 3d Balance Ball environment. 
+* Cumulative Reward - The mean cumulative episode reward over all agents. 
+Should increase during a successful training session.
+* Entropy - How random the decisions of the model are. Should slowly decrease 
+during a successful training process. If it decreases too quickly, the `beta` 
+hyperparameter should be increased.
+* Episode Length - The mean length of each episode in the environment for all 
+agents.
+* Learning Rate - How large a step the training algorithm takes as it searches 
+for the optimal policy. Should decrease over time.
+* Policy Loss - The mean loss of the policy function update. Correlates to how
+much the policy (process for deciding actions) is changing. The magnitude of 
+this should decrease during a successful training session.
+* Value Estimate - The mean value estimate for all states visited by the agent. 
+Should increase during a successful training session.
+* Value Loss - The mean loss of the value function update. Correlates to how
+well the model is able to predict the value of each state. This should decrease
+during a successful training session.
+
+![Example TensorBoard Run](images/mlagents-TensorBoard.png)
--- a/docs/Learning-Environment-Design-CoreBrains.md
+++ b/docs/Learning-Environment-Design-CoreBrains.md
+# CoreBrain Interface
+
+The behavior of a Brain object depends on its **Brain Type** setting. Each of the supported types of Brain implements the CoreBrain Interface. You can implement your own CoreBrain if none of the four included types do exactly what you want.
+
+The CoreBrain interface defines the following methods:
+
+    void SetBrain(Brain b);
+    void InitializeCoreBrain();
+    void DecideAction();
+    void OnInspector();
+
+Note that the name of your implementation must start with "CoreBrain" in order to add it to the list of Brain types in the Brain Inspector window. See [Adding a new Brain Type to the Brain Inspector](#adding-a-new-brain-type-to-the-brain-inspector).
+
+## SetBrain
+
+Use the `SetBrain()` function to store a reference to the parent Brain instance. `SetBrain()` is called before any of the other runtime CoreBrain functions, so you can use this Brain reference to access important properties of the parent Brain.
+
+    private Brain brain;
+    public void SetBrain(Brain b)
+    {
+        brain = b;
+    }
+
+## InitializeCoreBrain
+
+Use `InitializeCoreBrain()` to initialize your CoreBrain instance at runtime. Since `SetBrain()` has already been called, you can access the parent Brain properties. This function is also a good place to connect your brain to the ExternalCommunicator, if you want the CoreBrain implementation to communicate with an external process:
+
+    private ExternalCommunicator extComms;
+    public void InitializeCoreBrain(Communicator communicator)
+    {
+        actionValues = new float[brain.brainParameters.actionSize];
+        agentActions  = new Dictionary<int, float[]>();
+    
+        extComms = communicator as ExternalCommunicator;
+        if(extComms != null)
+        {
+            extComms.SubscribeBrain(brain);
+        }
+    }
+
+
+## DecideAction
+
+Use `DecideAction()` to determine the actions of any agents using this brain. The parent Brain passes a dictionary containing each Agent object and its corresponding AgentInfo struct. The AgentInfo struct provides all of the  agent's observations and rewards.
+
+For each agent, you must construct a `float[]` array containing the action vector elements and add this array to a Dictionary using the same integer key used by that agent in `Brain.agents`. Send this agent-action dictionary to the Brain using `Brain.SendAction()`.
+
+    public void DecideAction()
+    {
+        float[] actionValues = new float[brain.brainParameters.actionSize];
+        for(int i = 0; i < actionValues.Length; i++)
+        {
+            // Set actionValues[i]...
+        }
+        var agentActions = new Dictionary<int, float[]>();
+        foreach (KeyValuePair<int, Agent> idAgent in brain.agents)
+        {
+            agentActions.Add(idAgent.Key, actionValues);
+        }
+        brain.SendActions(agentActions);
+    }
+
+Of course, _how_ you decide an agent's actions is a key implementation detail for a CoreBrain. For example, the CoreBrainPlayer, which maps key commands to action values, simply checks for key presses using the `Input.GetKey()` function and sets the mapped element of the action vector to the corresponding, preset value. CoreBrainPlayer does not need to use any observations or memories of the agent and, thus, is very simple.
+
+In contrast, CoreBrainInternal feeds an agent's observations and other variables collected by the `SendState()` function into the TensorFlow data graph and then applies the output vector from the trained neural network to the agent's action vector.
+
+To support the ExternalCommunicator broadcast function, you must send the `Brain.BrainInfo` object using the `ExternalCommmunicator.giveBrainInfo()` function. In fact if all you need to do is send the agents' observations to an external process, you can simply call this function:
+
+    public void DecideAction()
+    {
+        // Assumes extComms has been set by InitializeCoreBrain() function
+        if (extComms != null)
+        {
+            extComms.giveBrainInfo(brain);
+        }
+    }
+
+The ExternalCommunicator class takes care of collecting each agent's observations and sending them to the process.
+
+## OnInspector
+
+Use `OnInspector()` to implement a Unity property inspector for your CoreBrain implementation. If you do not provide an implementation, users of your CoreBrain will not be able to set any its fields or properties in the Unity Editor Inspector window. See [Extending the editor](https://docs.unity3d.com/Manual/ExtendingTheEditor.html) and [EditorGUI](https://docs.unity3d.com/ScriptReference/EditorGUI.html) for more information about creating custom Inspector controls.
+
+## Adding a new Brain Type to the Brain Inspector
+
+For your CoreBrain implementation to appear in the list of Brain Types, you must add an entry to the Brain class' BrainType enum, which is defined in Brain.cs:
+
+    public enum BrainType
+    {
+        Player,
+        Heuristic,
+        External,
+        Internal
+    }
+
+When the Brain creates an instance of your CoreBrain, it adds the enum name to the string, "CoreBrain". Thus, the class name for the Internal brain is `CoreBrainInternal`. If you created a class named, `CoreBrainFuzzyLogic`, you would add an enum named, "FuzzyLogic", to the BrainType enum.
+
+## Example CoreBrain implementation
+
+Once you have determined that the existing CoreBrain implementations do not fill your needs, you can implement your own. Use `SendState()` to collect the observations from your agents and store them for use in `DecideAction()`.
--- a/docs/Learning-Environment-Design-Internal-Brains.md
+++ b/docs/Learning-Environment-Design-Internal-Brains.md
+# Internal Brain
+
+The Internal Brain type uses a [TensorFlow model](https://www.tensorflow.org/get_started/get_started_for_beginners#models_and_training) to make decisions. The Proximal Policy Optimization (PPO) algorithm included with the ML-Agents SDK produces a trained TensorFlow model that you can use with the Internal Brain type.
+
+A __model__ is a mathematical relationship mapping an agent's observations to its actions. TensorFlow is a software library for performing numerical computation through data flow graphs. A TensorFlow model, then, defines the mathematical relationship between your agent's observations and its actions using a TensorFlow data flow graph. 
+
+## Creating a graph model
+
+The PPO algorithm included with the ML-Agents SDK produces a TensorFlow graph model as the end result of the training process. See [Training with Proximal Policy Optimization](Training-PPO.md) for intructions on how to create and train a model using PPO.
+
+## Using a graph model
+
+To use a graph model:
+
+1. Select the Brain GameObject in the **Hierarchy** window of the Unity Editor. (The Brain GameObject must be a child of the Academy Gameobject and must have a Brain component.)
+2. Set the **Brain Type** to **Internal**.
+
+    **Note:** In order to see the **Internal** Brain Type option, you must [enable TensorFlowSharp](link).  
+
+3. Import the `environment_run-id.bytes` file produced by the PPO training program. (Where `environment_run-id` is the name of the model file, which is constructed from the name of your Unity environment executable and the run-id value you assigned when running the training process.)
+
+    You can [import assets into Unity](https://docs.unity3d.com/Manual/ImportingAssets.html) in various ways. The easiest way is to simply drag the file into the **Project** window and drop it into an appropriate folder.
+    
+4. Once the `environment.bytes` file is imported, drag it from the **Project** window to the **Graph Model** field of the Brain component.
+
+If you are using a model produced by the ML-Agents PPO program, use the default values for the other Internal Brain parameters.
+
+## Internal Brain properties
+
+The default values of the TensorFlow graph parameters work with the model produced by the PPO training code in the ML-Agents SDK. To use a default PPO model, the only parameter that you need to set is the `Graph Model`, which must be set to the .bytes file containing the trained model itself. 
+
+![Internal Brain Inspector](images/internal_brain.png)
+
+
+   *  `Graph Model` : This must be the `bytes` file corresponding to the pretrained Tensorflow graph. (You must first drag this file into your Resources folder and then from the Resources folder into the inspector)
+
+Only change the following Internal Brain properties if you have created your own TensorFlow model and are not using the ML-Agents PPO model:
+
+   *  `Graph Scope` : If you set a scope while training your TensorFlow model, all your placeholder name will have a prefix. You must specify that prefix here.
+   *  `Batch Size Node Name` : If the batch size is one of the inputs of your graph, you must specify the name if the placeholder here. The brain will make the batch size equal to the number of agents connected to the brain automatically.
+   *  `State Node Name` : If your graph uses the state as an input, you must specify the name of the placeholder here.
+   *  `Recurrent Input Node Name` : If your graph uses a recurrent input / memory as input and outputs new recurrent input / memory, you must specify the name if the input placeholder here.
+   *  `Recurrent Output Node Name` : If your graph uses a recurrent input / memory as input and outputs new recurrent input / memory, you must specify the name if the output placeholder here.
+   * `Observation Placeholder Name` : If your graph uses observations as input, you must specify it here. Note that the number of observations is equal to the length of `Camera Resolutions` in the brain parameters.
+   * `Action Node Name` : Specify the name of the placeholder corresponding to the actions of the brain in your graph. If the action space type is continuous, the output must be a one dimensional tensor of float of length `Action Space Size`, if the action space type is discrete, the output must be a one dimensional tensor of int of length 1.
+   * `Graph Placeholder` : If your graph takes additional inputs that are fixed (example: noise level) you can specify them here. Note that in your graph, these must correspond to one dimensional tensors of int or float of size 1.
+     * `Name` : Corresponds to the name of the placeholder.
+     * `Value Type` : Either Integer or Floating Point.
+     * `Min Value` and `Max Value` : Specify the range of the value here. The value will be sampled from the uniform distribution ranging from `Min Value` to `Max Value` inclusive.
+
--- a/docs/Learning-Environment-Design-Player-Brains.md
+++ b/docs/Learning-Environment-Design-Player-Brains.md
+# Player Brain
+
+
+## Player Brain properties
+
+![Player Brain Inspector](images/player_brain.png)
+
+If the action space is discrete, you must map input keys to their corresponding integer values. If the action space is continuous, you must map input keys to their corresponding indices and float values.