Merge pull request #389 from Unity-Technologies/docs/learn

Deletes PPO notebook and provides a replacement. PPO hyper-parameter doc tweaks .
7 年前 · 529877e0
--- a/docs/Getting-Started-with-Balance-Ball.md
+++ b/docs/Getting-Started-with-Balance-Ball.md

 ![Balance Ball](images/balance.png)

-This walkthrough uses the **3D Balance Ball** environment. 3D Balance Ball 
-contains a number of platforms and balls (which are all copies of each other). 
+This walk-through uses the **3D Balance Ball** environment. 3D Balance Ball contains 
+a number of platforms and balls (which are all copies of each other). 
 Each platform tries to keep its ball from falling by rotating either 
 horizontally or vertically. In this environment, a platform is an **agent** 
 that receives a reward for every step that it balances the ball. An agent is 
 training process just learns what values are better given particular state 
 observations based on the rewards received when it tries different values). 
 For example, an element might represent a force or torque applied to a 
-Rigidbody in the agent. The **Discrete** action vector space defines its
+`RigidBody` in the agent. The **Discrete** action vector space defines its
 actions as a table. A specific action given to the agent is an index into 
 this table. 

 OpenAI has a recent [blog post](https://blog.openai.com/openai-baselines-ppo/) 
 explaining it.

-In order to train the agents within the Ball Balance environment:
-1. Open `python/PPO.ipynb` notebook from Jupyter.
-2. Set `env_name` to the name of your environment file earlier.
-3. (optional) In order to get the best results quickly, set `max_steps` to 
-50000, set `buffer_size` to 5000, and set `batch_size` to 512.  For this 
-exercise, this will train the model in approximately ~5-10 minutes.
-4. (optional) Set `run_path` directory to your choice. When using TensorBoard 
-to observe the training statistics, it helps to set this to a sequential value 
+To train the agents within the Ball Balance environment, we will be using the python 
+package. We have provided a convenient python wrapper script called `learn.py` which accepts arguments used to configure both training and inference phases.
+
+
+We will pass to this script the path of the environment executable that we just built. (Optionally) We can
+use `run_id` to identify the experiment and create a folder where the model and summary statistics are stored. When 
+using TensorBoard to observe the training statistics, it helps to set this to a sequential value 
-5. Run all cells of notebook with the exception of the last one under "Export 
-the trained Tensorflow graph."
+
+To summarize, go to your command line, enter the `ml-agents` directory and type: 
+
+```
+
+python python/learn.py <env_file_path> --run-id=<run-identifier> --train 
+
+```
+
+The `--train` flag tells ML-Agents to run in training mode. `env_file_path` should be the path to the Unity executable that was just created. 
+
-In order to observe the training process in more detail, you can use 
-TensorBoard. In your command line, enter into `python` directory and then run :
+
+Once you start training using `learn.py` in the way described in the previous section, the `ml-agents` folder will 
+contain a `summaries` directory. In order to observe the training process 
+in more detail, you can use TensorBoard. From the command line run :

 `tensorboard --logdir=summaries`


 ### Embedding the trained model into Unity

-1. Run the final cell of the notebook under "Export the trained TensorFlow
-graph" to produce an `<env_name >.bytes` file.
-2. Move `<env_name>.bytes` from `python/models/ppo/` into 
+1. The trained model is stored in `models/<run-identifier>` in the `ml-agents` folder. Once the 
+training is complete, there will be a `<env_name>.bytes` file in that location where `<env_name>` is the name 
+of the executable used during training. 
+ 2. Move `<env_name>.bytes` from `python/models/ppo/` into 
 `unity-environment/Assets/ML-Agents/Examples/3DBall/TFModels/`.
 3. Open the Unity Editor, and select the `3DBall` scene as described above.
 4. Select the `Ball3DBrain` object from the Scene hierarchy.
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md
 # Training with Proximal Policy Optimization

-This document is still to be written. Refer to [Getting Started with the Balance Ball Environment](Getting-Started-with-Balance-Ball.md) for a walk-through of the PPO training process.
+This section is still to be written. Refer to [Getting Started with the Balance Ball Environment](Getting-Started-with-Balance-Ball.md) for a walk-through of the PPO training process.

 ## Best Practices when training with PPO


 ### Hyperparameters

-#### Batch Size
-
-`batch_size` corresponds to how many experiences are used for each gradient descent update. This should always be a fraction
-of the `buffer_size`. If you are using a continuous action space, this value should be large (in 1000s). If you are using a discrete action space, this value should be smaller (in 10s). 
-
-Typical Range (Continuous): `512` - `5120`
-
-Typical Range (Discrete): `32` - `512`
+#### Buffer Size
+`buffer_size` corresponds to how many experiences (agent observations, actions and rewards obtained) should be collected before we do any 
+learning or updating of the model. **This should be a multiple of `batch_size`**. Typically larger `buffer_size` correspond to more stable training updates.
-#### Beta (Used only in Discrete Control)
-
-`beta` corresponds to the strength of the entropy regularization, which makes the policy "more random." This ensures that discrete action space agents properly explore during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.
-
-Typical Range: `1e-4` - `1e-2`
+Typical Range: `2048` - `409600`
-#### Buffer Size
+#### Batch Size
-`buffer_size` corresponds to how many experiences should be collected before gradient descent is performed on them all.
-This should be a multiple of `batch_size`. Typically larger buffer sizes correspond to more stable training updates.
+`batch_size` is the number of experiences used for one iteration of a gradient descent update. **This should always be a fraction of the 
+`buffer_size`**. If you are using a continuous action space, this value should be large (in the order of  1000s). If you are using a discrete action space, this value 
+should be smaller (in order of 10s). 
-Typical Range: `2048` - `409600`
+Typical Range (Continuous): `512` - `5120`
-#### Epsilon
+Typical Range (Discrete): `32` - `512`
-`epsilon` corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process.
-Typical Range: `0.1` - `0.3`
+#### Number of Epochs
-#### Hidden Units
+`num_epoch` is the number of passes through the experience buffer during gradient descent. The larger the `batch_size`, the
+larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning.
-`hidden_units` correspond to how many units are in each fully connected layer of the neural network. For simple problems
-where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where
-the action is a very complex interaction between the observation variables, this should be larger.
+Typical Range: `3` - `10`
-Typical Range: `32` - `512`

 #### Learning Rate

 Typical Range: `1e-5` - `1e-3`

-#### Number of Epochs
-
-`num_epoch` is the number of passes through the experience buffer during gradient descent. The larger the batch size, the
-larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning.
-
-Typical Range: `3` - `10`

 #### Time Horizon


 #### Max Steps

-`max_steps` corresponds to how many steps of the simulation (multiplied by frame-skip) are run durring the training process. This value should be increased for more complex problems.
+`max_steps` corresponds to how many steps of the simulation (multiplied by frame-skip) are run during the training process. This value should be increased for more complex problems.
+
+Typical Range: `5e5` - `1e7`
-Typical Range: `5e5 - 1e7`
+#### Beta
+
+`beta` corresponds to the strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.
+
+Typical Range: `1e-4` - `1e-2`
+
+
+#### Epsilon
+
+`epsilon` corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process.
+
+Typical Range: `0.1` - `0.3`

 #### Normalize 

 fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems.

 Typical range: `1` - `3`
+
+#### Hidden Units
+
+`hidden_units` correspond to how many units are in each fully connected layer of the neural network. For simple problems
+where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where
+the action is a very complex interaction between the observation variables, this should be larger.
+
+Typical Range: `32` - `512`

 ### Training Statistics