This walkthrough uses the **3D Balance Ball** environment. 3D Balance Ball
contains a number of platforms and balls (which are all copies of each other).
This walk-through uses the **3D Balance Ball** environment. 3D Balance Ball contains
a number of platforms and balls (which are all copies of each other).
Each platform tries to keep its ball from falling by rotating either
horizontally or vertically. In this environment, a platform is an **agent**
that receives a reward for every step that it balances the ball. An agent is
training process just learns what values are better given particular state
observations based on the rewards received when it tries different values).
For example, an element might represent a force or torque applied to a
Rigidbody in the agent. The **Discrete** action vector space defines its
`RigidBody` in the agent. The **Discrete** action vector space defines its
actions as a table. A specific action given to the agent is an index into
this table.
OpenAI has a recent [blog post](https://blog.openai.com/openai-baselines-ppo/)
explaining it.
In order to train the agents within the Ball Balance environment:
1. Open `python/PPO.ipynb` notebook from Jupyter.
2. Set `env_name` to the name of your environment file earlier.
3. (optional) In order to get the best results quickly, set `max_steps` to
50000, set `buffer_size` to 5000, and set `batch_size` to 512. For this
exercise, this will train the model in approximately ~5-10 minutes.
4. (optional) Set `run_path` directory to your choice. When using TensorBoard
to observe the training statistics, it helps to set this to a sequential value
To train the agents within the Ball Balance environment, we will be using the python
package. We have provided a convenient python wrapper script called `learn.py` which accepts arguments used to configure both training and inference phases.
We will pass to this script the path of the environment executable that we just built. (Optionally) We can
use `run_id` to identify the experiment and create a folder where the model and summary statistics are stored. When
using TensorBoard to observe the training statistics, it helps to set this to a sequential value
5. Run all cells of notebook with the exception of the last one under "Export
the trained Tensorflow graph."
To summarize, go to your command line, enter the `ml-agents` directory and type:
This document is still to be written. Refer to [Getting Started with the Balance Ball Environment](Getting-Started-with-Balance-Ball.md) for a walk-through of the PPO training process.
This section is still to be written. Refer to [Getting Started with the Balance Ball Environment](Getting-Started-with-Balance-Ball.md) for a walk-through of the PPO training process.
## Best Practices when training with PPO
### Hyperparameters
#### Batch Size
`batch_size` corresponds to how many experiences are used for each gradient descent update. This should always be a fraction
of the `buffer_size`. If you are using a continuous action space, this value should be large (in 1000s). If you are using a discrete action space, this value should be smaller (in 10s).
Typical Range (Continuous): `512` - `5120`
Typical Range (Discrete): `32` - `512`
#### Buffer Size
`buffer_size` corresponds to how many experiences (agent observations, actions and rewards obtained) should be collected before we do any
learning or updating of the model. **This should be a multiple of `batch_size`**. Typically larger `buffer_size` correspond to more stable training updates.
#### Beta (Used only in Discrete Control)
`beta` corresponds to the strength of the entropy regularization, which makes the policy "more random." This ensures that discrete action space agents properly explore during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.
Typical Range: `1e-4` - `1e-2`
Typical Range: `2048` - `409600`
#### Buffer Size
#### Batch Size
`buffer_size` corresponds to how many experiences should be collected before gradient descent is performed on them all.
This should be a multiple of `batch_size`. Typically larger buffer sizes correspond to more stable training updates.
`batch_size` is the number of experiences used for one iteration of a gradient descent update. **This should always be a fraction of the
`buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value
should be smaller (in order of 10s).
Typical Range: `2048` - `409600`
Typical Range (Continuous): `512` - `5120`
#### Epsilon
Typical Range (Discrete): `32` - `512`
`epsilon` corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process.
Typical Range: `0.1` - `0.3`
#### Number of Epochs
#### Hidden Units
`num_epoch` is the number of passes through the experience buffer during gradient descent. The larger the `batch_size`, the
larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning.
`hidden_units` correspond to how many units are in each fully connected layer of the neural network. For simple problems
where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where
the action is a very complex interaction between the observation variables, this should be larger.
Typical Range: `3` - `10`
Typical Range: `32` - `512`
#### Learning Rate
Typical Range: `1e-5` - `1e-3`
#### Number of Epochs
`num_epoch` is the number of passes through the experience buffer during gradient descent. The larger the batch size, the
larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning.
Typical Range: `3` - `10`
#### Time Horizon
#### Max Steps
`max_steps` corresponds to how many steps of the simulation (multiplied by frame-skip) are run durring the training process. This value should be increased for more complex problems.
`max_steps` corresponds to how many steps of the simulation (multiplied by frame-skip) are run during the training process. This value should be increased for more complex problems.
Typical Range: `5e5` - `1e7`
Typical Range: `5e5 - 1e7`
#### Beta
`beta` corresponds to the strength of the entropy regularization, which makes the policy "more random." This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.
Typical Range: `1e-4` - `1e-2`
#### Epsilon
`epsilon` corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process.
Typical Range: `0.1` - `0.3`
#### Normalize
fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems.
Typical range: `1` - `3`
#### Hidden Units
`hidden_units` correspond to how many units are in each fully connected layer of the neural network. For simple problems
where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where
the action is a very complex interaction between the observation variables, this should be larger.