Updates best ppo practices

7 年前 · 3d03390a
--- a/docs/best-practices-ppo.md
+++ b/docs/best-practices-ppo.md
 ### Batch Size

 `batch_size` corresponds to how many experiences are used for each gradient descent update. This should always be a fraction
-of the `buffer_size`. If you are using a continuous action space, this value should be large. If you are using a discrete action space, this value should be smaller. 
+of the `buffer_size`. If you are using a continuous action space, this value should be large (in 1000s). If you are using a discrete action space, this value should be smaller (in 10s). 

 Typical Range (Continuous): `512` - `5120`

-### Beta
+### Beta (Used only in Discrete Control)
-`beta` corresponds to the strength of the entropy regularization. This ensures that discrete action space agents properly
-explore during training. Increasing this will ensure more random actions are taken. This should be adjusted such that 
-the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly,
-increase `beta`. If entropy drops too slowly, decrease `beta`.
+`beta` corresponds to the strength of the entropy regularization, which makes the policy "more random." This ensures that discrete action space agents properly explore during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase `beta`. If entropy drops too slowly, decrease `beta`.

 Typical Range: `1e-4` - `1e-2`

-This should be a multiple of `batch_size`. 
+This should be a multiple of `batch_size`. Typically larger buffer sizes correspond to more stable training updates.
-`epsilon` corresponds to the acceptable threshold between the old and new policies during gradient descent updating.
+`epsilon` corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process.

 Typical Range: `0.1` - `0.3`

 ### Number of Epochs

 `num_epoch` is the number of passes through the experience buffer during gradient descent. The larger the batch size, the
-larger it is acceptable to make this.
+larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning.

 Typical Range: `3` - `10`

-In cases where there are frequent rewards within an episode, or episodes are prohibitively large, this can be a smaller number.
-For most stable training however, this number should be large enough to capture all the important behavior within a sequence of 
-an agent's actions.
+In cases where there are frequent rewards within an episode, or episodes are prohibitively large, this can be a smaller number. For most stable training however, this number should be large enough to capture all the important behavior within a sequence of an agent's actions.
+
+### Max Steps
+
+`max_steps` corresponds to how many steps of the simulation (multiplied by frame-skip) are run durring the training process. This value should be increased for more complex problems.
+
+Typical Range: `5e5 - 1e7`

 ## Training Statistics


-The general trend in reward should consistently increase over time. Small ups and downs are to be expected.
+The general trend in reward should consistently increase over time. Small ups and downs are to be expected. Depending on the complexity of the task, a significant increase in reward may not present itself until millions of steps into the training process.
-This corresponds to how random the decisions of a brain are. This should consistently decrease during training. If it decreases 
-too soon or not at all, `beta` should be adjusted (when using discrete action space).
+This corresponds to how random the decisions of a brain are. This should consistently decrease during training. If it decreases too soon or not at all, `beta` should be adjusted (when using discrete action space).

 ### Learning Rate


 ### Value Estimate

-These values should increase with the reward. They corresponds to how much future reward the agent predicts itself receiving at 
-any given point.
+These values should increase with the reward. They corresponds to how much future reward the agent predicts itself receiving at any given point.

 ### Value Loss