浏览代码

Add additional explanation for time horizon

/develop-generalizationTraining-TrainerController
Arthur Juliani 7 年前
当前提交
36e95a95
共有 1 个文件被更改,包括 5 次插入2 次删除
  1. 7
      docs/best-practices-ppo.md

7
docs/best-practices-ppo.md


### Time Horizon
`time_horizon` corresponds to how many steps of experience to collect per-agent before adding it to the experience buffer.
In cases where there are frequent rewards within an episode, or episodes are prohibitively large, this can be a smaller number. For most stable training however, this number should be large enough to capture all the important behavior within a sequence of an agent's actions.
When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state.
As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon).
In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal.
This number should be large enough to capture all the important behavior within a sequence of an agent's actions.
Typical Range: `64` - `2048`
Typical Range: `32` - `2048`
### Max Steps

正在加载...
取消
保存