浏览代码

Address comments in docs

/develop/sac-apex
Ervin Teng 5 年前
当前提交
8b52a2d0
共有 2 个文件被更改,包括 6 次插入8 次删除
  1. 2
      docs/Training-SAC.md
  2. 12
      ml-agents/mlagents/trainers/sac/trainer.py

2
docs/Training-SAC.md


### Steps Per Update
`steps_per_update` corresponds to the number agent steps (actions) taken for each mini-batch sampled and used during training. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
`steps_per_update` corresponds to the number of agent steps (actions) taken for each mini-batch sampled and used during training. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
replay buffer, and using this mini batch to update the models. Typically, this should be greater than 1.
However, to imitate the training procedure in certain papers (e.g.
[Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)),

12
ml-agents/mlagents/trainers/sac/trainer.py


@timed
def _update_policy(self) -> None:
"""
Update the SAC policy and reward signals until the steps_per_update ratio
is met.
Update the SAC policy and reward signals. The reward signal generators are updated using different mini batches.
By default we imitate http://arxiv.org/abs/1809.02925 and similar papers, where the policy is updated
N times, then the reward signals are updated N times.
"""
self.update_sac_policy()
self.update_reward_signals()

def update_sac_policy(self) -> None:
"""
Uses demonstration_buffer to update the policy.
The reward signal generators are updated using different mini batches.
If we want to imitate http://arxiv.org/abs/1809.02925 and similar papers, where the policy is updated
N times, then the reward signals are updated N times, then reward_signal_updates_per_train
is greater than 1 and the reward signals are not updated in parallel.
Uses update_buffer to update the policy. We sample the update_buffer and update
until the steps_per_update ratio is met.
"""
self.cumulative_returns_since_policy_update.clear()

正在加载...
取消
保存