Address comments in docs

5 年前 · 8b52a2d0
--- a/docs/Training-SAC.md
+++ b/docs/Training-SAC.md

 ### Steps Per Update

-`steps_per_update` corresponds to the number agent steps (actions) taken for each mini-batch sampled and used during training. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
+`steps_per_update` corresponds to the number of agent steps (actions) taken for each mini-batch sampled and used during training. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
 replay buffer, and using this mini batch to update the models. Typically, this should be greater than 1.
 However, to imitate the training procedure in certain papers (e.g.
 [Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)),
--- a/ml-agents/mlagents/trainers/sac/trainer.py
+++ b/ml-agents/mlagents/trainers/sac/trainer.py
    @timed
    def _update_policy(self) -> None:
        """
-        Update the SAC policy and reward signals until the steps_per_update ratio
-        is met.
+        Update the SAC policy and reward signals. The reward signal generators are updated using different mini batches.
+        By default we imitate http://arxiv.org/abs/1809.02925 and similar papers, where the policy is updated
+        N times, then the reward signals are updated N times.
        """
        self.update_sac_policy()
        self.update_reward_signals()

    def update_sac_policy(self) -> None:
        """
-        Uses demonstration_buffer to update the policy.
-        The reward signal generators are updated using different mini batches.
-        If we want to imitate http://arxiv.org/abs/1809.02925 and similar papers, where the policy is updated
-        N times, then the reward signals are updated N times, then reward_signal_updates_per_train
-        is greater than 1 and the reward signals are not updated in parallel.
+        Uses update_buffer to update the policy. We sample the update_buffer and update
+        until the steps_per_update ratio is met.
        """

        self.cumulative_returns_since_policy_update.clear()