Update steps_per_update documentation

Add constant Tweak buffer max size
5 年前 · 817aab95
--- a/docs/Training-SAC.md
+++ b/docs/Training-SAC.md
 Curiosity reward, which can be used to encourage exploration in sparse extrinsic reward
 environments.

-#### Number of Updates for Reward Signal (Optional)
+#### Steps Per Update for Reward Signal (Optional)

 `reward_signal_steps_per_update` for the reward signals corresponds to the number of steps per mini batch sampled
 and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated.
 ### Steps Per Update

 `steps_per_update` corresponds to the number of agent steps (actions) taken for each mini-batch sampled and used during training. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
-replay buffer, and using this mini batch to update the models. Typically, this should be greater than 1.
-However, to imitate the training procedure in certain papers (e.g.
-[Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)),
-we may want to update N times with different mini batches before grabbing additional samples.
-We can change `steps_per_update` to lower than 1 to accomplish this.
+replay buffer, and using this mini batch to update the models. Typically, this should be greater
+than 1. Note that setting `steps_per_update` lower will improve sample efficiency (reduce the number of steps required to train)
+but increase the CPU time spent performaing updates. For most environments where steps are fairly fast (e.g. our example
+environments) `steps_per_update` equals the number of agents in the scene is a good balance. For slow environments (steps
+take 0.1 seconds or more) reducing `steps_per_update` may improve training speed.
+We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this is usually
+not neccessary.
-Typical Range: `10` - `20`
+Typical Range: `1` - `20`

 ### Tau

--- a/ml-agents/mlagents/trainers/agent_processor.py
+++ b/ml-agents/mlagents/trainers/agent_processor.py

        pass

-    def __init__(self, behavior_id: str, maxlen: int = 1000):
+    def __init__(self, behavior_id: str, maxlen: int = 20):
        """
        Initializes an AgentManagerQueue. Note that we can give it a behavior_id so that it can be identified
        separately from an AgentManager.
--- a/ml-agents/mlagents/trainers/sac/trainer.py
+++ b/ml-agents/mlagents/trainers/sac/trainer.py
 logger = get_logger(__name__)

 BUFFER_TRUNCATE_PERCENT = 0.8
+DEFAULT_STEPS_PER_UPDATE = 1


 class SACTrainer(RLTrainer):
        self.steps_per_update = (
            trainer_parameters["steps_per_update"]
            if "steps_per_update" in trainer_parameters
-            else 1
+            else DEFAULT_STEPS_PER_UPDATE
        )
        self.reward_signal_steps_per_update = (
            trainer_parameters["reward_signals"]["reward_signal_steps_per_update"]