Update docs

5 年前 · 8bf8c9a9
--- a/docs/Migrating.md
+++ b/docs/Migrating.md
 * The `--load` and `--train` command-line flags have been deprecated and replaced with `--resume` and `--inference`.
 * Running with the same `--run-id` twice will now throw an error.
 * The `play_against_current_self_ratio` self-play trainer hyperparameter has been renamed to `play_against_latest_model_ratio`
+* The Jupyter notebooks have been removed from the repository.
+* `Academy.FloatProperties` was removed.
+* `Academy.RegisterSideChannel` and `Academy.UnregisterSideChannel` were removed.
+* `num_updates` and `train_interval` for SAC have been replaced with `steps_per_update`.
-* The Jupyter notebooks have been removed from the repository.
-* `Academy.FloatProperties` was removed.
-* `Academy.RegisterSideChannel` and `Academy.UnregisterSideChannel` were removed.
-
-### Steps to Migrate
+* `steps_per_update` should be around equal to the number of agents in your environment, times `num_updates`
+ and divided by `train_interval`.

 ## Migrating from 0.14 to 0.15

--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 | time_horizon         | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, SAC    |
 | trainer              | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc".                                                                                                             | PPO, SAC             |
 | train_interval       | How often to update the agent.                                                                                                                                                          | SAC                      |
-| num_update           | Number of mini-batches to update the agent with during each update.                                                                                                                     | SAC                      |
+| steps_per_update           | Ratio of agent steps per mini-batch update.                                                                                                                     | SAC                      |
 | use_recurrent        | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                                                                       | PPO, SAC             |

 \*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation), GAIL = Generative Adversarial Imitaiton Learning
--- a/docs/Training-SAC.md
+++ b/docs/Training-SAC.md

 #### Number of Updates for Reward Signal (Optional)

-`reward_signal_num_update` for the reward signals corresponds to the number of mini batches sampled
-and used for updating the reward signals during each
-update. By default, we update the reward signals once every time the main policy is updated.
+`reward_signal_steps_per_update` for the reward signals corresponds to the number of steps per mini batch sampled
+and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated.
-we may want to update the policy N times, then update the reward signal (GAIL) M times.
-We can change `train_interval` and `num_update` of SAC to N, as well as `reward_signal_num_update`
-under `reward_signals` to M to accomplish this. By default, `reward_signal_num_update` is set to
-`num_update`.
+we may want to update the reward signal (GAIL) M times for every update of the policy.
+We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update`
+under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to
+`steps_per_update`.
-Typical Range: `num_update`
+Typical Range: `steps_per_update`

 ### Buffer Size


 Typical Range: `1` - `5`

-### Number of Updates
+### Steps Per Update
-`num_update` corresponds to the number of mini batches sampled and used for training during each
-training event. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
-replay buffer, and using this mini batch to update the models. Typically, this can be left at 1.
+`steps_per_update` corresponds to the number agent steps (actions) taken for each mini-batch sampled and used during training. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
+replay buffer, and using this mini batch to update the models. Typically, this should be greater than 1.
-We can change `train_interval` and `num_update` to N to accomplish this.
+We can change `steps_per_update` to lower than 1 to accomplish this.
-Typical Range: `1`
+Typical Range: `10` - `20`

 ### Tau