doc update for team_change

5 年前 · 7b37204e
--- a/docs/Training-Self-Play.md
+++ b/docs/Training-Self-Play.md

 The reward signal should still be used as described in the documentation for the other trainers and [reward signals.](Reward-Signals.md) However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games. Specifically, we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing) and to allow for more iterations of training to compensate for the sparsity of reward.

+### Save Steps
+
+The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots.  For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
+
+A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.
+
+Recommended Range : 10000-100000
+
-The `team-change` ***command line argument*** corresponds to the number of *trainer_steps* between switching the learning team. So,
-if you run with the command line flag `--team-change=200000`, the learning team will change every `200000` trainer steps. This ensures each team trains
-for precisely the same number of steps. Note, this is not specified in the trainer configuration yaml file, but as a command line argument.
+The `team_change` parameter corresponds to the number of *trainer_steps* between switching the learning team.
+This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team
+becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar
+performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents
+per team switch.

 A larger value of `team-change` will allow the agent to train longer against it's opponents.  The longer an agent trains against the same set of opponents
 the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies
-recommend setting this value as a function of the `save_steps` parameter which is discussed in the next section.
+recommend setting this value as a function of the `save_steps` parameter discussed previously.
-### Save Steps
-
-The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots.  For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
-
-A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.
-
-Recommended Range : 10000-100000
-The `swap_steps` parameter corresponds to the number of *ghost steps* between swapping the opponents policy with a different snapshot.
-This occurs when the team of this agent is not learning. A 'ghost step' refers
-to a step taken by an agent *that is following a fixed policy* i.e. is not the learning agent. The reason for this distinction is that in asymmetric games,
+The `swap_steps` parameter corresponds to the number of *ghost steps* (note, not trainer steps) between swapping the opponents policy with a different snapshot.
+A 'ghost step' refers to a step taken by an agent *that is following a fixed policy and not learning*. The reason for this distinction is that in asymmetric games,
 we may have teams with an unequal number of agents e.g. the 2v1 scenario in our Strikers Vs Goalie environment. The team with two agents collects
 twice as many agent steps per environment step as the team with one agent.  Thus, these two values will need to be distinct to ensure that the same number
 of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if