updated self-play doc for asymmetric games/changed current_self->current_best

5 年前 · a13f107f
--- a/config/trainer_config.yaml
+++ b/config/trainer_config.yaml
    time_horizon: 1000
    self_play:
        window: 10
-        play_against_current_self_ratio: 0.5
+        play_against_current_best_ratio: 0.5
        save_steps: 50000
        swap_steps: 50000

    num_layers: 2
    self_play:
        window: 10
-        play_against_current_self_ratio: 0.5
+        play_against_current_best_ratio: 0.5
        save_steps: 50000
        swap_steps: 50000

--- a/docs/Training-Self-Play.md
+++ b/docs/Training-Self-Play.md
 # Training with Self-Play

-ML-Agents provides the functionality to train symmetric, adversarial games with [Self-Play](https://openai.com/blog/competitive-self-play/).
-A symmetric game is one in which opposing agents are *equal* in form and function. In reinforcement learning,
-this means both agents have the same observation and action spaces.
-With self-play, an agent learns in adversarial games by competing against fixed, past versions of itself
-to provide a more stable, stationary learning environment. This is compared
-to competing against its current self in every episode, which is a constantly changing opponent.
+ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with
+[Self-Play](https://openai.com/blog/competitive-self-play/).
+A symmetric game is one in which opposing agents are equal in form, function snd objective. Examples of symmetric games
+are Tennis and Soccer. In reinforcement learning, this means both agents have the same observation and
+action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games,
+this is not the case. Examples of asymmetric games are Hide and Seek or Strikers vs Goalie in Soccer. Agents in these
+types of games do not always have the same observation or action spaces and so sharing policy networks is not
+necessarily ideal. Fortunately, both of these situations are supported with only a few extra command line
+arguments and trainer configurations!
+
+With self-play, an agent learns in adversarial games by competing against fixed, past versions of its opponent
+(which could be itself as in symmetric games) to provide a more stable, stationary learning environment. This is compared
+to competing against the current, best opponent in every episode, which is constantly changing (because it's learning).
+However, from the perspective of an individual agent, these scenarios appear to have non-stationary dynamics because the opponent is often changing.
+This can cause significant issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For further reading on
+this issue in particular, [see this paper](https://arxiv.org/pdf/1702.08887.pdf).
 For more general information on training with ML-Agents, see [Training ML-Agents](Training-ML-Agents.md).
 For more algorithm specific instruction, please see the documentation for [PPO](Training-PPO.md) or [SAC](Training-SAC.md).


 ***Team ID must be 0 or an integer greater than 0. Negative numbers will cause unpredictable behavior.***

-See the trainer configuration and agent prefabs for our Tennis environment for an example.
+In symmetric games, since all agents (even on opposing teams) will share the same policy, they should have the same 'Behavior Name' in their
+Behavior Parameters Script.  In asymmetric games, they should have a different Behavior Name in their Behavior Parameters script.
+Note, in asymmetric games, the agents must have both different Behavior Names *and* different team IDs! Then, specify the trainer configuration
+for each Behavior Name in your scene as you would normally, and remember to include the self-play hyperparameter hierarchy!
+
+For examples of how to use this feature, you can see the trainer configurations and agent prefabs for our Tennis, Soccer and Strikers Vs Goalie environments.
+Tennis and Soccer provide examples of symmetric games whereas Strikers Vs Goalie provides an example of an asymmetric game.
+

 ## Best Practices Training with Self-Play

 Training against a set of slowly or unchanging adversaries with low diversity
 results in a more stable learning process than training against a set of quickly
-changing adversaries with high diversity. With this context, this guide discusses the exposed self-play hyperparameters and intuitions for tuning them.
+changing adversaries with high diversity. With this context, this guide discusses
+the exposed self-play hyperparameters and intuitions for tuning them.


 ## Hyperparameters

 The reward signal should still be used as described in the documentation for the other trainers and [reward signals.](Reward-Signals.md) However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games. Specifically, we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing) and to allow for more iterations of training to compensate for the sparsity of reward.

+### Team Change
+
+The `team-change` ***command line argument*** corresponds to the number of *trainer_steps* between switching the learning team. So,
+if you run with the command line flag `--team-change=200000`, the learning team will change every `200000` trainer steps. This ensures each team trains
+for precisely the same number of steps. Note, this is not specified in the trainer configuration yaml file, but as a command line argument.
+
+A larger value of `team-change` will allow the agent to train longer against it's opponents.  The longer an agent trains against the same set of opponents
+the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies
+and so the agent may fail against the next batch of opponents.
+
+The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team.  So, we
+recommend setting this value as a function of the `save_steps` parameter which is discussed in the next section.
+
+Recommended Range : 4x-10x where x=`save_steps`
+
-The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots.  For example, if `save_steps`=10000 then a snapshot of the current policy will be saved every 10000 trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
+The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots.  For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.

 A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.


-The `swap_steps` parameter corresponds to the number of *trainer steps* between swapping the opponents policy with a different snapshot. As in the `save_steps` discussion, note that trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
+The `swap_steps` parameter corresponds to the number of *ghost steps* between swapping the opponents policy with a different snapshot.
+This occurs when the team of this agent is not learning. A 'ghost step' refers
+to a step taken by an agent *that is following a fixed policy* i.e. is not the learning agent. The reason for this distinction is that in asymmetric games,
+we may have teams with an unequal number of agents e.g. the 2v1 scenario in our Strikers Vs Goalie environment. The team with two agents collects
+twice as many agent steps per environment step as the team with one agent.  Thus, these two values will need to be distinct to ensure that the same number
+of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if
+a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents`
+agents during `team-change` total steps is:
+```
+swap_steps = (num_agents / num_opponent_agents) * (team_change / x)
+```
+
+As an example, in our Strikers Vs Goalie environment, if we want the swap to occur `x=4` times during `team-change=200000` steps,
+the `swap_steps` for the team of one agent is:
+
+```
+swap_steps = (1 / 2) * (200000 / 4) = 25000
+```
+The `swap_steps` for the team of two agents is:
+```
+swap_steps = (2 / 1) * (200000 / 4) = 100000
+```
+Note, with equal team sizes, the first term is equal to 1 and `swap_steps` can be calculated by just dividing the total steps by the desired number of swaps.
-### Play against current self ratio
+### Play against current best ratio
-The `play_against_current_self_ratio` parameter corresponds to the probability
-an agent will play against its ***current*** self. With probability
-1 - `play_against_current_self_ratio`, the agent will play against a snapshot of itself
-from a past iteration.
+The `play_against_current_best_ratio` parameter corresponds to the probability
+an agent will play against the current opponent. With probability
+1 - `play_against_current_best_ratio`, the agent will play against a snapshot of its
+opponent from a past iteration.
-A larger value of `play_against_current_self_ratio` indicates that an agent will be playing against itself more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration.  This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy.
+A larger value of `play_against_current_best_ratio` indicates that an agent will be playing against the current opponent more often. Since the agent is updating it's policy, the opponent will be different from iteration to iteration.  This can lead to an unstable learning environment, but poses the agent with an [auto-curricula](https://openai.com/blog/emergent-tool-use/) of more increasingly challenging situations which may lead to a stronger final policy.

 Recommended Range : 0.0 - 1.0

 In adversarial games, the cumulative environment reward may not be a meaningful metric by which to track learning progress.  This is because cumulative reward is entirely dependent on the skill of the opponent. An agent at a particular skill level will get more or less reward against a worse or better agent, respectively.

 We provide an implementation of the ELO rating system, a method for calculating the relative skill level between two players from a given population in a zero-sum game. For more information on ELO, please see [the ELO wiki](https://en.wikipedia.org/wiki/Elo_rating_system).
+In a proper training run, the ELO of the agent should steadily increase. The absolute value of the ELO is less important than the change in ELO over training iterations.
-In a proper training run, the ELO of the agent should steadily increase. The absolute value of the ELO is less important than the change in ELO over training iterations.
+Note, this implementation will support any number of teams but ELO is only applicable to games with two teams.  It is ongoing work to implement
+a reliable metric for measuring progress in these scenarios. These scenarios can still train, though as of now, reward and qualitative observations
+are the only metric by which we can judge performance.
+
--- a/ml-agents/mlagents/trainers/ghost/trainer.py
+++ b/ml-agents/mlagents/trainers/ghost/trainer.py

        self_play_parameters = trainer_parameters["self_play"]
        self.window = self_play_parameters.get("window", 10)
-        self.play_against_current_self_ratio = self_play_parameters.get(
-            "play_against_current_self_ratio", 0.5
+        self.play_against_current_best_ratio = self_play_parameters.get(
+            "play_against_current_best_ratio", 0.5
        )
        self.steps_between_save = self_play_parameters.get("save_steps", 20000)
        self.steps_between_swap = self_play_parameters.get("swap_steps", 20000)
        for team_id in self._team_to_name_to_policy_queue:
            if team_id == self._learning_team:
                continue
-            elif np.random.uniform() < (1 - self.play_against_current_self_ratio):
+            elif np.random.uniform() < (1 - self.play_against_current_best_ratio):
                x = np.random.randint(len(self.policy_snapshots))
                snapshot = self.policy_snapshots[x]
            else: