Merge branch 'self-play-mutex' into soccer-2v1

5 年前 · 650ec121
--- a/config/trainer_config.yaml
+++ b/config/trainer_config.yaml
        play_against_current_best_ratio: 0.5
        save_steps: 50000
        swap_steps: 50000
+        team_change: 100000

 SoccerOne:
    normalize: false
        play_against_current_best_ratio: 0.2
        save_steps: 50000
        swap_steps: 25000
+        team_change: 200000

 SoccerTwos:
    normalize: false
        play_against_current_best_ratio: 0.2
        save_steps: 50000
        swap_steps: 100000
+        team_change: 200000

 CrawlerStatic:
    normalize: true
--- a/docs/Training-Self-Play.md
+++ b/docs/Training-Self-Play.md
 ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with
 [Self-Play](https://openai.com/blog/competitive-self-play/).
 A symmetric game is one in which opposing agents are equal in form, function snd objective. Examples of symmetric games
-are Tennis and Soccer. In reinforcement learning, this means both agents have the same observation and
+are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and
 action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games,
 this is not the case. Examples of asymmetric games are Hide and Seek or Strikers vs Goalie in Soccer. Agents in these
 types of games do not always have the same observation or action spaces and so sharing policy networks is not

 The reward signal should still be used as described in the documentation for the other trainers and [reward signals.](Reward-Signals.md) However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games. Specifically, we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing) and to allow for more iterations of training to compensate for the sparsity of reward.

+### Save Steps
+
+The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots.  For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
+
+A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.
+
+Recommended Range : 10000-100000
+
-The `team-change` ***command line argument*** corresponds to the number of *trainer_steps* between switching the learning team. So,
-if you run with the command line flag `--team-change=200000`, the learning team will change every `200000` trainer steps. This ensures each team trains
-for precisely the same number of steps. Note, this is not specified in the trainer configuration yaml file, but as a command line argument.
+The `team_change` parameter corresponds to the number of *trainer_steps* between switching the learning team.
+This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team
+becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar
+performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents
+per team switch.

 A larger value of `team-change` will allow the agent to train longer against it's opponents.  The longer an agent trains against the same set of opponents
 the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies
-recommend setting this value as a function of the `save_steps` parameter which is discussed in the next section.
+recommend setting this value as a function of the `save_steps` parameter discussed previously.
-### Save Steps
-
-The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots.  For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
-
-A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.
-
-Recommended Range : 10000-100000
-The `swap_steps` parameter corresponds to the number of *ghost steps* between swapping the opponents policy with a different snapshot.
-This occurs when the team of this agent is not learning. A 'ghost step' refers
-to a step taken by an agent *that is following a fixed policy* i.e. is not the learning agent. The reason for this distinction is that in asymmetric games,
+The `swap_steps` parameter corresponds to the number of *ghost steps* (note, not trainer steps) between swapping the opponents policy with a different snapshot.
+A 'ghost step' refers to a step taken by an agent *that is following a fixed policy and not learning*. The reason for this distinction is that in asymmetric games,
 we may have teams with an unequal number of agents e.g. the 2v1 scenario in our Strikers Vs Goalie environment. The team with two agents collects
 twice as many agent steps per environment step as the team with one agent.  Thus, these two values will need to be distinct to ensure that the same number
 of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if
--- a/ml-agents/mlagents/trainers/ghost/controller.py
+++ b/ml-agents/mlagents/trainers/ghost/controller.py
 import logging
-
 from typing import Deque, Dict
 from collections import deque
 from mlagents.trainers.ghost.trainer import GhostTrainer
--- a/ml-agents/mlagents/trainers/ghost/trainer.py
+++ b/ml-agents/mlagents/trainers/ghost/trainer.py
        self.trainer.advance()

        if self.get_step - self.last_team_change > self.steps_to_train_team:
-            self.controller.finish_training()
+            self.controller.finish_training(self.get_step)
            self.last_team_change = self.get_step

        next_learning_team = self.controller.get_learning_team()
            ] = policy.get_weights()
            self._save_snapshot()  # Need to save after trainer initializes policy
            self.trainer.add_policy(parsed_behavior_id, policy)
-            self._learning_team = self.controller.get_learning_team(self.ghost_step)
+            self._learning_team = self.controller.get_learning_team()
            self.wrapped_trainer_team = team_id
        else:
            # for saving/swapping snapshots
--- a/ml-agents/mlagents/trainers/learn.py
+++ b/ml-agents/mlagents/trainers/learn.py
        type=int,
        help="Number of parallel environments to use for training",
    )
-
-    argparser.add_argument(
-        "--team-change",
-        default=50000,
-        type=int,
-        help="Number of trainer steps between changing the team_id that is learning",
-    )
-
    argparser.add_argument(
        "--docker-target-name",
        default=None,
    keep_checkpoints: int = parser.get_default("keep_checkpoints")
    base_port: int = parser.get_default("base_port")
    num_envs: int = parser.get_default("num_envs")
-    team_change: int = parser.get_default("team_change")
    curriculum_config: Optional[Dict] = None
    lesson: int = parser.get_default("lesson")
    no_graphics: bool = parser.get_default("no_graphics")
            options.keep_checkpoints,
            options.train_model,
            options.load_model,
-            options.team_change,
            run_seed,
            maybe_meta_curriculum,
            options.multi_gpu,
--- a/ml-agents/mlagents/trainers/tests/test_simple_rl.py
+++ b/ml-agents/mlagents/trainers/tests/test_simple_rl.py
            keep_checkpoints=1,
            train_model=True,
            load_model=False,
-            team_change=10000,
            seed=seed,
            meta_curriculum=meta_curriculum,
            multi_gpu=False,
--- a/ml-agents/mlagents/trainers/tests/test_trainer_util.py
+++ b/ml-agents/mlagents/trainers/tests/test_trainer_util.py
            keep_checkpoints=keep_checkpoints,
            train_model=train_model,
            load_model=load_model,
-            team_change=100,
            seed=seed,
        )
        trainers = {}
            keep_checkpoints=keep_checkpoints,
            train_model=train_model,
            load_model=load_model,
-            team_change=100,
            seed=seed,
        )
        trainers = {}
    keep_checkpoints = 1
    train_model = True
    load_model = False
-    team_change = 100
    seed = 11
    bad_config = dummy_bad_config
    BrainParametersMock.return_value.brain_name = "testbrain"
            keep_checkpoints=keep_checkpoints,
            train_model=train_model,
            load_model=load_model,
-            team_change=team_change,
            seed=seed,
        )
        trainers = {}
            keep_checkpoints=keep_checkpoints,
            train_model=train_model,
            load_model=load_model,
-            team_change=team_change,
            seed=seed,
        )
        trainers = {}
            keep_checkpoints=keep_checkpoints,
            train_model=train_model,
            load_model=load_model,
-            team_change=team_change,
            seed=seed,
        )
        trainers = {}
        keep_checkpoints=1,
        train_model=True,
        load_model=False,
-        team_change=100,
        seed=42,
    )
    trainer_factory.generate(brain_parameters.brain_name)
        keep_checkpoints=1,
        train_model=True,
        load_model=False,
-        team_change=100,
        seed=42,
    )
    with pytest.raises(TrainerConfigError):
--- a/ml-agents/mlagents/trainers/trainer_util.py
+++ b/ml-agents/mlagents/trainers/trainer_util.py
        keep_checkpoints: int,
        train_model: bool,
        load_model: bool,
-        team_change: int,
        seed: int,
        meta_curriculum: MetaCurriculum = None,
        multi_gpu: bool = False,
        self.seed = seed
        self.meta_curriculum = meta_curriculum
        self.multi_gpu = multi_gpu
-        self.ghost_controller = GhostController(team_change)
+        self.ghost_controller = GhostController()

    def generate(self, brain_name: str) -> Trainer:
        return initialize_trainer(