浏览代码

Merge branch 'self-play-mutex' into soccer-2v1

/asymm-envs
Andrew Cohen 5 年前
当前提交
650ec121
共有 8 个文件被更改,包括 23 次插入39 次删除
  1. 3
      config/trainer_config.yaml
  2. 32
      docs/Training-Self-Play.md
  3. 1
      ml-agents/mlagents/trainers/ghost/controller.py
  4. 4
      ml-agents/mlagents/trainers/ghost/trainer.py
  5. 10
      ml-agents/mlagents/trainers/learn.py
  6. 1
      ml-agents/mlagents/trainers/tests/test_simple_rl.py
  7. 8
      ml-agents/mlagents/trainers/tests/test_trainer_util.py
  8. 3
      ml-agents/mlagents/trainers/trainer_util.py

3
config/trainer_config.yaml


play_against_current_best_ratio: 0.5
save_steps: 50000
swap_steps: 50000
team_change: 100000
SoccerOne:
normalize: false

play_against_current_best_ratio: 0.2
save_steps: 50000
swap_steps: 25000
team_change: 200000
SoccerTwos:
normalize: false

play_against_current_best_ratio: 0.2
save_steps: 50000
swap_steps: 100000
team_change: 200000
CrawlerStatic:
normalize: true

32
docs/Training-Self-Play.md


ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with
[Self-Play](https://openai.com/blog/competitive-self-play/).
A symmetric game is one in which opposing agents are equal in form, function snd objective. Examples of symmetric games
are Tennis and Soccer. In reinforcement learning, this means both agents have the same observation and
are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and
action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games,
this is not the case. Examples of asymmetric games are Hide and Seek or Strikers vs Goalie in Soccer. Agents in these
types of games do not always have the same observation or action spaces and so sharing policy networks is not

The reward signal should still be used as described in the documentation for the other trainers and [reward signals.](Reward-Signals.md) However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games. Specifically, we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing) and to allow for more iterations of training to compensate for the sparsity of reward.
### Save Steps
The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.
Recommended Range : 10000-100000
The `team-change` ***command line argument*** corresponds to the number of *trainer_steps* between switching the learning team. So,
if you run with the command line flag `--team-change=200000`, the learning team will change every `200000` trainer steps. This ensures each team trains
for precisely the same number of steps. Note, this is not specified in the trainer configuration yaml file, but as a command line argument.
The `team_change` parameter corresponds to the number of *trainer_steps* between switching the learning team.
This is the number of trainer steps the teams associated with a specific ghost trainer will train before a different team
becomes the new learning team. It is possible that, in asymmetric games, opposing teams require fewer trainer steps to make similar
performance gains. This enables users to train a more complicated team of agents for more trainer steps than a simpler team of agents
per team switch.
A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents
the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies

recommend setting this value as a function of the `save_steps` parameter which is discussed in the next section.
recommend setting this value as a function of the `save_steps` parameter discussed previously.
### Save Steps
The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13.
A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent.
Recommended Range : 10000-100000
The `swap_steps` parameter corresponds to the number of *ghost steps* between swapping the opponents policy with a different snapshot.
This occurs when the team of this agent is not learning. A 'ghost step' refers
to a step taken by an agent *that is following a fixed policy* i.e. is not the learning agent. The reason for this distinction is that in asymmetric games,
The `swap_steps` parameter corresponds to the number of *ghost steps* (note, not trainer steps) between swapping the opponents policy with a different snapshot.
A 'ghost step' refers to a step taken by an agent *that is following a fixed policy and not learning*. The reason for this distinction is that in asymmetric games,
we may have teams with an unequal number of agents e.g. the 2v1 scenario in our Strikers Vs Goalie environment. The team with two agents collects
twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number
of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if

1
ml-agents/mlagents/trainers/ghost/controller.py


import logging
from typing import Deque, Dict
from collections import deque
from mlagents.trainers.ghost.trainer import GhostTrainer

4
ml-agents/mlagents/trainers/ghost/trainer.py


self.trainer.advance()
if self.get_step - self.last_team_change > self.steps_to_train_team:
self.controller.finish_training()
self.controller.finish_training(self.get_step)
self.last_team_change = self.get_step
next_learning_team = self.controller.get_learning_team()

] = policy.get_weights()
self._save_snapshot() # Need to save after trainer initializes policy
self.trainer.add_policy(parsed_behavior_id, policy)
self._learning_team = self.controller.get_learning_team(self.ghost_step)
self._learning_team = self.controller.get_learning_team()
self.wrapped_trainer_team = team_id
else:
# for saving/swapping snapshots

10
ml-agents/mlagents/trainers/learn.py


type=int,
help="Number of parallel environments to use for training",
)
argparser.add_argument(
"--team-change",
default=50000,
type=int,
help="Number of trainer steps between changing the team_id that is learning",
)
argparser.add_argument(
"--docker-target-name",
default=None,

keep_checkpoints: int = parser.get_default("keep_checkpoints")
base_port: int = parser.get_default("base_port")
num_envs: int = parser.get_default("num_envs")
team_change: int = parser.get_default("team_change")
curriculum_config: Optional[Dict] = None
lesson: int = parser.get_default("lesson")
no_graphics: bool = parser.get_default("no_graphics")

options.keep_checkpoints,
options.train_model,
options.load_model,
options.team_change,
run_seed,
maybe_meta_curriculum,
options.multi_gpu,

1
ml-agents/mlagents/trainers/tests/test_simple_rl.py


keep_checkpoints=1,
train_model=True,
load_model=False,
team_change=10000,
seed=seed,
meta_curriculum=meta_curriculum,
multi_gpu=False,

8
ml-agents/mlagents/trainers/tests/test_trainer_util.py


keep_checkpoints=keep_checkpoints,
train_model=train_model,
load_model=load_model,
team_change=100,
seed=seed,
)
trainers = {}

keep_checkpoints=keep_checkpoints,
train_model=train_model,
load_model=load_model,
team_change=100,
seed=seed,
)
trainers = {}

keep_checkpoints = 1
train_model = True
load_model = False
team_change = 100
seed = 11
bad_config = dummy_bad_config
BrainParametersMock.return_value.brain_name = "testbrain"

keep_checkpoints=keep_checkpoints,
train_model=train_model,
load_model=load_model,
team_change=team_change,
seed=seed,
)
trainers = {}

keep_checkpoints=keep_checkpoints,
train_model=train_model,
load_model=load_model,
team_change=team_change,
seed=seed,
)
trainers = {}

keep_checkpoints=keep_checkpoints,
train_model=train_model,
load_model=load_model,
team_change=team_change,
seed=seed,
)
trainers = {}

keep_checkpoints=1,
train_model=True,
load_model=False,
team_change=100,
seed=42,
)
trainer_factory.generate(brain_parameters.brain_name)

keep_checkpoints=1,
train_model=True,
load_model=False,
team_change=100,
seed=42,
)
with pytest.raises(TrainerConfigError):

3
ml-agents/mlagents/trainers/trainer_util.py


keep_checkpoints: int,
train_model: bool,
load_model: bool,
team_change: int,
seed: int,
meta_curriculum: MetaCurriculum = None,
multi_gpu: bool = False,

self.seed = seed
self.meta_curriculum = meta_curriculum
self.multi_gpu = multi_gpu
self.ghost_controller = GhostController(team_change)
self.ghost_controller = GhostController()
def generate(self, brain_name: str) -> Trainer:
return initialize_trainer(

正在加载...
取消
保存