浏览代码

Merge branch 'soccer-2v1' into asymm-envs

/asymm-envs
Andrew Cohen 5 年前
当前提交
4468280a
共有 15 个文件被更改,包括 1417 次插入801 次删除
  1. 1001
      Project/Assets/ML-Agents/Examples/Soccer/TFModels/Goalie.nn
  2. 1001
      Project/Assets/ML-Agents/Examples/Soccer/TFModels/Striker.nn
  3. 7
      Project/Assets/ML-Agents/Examples/Tennis/Scripts/TennisAgent.cs
  4. 1
      com.unity.ml-agents/CHANGELOG.md
  5. 8
      docs/Training-ML-Agents.md
  6. 11
      docs/Training-PPO.md
  7. 11
      docs/Training-SAC.md
  8. 10
      docs/Training-Self-Play.md
  9. 20
      ml-agents/mlagents/trainers/learn.py
  10. 56
      ml-agents/mlagents/trainers/policy/tf_policy.py
  11. 2
      ml-agents/mlagents/trainers/tests/test_learn.py
  12. 49
      ml-agents/mlagents/trainers/tests/test_nn_policy.py
  13. 2
      ml-agents/mlagents/trainers/tests/test_simple_rl.py
  14. 12
      ml-agents/mlagents/trainers/tests/test_trainer_util.py
  15. 27
      ml-agents/mlagents/trainers/trainer_util.py

1001
Project/Assets/ML-Agents/Examples/Soccer/TFModels/Goalie.nn
文件差异内容过多而无法显示
查看文件

1001
Project/Assets/ML-Agents/Examples/Soccer/TFModels/Striker.nn
文件差异内容过多而无法显示
查看文件

7
Project/Assets/ML-Agents/Examples/Tennis/Scripts/TennisAgent.cs


public override float[] Heuristic()
{
var action = new float[2];
var action = new float[3];
action[0] = Input.GetAxis("Horizontal");
action[1] = Input.GetKey(KeyCode.Space) ? 1f : 0f;
action[0] = Input.GetAxis("Horizontal"); // Racket Movement
action[1] = Input.GetKey(KeyCode.Space) ? 1f : 0f; // Racket Jumping
action[2] = Input.GetAxis("Vertical"); // Racket Rotation
return action;
}

1
com.unity.ml-agents/CHANGELOG.md


- The Jupyter notebooks have been removed from the repository.
- Introduced the `SideChannelUtils` to register, unregister and access side channels.
- `Academy.FloatProperties` was removed, please use `SideChannelUtils.GetSideChannel<FloatPropertiesChannel>()` instead.
- Added ability to start training (initialize model weights) from a previous run ID. (#3710)
### Minor Changes
- Format of console output has changed slightly and now matches the name of the model/summary directory. (#3630, #3616)

8
docs/Training-ML-Agents.md


specified, you will not be able to continue with training. Use `--force` to force ML-Agents to
overwrite the existing data.
Alternatively, you might want to start a new training run but _initialize_ it using an already-trained
model. You may want to do this, for instance, if your environment changed and you want
a new model, but the old behavior is still better than random. You can do this by specifying `--initialize-from=<run-identifier>`, where `<run-identifier>` is the old run ID.
### Command Line Training Options
In addition to passing the path of the Unity executable containing your training

as the current agents in your scene.
* `--force`: Attempting to train a model with a run-id that has been used before will
throw an error. Use `--force` to force-overwrite this run-id's summary and model data.
* `--initialize-from=<run-identifier>`: Specify an old run-id here to initialize your model from
a previously trained model. Note that the previously saved models _must_ have the same behavior
parameters as your current environment.
* `--no-graphics`: Specify this option to run the Unity executable in
`-batchmode` and doesn't initialize the graphics driver. Use this only if your
training doesn't involve visual observations (reading from Pixels). See

| train_interval | How often to update the agent. | SAC |
| num_update | Number of mini-batches to update the agent with during each update. | SAC |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC |
| init_path | Initialize trainer from a previously saved model. | PPO, SAC |
\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation), GAIL = Generative Adversarial Imitaiton Learning

11
docs/Training-PPO.md


Typical Range: Approximately equal to PPO's `buffer_size`
### (Optional) Advanced: Initialize Model Path
`init_path` can be specified to initialize your model from a previous run before starting.
Note that the prior run should have used the same trainer configurations as the current run,
and have been saved with the same version of ML-Agents. You should provide the full path
to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`.
This option is provided in case you want to initialize different behaviors from different runs;
in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize
all models from the same run.
## Training Statistics
To view training statistics, use TensorBoard. For information on launching and

11
docs/Training-SAC.md


Typical Range (Discrete): `32` - `512`
### (Optional) Advanced: Initialize Model Path
`init_path` can be specified to initialize your model from a previous run before starting.
Note that the prior run should have used the same trainer configurations as the current run,
and have been saved with the same version of ML-Agents. You should provide the full path
to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`.
This option is provided in case you want to initialize different behaviors from different runs;
in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize
all models from the same run.
## Training Statistics
To view training statistics, use TensorBoard. For information on launching and

10
docs/Training-Self-Play.md


A symmetric game is one in which opposing agents are equal in form, function and objective. Examples of symmetric games
are our Tennis and Soccer example environments. In reinforcement learning, this means both agents have the same observation and
action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games,
this is not the case. An example of an asymmetric games are Hide and Seek. Agents in these
this is not the case. An example of an asymmetric game is our Strikers Vs Goalie example environment. Agents in these
types of games do not always have the same observation or action spaces and so sharing policy networks is not
necessarily ideal.

Note, in asymmetric games, the agents must have both different Behavior Names *and* different team IDs! Then, specify the trainer configuration
for each Behavior Name in your scene as you would normally, and remember to include the self-play hyperparameter hierarchy!
For examples of how to use this feature, you can see the trainer configurations and agent prefabs for our Tennis and Soccer environments.
Tennis and Soccer provide examples of symmetric games. To train an asymmetric game, specify trainer configurations for each of your behavior names
and include the self-play hyperparameter hierarchy in both.
For examples of how to use this feature, you can see the trainer configurations and agent prefabs for our Tennis, Soccer, and
Strikers Vs Goalie environments.
Tennis and Soccer provide examples of symmetric games and Strikers Vs Goalie provides an example of an asymmetric game.
## Best Practices Training with Self-Play

The `swap_steps` parameter corresponds to the number of *ghost steps* (not trainer steps) between swapping the opponents policy with a different snapshot.
A 'ghost step' refers to a step taken by an agent *that is following a fixed policy and not learning*. The reason for this distinction is that in asymmetric games,
we may have teams with an unequal number of agents e.g. a 2v1 scenario. The team with two agents collects
we may have teams with an unequal number of agents e.g. a 2v1 scenario like our Strikers Vs Goalie example environment. The team with two agents collects
twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number
of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if
a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents`

20
ml-agents/mlagents/trainers/learn.py


default=False,
dest="force",
action="store_true",
help="Force-overwrite existing models and summaries for a run-id that has been used "
help="Force-overwrite existing models and summaries for a run ID that has been used "
help="The directory name for model and summary statistics",
help="The run identifier for model and summary statistics.",
)
argparser.add_argument(
"--initialize-from",
metavar="RUN_ID",
default=None,
help="Specify a previously saved run ID from which to initialize the model from. "
"This can be used, for instance, to fine-tune an existing model on a new environment. ",
)
argparser.add_argument(
"--save-freq", default=50000, type=int, help="Frequency at which to save model"

dest="inference",
action="store_true",
help="Run in Python inference mode (don't train). Use with --resume to load a model trained with an "
"existing run-id.",
"existing run ID.",
)
argparser.add_argument(
"--base-port",

seed: int = parser.get_default("seed")
env_path: Optional[str] = parser.get_default("env_path")
run_id: str = parser.get_default("run_id")
initialize_from: str = parser.get_default("initialize_from")
load_model: bool = parser.get_default("load_model")
resume: bool = parser.get_default("resume")
force: bool = parser.get_default("force")

"""
with hierarchical_timer("run_training.setup"):
model_path = f"./models/{options.run_id}"
maybe_init_path = (
f"./models/{options.initialize_from}" if options.initialize_from else None
)
summaries_dir = "./summaries"
port = options.base_port

],
)
handle_existing_directories(
model_path, summaries_dir, options.resume, options.force
model_path, summaries_dir, options.resume, options.force, maybe_init_path
)
tb_writer = TensorboardWriter(summaries_dir, clear_past_data=not options.resume)
gauge_write = GaugeWriter()

not options.inference,
options.resume,
run_seed,
maybe_init_path,
maybe_meta_curriculum,
options.multi_gpu,
)

56
ml-agents/mlagents/trainers/policy/tf_policy.py


if self.use_continuous_act:
self.num_branches = self.brain.vector_action_space_size[0]
self.model_path = trainer_parameters["model_path"]
self.initialize_path = trainer_parameters.get("init_path", None)
self.keep_checkpoints = trainer_parameters.get("keep_checkpoints", 5)
self.graph = tf.Graph()
self.sess = tf.Session(

init = tf.global_variables_initializer()
self.sess.run(init)
def _load_graph(self):
def _load_graph(self, model_path: str, reset_global_steps: bool = False) -> None:
logger.info("Loading Model for brain {}".format(self.brain.brain_name))
ckpt = tf.train.get_checkpoint_state(self.model_path)
logger.info(
"Loading model for brain {} from {}.".format(
self.brain.brain_name, model_path
)
)
ckpt = tf.train.get_checkpoint_state(model_path)
"--run-id. and that the previous run you are resuming from had the same "
"behavior names.".format(self.model_path)
"--run-id and that the previous run you are loading from had the same "
"behavior names.".format(model_path)
self.saver.restore(self.sess, ckpt.model_checkpoint_path)
try:
self.saver.restore(self.sess, ckpt.model_checkpoint_path)
except tf.errors.NotFoundError:
raise UnityPolicyException(
"The model {0} was found but could not be loaded. Make "
"sure the model is from the same version of ML-Agents, has the same behavior parameters, "
"and is using the same trainer configuration as the current run.".format(
model_path
)
)
if reset_global_steps:
logger.info(
"Starting training from step 0 and saving to {}.".format(
self.model_path
)
)
else:
logger.info(
"Resuming training from step {}.".format(self.get_current_step())
)
if self.load:
self._load_graph()
# If there is an initialize path, load from that. Else, load from the set model path.
# If load is set to True, don't reset steps to 0. Else, do. This allows a user to,
# e.g., resume from an initialize path.
reset_steps = not self.load
if self.initialize_path is not None:
self._load_graph(self.initialize_path, reset_global_steps=reset_steps)
elif self.load:
self._load_graph(self.model_path, reset_global_steps=reset_steps)
else:
self._initialize_graph()

"""
step = self.sess.run(self.global_step)
return step
def _set_step(self, step: int) -> int:
"""
Sets current model step to step without creating additional ops.
:param step: Step to set the current model step to.
:return: The step the model was set to.
"""
current_step = self.get_current_step()
# Increment a positive or negative number of steps.
return self.increment_step(step - current_step)
def increment_step(self, n_steps):
"""

2
ml-agents/mlagents/trainers/tests/test_learn.py


None,
)
handle_dir_mock.assert_called_once_with(
"./models/ppo", "./summaries", False, False
"./models/ppo", "./summaries", False, False, None
)
StatsReporter.writers.clear() # make sure there aren't any writers as added by learn.py

49
ml-agents/mlagents/trainers/tests/test_nn_policy.py


import pytest
import os
from typing import Dict, Any
import numpy as np
from mlagents.tf_utils import tf

NUM_AGENTS = 12
def create_policy_mock(dummy_config, use_rnn, use_discrete, use_visual):
def create_policy_mock(
dummy_config: Dict[str, Any],
use_rnn: bool = False,
use_discrete: bool = True,
use_visual: bool = False,
load: bool = False,
seed: int = 0,
) -> NNPolicy:
mock_brain = mb.setup_mock_brain(
use_discrete,
use_visual,

trainer_parameters = dummy_config
trainer_parameters["keep_checkpoints"] = 3
trainer_parameters["use_recurrent"] = use_rnn
policy = NNPolicy(0, mock_brain, trainer_parameters, False, False)
policy = NNPolicy(seed, mock_brain, trainer_parameters, False, load)
def test_load_save(dummy_config, tmp_path):
path1 = os.path.join(tmp_path, "runid1")
path2 = os.path.join(tmp_path, "runid2")
trainer_params = dummy_config
trainer_params["model_path"] = path1
policy = create_policy_mock(trainer_params)
policy.initialize_or_load()
policy.save_model(2000)
assert len(os.listdir(tmp_path)) > 0
# Try load from this path
policy2 = create_policy_mock(trainer_params, load=True, seed=1)
policy2.initialize_or_load()
_compare_two_policies(policy, policy2)
# Try initialize from path 1
trainer_params["model_path"] = path2
trainer_params["init_path"] = path1
policy3 = create_policy_mock(trainer_params, load=False, seed=2)
policy3.initialize_or_load()
_compare_two_policies(policy2, policy3)
def _compare_two_policies(policy1: NNPolicy, policy2: NNPolicy) -> None:
"""
Make sure two policies have the same output for the same input.
"""
step = mb.create_batchedstep_from_brainparams(policy1.brain, num_agents=1)
run_out1 = policy1.evaluate(step, list(step.agent_id))
run_out2 = policy2.evaluate(step, list(step.agent_id))
np.testing.assert_array_equal(run_out2["log_probs"], run_out1["log_probs"])
@pytest.mark.parametrize("discrete", [True, False], ids=["discrete", "continuous"])

2
ml-agents/mlagents/trainers/tests/test_simple_rl.py


processed_rewards = [
default_reward_processor(rewards) for rewards in env.final_rewards.values()
]
success_threshold = 0.99
success_threshold = 0.9
assert any(reward > success_threshold for reward in processed_rewards) and any(
reward < success_threshold for reward in processed_rewards
)

12
ml-agents/mlagents/trainers/tests/test_trainer_util.py


trainer_util.handle_existing_directories(model_path, summary_path, True, False)
# Test try to train w/ force - should work
trainer_util.handle_existing_directories(model_path, summary_path, False, True)
# Test initialize option
init_path = os.path.join(tmp_path, "runid2")
with pytest.raises(UnityTrainerException):
trainer_util.handle_existing_directories(
model_path, summary_path, False, True, init_path
)
os.mkdir(init_path)
# Should pass since the directory exists now.
trainer_util.handle_existing_directories(
model_path, summary_path, False, True, init_path
)

27
ml-agents/mlagents/trainers/trainer_util.py


train_model: bool,
load_model: bool,
seed: int,
init_path: str = None,
meta_curriculum: MetaCurriculum = None,
multi_gpu: bool = False,
):

self.model_path = model_path
self.init_path = init_path
self.keep_checkpoints = keep_checkpoints
self.train_model = train_model
self.load_model = load_model

self.load_model,
self.ghost_controller,
self.seed,
self.init_path,
self.meta_curriculum,
self.multi_gpu,
)

load_model: bool,
ghost_controller: GhostController,
seed: int,
init_path: str = None,
meta_curriculum: MetaCurriculum = None,
multi_gpu: bool = False,
) -> Trainer:

:param load_model: Whether to load the model or randomly initialize
:param ghost_controller: The object that coordinates ghost trainers
:param seed: The random seed to use
:param init_path: Path from which to load model, if different from model_path.
:param meta_curriculum: Optional meta_curriculum, used to determine a reward buffer length for PPOTrainer
:return:
"""

trainer_parameters["model_path"] = "{basedir}/{name}".format(
basedir=model_path, name=brain_name
)
if init_path is not None:
trainer_parameters["init_path"] = "{basedir}/{name}".format(
basedir=init_path, name=brain_name
)
trainer_parameters["keep_checkpoints"] = keep_checkpoints
if brain_name in trainer_config:
_brain_key: Any = brain_name

def handle_existing_directories(
model_path: str, summary_path: str, resume: bool, force: bool
model_path: str, summary_path: str, resume: bool, force: bool, init_path: str = None
) -> None:
"""
Validates that if the run_id model exists, we do not overwrite it unless --force is specified.

if model_path_exists:
if not resume and not force:
raise UnityTrainerException(
"Previous data from this run-id was found. "
"Either specify a new run-id, use --resume to resume this run, "
"Previous data from this run ID was found. "
"Either specify a new run ID, use --resume to resume this run, "
"Previous data from this run-id was not found. "
"Previous data from this run ID was not found. "
# Verify init path if specified.
if init_path is not None:
if not os.path.isdir(init_path):
raise UnityTrainerException(
"Could not initialize from {}. "
"Make sure models have already been saved with that run ID.".format(
init_path
)
)
正在加载...
取消
保存