浏览代码

[refactor] Move configuration files to single YAML file (#3791)

/whitepaper-experiments
GitHub 5 年前
当前提交
f86fc81d
共有 76 个文件被更改,包括 1650 次插入641 次删除
  1. 3
      com.unity.ml-agents/CHANGELOG.md
  2. 2
      docs/Feature-Memory.md
  3. 10
      docs/Getting-Started.md
  4. 43
      docs/Learning-Environment-Create-New.md
  5. 4
      docs/Learning-Environment-Examples.md
  6. 6
      docs/Learning-Environment-Executable.md
  7. 15
      docs/Migrating.md
  8. 4
      docs/Reward-Signals.md
  9. 61
      docs/Training-Curriculum-Learning.md
  10. 52
      docs/Training-Environment-Parameter-Randomization.md
  11. 3
      docs/Training-Imitation-Learning.md
  12. 36
      docs/Training-ML-Agents.md
  13. 2
      docs/Training-Using-Concurrent-Unity-Instances.md
  14. 2
      gym-unity/README.md
  15. 51
      ml-agents/mlagents/trainers/learn.py
  16. 45
      ml-agents/mlagents/trainers/tests/test_learn.py
  17. 41
      ml-agents/mlagents/trainers/tests/test_trainer_util.py
  18. 25
      ml-agents/mlagents/trainers/trainer_util.py
  19. 2
      ml-agents/tests/yamato/training_int_tests.py
  20. 3
      ml-agents/tests/yamato/yamato_utils.py
  21. 29
      config/imitation/CrawlerStatic.yaml
  22. 29
      config/imitation/FoodCollector.yaml
  23. 28
      config/imitation/Hallway.yaml
  24. 25
      config/imitation/PushBlock.yaml
  25. 36
      config/imitation/Pyramids.yaml
  26. 25
      config/ppo/3DBall.yaml
  27. 25
      config/ppo/3DBallHard.yaml
  28. 40
      config/ppo/3DBall_randomize.yaml
  29. 25
      config/ppo/Basic.yaml
  30. 25
      config/ppo/Bouncer.yaml
  31. 25
      config/ppo/CrawlerDynamic.yaml
  32. 25
      config/ppo/CrawlerStatic.yaml
  33. 25
      config/ppo/FoodCollector.yaml
  34. 25
      config/ppo/GridWorld.yaml
  35. 25
      config/ppo/Hallway.yaml
  36. 25
      config/ppo/PushBlock.yaml
  37. 29
      config/ppo/Pyramids.yaml
  38. 25
      config/ppo/Reacher.yaml
  39. 38
      config/ppo/SoccerTwos.yaml
  40. 62
      config/ppo/StrikersVsGoalie.yaml
  41. 31
      config/ppo/Tennis.yaml
  42. 25
      config/ppo/VisualHallway.yaml
  43. 25
      config/ppo/VisualPushBlock.yaml
  44. 29
      config/ppo/VisualPyramids.yaml
  45. 25
      config/ppo/Walker.yaml
  46. 50
      config/ppo/WallJump.yaml
  47. 65
      config/ppo/WallJump_curriculum.yaml
  48. 25
      config/ppo/WormDynamic.yaml
  49. 25
      config/ppo/WormStatic.yaml
  50. 25
      config/sac/3DBall.yaml
  51. 25
      config/sac/3DBallHard.yaml
  52. 25
      config/sac/Basic.yaml
  53. 25
      config/sac/Bouncer.yaml
  54. 25
      config/sac/CrawlerDynamic.yaml
  55. 25
      config/sac/CrawlerStatic.yaml
  56. 25
      config/sac/FoodCollector.yaml
  57. 25
      config/sac/GridWorld.yaml
  58. 25
      config/sac/Hallway.yaml
  59. 25
      config/sac/PushBlock.yaml
  60. 31
      config/sac/Pyramids.yaml
  61. 25
      config/sac/Reacher.yaml
  62. 30
      config/sac/Tennis.yaml
  63. 26
      config/sac/VisualHallway.yaml
  64. 26
      config/sac/VisualPushBlock.yaml
  65. 31
      config/sac/VisualPyramids.yaml
  66. 25
      config/sac/Walker.yaml
  67. 50
      config/sac/WallJump.yaml
  68. 129
      config/gail_config.yaml
  69. 16
      config/3dball_randomize.yaml
  70. 351
      config/trainer_config.yaml

3
com.unity.ml-agents/CHANGELOG.md


C# style conventions. All public fields and properties now use "PascalCase"
instead of "camelCase"; for example, `Agent.maxStep` was renamed to
`Agent.MaxStep`. For a full list of changes, see the pull request. (#3828)
- Curriculum and Parameter Randomization configurations have been merged
into the main training configuration file. Note that this means training
configuration files are now environment-specific. (#3791)
- Update Barracuda to 0.7.0-preview which has breaking namespace and assembly name changes.
- Training artifacts (trained models, summaries) are now found in the `results/`
directory. (#3829)

2
docs/Feature-Memory.md


## How to use
When configuring the trainer parameters in the `config/trainer_config.yaml`
When configuring the trainer parameters in the config YAML
file, add the following parameters to the Behavior you want to use.
```json

10
docs/Getting-Started.md


1. Navigate to the folder where you cloned the `ml-agents` repository. **Note**:
If you followed the default [installation](Installation.md), then you should
be able to run `mlagents-learn` from any directory.
1. Run `mlagents-learn config/trainer_config.yaml --run-id=first3DBallRun`.
- `config/trainer_config.yaml` is the path to a default training
configuration file that we provide. In includes training configurations for
all our example environments, including 3DBall.
1. Run `mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun`.
- `config/ppo/3DBall.yaml` is the path to a default training
configuration file that we provide. The `config/ppo` folder includes training configuration
files for all our example environments, including 3DBall.
- `run-id` is a unique name for this training session.
1. When the message _"Start training by pressing the Play button in the Unity
Editor"_ is displayed on the screen, you can press the :arrow_forward: button

the same command again, appending the `--resume` flag:
```sh
mlagents-learn config/trainer_config.yaml --run-id=first3DBallRun --resume
mlagents-learn config/ppo/3DBall.yaml --run-id=firstRun --resume
```
Your trained model will be at `results/<run-identifier>/<behavior_name>.nn` where

43
docs/Learning-Environment-Create-New.md


and include the following hyperparameter values:
```yml
RollerBall:
trainer: ppo
batch_size: 10
beta: 5.0e-3
buffer_size: 100
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 3.0e-4
learning_rate_schedule: linear
max_steps: 5.0e4
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
summary_freq: 10000
use_recurrent: false
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
behaviors:
RollerBall:
trainer: ppo
batch_size: 10
beta: 5.0e-3
buffer_size: 100
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 3.0e-4
learning_rate_schedule: linear
max_steps: 5.0e4
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
summary_freq: 10000
use_recurrent: false
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
```
Since this example creates a very simple training environment with only a few

4
docs/Learning-Environment-Examples.md


does not train with the provided default training parameters.**
- Float Properties: None
- Benchmark Mean Reward: 0.7
- To speed up training, you can enable curiosity by adding the `curiosity`
reward signal in `config/trainer_config.yaml`
- To train this environment, you can enable curiosity by adding the `curiosity` reward signal
in `config/ppo/Hallway.yaml`
## Bouncer

6
docs/Learning-Environment-Executable.md


the directory where you installed the ML-Agents Toolkit, run:
```sh
mlagents-learn ../config/trainer_config.yaml --env=3DBall --run-id=firstRun
mlagents-learn ../config/ppo/3DBall.yaml --env=3DBall --run-id=firstRun
ml-agents$ mlagents-learn config/trainer_config.yaml --env=3DBall --run-id=first-run
ml-agents$ mlagents-learn config/ppo/3DBall.yaml --env=3DBall --run-id=first-run
▄▄▄▓▓▓▓

latest checkpoint. (**Note:** There is a known bug on Windows that causes the
saving of the model to fail when you early terminate the training, it's
recommended to wait until Step has reached the max_steps parameter you set in
trainer_config.yaml.) You can now embed this trained model into your Agent by
your config YAML.) You can now embed this trained model into your Agent by
following the steps below:
1. Move your model file into

15
docs/Migrating.md


- `WriteAdapter` was renamed to `ObservationWriter`. (#3834)
- Training artifacts (trained models, summaries) are now found under `results/`
instead of `summaries/` and `models/`.
- Trainer configuration, curriculum configuration, and parameter randomization
configuration have all been moved to a single YAML file. (#3791)
### Steps to Migrate

- Update uses of "camelCase" fields and properties to "PascalCase".
- If you have a custom `ISensor` implementation, you will need to change the signature of
its `Write()` method to use `ObservationWriter` instead of `WriteAdapter`.
- Before upgrading, copy your `Behavior Name` sections from `trainer_config.yaml` into
a separate trainer configuration file, under a `behaviors` section. You can move the `default` section too
if it's being used. This file should be specific to your environment, and not contain configurations for
multiple environments (unless they have the same Behavior Names).
- If your training uses [curriculum](Training-Curriculum-Learning.md), move those configurations under
the `Behavior Name` section.
- If your training uses [parameter randomization](Training-Environment-Parameter-Randomization.md), move
the contents of the sampler config to `parameter_randomization` in the main trainer configuration.
## Migrating from 0.14 to 0.15

- Multiply `max_steps` and `summary_freq` in your `trainer_config.yaml` by the
number of Agents in the scene.
- Combine curriculum configs into a single file. See
[the WallJump curricula](../config/curricula/wall_jump.yaml) for an example of
[the WallJump curricula](https://github.com/Unity-Technologies/ml-agents/blob/0.14.1/config/curricula/wall_jump.yaml) for an example of
the new curriculum config format. A tool like https://www.json2yaml.com may be
useful to help with the conversion.
- If you have a model trained which uses RayPerceptionSensor and has non-1.0

- It is now required to specify the path to the yaml trainer configuration file
when running `mlagents-learn`. For an example trainer configuration file, see
[trainer_config.yaml](../config/trainer_config.yaml). An example of passing a
[trainer_config.yaml](https://github.com/Unity-Technologies/ml-agents/blob/0.5.0a/config/trainer_config.yaml). An example of passing a
trainer configuration to `mlagents-learn` is shown above.
- The environment name is now passed through the `--env` option.
- Curriculum learning has been changed. Refer to the

4
docs/Reward-Signals.md


## Enabling Reward Signals
Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. An
example is provided in `config/trainer_config.yaml` and `config/gail_config.yaml`. To enable a reward signal, add it to the
Reward signals, like other hyperparameters, are defined in the trainer config `.yaml` file. Examples of config files
are provided in `config/ppo/` and `config/imitation/`. To enable a reward signal, add it to the
`reward_signals:` section under the behavior name. For instance, to enable the extrinsic signal
in addition to a small curiosity reward and a GAIL reward signal, you would define your `reward_signals` as follows:

61
docs/Training-Curriculum-Learning.md


the height of the wall is what varies. We define this as a `Environment Parameters`
that can be accessed in `Academy.Instance.EnvironmentParameters`, and by doing
so it becomes adjustable via the Python API.
Rather than adjusting it by hand, we will create a YAML file which
Rather than adjusting it by hand, we will add a section to our YAML configuration file that
curricula for the Wall Jump environment.
curricula for the Wall Jump environment. You can find the full file in `config/ppo/WallJump_curriculum.yaml`.
BigWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
SmallWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
small_wall_height: [1.5, 2.0, 2.5, 4.0]
behaviors:
BigWallJump:
trainer: ppo
... # The rest of the hyperparameters
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curriculum: # Add this section for curriculum
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
SmallWallJump:
trainer: ppo
... # The rest of the hyperparameters
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curriculum: # Add this section for curriculum
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
small_wall_height: [1.5, 2.0, 2.5, 4.0]
At the top level of the config is the behavior name. Note that this must be the
For each Behavior Name described in your training configuration file, we can specify a curriculum
by adding a `curriculum:` section under that particular Behavior Name.
Note that these must be the
The curriculum for each
The curriculum for each
behavior has the following parameters:
* `measure` - What to measure learning progress, and advancement in lessons by.
* `reward` - Uses a measure received reward.

cumulative reward of the last `100` episodes exceeds the current threshold.
The mean reward logged to the console is dictated by the `summary_freq`
parameter in the
[trainer configuration file](Training-ML-Agents.md#training-config-file).
[training configuration file](Training-ML-Agents.md#training-config-file).
* `signal_smoothing` (true/false) - Whether to weight the current progress
measure by previous values.
* If `true`, weighting will be 0.75 (new) 0.25 (old).

to train agents in the Wall Jump environment with curriculum learning, you can run:
```sh
mlagents-learn config/trainer_config.yaml --curriculum=config/curricula/wall_jump.yaml --run-id=wall-jump-curriculum
mlagents-learn config/ppo/WallJump_curriculum.yaml --run-id=wall-jump-curriculum
```
You can then keep track of the current lessons and progresses via TensorBoard.

52
docs/Training-Environment-Parameter-Randomization.md


are handled by a **Sampler Manager**, which also handles the generation of new
values for the environment parameters when needed.
To setup the Sampler Manager, we create a YAML file that specifies how we wish to
generate new samples for each `Environment Parameters`. In this file, we specify the samplers and the
To setup the Sampler Manager, we edit our [training configuration file](Training-ML-Agents.md#training-config-file).
Add a `parameter_randomization` section that specifies how we wish to generate new samples for each `Environment
Parameters`. In this section, we specify the samplers and the
resampled). Below is an example of a sampler file for the 3D ball environment.
resampled). Below is an example of a sampler file for the 3D ball environment. The full file is provided in
`config/ppo/3DBall_randomize.yaml`.
resampling-interval: 5000
behaviors:
# Trainer hyperparameters
# New section
parameter_randomization:
resampling-interval: 5000
mass:
sampler-type: "uniform"
min_value: 0.5
max_value: 10
mass:
sampler-type: "uniform"
min_value: 0.5
max_value: 10
gravity:
sampler-type: "multirange_uniform"
intervals: [[7, 10], [15, 20]]
gravity:
sampler-type: "multirange_uniform"
intervals: [[7, 10], [15, 20]]
scale:
sampler-type: "uniform"
min_value: 0.75
max_value: 3
scale:
sampler-type: "uniform"
min_value: 0.75
max_value: 3
```

return np.random.choice(self.possible_vals)
```
Now we need to specify the new sampler type in the sampler YAML file. For example, we use this new
Now we need to specify the new sampler type in the trainer configuration file. For example, we use this new
sampler type for the `Environment Parameter` *mass*.
```yaml

### Training with Environment Parameter Randomization
After the sampler YAML file is defined, we proceed by launching `mlagents-learn` and specify
our configured sampler file with the `--sampler` flag. For example, if we wanted to train the
3D ball agent with parameter randomization using `Environment Parameters` with `config/3dball_randomize.yaml`
sampling setup, we would run
After the parameter variations are defined in the training config file, we proceed by launching the file with
`mlagents-learn` as usual. For example, if we wanted to train the
3D ball agent with parameter randomization using `Environment Parameters` as specified in
`config/ppo/3DBall_randomize.yaml` sampling setup, we would run
mlagents-learn config/trainer_config.yaml --sampler=config/3dball_randomize.yaml
--run-id=3D-Ball-randomize
mlagents-learn config/ppo/3DBall_randomize.yaml --run-id=3D-Ball-randomize
We can observe progress and metrics via Tensorboard.
We can observe progress and metrics via Tensorboard as usual.

3
docs/Training-Imitation-Learning.md


width="375" border="10" />
</p>
You can then specify the path to this file as the `demo_path` in your `trainer_config.yaml` file
You can then specify the path to this file as the `demo_path` in your
[training configuration file](Training-ML-Agents.md#training-config-file).
when using BC or GAIL. For instance, for BC:
```

36
docs/Training-ML-Agents.md


This section offers a detailed guide into how to manage the different training
set-ups withing the toolkit.
The training config files `config/trainer_config.yaml`,
`config/sac_trainer_config.yaml`, `config/gail_config.yaml` and
`config/offline_bc_config.yaml` specifies the training method, the
hyperparameters, and a few additional values to use when training with Proximal
Policy Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial
Imitation Learning) with PPO/SAC, and Behavioral Cloning(BC)/Imitation with
PPO/SAC. These files are divided into sections. The **default** section defines
the default values for all the available training with PPO, SAC, GAIL (with
PPO), and BC. These files are divided into sections. The **default** section
defines the default values for all the available settings. You can also add new
sections to override these defaults to train specific Behaviors. Name each of
these override sections after the appropriate `Behavior Name`. Sections for the
example environments are included in the provided config file.
For each training run, create a YAML file that contains the the training method and the
hyperparameters for each of the Behaviors found in your environment. Example files for
Policy Optimization (PPO) and Soft Actor-Critic (SAC) are provided in `config/ppo/` and
`config/sac/`, respectively. Examples for imitation learning through GAIL (Generative Adversarial
Imitation Learning) and Behavioral Cloning (BC) can be found in `config/imitiation/`.
Each file is divided into sections. The `behaviors` section defines the hyperparameters
for each Behavior found in your environment. A section should be created for each `Behavior Name`.
The available parameters for PPO and SAC are listed below. Alternatively, if there are many
different Behaviors that all use similar hyperparameters, you can create a `default` behavior name
that specifies all hyperparameters that are not specified in the Behavior-specific sections.
To use [Curriculum Learning](Training-Curriculum-Learning.md) for a particular Behavior, add a
section under that `Behavior Name` called `curriculum`.
See the [Curriculum Learning](Training-Curriculum-Learning.md) page for more information.
To use Parameter Randomization, add a `parameter_randomization` section in the configuration
file. See the [Parameter Randomization](Training-Environment-Parameter-Randomization.md) docs
for more information.
\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral
Cloning (Imitation), GAIL = Generative Adversarial Imitation Learning

You can also compare the
[example environments](Learning-Environment-Examples.md) to the corresponding
sections of the `config/trainer_config.yaml` file for each example to see how
the hyperparameters and other configuration variables have been changed from the
defaults.
files in the `config/ppo/` file for each example to see how
the hyperparameters and other configuration variables have been changed from environment to environment.

2
docs/Training-Using-Concurrent-Unity-Instances.md


### Buffer Size
If you are having trouble getting an agent to train, even with multiple concurrent Unity instances, you could increase `buffer_size` in the `config/trainer_config.yaml` file. A common practice is to multiply `buffer_size` by `num-envs`.
If you are having trouble getting an agent to train, even with multiple concurrent Unity instances, you could increase `buffer_size` in the [training configuration file](Training-ML-Agents.md#training-config-file). A common practice is to multiply `buffer_size` by `num-envs`.
### Resource Constraints

2
gym-unity/README.md


We provide results from our PPO implementation and the DQN from Baselines as reference.
Note that all runs used the same greyscale GridWorld as Dopamine. For PPO, `num_layers`
was set to 2, and all other hyperparameters are the default for GridWorld in `trainer_config.yaml`.
was set to 2, and all other hyperparameters are the default for GridWorld in `config/ppo/GridWorld.yaml`.
For Baselines DQN, the provided hyperparameters in the previous section are used. Note
that Baselines implements certain features (e.g. dueling-Q) that are not enabled
in Dopamine DQN.

51
ml-agents/mlagents/trainers/learn.py


load_config,
TrainerFactory,
handle_existing_directories,
assemble_curriculum_config,
)
from mlagents.trainers.stats import (
TensorboardWriter,

)
from mlagents_envs.environment import UnityEnvironment
from mlagents.trainers.sampler_class import SamplerManager
from mlagents.trainers.exception import SamplerException
from mlagents.trainers.exception import SamplerException, TrainerConfigError
from mlagents_envs.base_env import BaseEnv
from mlagents.trainers.subprocess_env_manager import SubprocessEnvManager
from mlagents_envs.side_channel.side_channel import SideChannel

help="Path to the Unity executable to train",
)
argparser.add_argument(
"--curriculum",
default=None,
dest="curriculum_config_path",
help="YAML file for defining the lessons for curriculum training",
)
argparser.add_argument(
)
argparser.add_argument(
"--sampler",
default=None,
dest="sampler_file_path",
help="YAML file for defining the sampler for environment parameter randomization",
)
argparser.add_argument(
"--keep-checkpoints",

class RunOptions(NamedTuple):
trainer_config: Dict
behaviors: Dict
debug: bool = parser.get_default("debug")
seed: int = parser.get_default("seed")
env_path: Optional[str] = parser.get_default("env_path")

lesson: int = parser.get_default("lesson")
no_graphics: bool = parser.get_default("no_graphics")
multi_gpu: bool = parser.get_default("multi_gpu")
sampler_config: Optional[Dict] = None
parameter_randomization: Optional[Dict] = None
env_args: Optional[List[str]] = parser.get_default("env_args")
cpu: bool = parser.get_default("cpu")
width: int = parser.get_default("width")

configs loaded from files.
"""
argparse_args = vars(args)
trainer_config_path = argparse_args["trainer_config_path"]
curriculum_config_path = argparse_args["curriculum_config_path"]
argparse_args["trainer_config"] = load_config(trainer_config_path)
if curriculum_config_path is not None:
argparse_args["curriculum_config"] = load_config(curriculum_config_path)
if argparse_args["sampler_file_path"] is not None:
argparse_args["sampler_config"] = load_config(
argparse_args["sampler_file_path"]
config_path = argparse_args["trainer_config_path"]
# Load YAML and apply overrides as needed
yaml_config = load_config(config_path)
try:
argparse_args["behaviors"] = yaml_config["behaviors"]
except KeyError:
raise TrainerConfigError(
"Trainer configurations not found. Make sure your YAML file has a section for behaviors."
argparse_args["parameter_randomization"] = yaml_config.get(
"parameter_randomization", None
)
argparse_args.pop("sampler_file_path")
argparse_args.pop("curriculum_config_path")
return RunOptions(**vars(args))

capture_frame_rate=options.capture_frame_rate,
)
env_manager = SubprocessEnvManager(env_factory, engine_config, options.num_envs)
curriculum_config = assemble_curriculum_config(options.behaviors)
options.curriculum_config, env_manager, options.lesson
curriculum_config, env_manager, options.lesson
options.sampler_config, run_seed
options.parameter_randomization, run_seed
options.trainer_config,
options.behaviors,
options.run_id,
write_path,
options.keep_checkpoints,

def try_create_meta_curriculum(
curriculum_config: Optional[Dict], env: SubprocessEnvManager, lesson: int
) -> Optional[MetaCurriculum]:
if curriculum_config is None:
if curriculum_config is None or len(curriculum_config) <= 0:
return None
else:
meta_curriculum = MetaCurriculum(curriculum_config)

45
ml-agents/mlagents/trainers/tests/test_learn.py


import pytest
import yaml
from unittest.mock import MagicMock, patch, mock_open
from mlagents.trainers import learn
from mlagents.trainers.trainer_controller import TrainerController

return parse_command_line(args)
MOCK_YAML = """
behaviors:
{}
"""
MOCK_SAMPLER_CURRICULUM_YAML = """
behaviors:
behavior1:
curriculum:
curriculum1
behavior2:
curriculum:
curriculum2
parameter_randomization:
sampler1
"""
@patch("mlagents.trainers.learn.write_timing_tree")
@patch("mlagents.trainers.learn.write_run_options")
@patch("mlagents.trainers.learn.handle_existing_directories")

mock_env.external_brain_names = []
mock_env.academy_name = "TestAcademyName"
create_environment_factory.return_value = mock_env
trainer_config_mock = MagicMock()
load_config.return_value = trainer_config_mock
load_config.return_value = yaml.safe_load(MOCK_YAML)
mock_init = MagicMock(return_value=None)
with patch.object(TrainerController, "__init__", mock_init):

)
@patch("builtins.open", new_callable=mock_open, read_data="{}")
@patch("builtins.open", new_callable=mock_open, read_data=MOCK_YAML)
def test_commandline_args(mock_file):
# No args raises

# Test with defaults
opt = parse_command_line(["mytrainerpath"])
assert opt.trainer_config == {}
assert opt.behaviors == {}
assert opt.curriculum_config is None
assert opt.sampler_config is None
assert opt.parameter_randomization is None
assert opt.keep_checkpoints == 5
assert opt.lesson == 0
assert opt.resume is False

full_args = [
"mytrainerpath",
"--env=./myenvfile",
"--curriculum=./mycurriculum",
"--sampler=./mysample",
"--keep-checkpoints=42",
"--lesson=3",
"--resume",

]
opt = parse_command_line(full_args)
assert opt.trainer_config == {}
assert opt.behaviors == {}
assert opt.curriculum_config == {}
assert opt.sampler_config == {}
assert opt.parameter_randomization is None
assert opt.keep_checkpoints == 42
assert opt.lesson == 3
assert opt.run_id == "myawesomerun"

assert opt.resume is True
@patch("builtins.open", new_callable=mock_open, read_data="{}")
@patch("builtins.open", new_callable=mock_open, read_data=MOCK_SAMPLER_CURRICULUM_YAML)
def test_sampler_configs(mock_file):
opt = parse_command_line(["mytrainerpath"])
assert opt.parameter_randomization == "sampler1"
@patch("builtins.open", new_callable=mock_open, read_data=MOCK_YAML)
def test_env_args(mock_file):
full_args = [
"mytrainerpath",

41
ml-agents/mlagents/trainers/tests/test_trainer_util.py


from unittest.mock import patch
from mlagents.trainers import trainer_util
from mlagents.trainers.trainer_util import load_config, _load_config
from mlagents.trainers.trainer_util import (
load_config,
_load_config,
assemble_curriculum_config,
)
from mlagents.trainers.ppo.trainer import PPOTrainer
from mlagents.trainers.exception import TrainerConfigError, UnityTrainerException
from mlagents.trainers.brain import BrainParameters

with pytest.raises(TrainerConfigError):
fp = io.StringIO(file_contents)
_load_config(fp)
def test_assemble_curriculum_config():
file_contents = """
behavior1:
curriculum:
foo: 5
behavior2:
curriculum:
foo: 6
"""
trainer_config = _load_config(file_contents)
curriculum_config = assemble_curriculum_config(trainer_config)
assert curriculum_config == {"behavior1": {"foo": 5}, "behavior2": {"foo": 6}}
# Check that nothing is returned if no curriculum.
file_contents = """
behavior1:
foo: 3
behavior2:
foo: 4
"""
trainer_config = _load_config(file_contents)
curriculum_config = assemble_curriculum_config(trainer_config)
assert curriculum_config == {}
# Check that method doesn't break if 1st level entity isn't a dict.
# Note: this is a malformed configuration.
file_contents = """
behavior1: 3
behavior2: 4
"""
trainer_config = _load_config(file_contents)
curriculum_config = assemble_curriculum_config(trainer_config)
assert curriculum_config == {}
def test_existing_directories(tmp_path):

25
ml-agents/mlagents/trainers/trainer_util.py


"""
if "default" not in trainer_config and brain_name not in trainer_config:
raise TrainerConfigError(
f'Trainer config must have either a "default" section, or a section for the brain name ({brain_name}). '
"See config/trainer_config.yaml for an example."
f'Trainer config must have either a "default" section, or a section for the brain name {brain_name}. '
"See the config/ directory for examples."
)
trainer_parameters = trainer_config.get("default", {}).copy()

while not isinstance(trainer_config[_brain_key], dict):
_brain_key = trainer_config[_brain_key]
trainer_parameters.update(trainer_config[_brain_key])
if init_path is not None:
trainer_parameters["init_path"] = "{basedir}/{name}".format(
basedir=init_path, name=brain_name
)
min_lesson_length = 1
if meta_curriculum:

"Error parsing yaml file. Please check for formatting errors. "
"A tool such as http://www.yamllint.com/ can be helpful with this."
) from e
def assemble_curriculum_config(trainer_config: Dict[str, Any]) -> Dict[str, Any]:
"""
Assembles a curriculum config Dict from a trainer config. The resulting
dictionary should have a mapping of {brain_name: config}, where config is another
Dict that
:param trainer_config: Dict of trainer configurations (keys are brain_names).
:return: Dict of curriculum configurations. Returns empty dict if none are found.
"""
curriculum_config: Dict[str, Any] = {}
for behavior_name, behavior_config in trainer_config.items():
# Don't try to iterate non-Dicts. This probably means your config is malformed.
if isinstance(behavior_config, dict) and "curriculum" in behavior_config:
curriculum_config[behavior_name] = behavior_config["curriculum"]
return curriculum_config
def handle_existing_directories(

2
ml-agents/tests/yamato/training_int_tests.py


# Copy the default training config but override the max_steps parameter,
# and reduce the batch_size and buffer_size enough to ensure an update step happens.
override_config_file(
"config/trainer_config.yaml",
"config/ppo/3DBall.yaml",
"override.yaml",
max_steps=100,
batch_size=10,

3
ml-agents/tests/yamato/yamato_utils.py


"""
with open(src_path) as f:
configs = yaml.safe_load(f)
behavior_configs = configs["behaviors"]
for config in configs.values():
for config in behavior_configs.values():
config.update(**kwargs)
with open(dest_path, "w") as f:

29
config/imitation/CrawlerStatic.yaml


behaviors:
CrawlerStatic:
trainer: ppo
batch_size: 2024
beta: 0.005
buffer_size: 20240
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
max_steps: 1e7
memory_size: 256
normalize: true
num_epoch: 3
num_layers: 3
time_horizon: 1000
sequence_length: 64
summary_freq: 30000
use_recurrent: false
reward_signals:
gail:
strength: 1.0
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/Crawler/Demos/ExpertCrawlerSta.demo
behavioral_cloning:
demo_path: Project/Assets/ML-Agents/Examples/Crawler/Demos/ExpertCrawlerSta.demo
strength: 0.5
steps: 50000

29
config/imitation/FoodCollector.yaml


behaviors:
FoodCollector:
trainer: ppo
batch_size: 64
beta: 0.005
buffer_size: 10240
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 0.0003
max_steps: 2.0e6
memory_size: 256
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 32
summary_freq: 10000
use_recurrent: false
reward_signals:
gail:
strength: 0.1
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/FoodCollector/Demos/ExpertFood.demo
behavioral_cloning:
demo_path: Project/Assets/ML-Agents/Examples/FoodCollector/Demos/ExpertFood.demo
strength: 1.0
steps: 0

28
config/imitation/Hallway.yaml


behaviors:
Hallway:
trainer: ppo
batch_size: 128
beta: 0.01
buffer_size: 1024
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 0.0003
max_steps: 1.0e7
memory_size: 256
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 10000
use_recurrent: true
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
gail:
strength: 0.1
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/Hallway/Demos/ExpertHallway.demo

25
config/imitation/PushBlock.yaml


behaviors:
PushBlock:
trainer: ppo
batch_size: 128
beta: 0.01
buffer_size: 2048
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
max_steps: 1.5e7
memory_size: 256
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 60000
use_recurrent: false
reward_signals:
gail:
strength: 1.0
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/PushBlock/Demos/ExpertPush.demo

36
config/imitation/Pyramids.yaml


behaviors:
Pyramids:
trainer: ppo
batch_size: 128
beta: 0.01
buffer_size: 2048
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
max_steps: 1.0e7
memory_size: 256
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 128
sequence_length: 64
summary_freq: 30000
use_recurrent: false
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 256
gail:
strength: 0.01
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
behavioral_cloning:
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
strength: 0.5
steps: 150000

25
config/ppo/3DBall.yaml


behaviors:
3DBall:
trainer: ppo
batch_size: 64
beta: 0.001
buffer_size: 12000
epsilon: 0.2
hidden_units: 128
lambd: 0.99
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 5.0e5
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 12000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/ppo/3DBallHard.yaml


behaviors:
3DBallHard:
trainer: ppo
batch_size: 1200
beta: 0.001
buffer_size: 12000
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 5.0e6
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 12000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

40
config/ppo/3DBall_randomize.yaml


behaviors:
3DBall:
trainer: ppo
batch_size: 64
beta: 0.001
buffer_size: 12000
epsilon: 0.2
hidden_units: 128
lambd: 0.99
learning_rate: 3.0e-4
learning_rate_schedule: linear
max_steps: 5.0e5
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 12000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
parameter_randomization:
resampling-interval: 500
mass:
sampler-type: "uniform"
min_value: 0.5
max_value: 10
gravity:
sampler-type: "uniform"
min_value: 7
max_value: 12
scale:
sampler-type: "uniform"
min_value: 0.75
max_value: 3

25
config/ppo/Basic.yaml


behaviors:
Basic:
trainer: ppo
batch_size: 32
beta: 0.005
buffer_size: 256
epsilon: 0.2
hidden_units: 20
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 5.0e5
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 1
time_horizon: 3
sequence_length: 64
summary_freq: 2000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.9

25
config/ppo/Bouncer.yaml


behaviors:
Bouncer:
trainer: ppo
batch_size: 1024
beta: 0.005
buffer_size: 10240
epsilon: 0.2
hidden_units: 64
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 4.0e6
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 10000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/ppo/CrawlerDynamic.yaml


behaviors:
CrawlerDynamic:
trainer: ppo
batch_size: 2024
beta: 0.005
buffer_size: 20240
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 1e7
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 3
time_horizon: 1000
sequence_length: 64
summary_freq: 30000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

25
config/ppo/CrawlerStatic.yaml


behaviors:
CrawlerStatic:
trainer: ppo
batch_size: 2024
beta: 0.005
buffer_size: 20240
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 1e7
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 3
time_horizon: 1000
sequence_length: 64
summary_freq: 30000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

25
config/ppo/FoodCollector.yaml


behaviors:
FoodCollector:
trainer: ppo
batch_size: 1024
beta: 0.005
buffer_size: 10240
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 2.0e6
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 10000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/ppo/GridWorld.yaml


behaviors:
GridWorld:
trainer: ppo
batch_size: 32
beta: 0.005
buffer_size: 256
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 500000
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 1
time_horizon: 5
sequence_length: 64
summary_freq: 20000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.9

25
config/ppo/Hallway.yaml


behaviors:
Hallway:
trainer: ppo
batch_size: 128
beta: 0.01
buffer_size: 1024
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 1.0e7
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 10000
use_recurrent: true
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/ppo/PushBlock.yaml


behaviors:
PushBlock:
trainer: ppo
batch_size: 128
beta: 0.01
buffer_size: 2048
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 2.0e6
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 60000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

29
config/ppo/Pyramids.yaml


behaviors:
Pyramids:
trainer: ppo
batch_size: 128
beta: 0.01
buffer_size: 2048
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 1.0e7
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 128
sequence_length: 64
summary_freq: 30000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 256

25
config/ppo/Reacher.yaml


behaviors:
Reacher:
trainer: ppo
batch_size: 2024
beta: 0.005
buffer_size: 20240
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 2e7
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 60000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

38
config/ppo/SoccerTwos.yaml


behaviors:
SoccerTwos:
trainer: ppo
batch_size: 2048
beta: 0.005
buffer_size: 20480
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5.0e7
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 10000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
self_play:
window: 10
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 50000
team_change: 200000
curriculum:
measure: progress
thresholds: [0.05, 0.1]
min_lesson_length: 100
signal_smoothing: true
parameters:
ball_touch: [1.0, 0.5, 0.0]

62
config/ppo/StrikersVsGoalie.yaml


behaviors:
Goalie:
trainer: ppo
batch_size: 2048
beta: 0.005
buffer_size: 20480
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5.0e7
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 10000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
self_play:
window: 10
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 25000
team_change: 200000
Striker:
trainer: ppo
batch_size: 2048
beta: 0.005
buffer_size: 20480
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5.0e7
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 10000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
self_play:
window: 10
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 100000
team_change: 200000

31
config/ppo/Tennis.yaml


behaviors:
Tennis:
trainer: ppo
batch_size: 1024
beta: 0.005
buffer_size: 10240
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5.0e7
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 10000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
self_play:
window: 10
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 50000
team_change: 100000

25
config/ppo/VisualHallway.yaml


behaviors:
VisualHallway:
trainer: ppo
batch_size: 64
beta: 0.01
buffer_size: 1024
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 1.0e7
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 1
time_horizon: 64
sequence_length: 64
summary_freq: 10000
use_recurrent: true
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/ppo/VisualPushBlock.yaml


behaviors:
VisualPushBlock:
trainer: ppo
batch_size: 64
beta: 0.01
buffer_size: 1024
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 3.0e6
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 1
time_horizon: 64
sequence_length: 32
summary_freq: 60000
use_recurrent: true
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

29
config/ppo/VisualPyramids.yaml


behaviors:
VisualPyramids:
trainer: ppo
batch_size: 64
beta: 0.01
buffer_size: 2024
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 1.0e7
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 1
time_horizon: 128
sequence_length: 64
summary_freq: 10000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.01
gamma: 0.99
encoding_size: 256

25
config/ppo/Walker.yaml


behaviors:
Walker:
trainer: ppo
batch_size: 2048
beta: 0.005
buffer_size: 20480
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 2e7
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 3
time_horizon: 1000
sequence_length: 64
summary_freq: 30000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

50
config/ppo/WallJump.yaml


behaviors:
BigWallJump:
trainer: ppo
batch_size: 128
beta: 0.005
buffer_size: 2048
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 2e7
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 128
sequence_length: 64
summary_freq: 20000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
SmallWallJump:
trainer: ppo
batch_size: 128
beta: 0.005
buffer_size: 2048
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 5e6
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 128
sequence_length: 64
summary_freq: 20000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

65
config/ppo/WallJump_curriculum.yaml


behaviors:
BigWallJump:
trainer: ppo
batch_size: 128
beta: 0.005
buffer_size: 2048
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 2e7
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 128
sequence_length: 64
summary_freq: 20000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curriculum:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
SmallWallJump:
trainer: ppo
batch_size: 128
beta: 0.005
buffer_size: 2048
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 5e6
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 128
sequence_length: 64
summary_freq: 20000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curriculum:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
small_wall_height: [1.5, 2.0, 2.5, 4.0]

25
config/ppo/WormDynamic.yaml


behaviors:
WormDynamic:
trainer: ppo
batch_size: 2024
beta: 0.005
buffer_size: 20240
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 3.5e6
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 3
time_horizon: 1000
sequence_length: 64
summary_freq: 30000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

25
config/ppo/WormStatic.yaml


behaviors:
WormStatic:
trainer: ppo
batch_size: 2024
beta: 0.005
buffer_size: 20240
epsilon: 0.2
hidden_units: 512
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 3.5e6
memory_size: 128
normalize: true
num_epoch: 3
num_layers: 3
time_horizon: 1000
sequence_length: 64
summary_freq: 30000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

25
config/sac/3DBall.yaml


behaviors:
3DBall:
trainer: sac
batch_size: 64
buffer_size: 12000
buffer_init_steps: 0
hidden_units: 64
init_entcoef: 0.5
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5.0e5
memory_size: 128
normalize: true
steps_per_update: 10
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 12000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/sac/3DBallHard.yaml


behaviors:
3DBallHard:
trainer: sac
batch_size: 256
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 128
init_entcoef: 1.0
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5.0e5
memory_size: 128
normalize: true
steps_per_update: 10
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 12000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/sac/Basic.yaml


behaviors:
Basic:
trainer: sac
batch_size: 64
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 20
init_entcoef: 0.01
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5.0e5
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 2
time_horizon: 10
sequence_length: 64
summary_freq: 2000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/sac/Bouncer.yaml


behaviors:
Bouncer:
trainer: sac
batch_size: 128
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 64
init_entcoef: 1.0
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 1.0e6
memory_size: 128
normalize: true
steps_per_update: 10
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 20000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/sac/CrawlerDynamic.yaml


behaviors:
CrawlerDynamic:
trainer: sac
batch_size: 256
buffer_size: 500000
buffer_init_steps: 0
hidden_units: 512
init_entcoef: 1.0
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5e6
memory_size: 128
normalize: true
steps_per_update: 20
num_layers: 3
time_horizon: 1000
sequence_length: 64
summary_freq: 30000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

25
config/sac/CrawlerStatic.yaml


behaviors:
CrawlerStatic:
trainer: sac
batch_size: 256
buffer_size: 500000
buffer_init_steps: 2000
hidden_units: 512
init_entcoef: 1.0
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 3e6
memory_size: 128
normalize: true
steps_per_update: 20
num_layers: 3
time_horizon: 1000
sequence_length: 64
summary_freq: 30000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

25
config/sac/FoodCollector.yaml


behaviors:
FoodCollector:
trainer: sac
batch_size: 256
buffer_size: 500000
buffer_init_steps: 0
hidden_units: 128
init_entcoef: 0.05
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 2.0e6
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 10000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/sac/GridWorld.yaml


behaviors:
GridWorld:
trainer: sac
batch_size: 128
buffer_size: 50000
buffer_init_steps: 1000
hidden_units: 128
init_entcoef: 0.5
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 500000
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 1
time_horizon: 5
sequence_length: 64
summary_freq: 20000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.9

25
config/sac/Hallway.yaml


behaviors:
Hallway:
trainer: sac
batch_size: 128
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 128
init_entcoef: 0.1
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5.0e6
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 2
time_horizon: 64
sequence_length: 32
summary_freq: 10000
tau: 0.005
use_recurrent: true
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

25
config/sac/PushBlock.yaml


behaviors:
PushBlock:
trainer: sac
batch_size: 128
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 256
init_entcoef: 0.05
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 2e6
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 100000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

31
config/sac/Pyramids.yaml


behaviors:
Pyramids:
trainer: sac
batch_size: 128
buffer_size: 500000
buffer_init_steps: 10000
hidden_units: 256
init_entcoef: 0.01
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 1.0e7
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 2
time_horizon: 128
sequence_length: 16
summary_freq: 30000
tau: 0.01
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 2.0
gamma: 0.99
gail:
strength: 0.02
gamma: 0.99
encoding_size: 128
use_actions: true
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo

25
config/sac/Reacher.yaml


behaviors:
Reacher:
trainer: sac
batch_size: 128
buffer_size: 500000
buffer_init_steps: 0
hidden_units: 128
init_entcoef: 1.0
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 2e7
memory_size: 128
normalize: true
steps_per_update: 20
num_layers: 2
time_horizon: 1000
sequence_length: 64
summary_freq: 60000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

30
config/sac/Tennis.yaml


behaviors:
Tennis:
trainer: sac
batch_size: 128
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 256
init_entcoef: 1.0
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 2e7
memory_size: 128
normalize: true
steps_per_update: 10
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 10000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
self_play:
window: 10
play_against_current_self_ratio: 0.5
save_steps: 50000
swap_steps: 50000

26
config/sac/VisualHallway.yaml


behaviors:
VisualHallway:
trainer: sac
batch_size: 64
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 128
init_entcoef: 1.0
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 1.0e7
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 1
time_horizon: 64
sequence_length: 32
summary_freq: 10000
tau: 0.005
use_recurrent: true
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
gamma: 0.99

26
config/sac/VisualPushBlock.yaml


behaviors:
VisualPushBlock:
trainer: sac
batch_size: 64
buffer_size: 1024
buffer_init_steps: 0
hidden_units: 128
init_entcoef: 1.0
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 3.0e6
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 1
time_horizon: 64
sequence_length: 32
summary_freq: 60000
tau: 0.005
use_recurrent: true
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
gamma: 0.99

31
config/sac/VisualPyramids.yaml


behaviors:
VisualPyramids:
trainer: sac
batch_size: 64
buffer_size: 500000
buffer_init_steps: 1000
hidden_units: 256
init_entcoef: 0.01
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 1.0e7
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 1
time_horizon: 128
sequence_length: 64
summary_freq: 10000
tau: 0.01
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 2.0
gamma: 0.99
gail:
strength: 0.02
gamma: 0.99
encoding_size: 128
use_actions: true
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo

25
config/sac/Walker.yaml


behaviors:
Walker:
trainer: sac
batch_size: 256
buffer_size: 500000
buffer_init_steps: 0
hidden_units: 512
init_entcoef: 1.0
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 2e7
memory_size: 128
normalize: true
steps_per_update: 30
num_layers: 4
time_horizon: 1000
sequence_length: 64
summary_freq: 30000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995

50
config/sac/WallJump.yaml


behaviors:
BigWallJump:
trainer: sac
batch_size: 128
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 256
init_entcoef: 0.1
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 2e7
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 2
time_horizon: 128
sequence_length: 64
summary_freq: 20000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
SmallWallJump:
trainer: sac
batch_size: 128
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 256
init_entcoef: 0.1
learning_rate: 0.0003
learning_rate_schedule: constant
max_steps: 5e6
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 2
time_horizon: 128
sequence_length: 64
summary_freq: 20000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

129
config/gail_config.yaml


default:
trainer: ppo
batch_size: 1024
beta: 5.0e-3
buffer_size: 10240
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 3.0e-4
max_steps: 5.0e5
memory_size: 256
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 10000
use_recurrent: false
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
Pyramids:
summary_freq: 30000
time_horizon: 128
batch_size: 128
buffer_size: 2048
hidden_units: 512
num_layers: 2
beta: 1.0e-2
max_steps: 1.0e7
num_epoch: 3
behavioral_cloning:
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
strength: 0.5
steps: 150000
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 256
gail:
strength: 0.01
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
CrawlerStatic:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2024
buffer_size: 20240
max_steps: 1e7
summary_freq: 30000
num_layers: 3
hidden_units: 512
behavioral_cloning:
demo_path: Project/Assets/ML-Agents/Examples/Crawler/Demos/ExpertCrawlerSta.demo
strength: 0.5
steps: 50000
reward_signals:
gail:
strength: 1.0
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/Crawler/Demos/ExpertCrawlerSta.demo
PushBlock:
max_steps: 1.5e7
batch_size: 128
buffer_size: 2048
beta: 1.0e-2
hidden_units: 256
summary_freq: 60000
time_horizon: 64
num_layers: 2
reward_signals:
gail:
strength: 1.0
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/PushBlock/Demos/ExpertPush.demo
Hallway:
use_recurrent: true
sequence_length: 64
num_layers: 2
hidden_units: 128
memory_size: 256
beta: 1.0e-2
num_epoch: 3
buffer_size: 1024
batch_size: 128
max_steps: 1.0e7
summary_freq: 10000
time_horizon: 64
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
gail:
strength: 0.1
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/Hallway/Demos/ExpertHallway.demo
FoodCollector:
batch_size: 64
max_steps: 2.0e6
use_recurrent: false
hidden_units: 128
learning_rate: 3.0e-4
num_layers: 2
sequence_length: 32
reward_signals:
gail:
strength: 0.1
gamma: 0.99
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/FoodCollector/Demos/ExpertFood.demo
behavioral_cloning:
demo_path: Project/Assets/ML-Agents/Examples/FoodCollector/Demos/ExpertFood.demo
strength: 1.0
steps: 0

16
config/3dball_randomize.yaml


resampling-interval: 5000
mass:
sampler-type: "uniform"
min_value: 0.5
max_value: 10
gravity:
sampler-type: "uniform"
min_value: 7
max_value: 12
scale:
sampler-type: "uniform"
min_value: 0.75
max_value: 3

351
config/trainer_config.yaml


default:
trainer: ppo
batch_size: 1024
beta: 5.0e-3
buffer_size: 10240
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 3.0e-4
learning_rate_schedule: linear
max_steps: 5.0e5
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 10000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
FoodCollector:
normalize: false
beta: 5.0e-3
batch_size: 1024
buffer_size: 10240
max_steps: 2.0e6
Bouncer:
normalize: true
max_steps: 4.0e6
num_layers: 2
hidden_units: 64
PushBlock:
max_steps: 2.0e6
batch_size: 128
buffer_size: 2048
beta: 1.0e-2
hidden_units: 256
summary_freq: 60000
time_horizon: 64
num_layers: 2
SmallWallJump:
max_steps: 5e6
batch_size: 128
buffer_size: 2048
beta: 5.0e-3
hidden_units: 256
summary_freq: 20000
time_horizon: 128
num_layers: 2
normalize: false
BigWallJump:
max_steps: 2e7
batch_size: 128
buffer_size: 2048
beta: 5.0e-3
hidden_units: 256
summary_freq: 20000
time_horizon: 128
num_layers: 2
normalize: false
Pyramids:
summary_freq: 30000
time_horizon: 128
batch_size: 128
buffer_size: 2048
hidden_units: 512
num_layers: 2
beta: 1.0e-2
max_steps: 1.0e7
num_epoch: 3
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 256
VisualPyramids:
time_horizon: 128
batch_size: 64
buffer_size: 2024
hidden_units: 256
num_layers: 1
beta: 1.0e-2
max_steps: 1.0e7
num_epoch: 3
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
curiosity:
strength: 0.01
gamma: 0.99
encoding_size: 256
3DBall:
normalize: true
batch_size: 64
buffer_size: 12000
summary_freq: 12000
time_horizon: 1000
lambd: 0.99
beta: 0.001
3DBallHard:
normalize: true
batch_size: 1200
buffer_size: 12000
summary_freq: 12000
time_horizon: 1000
max_steps: 5.0e6
beta: 0.001
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995
Tennis:
normalize: true
max_steps: 5.0e7
learning_rate_schedule: constant
batch_size: 1024
buffer_size: 10240
hidden_units: 256
time_horizon: 1000
self_play:
window: 10
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 50000
team_change: 100000
Goalie:
normalize: false
max_steps: 5.0e7
learning_rate_schedule: constant
batch_size: 2048
buffer_size: 20480
hidden_units: 512
time_horizon: 1000
num_layers: 2
self_play:
window: 10
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 25000
team_change: 200000
Striker:
normalize: false
max_steps: 5.0e7
learning_rate_schedule: constant
batch_size: 2048
buffer_size: 20480
hidden_units: 512
time_horizon: 1000
num_layers: 2
self_play:
window: 10
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 100000
team_change: 200000
SoccerTwos:
normalize: false
max_steps: 5.0e7
learning_rate_schedule: constant
batch_size: 2048
buffer_size: 20480
hidden_units: 512
time_horizon: 1000
num_layers: 2
self_play:
window: 10
play_against_latest_model_ratio: 0.5
save_steps: 50000
swap_steps: 50000
team_change: 200000
CrawlerStatic:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2024
buffer_size: 20240
max_steps: 1e7
summary_freq: 30000
num_layers: 3
hidden_units: 512
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995
CrawlerDynamic:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2024
buffer_size: 20240
max_steps: 1e7
summary_freq: 30000
num_layers: 3
hidden_units: 512
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995
WormDynamic:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2024
buffer_size: 20240
max_steps: 3.5e6
summary_freq: 30000
num_layers: 3
hidden_units: 512
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995
WormStatic:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2024
buffer_size: 20240
max_steps: 3.5e6
summary_freq: 30000
num_layers: 3
hidden_units: 512
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995
Walker:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2048
buffer_size: 20480
max_steps: 2e7
summary_freq: 30000
num_layers: 3
hidden_units: 512
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995
Reacher:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2024
buffer_size: 20240
max_steps: 2e7
summary_freq: 60000
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.995
Hallway:
use_recurrent: true
sequence_length: 64
num_layers: 2
hidden_units: 128
memory_size: 128
beta: 1.0e-2
num_epoch: 3
buffer_size: 1024
batch_size: 128
max_steps: 1.0e7
summary_freq: 10000
time_horizon: 64
VisualHallway:
use_recurrent: true
sequence_length: 64
num_layers: 1
hidden_units: 128
memory_size: 128
beta: 1.0e-2
num_epoch: 3
buffer_size: 1024
batch_size: 64
max_steps: 1.0e7
summary_freq: 10000
time_horizon: 64
VisualPushBlock:
use_recurrent: true
sequence_length: 32
num_layers: 1
hidden_units: 128
memory_size: 128
beta: 1.0e-2
num_epoch: 3
buffer_size: 1024
batch_size: 64
max_steps: 3.0e6
summary_freq: 60000
time_horizon: 64
GridWorld:
batch_size: 32
normalize: false
num_layers: 1
hidden_units: 256
beta: 5.0e-3
buffer_size: 256
max_steps: 500000
summary_freq: 20000
time_horizon: 5
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.9
Basic:
batch_size: 32
normalize: false
num_layers: 1
hidden_units: 20
beta: 5.0e-3
buffer_size: 256
max_steps: 5.0e5
summary_freq: 2000
time_horizon: 3
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.9

部分文件因为文件数量过多而无法显示

正在加载...
取消
保存