浏览代码

Refactor of Curriculum and parameter sampling (#4160)

* Introduced the Constant Parameter Sampler that will be useful later as samplers and floats can be used interchangeably

* Refactored the settings.py to refect the new format of the config.yaml

* First working version

* Added the unit tests

* Update to Upgrade for Updates

* fixing the tests

* Upgraded the config files

* Fixes

* Additional error catching

* addressing some comments

* Making the code nicer with cattr

* Added and registered an unstructure hook for PrameterRandomization

* Updating C# Walljump

* Adding comments

* Add test for settings export (#4164)

* Add test for settings export

* Update ml-agents/mlagents/trainers/tests/test_settings.py

Co-authored-by: Vincent-Pierre BERGES <vincentpierre@unity3d.com>

Co-authored-by: Vincent-Pierre BERGES <vincentpierre@unity3d.com>

* Including environment parameters for the test for settings export

* First documentation up...
/MLA-1734-demo-provider
GitHub 4 年前
当前提交
8eefdcd3
共有 26 个文件被更改,包括 1190 次插入868 次删除
  1. 4
      Project/Assets/ML-Agents/Examples/WallJump/Scripts/WallJumpAgent.cs
  2. 4
      com.unity.ml-agents/CHANGELOG.md
  3. 3
      config/ppo/3DBall_randomize.yaml
  4. 86
      config/ppo/WallJump_curriculum.yaml
  5. 16
      docs/Migrating.md
  6. 216
      docs/Training-ML-Agents.md
  7. 2
      docs/Using-Docker.md
  8. 2
      ml-agents/mlagents/trainers/env_manager.py
  9. 44
      ml-agents/mlagents/trainers/learn.py
  10. 246
      ml-agents/mlagents/trainers/settings.py
  11. 4
      ml-agents/mlagents/trainers/subprocess_env_manager.py
  12. 101
      ml-agents/mlagents/trainers/tests/test_config_conversion.py
  13. 61
      ml-agents/mlagents/trainers/tests/test_learn.py
  14. 135
      ml-agents/mlagents/trainers/tests/test_settings.py
  15. 9
      ml-agents/mlagents/trainers/tests/test_simple_rl.py
  16. 5
      ml-agents/mlagents/trainers/tests/test_trainer_controller.py
  17. 3
      ml-agents/mlagents/trainers/tests/test_trainer_util.py
  18. 104
      ml-agents/mlagents/trainers/trainer_controller.py
  19. 24
      ml-agents/mlagents/trainers/trainer_util.py
  20. 125
      ml-agents/mlagents/trainers/upgrade_config.py
  21. 156
      ml-agents/mlagents/trainers/environment_parameter_manager.py
  22. 256
      ml-agents/mlagents/trainers/tests/test_env_param_manager.py
  23. 77
      ml-agents/mlagents/trainers/tests/test_curriculum.py
  24. 136
      ml-agents/mlagents/trainers/tests/test_meta_curriculum.py
  25. 91
      ml-agents/mlagents/trainers/curriculum.py
  26. 148
      ml-agents/mlagents/trainers/meta_curriculum.py

4
Project/Assets/ML-Agents/Examples/WallJump/Scripts/WallJumpAgent.cs


}
else
{
var min = m_ResetParams.GetWithDefault("big_wall_min_height", 8);
var max = m_ResetParams.GetWithDefault("big_wall_max_height", 8);
var height = min + Random.value * (max - min);
var height = m_ResetParams.GetWithDefault("big_wall_height", 8);
localScale = new Vector3(
localScale.x,
height,

4
com.unity.ml-agents/CHANGELOG.md


#### ml-agents / ml-agents-envs / gym-unity (Python)
- The Parameter Randomization feature has been refactored to enable sampling of new parameters per episode to improve robustness. The
`resampling-interval` parameter has been removed and the config structure updated. More information [here](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-ML-Agents.md). (#4065)
- The Parameter Randomization feature has been merged with the Curriculum feature. It is now possible to specify a sampler
in the lesson of a Curriculum. Curriculum has been refactored and is now specified at the level of the parameter, not the
behavior. More information
[here](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-ML-Agents.md).(#4160)
### Minor Changes
#### com.unity.ml-agents (C#)

3
config/ppo/3DBall_randomize.yaml


time_horizon: 1000
summary_freq: 12000
threaded: true
parameter_randomization:
environment_parameters:
mass:
sampler_type: uniform
sampler_parameters:

86
config/ppo/WallJump_curriculum.yaml


time_horizon: 128
summary_freq: 20000
threaded: true
curriculum:
BigWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
SmallWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
small_wall_height: [1.5, 2.0, 2.5, 4.0]
environment_parameters:
big_wall_height:
curriculum:
- name: Lesson0 # The '-' is important as this is a list
completion_criteria:
measure: progress
behavior: BigWallJump
signal_smoothing: true
min_lesson_length: 100
threshold: 0.1
value:
sampler_type: uniform
sampler_parameters:
min_value: 0.0
max_value: 4.0
- name: Lesson1 # This is the start of the second lesson
completion_criteria:
measure: progress
behavior: BigWallJump
signal_smoothing: true
min_lesson_length: 100
threshold: 0.3
value:
sampler_type: uniform
sampler_parameters:
min_value: 4.0
max_value: 7.0
- name: Lesson2
completion_criteria:
measure: progress
behavior: BigWallJump
signal_smoothing: true
min_lesson_length: 100
threshold: 0.5
value:
sampler_type: uniform
sampler_parameters:
min_value: 6.0
max_value: 8.0
- name: Lesson3
value: 8.0
small_wall_height:
curriculum:
- name: Lesson0
completion_criteria:
measure: progress
behavior: SmallWallJump
signal_smoothing: true
min_lesson_length: 100
threshold: 0.1
value: 1.5
- name: Lesson1
completion_criteria:
measure: progress
behavior: SmallWallJump
signal_smoothing: true
min_lesson_length: 100
threshold: 0.3
value: 2.0
- name: Lesson2
completion_criteria:
measure: progress
behavior: SmallWallJump
signal_smoothing: true
min_lesson_length: 100
threshold: 0.5
value: 2.5
- name: Lesson3
value: 4.0

16
docs/Migrating.md


# Migrating
## Migrating from Release 1 to latest
## Migrating from Release 3 to latest
### Important changes
- The Parameter Randomization feature has been merged with the Curriculum feature. It is now possible to specify a sampler
in the lesson of a Curriculum. Curriculum has been refactored and is now specified at the level of the parameter, not the
behavior. More information
[here](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-ML-Agents.md).(#4160)
### Steps to Migrate
- The configuration format for curriculum and parameter randomization has changed. To upgrade your configuration files,
an upgrade script has been provided. Run `python -m mlagents.trainers.upgrade_config -h` to see the script usage. Note that you will have had to upgrade to/install the current version of ML-Agents before running the script. To update manually:
- If your config file used a `parameter_randomization` section, rename that section to `environment_parameters`
- If your config file used a `curriculum` section, you will need to rewrite your curriculum with this [format](Training-ML-Agents.md#curriculum).
## Migrating from Release 1 to Release 3
### Important changes
- Training artifacts (trained models, summaries) are now found under `results/`

216
docs/Training-ML-Agents.md


- [Loading an Existing Model](#loading-an-existing-model)
- [Training Configurations](#training-configurations)
- [Behavior Configurations](#behavior-configurations)
- [Curriculum Learning](#curriculum-learning)
- [Specifying Curricula](#specifying-curricula)
- [Training with a Curriculum](#training-with-a-curriculum)
- [Environment Parameter Randomization](#environment-parameter-randomization)
- [Included Sampler Types](#included-sampler-types)
- [Defining a New Sampler Type](#defining-a-new-sampler-type)
- [Training with Environment Parameter Randomization](#training-with-environment-parameter-randomization)
- [Environment Parameters](#environment-parameters)
- [Environment Parameter Randomization](#environment-parameter-randomization)
- [Supported Sampler Types](#supported-sampler-types)
- [Training with Environment Parameter Randomization](#training-with-environment-parameter-randomization)
- [Curriculum Learning](#curriculum)
- [Training with a Curriculum](#training-with-a-curriculum)
- [Training Using Concurrent Unity Instances](#training-using-concurrent-unity-instances)
For a broad overview of reinforcement learning, imitation learning and all the

flags for `mlagents-learn` that control the training configurations:
- `<trainer-config-file>`: defines the training hyperparameters for each
Behavior in the scene, and the set-ups for Curriculum Learning and
Environment Parameter Randomization
Behavior in the scene, and the set-ups for the environment parameters
(Curriculum Learning and Environment Parameter Randomization)
- `--num-envs`: number of concurrent Unity instances to use during training
Reminder that a detailed description of all command-line options can be found by

The rest of this guide breaks down the different sub-sections of the trainer config file
and explains the possible settings for each.
**NOTE:** The configuration file format has been changed from 0.17.0 and onwards. To convert
**NOTE:** The configuration file format has been changed between 0.17.0 and 0.18.0 and
between 0.18.0 and onwards. To convert
an old set of configuration files (trainer config, curriculum, and sampler files) to the new
format, a script has been provided. Run `python -m mlagents.trainers.upgrade_config -h` in your
console to see the script's usage.

PPO trainer with all the possible training functionalities enabled (memory,
behavioral cloning, curiosity, GAIL and self-play). You will notice that
curriculum and environment parameter randomization settings are not part of the `behaviors`
configuration, but their settings live in different sections that we'll cover subsequently.
configuration, but in their own section called `environment_parameters`.
```yaml
behaviors:

description of all the configurations listed above, along with their defaults.
Unless otherwise specified, omitting a configuration will revert it to its default.
### Curriculum Learning
To enable curriculum learning, you need to add a `curriculum ` sub-section to the trainer
configuration YAML file. Within this sub-section, add an entry for each behavior that defines
the curriculum for thatbehavior. Here is one example:
### Environment Parameters
In order to control the `EnvironmentParameters` in the Unity simulation during training,
you need to add a section called `environment_parameters`. For example you can set the
value of an `EnvironmentParameter` called `my_environment_parameter` to `3.0` with
the following code :
```yml
behaviors:

# Add this section
curriculum:
BehaviorY:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
wall_height: [1.5, 2.0, 2.5, 4.0]
```
Each group of Agents under the same `Behavior Name` in an environment can have a
corresponding curriculum. These curricula are held in what we call a
"metacurriculum". A metacurriculum allows different groups of Agents to follow
different curricula within the same environment.
#### Specifying Curricula
In order to define the curricula, the first step is to decide which parameters
of the environment will vary. In the case of the Wall Jump environment, the
height of the wall is what varies. Rather than adjusting it by hand, we will
create a configuration which describes the structure of the curricula. Within it, we
can specify which points in the training process our wall height will change,
either based on the percentage of training steps which have taken place, or what
the average reward the agent has received in the recent past is. Below is an
example config for the curricula for the Wall Jump environment.
```yaml
behaviors:
BigWallJump:
# < Trainer parameters for BigWallJump >
SmallWallJump:
# < Trainer parameters for SmallWallJump >
curriculum:
BigWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
SmallWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
small_wall_height: [1.5, 2.0, 2.5, 4.0]
environment_parameters:
my_environment_parameter: 3.0
The curriculum for each Behavior has the following parameters:
| **Setting** | **Description** |
| :------------------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `measure` | What to measure learning progress, and advancement in lessons by.<br><br> `reward` uses a measure received reward, while `progress` uses the ratio of steps/max_steps. |
| `thresholds` | Points in value of `measure` where lesson should be increased. |
| `min_lesson_length` | The minimum number of episodes that should be completed before the lesson can change. If `measure` is set to `reward`, the average cumulative reward of the last `min_lesson_length` episodes will be used to determine if the lesson should change. Must be nonnegative. <br><br> **Important**: the average reward that is compared to the thresholds is different than the mean reward that is logged to the console. For example, if `min_lesson_length` is `100`, the lesson will increment after the average cumulative reward of the last `100` episodes exceeds the current threshold. The mean reward logged to the console is dictated by the `summary_freq` parameter defined above. |
| `signal_smoothing` | Whether to weight the current progress measure by previous values. |
| `parameters` | Corresponds to environment parameters to control. Length of each array should be one greater than number of thresholds. |
Inside the Unity simulation, you can access your Environment Parameters by doing :
#### Training with a Curriculum
Once we have specified our metacurriculum and curricula, we can launch
`mlagents-learn` to point to the config file containing
our curricula and PPO will train using Curriculum Learning. For example, to
train agents in the Wall Jump environment with curriculum learning, we can run:
```sh
mlagents-learn config/ppo/WallJump_curriculum.yaml --run-id=wall-jump-curriculum
```csharp
Academy.Instance.EnvironmentParameters.GetWithDefault("my_environment_parameter", 0.0f);
We can then keep track of the current lessons and progresses via TensorBoard. If you've terminated
the run, you can resume it using `--resume` and lesson progress will start off where it
ended.
### Environment Parameter Randomization
#### Environment Parameter Randomization
To enable parameter randomization, you need to add a `parameter-randomization` sub-section
to your trainer config YAML file. Here is one example:
To enable environment parameter randomization, you need to edit the `environment_parameters`
section of your training configuration yaml file. Instead of providing a single float value
for your environment parameter, you can specify a sampler instead. Here is an example with
three environment parameters called `mass`, `length` and `scale`:
```yaml
```yml
# < Same as above>
BehaviorY:
# < Same as above >
parameter_randomization:
# Add this section
environment_parameters:
mass:
sampler_type: uniform
sampler_parameters:

st_dev: .3
```
Note that `mass`, `length` and `scale` are the names of the environment
parameters that will be sampled. These are used as keys by the `EnvironmentParameter`
class to sample new parameters via the function `GetWithDefault`.
| **Setting** | **Description** |
| :--------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

#### Supported Sampler Types
##### Supported Sampler Types
Below is a list of the `sampler_type` values supported by the toolkit.

The implementation of the samplers can be found in the
[Samplers.cs file](../com.unity.ml-agents/Runtime/Sampler.cs).
#### Training with Environment Parameter Randomization
##### Training with Environment Parameter Randomization
and specify trainer configuration with `parameter-randomization` defined. For example,
if we wanted to train the 3D ball agent with parameter randomization using
`Environment Parameters` with sampling setup, we would run
and specify trainer configuration with parameter randomization enabled. For example,
if we wanted to train the 3D ball agent with parameter randomization, we would run
```sh
mlagents-learn config/ppo/3DBall_randomize.yaml --run-id=3D-Ball-randomize

#### Curriculum
To enable curriculum learning, you need to add a `curriculum` sub-section to your environment
parameter. Here is one example with the environment parameter `my_environment_parameter` :
```yml
behaviors:
BehaviorY:
# < Same as above >
# Add this section
environment_parameters:
my_environment_parameter:
curriculum:
- name: MyFirstLesson # The '-' is important as this is a list
completion_criteria:
measure: progress
behavior: my_behavior
signal_smoothing: true
min_lesson_length: 100
threshold: 0.2
value: 0.0
- name: MySecondLesson # This is the start of the second lesson
completion_criteria:
measure: progress
behavior: my_behavior
signal_smoothing: true
min_lesson_length: 100
threshold: 0.6
require_reset: true
value:
sampler_type: uniform
sampler_parameters:
min_value: 4.0
max_value: 7.0
- name: MyLastLesson
value: 8.0
```
Note that this curriculum __only__ applies to `my_environment_parameter`. The `curriculum` section
contains a list of `Lessons`. In the example, the lessons are named `MyFirstLesson`, `MySecondLesson`
and `MyLastLesson`.
Each `Lesson` has 3 fields :
- `name` which is a user defined name for the lesson (The name of the lesson will be displayed in
the console when the lesson changes)
- `completion_criteria` which determines what needs to happen in the simulation before the lesson
can be considered complete. When that condition is met, the curriculum moves on to the next
`Lesson`. Note that you do not need to specify a `completion_criteria` for the last `Lesson`
- `value` which is the value the environment parameter will take during the lesson. Note that this
can be a float or a sampler.
There are the different settings of the `completion_criteria` :
| **Setting** | **Description** |
| :------------------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `measure` | What to measure learning progress, and advancement in lessons by.<br><br> `reward` uses a measure received reward, while `progress` uses the ratio of steps/max_steps. |
| `behavior` | Specifies which behavior is being tracked. There can be multiple behaviors with different names, each at different points of training. This setting allows the curriculum to track only one of them. |
| `threshold` | Determines at what point in value of `measure` the lesson should be increased. |
| `min_lesson_length` | The minimum number of episodes that should be completed before the lesson can change. If `measure` is set to `reward`, the average cumulative reward of the last `min_lesson_length` episodes will be used to determine if the lesson should change. Must be nonnegative. <br><br> **Important**: the average reward that is compared to the thresholds is different than the mean reward that is logged to the console. For example, if `min_lesson_length` is `100`, the lesson will increment after the average cumulative reward of the last `100` episodes exceeds the current threshold. The mean reward logged to the console is dictated by the `summary_freq` parameter defined above. |
| `signal_smoothing` | Whether to weight the current progress measure by previous values. |
| `require_reset` | Whether changing lesson requires the environment to reset (default: false) |
##### Training with a Curriculum
Once we have specified our metacurriculum and curricula, we can launch
`mlagents-learn` to point to the config file containing
our curricula and PPO will train using Curriculum Learning. For example, to
train agents in the Wall Jump environment with curriculum learning, we can run:
```sh
mlagents-learn config/ppo/WallJump_curriculum.yaml --run-id=wall-jump-curriculum
```
We can then keep track of the current lessons and progresses via TensorBoard. If you've terminated
the run, you can resume it using `--resume` and lesson progress will start off where it
ended.
### Training Using Concurrent Unity Instances

2
docs/Using-Docker.md


- Since Docker runs a container in an environment that is isolated from the host
machine, a mounted directory in your host machine is used to share data, e.g.
the trainer configuration file, Unity executable, curriculum files and
the trainer configuration file, Unity executable and
TensorFlow graph. For convenience, we created an empty `unity-volume`
directory at the root of the repository for this purpose, but feel free to use
any other directory. The remainder of this guide assumes that the

2
ml-agents/mlagents/trainers/env_manager.py


def set_env_parameters(self, config: Dict = None) -> None:
"""
Sends environment parameter settings to C# via the
EnvironmentParametersSidehannel.
EnvironmentParametersSideChannel.
:param config: Dict of environment parameter keys and values
"""
pass

44
ml-agents/mlagents/trainers/learn.py


import numpy as np
import json
from typing import Callable, Optional, List, Dict
from typing import Callable, Optional, List
from mlagents.trainers.meta_curriculum import MetaCurriculum
from mlagents.trainers.environment_parameter_manager import EnvironmentParameterManager
from mlagents.trainers.trainer_util import TrainerFactory, handle_existing_directories
from mlagents.trainers.stats import (
TensorboardWriter,

from mlagents.trainers.cli_utils import parser
from mlagents_envs.environment import UnityEnvironment
from mlagents.trainers.settings import RunOptions
from mlagents.trainers.training_status import GlobalTrainingStatus
from mlagents_envs.base_env import BaseEnv
from mlagents.trainers.subprocess_env_manager import SubprocessEnvManager

env_manager = SubprocessEnvManager(
env_factory, engine_config, env_settings.num_envs
)
maybe_meta_curriculum = try_create_meta_curriculum(
options.curriculum, env_manager, restore=checkpoint_settings.resume
env_parameter_manager = EnvironmentParameterManager(
options.environment_parameters, run_seed, restore=checkpoint_settings.resume
maybe_add_samplers(options.parameter_randomization, env_manager, run_seed)
trainer_factory = TrainerFactory(
options.behaviors,
write_path,

env_parameter_manager,
maybe_meta_curriculum,
False,
)
# Create controller and begin training.

checkpoint_settings.run_id,
maybe_meta_curriculum,
env_parameter_manager,
not checkpoint_settings.inference,
run_seed,
)

logger.warning(
f"Unable to save to {timing_path}. Make sure the directory exists"
)
def maybe_add_samplers(
sampler_config: Optional[Dict], env: SubprocessEnvManager, run_seed: int
) -> None:
"""
Adds samplers to env if sampler config provided and sets seed if not configured.
:param sampler_config: validated dict of sampler configs. None if not included.
:param env: env manager to pass samplers via reset
:param run_seed: Random seed used for training.
"""
if sampler_config is not None:
# If the seed is not specified in yaml, this will grab the run seed
for offset, v in enumerate(sampler_config.values()):
if v.seed == -1:
v.seed = run_seed + offset
env.set_env_parameters(config=sampler_config)
def try_create_meta_curriculum(
curriculum_config: Optional[Dict], env: SubprocessEnvManager, restore: bool = False
) -> Optional[MetaCurriculum]:
if curriculum_config is None or len(curriculum_config) <= 0:
return None
else:
meta_curriculum = MetaCurriculum(curriculum_config)
if restore:
meta_curriculum.try_restore_all_curriculum()
return meta_curriculum
def create_environment_factory(

246
ml-agents/mlagents/trainers/settings.py


import attr
import cattr
from typing import Dict, Optional, List, Any, DefaultDict, Mapping, Tuple
from typing import Dict, Optional, List, Any, DefaultDict, Mapping, Tuple, Union
import numpy as np
import math
from mlagents.trainers.cli_utils import StoreConfigFile, DetectDefault, parser
from mlagents.trainers.cli_utils import load_config

return self.steps_per_update
# INTRINSIC REWARD SIGNALS #############################################################
class RewardSignalType(Enum):
EXTRINSIC: str = "extrinsic"
GAIL: str = "gail"

learning_rate: float = 3e-4
# SAMPLERS #############################################################################
CONSTANT: str = "constant"
def to_settings(self) -> type:
_mapping = {

ParameterRandomizationType.CONSTANT: ConstantSettings
# Constant type is handled if a float is provided instead of a config
}
return _mapping[self]

seed: int = parser.get_default("seed")
@staticmethod
def structure(d: Mapping, t: type) -> Any:
def structure(
d: Union[Mapping, float], t: type
) -> "ParameterRandomizationSettings":
Helper method to structure a Dict of ParameterRandomizationSettings class. Meant to be registered with
Helper method to a ParameterRandomizationSettings class. Meant to be registered with
if isinstance(d, (float, int)):
return ConstantSettings(value=d)
d_final: Dict[str, List[float]] = {}
for environment_parameter, environment_parameter_config in d.items():
if environment_parameter == "resampling-interval":
logger.warning(
"The resampling-interval is no longer necessary for parameter randomization. It is being ignored."
)
continue
if "sampler_type" not in environment_parameter_config:
raise TrainerConfigError(
f"Sampler configuration for {environment_parameter} does not contain sampler_type."
)
if "sampler_parameters" not in environment_parameter_config:
raise TrainerConfigError(
f"Sampler configuration for {environment_parameter} does not contain sampler_parameters."
)
enum_key = ParameterRandomizationType(
environment_parameter_config["sampler_type"]
if "sampler_type" not in d:
raise TrainerConfigError(
f"Sampler configuration does not contain sampler_type : {d}."
t = enum_key.to_settings()
d_final[environment_parameter] = strict_to_cls(
environment_parameter_config["sampler_parameters"], t
if "sampler_parameters" not in d:
raise TrainerConfigError(
f"Sampler configuration does not contain sampler_parameters : {d}."
return d_final
enum_key = ParameterRandomizationType(d["sampler_type"])
t = enum_key.to_settings()
return strict_to_cls(d["sampler_parameters"], t)
@staticmethod
def unstructure(d: "ParameterRandomizationSettings") -> Mapping:
"""
Helper method to a ParameterRandomizationSettings class. Meant to be registered with
cattr.register_unstructure_hook() and called with cattr.unstructure().
"""
_reversed_mapping = {
UniformSettings: ParameterRandomizationType.UNIFORM,
GaussianSettings: ParameterRandomizationType.GAUSSIAN,
MultiRangeUniformSettings: ParameterRandomizationType.MULTIRANGEUNIFORM,
ConstantSettings: ParameterRandomizationType.CONSTANT,
}
sampler_type: Optional[str] = None
for t, name in _reversed_mapping.items():
if isinstance(d, t):
sampler_type = name.value
sampler_parameters = attr.asdict(d)
return {"sampler_type": sampler_type, "sampler_parameters": sampler_parameters}
@abc.abstractmethod
def apply(self, key: str, env_channel: EnvironmentParametersChannel) -> None:

:param env_channel: The EnvironmentParametersChannel to communicate sampler settings to environment
"""
pass
@attr.s(auto_attribs=True)
class ConstantSettings(ParameterRandomizationSettings):
value: float = 0.0
def apply(self, key: str, env_channel: EnvironmentParametersChannel) -> None:
"""
Helper method to send sampler settings over EnvironmentParametersChannel
Calls the constant sampler type set method.
:param key: environment parameter to be sampled
:param env_channel: The EnvironmentParametersChannel to communicate sampler settings to environment
"""
env_channel.set_float_parameter(key, self.value)
@attr.s(auto_attribs=True)

)
# ENVIRONMENT PARAMETERS ###############################################################
@attr.s(auto_attribs=True)
class CompletionCriteriaSettings:
"""
CompletionCriteriaSettings contains the information needed to figure out if the next
lesson must start.
"""
class MeasureType(Enum):
PROGRESS: str = "progress"
REWARD: str = "reward"
measure: MeasureType = attr.ib(default=MeasureType.REWARD)
behavior: str = attr.ib(default="")
min_lesson_length: int = 0
signal_smoothing: bool = True
threshold: float = attr.ib(default=0.0)
require_reset: bool = False
@threshold.validator
def _check_threshold_value(self, attribute, value):
"""
Verify that the threshold has a value between 0 and 1 when the measure is
PROGRESS
"""
if self.measure == self.MeasureType.PROGRESS:
if self.threshold > 1.0:
raise TrainerConfigError(
"Threshold for next lesson cannot be greater than 1 when the measure is progress."
)
if self.threshold < 0.0:
raise TrainerConfigError(
"Threshold for next lesson cannot be negative when the measure is progress."
)
def need_increment(
self, progress: float, reward_buffer: List[float], smoothing: float
) -> Tuple[bool, float]:
"""
Given measures, this method returns a boolean indicating if the lesson
needs to change now, and a float corresponding to the new smoothed value.
"""
# Is the min number of episodes reached
if len(reward_buffer) < self.min_lesson_length:
return False, smoothing
if self.measure == CompletionCriteriaSettings.MeasureType.PROGRESS:
if progress > self.threshold:
return True, smoothing
if self.measure == CompletionCriteriaSettings.MeasureType.REWARD:
if len(reward_buffer) < 1:
return False, smoothing
measure = np.mean(reward_buffer)
if math.isnan(measure):
return False, smoothing
if self.signal_smoothing:
measure = 0.25 * smoothing + 0.75 * measure
smoothing = measure
if measure > self.threshold:
return True, smoothing
return False, smoothing
@attr.s(auto_attribs=True)
class Lesson:
"""
Gathers the data of one lesson for one environment parameter including its name,
the condition that must be fullfiled for the lesson to be completed and a sampler
for the environment parameter. If the completion_criteria is None, then this is
the last lesson in the curriculum.
"""
value: ParameterRandomizationSettings
name: str
completion_criteria: Optional[CompletionCriteriaSettings] = attr.ib(default=None)
@attr.s(auto_attribs=True)
class EnvironmentParameterSettings:
"""
EnvironmentParameterSettings is an ordered list of lessons for one environment
parameter.
"""
curriculum: List[Lesson]
@staticmethod
def _check_lesson_chain(lessons, parameter_name):
"""
Ensures that when using curriculum, all non-terminal lessons have a valid
CompletionCriteria
"""
num_lessons = len(lessons)
for index, lesson in enumerate(lessons):
if index < num_lessons - 1 and lesson.completion_criteria is None:
raise TrainerConfigError(
f"A non-terminal lesson does not have a completion_criteria for {parameter_name}."
)
@staticmethod
def structure(d: Mapping, t: type) -> Dict[str, "EnvironmentParameterSettings"]:
"""
Helper method to structure a Dict of EnvironmentParameterSettings class. Meant
to be registered with cattr.register_structure_hook() and called with
cattr.structure().
"""
if not isinstance(d, Mapping):
raise TrainerConfigError(
f"Unsupported parameter environment parameter settings {d}."
)
d_final: Dict[str, EnvironmentParameterSettings] = {}
for environment_parameter, environment_parameter_config in d.items():
if (
isinstance(environment_parameter_config, Mapping)
and "curriculum" in environment_parameter_config
):
d_final[environment_parameter] = strict_to_cls(
environment_parameter_config, EnvironmentParameterSettings
)
EnvironmentParameterSettings._check_lesson_chain(
d_final[environment_parameter].curriculum, environment_parameter
)
else:
sampler = ParameterRandomizationSettings.structure(
environment_parameter_config, ParameterRandomizationSettings
)
d_final[environment_parameter] = EnvironmentParameterSettings(
curriculum=[
Lesson(
completion_criteria=None,
value=sampler,
name=environment_parameter,
)
]
)
return d_final
# TRAINERS #############################################################################
@attr.s(auto_attribs=True)
class SelfPlaySettings:
save_steps: int = 20000

return t(**d_copy)
@attr.s(auto_attribs=True)
class CurriculumSettings:
class MeasureType:
PROGRESS: str = "progress"
REWARD: str = "reward"
measure: str = attr.ib(default=MeasureType.REWARD)
thresholds: List[float] = attr.ib(factory=list)
min_lesson_length: int = 0
signal_smoothing: bool = True
parameters: Dict[str, List[float]] = attr.ib(kw_only=True)
# COMMAND LINE #########################################################################
@attr.s(auto_attribs=True)
class CheckpointSettings:
run_id: str = parser.get_default("run_id")

)
env_settings: EnvironmentSettings = attr.ib(factory=EnvironmentSettings)
engine_settings: EngineSettings = attr.ib(factory=EngineSettings)
parameter_randomization: Optional[Dict[str, ParameterRandomizationSettings]] = None
curriculum: Optional[Dict[str, CurriculumSettings]] = None
environment_parameters: Optional[Dict[str, EnvironmentParameterSettings]] = None
checkpoint_settings: CheckpointSettings = attr.ib(factory=CheckpointSettings)
# These are options that are relevant to the run itself, and not the engine or environment.

cattr.register_structure_hook(EngineSettings, strict_to_cls)
cattr.register_structure_hook(CheckpointSettings, strict_to_cls)
cattr.register_structure_hook(
Dict[str, ParameterRandomizationSettings],
ParameterRandomizationSettings.structure,
Dict[str, EnvironmentParameterSettings], EnvironmentParameterSettings.structure
)
cattr.register_structure_hook(Lesson, strict_to_cls)
cattr.register_structure_hook(
ParameterRandomizationSettings, ParameterRandomizationSettings.structure
)
cattr.register_unstructure_hook(
ParameterRandomizationSettings, ParameterRandomizationSettings.unstructure
cattr.register_structure_hook(CurriculumSettings, strict_to_cls)
cattr.register_structure_hook(TrainerSettings, TrainerSettings.structure)
cattr.register_structure_hook(
DefaultDict[str, TrainerSettings], TrainerSettings.dict_to_defaultdict

4
ml-agents/mlagents/trainers/subprocess_env_manager.py


_send_response(EnvironmentCommand.BEHAVIOR_SPECS, env.behavior_specs)
elif req.cmd == EnvironmentCommand.ENVIRONMENT_PARAMETERS:
for k, v in req.payload.items():
if isinstance(v, float):
env_parameters.set_float_parameter(k, v)
elif isinstance(v, ParameterRandomizationSettings):
if isinstance(v, ParameterRandomizationSettings):
v.apply(k, env_parameters)
elif req.cmd == EnvironmentCommand.RESET:
env.reset()

101
ml-agents/mlagents/trainers/tests/test_config_conversion.py


import yaml
import pytest
from unittest import mock
from argparse import Namespace
from mlagents.trainers.upgrade_config import convert_behaviors, main, remove_nones
from mlagents.trainers.upgrade_config import convert_behaviors, remove_nones, convert
from mlagents.trainers.settings import (
TrainerType,
PPOSettings,

encoding_size: 128
"""
CURRICULUM = """
BigWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 200
signal_smoothing: true
parameters:
big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
SmallWallJump:
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
small_wall_height: [1.5, 2.0, 2.5, 4.0]
"""
RANDOMIZATION = """
resampling-interval: 5000
mass:
sampler-type: uniform
min_value: 0.5
max_value: 10
gravity:
sampler-type: uniform
min_value: 7
max_value: 12
scale:
sampler-type: uniform
min_value: 0.75
max_value: 3
"""
@pytest.mark.parametrize("use_recurrent", [True, False])
@pytest.mark.parametrize("trainer_type", [TrainerType.PPO, TrainerType.SAC])

assert RewardSignalType.CURIOSITY in trainer_settings.reward_signals
@mock.patch("mlagents.trainers.upgrade_config.convert_samplers")
@mock.patch("mlagents.trainers.upgrade_config.convert_behaviors")
@mock.patch("mlagents.trainers.upgrade_config.remove_nones")
@mock.patch("mlagents.trainers.upgrade_config.write_to_yaml_file")
@mock.patch("mlagents.trainers.upgrade_config.parse_args")
@mock.patch("mlagents.trainers.upgrade_config.load_config")
def test_main(
mock_load,
mock_parse,
yaml_write_mock,
remove_none_mock,
mock_convert_behaviors,
mock_convert_samplers,
):
test_output_file = "test.yaml"
mock_load.side_effect = [
yaml.safe_load(PPO_CONFIG),
"test_curriculum_config",
"test_sampler_config",
]
mock_args = Namespace(
trainer_config_path="mock",
output_config_path=test_output_file,
curriculum="test",
sampler="test",
)
mock_parse.return_value = mock_args
mock_convert_behaviors.return_value = "test_converted_config"
mock_convert_samplers.return_value = "test_converted_sampler_config"
dict_without_nones = mock.Mock(name="nonones")
remove_none_mock.return_value = dict_without_nones
def test_convert():
old_behaviors = yaml.safe_load(PPO_CONFIG)
old_curriculum = yaml.safe_load(CURRICULUM)
old_sampler = yaml.safe_load(RANDOMIZATION)
config = convert(old_behaviors, old_curriculum, old_sampler)
assert BRAIN_NAME in config["behaviors"]
assert "big_wall_min_height" in config["environment_parameters"]
curriculum = config["environment_parameters"]["big_wall_min_height"]["curriculum"]
assert len(curriculum) == 4
for i, expected_value in enumerate([0.0, 4.0, 6.0, 8.0]):
assert curriculum[i][f"Lesson{i}"]["value"] == expected_value
for i, threshold in enumerate([0.1, 0.3, 0.5]):
criteria = curriculum[i][f"Lesson{i}"]["completion_criteria"]
assert criteria["threshold"] == threshold
assert criteria["behavior"] == "BigWallJump"
assert criteria["signal_smoothing"]
assert criteria["min_lesson_length"] == 200
assert criteria["measure"] == "progress"
main()
saved_dict = remove_none_mock.call_args[0][0]
# Check that the output of the remove_none call is here
yaml_write_mock.assert_called_with(dict_without_nones, test_output_file)
assert saved_dict["behaviors"] == "test_converted_config"
assert saved_dict["curriculum"] == "test_curriculum_config"
assert saved_dict["parameter_randomization"] == "test_converted_sampler_config"
assert "gravity" in config["environment_parameters"]
gravity = config["environment_parameters"]["gravity"]
assert gravity["sampler_type"] == "uniform"
assert gravity["sampler_parameters"]["min_value"] == 7
assert gravity["sampler_parameters"]["max_value"] == 12
def test_remove_nones():

61
ml-agents/mlagents/trainers/tests/test_learn.py


from mlagents.trainers.cli_utils import DetectDefault
from mlagents_envs.exception import UnityEnvironmentException
from mlagents.trainers.stats import StatsReporter
from mlagents.trainers.settings import UniformSettings
from mlagents.trainers.environment_parameter_manager import EnvironmentParameterManager
def basic_options(extra_args=None):

debug: false
"""
MOCK_SAMPLER_CURRICULUM_YAML = """
parameter_randomization:
sampler1:
sampler_type: uniform
sampler_parameters:
min_value: 0.2
curriculum:
behavior1:
parameters:
foo: [0.2, 0.5]
behavior2:
parameters:
foo: [0.2, 0.5]
"""
@patch("mlagents.trainers.learn.write_timing_tree")
@patch("mlagents.trainers.learn.write_run_options")

mock_env.academy_name = "TestAcademyName"
create_environment_factory.return_value = mock_env
load_config.return_value = yaml.safe_load(MOCK_INITIALIZE_YAML)
mock_param_manager = MagicMock(return_value="mock_param_manager")
with patch.object(TrainerController, "__init__", mock_init):
with patch.object(TrainerController, "start_learning", MagicMock()):
options = basic_options()
learn.run_training(0, options)
mock_init.assert_called_once_with(
trainer_factory_mock.return_value, "results/ppo", "ppo", None, True, 0
)
handle_dir_mock.assert_called_once_with(
"results/ppo", False, False, "results/notuselessrun"
)
write_timing_tree_mock.assert_called_once_with("results/ppo/run_logs")
write_run_options_mock.assert_called_once_with("results/ppo", options)
with patch.object(EnvironmentParameterManager, "__new__", mock_param_manager):
with patch.object(TrainerController, "__init__", mock_init):
with patch.object(TrainerController, "start_learning", MagicMock()):
options = basic_options()
learn.run_training(0, options)
mock_init.assert_called_once_with(
trainer_factory_mock.return_value,
"results/ppo",
"ppo",
"mock_param_manager",
True,
0,
)
handle_dir_mock.assert_called_once_with(
"results/ppo", False, False, "results/notuselessrun"
)
write_timing_tree_mock.assert_called_once_with("results/ppo/run_logs")
write_run_options_mock.assert_called_once_with("results/ppo", options)
StatsReporter.writers.clear() # make sure there aren't any writers as added by learn.py

opt = parse_command_line(["mytrainerpath"])
assert opt.behaviors == {}
assert opt.env_settings.env_path is None
assert opt.parameter_randomization is None
assert opt.checkpoint_settings.resume is False
assert opt.checkpoint_settings.inference is False
assert opt.checkpoint_settings.run_id == "ppo"

opt = parse_command_line(full_args)
assert opt.behaviors == {}
assert opt.env_settings.env_path == "./myenvfile"
assert opt.parameter_randomization is None
assert opt.checkpoint_settings.run_id == "myawesomerun"
assert opt.checkpoint_settings.initialize_from == "testdir"
assert opt.env_settings.seed == 7890

opt = parse_command_line(["mytrainerpath"])
assert opt.behaviors == {}
assert opt.env_settings.env_path == "./oldenvfile"
assert opt.parameter_randomization is None
assert opt.checkpoint_settings.run_id == "uselessrun"
assert opt.checkpoint_settings.initialize_from == "notuselessrun"
assert opt.env_settings.seed == 9870

opt = parse_command_line(full_args)
assert opt.behaviors == {}
assert opt.env_settings.env_path == "./myenvfile"
assert opt.parameter_randomization is None
assert opt.checkpoint_settings.run_id == "myawesomerun"
assert opt.env_settings.seed == 7890
assert opt.env_settings.base_port == 4004

assert opt.checkpoint_settings.inference is True
assert opt.checkpoint_settings.resume is True
@patch("builtins.open", new_callable=mock_open, read_data=MOCK_SAMPLER_CURRICULUM_YAML)
def test_sampler_configs(mock_file):
opt = parse_command_line(["mytrainerpath"])
assert isinstance(opt.parameter_randomization["sampler1"], UniformSettings)
assert len(opt.curriculum.keys()) == 2
@patch("builtins.open", new_callable=mock_open, read_data=MOCK_YAML)

135
ml-agents/mlagents/trainers/tests/test_settings.py


RewardSignalType,
RewardSignalSettings,
CuriositySettings,
ParameterRandomizationSettings,
EnvironmentParameterSettings,
ConstantSettings,
UniformSettings,
GaussianSettings,
MultiRangeUniformSettings,

NetworkSettings.MemorySettings(sequence_length=128, memory_size=0)
def test_parameter_randomization_structure():
def test_env_parameter_structure():
Tests the ParameterRandomizationSettings structure method and all validators.
Tests the EnvironmentParameterSettings structure method and all validators.
parameter_randomization_dict = {
env_params_dict = {
"mass": {
"sampler_type": "uniform",
"sampler_parameters": {"min_value": 1.0, "max_value": 2.0},

"sampler_type": "multirangeuniform",
"sampler_parameters": {"intervals": [[1.0, 2.0], [3.0, 4.0]]},
},
"gravity": 1,
"wall_height": {
"curriculum": [
{
"name": "Lesson1",
"completion_criteria": {
"measure": "reward",
"behavior": "fake_behavior",
"threshold": 10,
},
"value": 1,
},
{"value": 4, "name": "Lesson2"},
]
},
parameter_randomization_distributions = ParameterRandomizationSettings.structure(
parameter_randomization_dict, Dict[str, ParameterRandomizationSettings]
env_param_settings = EnvironmentParameterSettings.structure(
env_params_dict, Dict[str, EnvironmentParameterSettings]
)
assert isinstance(env_param_settings["mass"].curriculum[0].value, UniformSettings)
assert isinstance(env_param_settings["scale"].curriculum[0].value, GaussianSettings)
assert isinstance(
env_param_settings["length"].curriculum[0].value, MultiRangeUniformSettings
)
assert isinstance(
env_param_settings["wall_height"].curriculum[0].value, ConstantSettings
assert isinstance(parameter_randomization_distributions["mass"], UniformSettings)
assert isinstance(parameter_randomization_distributions["scale"], GaussianSettings)
parameter_randomization_distributions["length"], MultiRangeUniformSettings
env_param_settings["wall_height"].curriculum[1].value, ConstantSettings
)
# Check invalid distribution type

}
}
with pytest.raises(ValueError):
ParameterRandomizationSettings.structure(
invalid_distribution_dict, Dict[str, ParameterRandomizationSettings]
EnvironmentParameterSettings.structure(
invalid_distribution_dict, Dict[str, EnvironmentParameterSettings]
)
# Check min less than max in uniform

}
}
with pytest.raises(TrainerConfigError):
ParameterRandomizationSettings.structure(
invalid_distribution_dict, Dict[str, ParameterRandomizationSettings]
EnvironmentParameterSettings.structure(
invalid_distribution_dict, Dict[str, EnvironmentParameterSettings]
)
# Check min less than max in multirange

}
}
with pytest.raises(TrainerConfigError):
ParameterRandomizationSettings.structure(
invalid_distribution_dict, Dict[str, ParameterRandomizationSettings]
EnvironmentParameterSettings.structure(
invalid_distribution_dict, Dict[str, EnvironmentParameterSettings]
)
# Check multirange has valid intervals

}
}
with pytest.raises(TrainerConfigError):
ParameterRandomizationSettings.structure(
invalid_distribution_dict, Dict[str, ParameterRandomizationSettings]
EnvironmentParameterSettings.structure(
invalid_distribution_dict, Dict[str, EnvironmentParameterSettings]
ParameterRandomizationSettings.structure(
"notadict", Dict[str, ParameterRandomizationSettings]
EnvironmentParameterSettings.structure(
"notadict", Dict[str, EnvironmentParameterSettings]
)
invalid_curriculum_dict = {
"wall_height": {
"curriculum": [
{
"name": "Lesson1",
"completion_criteria": {
"measure": "progress",
"behavior": "fake_behavior",
"threshold": 10,
}, # > 1 is too large
"value": 1,
},
{"value": 4, "name": "Lesson2"},
]
}
}
with pytest.raises(TrainerConfigError):
EnvironmentParameterSettings.structure(
invalid_curriculum_dict, Dict[str, EnvironmentParameterSettings]
)

train_model: false
inference: false
debug: true
environment_parameters:
big_wall_height:
curriculum:
- name: Lesson0
completion_criteria:
measure: progress
behavior: BigWallJump
signal_smoothing: true
min_lesson_length: 100
threshold: 0.1
value:
sampler_type: uniform
sampler_parameters:
min_value: 0.0
max_value: 4.0
- name: Lesson1
completion_criteria:
measure: reward
behavior: BigWallJump
signal_smoothing: true
min_lesson_length: 100
threshold: 0.2
value:
sampler_type: gaussian
sampler_parameters:
mean: 4.0
st_dev: 7.0
- name: Lesson2
completion_criteria:
measure: progress
behavior: BigWallJump
signal_smoothing: true
min_lesson_length: 20
threshold: 0.3
value:
sampler_type: multirangeuniform
sampler_parameters:
intervals: [[1.0, 2.0],[4.0, 5.0]]
- name: Lesson3
value: 8.0
small_wall_height: 42.0
other_wall_height:
sampler_type: multirangeuniform
sampler_parameters:
intervals: [[1.0, 2.0],[4.0, 5.0]]
"""
if not use_defaults:
loaded_yaml = yaml.safe_load(test_yaml)

dict_export = run_options.as_dict()
if not use_defaults: # Don't need to check if no yaml
check_dict_is_at_least(loaded_yaml, dict_export)
check_dict_is_at_least(
loaded_yaml, dict_export, exceptions=["environment_parameters"]
)
check_dict_is_at_least(dict_export, second_export)
# Should be able to use equality instead of back-and-forth once environment_parameters
# is working
check_dict_is_at_least(second_export, dict_export)
# Check that the two exports are the same
assert dict_export == second_export

9
ml-agents/mlagents/trainers/tests/test_simple_rl.py


TrainerType,
RewardSignalType,
)
from mlagents.trainers.environment_parameter_manager import EnvironmentParameterManager
from mlagents.trainers.models import EncoderType, ScheduleType
from mlagents_envs.side_channel.environment_parameters_channel import (
EnvironmentParametersChannel,

env,
trainer_config,
reward_processor=default_reward_processor,
meta_curriculum=None,
env_parameter_manager=None,
if env_parameter_manager is None:
env_parameter_manager = EnvironmentParameterManager()
# Create controller and begin training.
with tempfile.TemporaryDirectory() as dir:
run_id = "id"

train_model=True,
load_model=False,
seed=seed,
meta_curriculum=meta_curriculum,
param_manager=env_parameter_manager,
multi_gpu=False,
)

run_id=run_id,
meta_curriculum=meta_curriculum,
param_manager=env_parameter_manager,
train=True,
training_seed=seed,
)

5
ml-agents/mlagents/trainers/tests/test_trainer_controller.py


from mlagents.tf_utils import tf
from mlagents.trainers.trainer_controller import TrainerController
from mlagents.trainers.environment_parameter_manager import EnvironmentParameterManager
from mlagents.trainers.ghost.controller import GhostController

trainer_factory=trainer_factory_mock,
output_path="test_model_path",
run_id="test_run_id",
meta_curriculum=None,
param_manager=EnvironmentParameterManager(),
train=True,
training_seed=99,
)

trainer_factory=trainer_factory_mock,
output_path="",
run_id="1",
meta_curriculum=None,
param_manager=None,
train=True,
training_seed=seed,
)

3
ml-agents/mlagents/trainers/tests/test_trainer_util.py


from mlagents.trainers.exception import TrainerConfigError, UnityTrainerException
from mlagents.trainers.settings import RunOptions
from mlagents.trainers.tests.test_simple_rl import PPO_CONFIG
from mlagents.trainers.environment_parameter_manager import EnvironmentParameterManager
@pytest.fixture

train_model=train_model,
load_model=load_model,
seed=seed,
param_manager=EnvironmentParameterManager(),
)
trainers = {}
for brain_name in training_behaviors.keys():

train_model=True,
load_model=False,
seed=42,
param_manager=EnvironmentParameterManager(),
)
trainer_factory.generate(brain_name)

104
ml-agents/mlagents/trainers/trainer_controller.py


import os
import threading
from typing import Dict, Optional, Set, List
from typing import Dict, Set, List
from collections import defaultdict
import numpy as np

merge_gauges,
)
from mlagents.trainers.trainer import Trainer
from mlagents.trainers.meta_curriculum import MetaCurriculum
from mlagents.trainers.environment_parameter_manager import EnvironmentParameterManager
from mlagents.trainers.settings import CurriculumSettings
from mlagents.trainers.training_status import GlobalTrainingStatus, StatusType
class TrainerController(object):

output_path: str,
run_id: str,
meta_curriculum: Optional[MetaCurriculum],
param_manager: EnvironmentParameterManager,
train: bool,
training_seed: int,
):

:param run_id: The sub-directory name for model and summary statistics
:param meta_curriculum: MetaCurriculum object which stores information about all curricula.
:param param_manager: EnvironmentParameterManager object which stores information about all
environment parameters.
:param train: Whether to train model, or only run inference.
:param training_seed: Seed to use for Numpy and Tensorflow random number generation.
:param threaded: Whether or not to run trainers in a separate thread. Disable for testing/debugging.

self.logger = get_logger(__name__)
self.run_id = run_id
self.train_model = train
self.meta_curriculum = meta_curriculum
self.param_manager = param_manager
self.ghost_controller = self.trainer_factory.ghost_controller
self.trainer_threads: List[threading.Thread] = []

def _get_measure_vals(self):
brain_names_to_measure_vals = {}
if self.meta_curriculum:
for (
brain_name,
curriculum,
) in self.meta_curriculum.brains_to_curricula.items():
# Skip brains that are in the metacurriculum but no trainer yet.
if brain_name not in self.trainers:
continue
if curriculum.measure == CurriculumSettings.MeasureType.PROGRESS:
measure_val = self.trainers[brain_name].get_step / float(
self.trainers[brain_name].get_max_steps
)
brain_names_to_measure_vals[brain_name] = measure_val
elif curriculum.measure == CurriculumSettings.MeasureType.REWARD:
measure_val = np.mean(self.trainers[brain_name].reward_buffer)
brain_names_to_measure_vals[brain_name] = measure_val
else:
for brain_name, trainer in self.trainers.items():
measure_val = np.mean(trainer.reward_buffer)
brain_names_to_measure_vals[brain_name] = measure_val
return brain_names_to_measure_vals
@timed
def _save_model(self):
"""

A Data structure corresponding to the initial reset state of the
environment.
"""
new_meta_curriculum_config = (
self.meta_curriculum.get_config() if self.meta_curriculum else {}
)
env.reset(config=new_meta_curriculum_config)
new_config = self.param_manager.get_current_samplers()
env.reset(config=new_config)
def _not_done_training(self) -> bool:
return (

self._save_model()
self._export_graph()
def end_trainer_episodes(
self, env: EnvManager, lessons_incremented: Dict[str, bool]
) -> None:
self._reset_env(env)
def end_trainer_episodes(self) -> None:
for brain_name, changed in lessons_incremented.items():
if changed:
self.trainers[brain_name].reward_buffer.clear()
if self.meta_curriculum:
# Get the sizes of the reward buffers.
reward_buff_sizes = {
k: len(t.reward_buffer) for (k, t) in self.trainers.items()
}
# Attempt to increment the lessons of the brains who
# were ready.
lessons_incremented = self.meta_curriculum.increment_lessons(
self._get_measure_vals(), reward_buff_sizes=reward_buff_sizes
)
else:
lessons_incremented = {}
# If any lessons were incremented or the environment is
# ready to be reset
meta_curriculum_reset = any(lessons_incremented.values())
# Get the sizes of the reward buffers.
reward_buff = {k: list(t.reward_buffer) for (k, t) in self.trainers.items()}
curr_step = {k: int(t.step) for (k, t) in self.trainers.items()}
max_step = {k: int(t.get_max_steps) for (k, t) in self.trainers.items()}
# Attempt to increment the lessons of the brains who
# were ready.
updated, param_must_reset = self.param_manager.update_lessons(
curr_step, max_step, reward_buff
)
if updated:
for trainer in self.trainers.values():
trainer.reward_buffer.clear()
if meta_curriculum_reset or ghost_controller_reset:
self.end_trainer_episodes(env, lessons_incremented)
if param_must_reset or ghost_controller_reset:
self._reset_env(env) # This reset also sends the new config to env
self.end_trainer_episodes()
elif updated:
env.set_env_parameters(self.param_manager.get_current_samplers())
@timed
def advance(self, env: EnvManager) -> int:

# Report current lesson
if self.meta_curriculum:
for brain_name, curr in self.meta_curriculum.brains_to_curricula.items():
if brain_name in self.trainers:
self.trainers[brain_name].stats_reporter.set_stat(
"Environment/Lesson", curr.lesson_num
)
GlobalTrainingStatus.set_parameter_state(
brain_name, StatusType.LESSON_NUM, curr.lesson_num
)
# Report current lesson for each environment parameter
for (
param_name,
lesson_number,
) in self.param_manager.get_current_lesson_number().items():
for trainer in self.trainers.values():
trainer.stats_reporter.set_stat(
f"Environment/Lesson/{param_name}", lesson_number
)
for trainer in self.trainers.values():
if not trainer.threaded:

24
ml-agents/mlagents/trainers/trainer_util.py


from typing import Dict
from mlagents_envs.logging_util import get_logger
from mlagents.trainers.meta_curriculum import MetaCurriculum
from mlagents.trainers.environment_parameter_manager import EnvironmentParameterManager
from mlagents.trainers.exception import TrainerConfigError
from mlagents.trainers.trainer import Trainer
from mlagents.trainers.exception import UnityTrainerException

train_model: bool,
load_model: bool,
seed: int,
param_manager: EnvironmentParameterManager,
meta_curriculum: MetaCurriculum = None,
multi_gpu: bool = False,
):
self.trainer_config = trainer_config

self.load_model = load_model
self.seed = seed
self.meta_curriculum = meta_curriculum
self.param_manager = param_manager
self.multi_gpu = multi_gpu
self.ghost_controller = GhostController()

self.load_model,
self.ghost_controller,
self.seed,
self.param_manager,
self.meta_curriculum,
self.multi_gpu,
)

load_model: bool,
ghost_controller: GhostController,
seed: int,
param_manager: EnvironmentParameterManager,
meta_curriculum: MetaCurriculum = None,
multi_gpu: bool = False,
) -> Trainer:
"""

:param load_model: Whether to load the model or randomly initialize
:param ghost_controller: The object that coordinates ghost trainers
:param seed: The random seed to use
:param param_manager: EnvironmentParameterManager, used to determine a reward buffer length for PPOTrainer
:param meta_curriculum: Optional meta_curriculum, used to determine a reward buffer length for PPOTrainer
:return:
"""
trainer_artifact_path = os.path.join(output_path, brain_name)

min_lesson_length = 1
if meta_curriculum:
if brain_name in meta_curriculum.brains_to_curricula:
min_lesson_length = meta_curriculum.brains_to_curricula[
brain_name
].min_lesson_length
else:
logger.warning(
f"Metacurriculum enabled, but no curriculum for brain {brain_name}. "
f"Brains with curricula: {meta_curriculum.brains_to_curricula.keys()}. "
)
min_lesson_length = param_manager.get_minimum_reward_buffer_size(brain_name)
trainer: Trainer = None # type: ignore # will be set to one of these, or raise
trainer_type = trainer_settings.trainer_type

125
ml-agents/mlagents/trainers/upgrade_config.py


import attr
import cattr
import yaml
from typing import Dict, Any
from typing import Dict, Any, Optional
import argparse
from mlagents.trainers.settings import TrainerSettings, NetworkSettings, TrainerType
from mlagents.trainers.cli_utils import load_config

return new_sampler_config
def convert_samplers_and_curriculum(
parameter_dict: Dict[str, Any], curriculum: Dict[str, Any]
) -> Dict[str, Any]:
for key, sampler in parameter_dict.items():
if "sampler_parameters" not in sampler:
parameter_dict[key]["sampler_parameters"] = {}
for argument in [
"seed",
"min_value",
"max_value",
"mean",
"st_dev",
"intervals",
]:
if argument in sampler:
parameter_dict[key]["sampler_parameters"][argument] = sampler[argument]
parameter_dict[key].pop(argument)
param_set = set(parameter_dict.keys())
for behavior_name, behavior_dict in curriculum.items():
measure = behavior_dict["measure"]
min_lesson_length = behavior_dict.get("min_lesson_length", 1)
signal_smoothing = behavior_dict.get("signal_smoothing", False)
thresholds = behavior_dict["thresholds"]
num_lessons = len(thresholds) + 1
parameters = behavior_dict["parameters"]
for param_name in parameters.keys():
if param_name in param_set:
print(
f"The parameter {param_name} has both a sampler and a curriculum. Will ignore curriculum"
)
else:
param_set.add(param_name)
parameter_dict[param_name] = {"curriculum": []}
for lesson_index in range(num_lessons - 1):
parameter_dict[param_name]["curriculum"].append(
{
f"Lesson{lesson_index}": {
"completion_criteria": {
"measure": measure,
"behavior": behavior_name,
"signal_smoothing": signal_smoothing,
"min_lesson_length": min_lesson_length,
"threshold": thresholds[lesson_index],
},
"value": parameters[param_name][lesson_index],
}
}
)
lesson_index += 1 # This is the last lesson
parameter_dict[param_name]["curriculum"].append(
{
f"Lesson{lesson_index}": {
"value": parameters[param_name][lesson_index]
}
}
)
return parameter_dict
def parse_args():
argparser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter

help="Path to old format (<=0.16.X) trainer configuration YAML.",
help="Path to old format (<=0.18.X) trainer configuration YAML.",
)
argparser.add_argument(
"--curriculum",

return args
def convert(
config: Dict[str, Any],
old_curriculum: Optional[Dict[str, Any]],
old_param_random: Optional[Dict[str, Any]],
) -> Dict[str, Any]:
if "behaviors" not in config:
print("Config file format version : version <= 0.16.X")
behavior_config_dict = convert_behaviors(config)
full_config = {"behaviors": behavior_config_dict}
# Convert curriculum and sampler. note that we don't validate these; if it was correct
# before it should be correct now.
if old_curriculum is not None:
full_config["curriculum"] = old_curriculum
if old_param_random is not None:
sampler_config_dict = convert_samplers(old_param_random)
full_config["parameter_randomization"] = sampler_config_dict
# Convert config to dict
config = cattr.unstructure(full_config)
if "curriculum" in config or "parameter_randomization" in config:
print("Config file format version : 0.16.X < version <= 0.18.X")
full_config = {"behaviors": config["behaviors"]}
param_randomization = config.get("parameter_randomization", {})
if "resampling-interval" in param_randomization:
param_randomization.pop("resampling-interval")
if len(param_randomization) > 0:
# check if we use the old format sampler-type vs sampler_type
if (
"sampler-type"
in param_randomization[list(param_randomization.keys())[0]]
):
param_randomization = convert_samplers(param_randomization)
full_config["environment_parameters"] = convert_samplers_and_curriculum(
param_randomization, config.get("curriculum", {})
)
# Convert config to dict
config = cattr.unstructure(full_config)
return config
def main() -> None:
args = parse_args()
print(

old_config = load_config(args.trainer_config_path)
behavior_config_dict = convert_behaviors(old_config)
full_config = {"behaviors": behavior_config_dict}
# Convert curriculum and sampler. note that we don't validate these; if it was correct
# before it should be correct now.
curriculum_config_dict = None
old_sampler_config_dict = None
full_config["curriculum"] = curriculum_config_dict
sampler_config_dict = convert_samplers(old_sampler_config_dict)
full_config["parameter_randomization"] = sampler_config_dict
# Convert config to dict
unstructed_config = cattr.unstructure(full_config)
unstructed_config = remove_nones(unstructed_config)
new_config = convert(old_config, curriculum_config_dict, old_sampler_config_dict)
unstructed_config = remove_nones(new_config)
write_to_yaml_file(unstructed_config, args.output_config_path)

156
ml-agents/mlagents/trainers/environment_parameter_manager.py


from typing import Dict, List, Tuple, Optional
from mlagents.trainers.settings import (
EnvironmentParameterSettings,
ParameterRandomizationSettings,
)
from collections import defaultdict
from mlagents.trainers.training_status import GlobalTrainingStatus, StatusType
from mlagents_envs.logging_util import get_logger
logger = get_logger(__name__)
class EnvironmentParameterManager:
def __init__(
self,
settings: Optional[Dict[str, EnvironmentParameterSettings]] = None,
run_seed: int = -1,
restore: bool = False,
):
"""
EnvironmentParameterManager manages all the environment parameters of a training
session. It determines when parameters should change and gives access to the
current sampler of each parameter.
:param settings: A dictionary from environment parameter to
EnvironmentParameterSettings.
:param run_seed: When the seed is not provided for an environment parameter,
this seed will be used instead.
:param restore: If true, the EnvironmentParameterManager will use the
GlobalTrainingStatus to try and reload the lesson status of each environment
parameter.
"""
if settings is None:
settings = {}
self._dict_settings = settings
for parameter_name in self._dict_settings.keys():
initial_lesson = GlobalTrainingStatus.get_parameter_state(
parameter_name, StatusType.LESSON_NUM
)
if initial_lesson is None or not restore:
GlobalTrainingStatus.set_parameter_state(
parameter_name, StatusType.LESSON_NUM, 0
)
self._smoothed_values: Dict[str, float] = defaultdict(float)
for key in self._dict_settings.keys():
self._smoothed_values[key] = 0.0
# Update the seeds of the samplers
self._set_sampler_seeds(run_seed)
def _set_sampler_seeds(self, seed):
"""
Sets the seeds for the samplers (if no seed was already present). Note that
using the provided seed.
"""
offset = 0
for settings in self._dict_settings.values():
for lesson in settings.curriculum:
if lesson.value.seed == -1:
lesson.value.seed = seed + offset
offset += 1
def get_minimum_reward_buffer_size(self, behavior_name: str) -> int:
"""
Calculates the minimum size of the reward buffer a behavior must use. This
method uses the 'min_lesson_length' sampler_parameter to determine this value.
:param behavior_name: The name of the behavior the minimum reward buffer
size corresponds to.
"""
result = 1
for settings in self._dict_settings.values():
for lesson in settings.curriculum:
if lesson.completion_criteria is not None:
if lesson.completion_criteria.behavior == behavior_name:
result = max(
result, lesson.completion_criteria.min_lesson_length
)
return result
def get_current_samplers(self) -> Dict[str, ParameterRandomizationSettings]:
"""
Creates a dictionary from environment parameter name to their corresponding
ParameterRandomizationSettings. If curriculum is used, the
ParameterRandomizationSettings corresponds to the sampler of the current lesson.
"""
samplers: Dict[str, ParameterRandomizationSettings] = {}
for param_name, settings in self._dict_settings.items():
lesson_num = GlobalTrainingStatus.get_parameter_state(
param_name, StatusType.LESSON_NUM
)
lesson = settings.curriculum[lesson_num]
samplers[param_name] = lesson.value
return samplers
def get_current_lesson_number(self) -> Dict[str, int]:
"""
Creates a dictionary from environment parameter to the current lesson number.
If not using curriculum, this number is always 0 for that environment parameter.
"""
result: Dict[str, int] = {}
for parameter_name in self._dict_settings.keys():
result[parameter_name] = GlobalTrainingStatus.get_parameter_state(
parameter_name, StatusType.LESSON_NUM
)
return result
def update_lessons(
self,
trainer_steps: Dict[str, int],
trainer_max_steps: Dict[str, int],
trainer_reward_buffer: Dict[str, List[float]],
) -> Tuple[bool, bool]:
"""
Given progress metrics, calculates if at least one environment parameter is
in a new lesson and if at least one environment parameter requires the env
to reset.
:param trainer_steps: A dictionary from behavior_name to the number of training
steps this behavior's trainer has performed.
:param trainer_max_steps: A dictionary from behavior_name to the maximum number
of training steps this behavior's trainer has performed.
:param trainer_reward_buffer: A dictionary from behavior_name to the list of
the most recent episode returns for this behavior's trainer.
:returns: A tuple of two booleans : (True if any lesson has changed, True if
environment needs to reset)
"""
must_reset = False
updated = False
for param_name, settings in self._dict_settings.items():
lesson_num = GlobalTrainingStatus.get_parameter_state(
param_name, StatusType.LESSON_NUM
)
lesson = settings.curriculum[lesson_num]
if (
lesson.completion_criteria is not None
and len(settings.curriculum) > lesson_num
):
behavior_to_consider = lesson.completion_criteria.behavior
if behavior_to_consider in trainer_steps:
must_increment, new_smoothing = lesson.completion_criteria.need_increment(
float(trainer_steps[behavior_to_consider])
/ float(trainer_max_steps[behavior_to_consider]),
trainer_reward_buffer[behavior_to_consider],
self._smoothed_values[param_name],
)
self._smoothed_values[param_name] = new_smoothing
if must_increment:
GlobalTrainingStatus.set_parameter_state(
param_name, StatusType.LESSON_NUM, lesson_num + 1
)
new_lesson_name = settings.curriculum[lesson_num + 1].name
logger.info(
f"Parameter '{param_name}' has changed. Now in lesson '{new_lesson_name}'"
)
updated = True
if lesson.completion_criteria.require_reset:
must_reset = True
return updated, must_reset

256
ml-agents/mlagents/trainers/tests/test_env_param_manager.py


import pytest
import yaml
from mlagents.trainers.exception import TrainerConfigError
from mlagents.trainers.environment_parameter_manager import EnvironmentParameterManager
from mlagents.trainers.settings import (
RunOptions,
UniformSettings,
GaussianSettings,
ConstantSettings,
CompletionCriteriaSettings,
)
test_sampler_config_yaml = """
environment_parameters:
param_1:
sampler_type: uniform
sampler_parameters:
min_value: 0.5
max_value: 10
"""
def test_sampler_conversion():
run_options = RunOptions.from_dict(yaml.safe_load(test_sampler_config_yaml))
assert run_options.environment_parameters is not None
assert "param_1" in run_options.environment_parameters
lessons = run_options.environment_parameters["param_1"].curriculum
assert len(lessons) == 1
assert lessons[0].completion_criteria is None
assert isinstance(lessons[0].value, UniformSettings)
assert lessons[0].value.min_value == 0.5
assert lessons[0].value.max_value == 10
test_sampler_and_constant_config_yaml = """
environment_parameters:
param_1:
sampler_type: gaussian
sampler_parameters:
mean: 4
st_dev: 5
param_2: 20
"""
def test_sampler_and_constant_conversion():
run_options = RunOptions.from_dict(
yaml.safe_load(test_sampler_and_constant_config_yaml)
)
assert "param_1" in run_options.environment_parameters
assert "param_2" in run_options.environment_parameters
lessons_1 = run_options.environment_parameters["param_1"].curriculum
lessons_2 = run_options.environment_parameters["param_2"].curriculum
# gaussian
assert isinstance(lessons_1[0].value, GaussianSettings)
assert lessons_1[0].value.mean == 4
assert lessons_1[0].value.st_dev == 5
# constant
assert isinstance(lessons_2[0].value, ConstantSettings)
assert lessons_2[0].value.value == 20
test_curriculum_config_yaml = """
environment_parameters:
param_1:
curriculum:
- name: Lesson1
completion_criteria:
measure: reward
behavior: fake_behavior
threshold: 30
min_lesson_length: 100
require_reset: true
value: 1
- name: Lesson2
completion_criteria:
measure: reward
behavior: fake_behavior
threshold: 60
min_lesson_length: 100
require_reset: false
value: 2
- name: Lesson3
value:
sampler_type: uniform
sampler_parameters:
min_value: 1
max_value: 3
"""
def test_curriculum_conversion():
run_options = RunOptions.from_dict(yaml.safe_load(test_curriculum_config_yaml))
assert "param_1" in run_options.environment_parameters
lessons = run_options.environment_parameters["param_1"].curriculum
assert len(lessons) == 3
# First lesson
lesson = lessons[0]
assert lesson.completion_criteria is not None
assert (
lesson.completion_criteria.measure
== CompletionCriteriaSettings.MeasureType.REWARD
)
assert lesson.completion_criteria.behavior == "fake_behavior"
assert lesson.completion_criteria.threshold == 30.0
assert lesson.completion_criteria.min_lesson_length == 100
assert lesson.completion_criteria.require_reset
assert isinstance(lesson.value, ConstantSettings)
assert lesson.value.value == 1
# Second lesson
lesson = lessons[1]
assert lesson.completion_criteria is not None
assert (
lesson.completion_criteria.measure
== CompletionCriteriaSettings.MeasureType.REWARD
)
assert lesson.completion_criteria.behavior == "fake_behavior"
assert lesson.completion_criteria.threshold == 60.0
assert lesson.completion_criteria.min_lesson_length == 100
assert not lesson.completion_criteria.require_reset
assert isinstance(lesson.value, ConstantSettings)
assert lesson.value.value == 2
# Last lesson
lesson = lessons[2]
assert lesson.completion_criteria is None
assert isinstance(lesson.value, UniformSettings)
assert lesson.value.min_value == 1
assert lesson.value.max_value == 3
test_bad_curriculum_no_competion_criteria_config_yaml = """
environment_parameters:
param_1:
curriculum:
- name: Lesson1
completion_criteria:
measure: reward
behavior: fake_behavior
threshold: 30
min_lesson_length: 100
require_reset: true
value: 1
- name: Lesson2
value: 2
- name: Lesson3
value:
sampler_type: uniform
sampler_parameters:
min_value: 1
max_value: 3
"""
def test_curriculum_raises_no_completion_criteria_conversion():
with pytest.raises(TrainerConfigError):
RunOptions.from_dict(
yaml.safe_load(test_bad_curriculum_no_competion_criteria_config_yaml)
)
test_everything_config_yaml = """
environment_parameters:
param_1:
curriculum:
- name: Lesson1
completion_criteria:
measure: reward
behavior: fake_behavior
threshold: 30
min_lesson_length: 100
require_reset: true
value: 1
- name: Lesson2
completion_criteria:
measure: progress
behavior: fake_behavior
threshold: 0.5
min_lesson_length: 100
require_reset: false
value: 2
- name: Lesson3
value:
sampler_type: uniform
sampler_parameters:
min_value: 1
max_value: 3
param_2:
sampler_type: gaussian
sampler_parameters:
mean: 4
st_dev: 5
param_3: 20
"""
def test_create_manager():
run_options = RunOptions.from_dict(yaml.safe_load(test_everything_config_yaml))
param_manager = EnvironmentParameterManager(
run_options.environment_parameters, 1337, False
)
assert param_manager.get_minimum_reward_buffer_size("fake_behavior") == 100
assert param_manager.get_current_lesson_number() == {
"param_1": 0,
"param_2": 0,
"param_3": 0,
}
assert param_manager.get_current_samplers() == {
"param_1": ConstantSettings(seed=1337, value=1),
"param_2": GaussianSettings(seed=1337 + 3, mean=4, st_dev=5),
"param_3": ConstantSettings(seed=1337 + 3 + 1, value=20),
}
# Not enough episodes completed
assert param_manager.update_lessons(
trainer_steps={"fake_behavior": 500},
trainer_max_steps={"fake_behavior": 1000},
trainer_reward_buffer={"fake_behavior": [1000] * 99},
) == (False, False)
# Not enough episodes reward
assert param_manager.update_lessons(
trainer_steps={"fake_behavior": 500},
trainer_max_steps={"fake_behavior": 1000},
trainer_reward_buffer={"fake_behavior": [1] * 101},
) == (False, False)
assert param_manager.update_lessons(
trainer_steps={"fake_behavior": 500},
trainer_max_steps={"fake_behavior": 1000},
trainer_reward_buffer={"fake_behavior": [1000] * 101},
) == (True, True)
assert param_manager.get_current_lesson_number() == {
"param_1": 1,
"param_2": 0,
"param_3": 0,
}
param_manager_2 = EnvironmentParameterManager(
run_options.environment_parameters, 1337, restore=True
)
# The use of global status should make it so that the lesson numbers are maintained
assert param_manager_2.get_current_lesson_number() == {
"param_1": 1,
"param_2": 0,
"param_3": 0,
}
# No reset required
assert param_manager.update_lessons(
trainer_steps={"fake_behavior": 700},
trainer_max_steps={"fake_behavior": 1000},
trainer_reward_buffer={"fake_behavior": [0] * 101},
) == (True, False)
assert param_manager.get_current_samplers() == {
"param_1": UniformSettings(seed=1337 + 2, min_value=1, max_value=3),
"param_2": GaussianSettings(seed=1337 + 3, mean=4, st_dev=5),
"param_3": ConstantSettings(seed=1337 + 3 + 1, value=20),
}

77
ml-agents/mlagents/trainers/tests/test_curriculum.py


import pytest
from mlagents.trainers.exception import CurriculumConfigError
from mlagents.trainers.curriculum import Curriculum
from mlagents.trainers.settings import CurriculumSettings
dummy_curriculum_config = CurriculumSettings(
measure="reward",
thresholds=[10, 20, 50],
min_lesson_length=3,
signal_smoothing=True,
parameters={
"param1": [0.7, 0.5, 0.3, 0.1],
"param2": [100, 50, 20, 15],
"param3": [0.2, 0.3, 0.7, 0.9],
},
)
bad_curriculum_config = CurriculumSettings(
measure="reward",
thresholds=[10, 20, 50],
min_lesson_length=3,
signal_smoothing=False,
parameters={
"param1": [0.7, 0.5, 0.3, 0.1],
"param2": [100, 50, 20],
"param3": [0.2, 0.3, 0.7, 0.9],
},
)
@pytest.fixture
def default_reset_parameters():
return {"param1": 1, "param2": 1, "param3": 1}
def test_init_curriculum_happy_path():
curriculum = Curriculum("TestBrain", dummy_curriculum_config)
assert curriculum.brain_name == "TestBrain"
assert curriculum.lesson_num == 0
assert curriculum.measure == "reward"
def test_increment_lesson():
curriculum = Curriculum("TestBrain", dummy_curriculum_config)
assert curriculum.lesson_num == 0
curriculum.lesson_num = 1
assert curriculum.lesson_num == 1
assert not curriculum.increment_lesson(10)
assert curriculum.lesson_num == 1
assert curriculum.increment_lesson(30)
assert curriculum.lesson_num == 2
assert not curriculum.increment_lesson(30)
assert curriculum.lesson_num == 2
assert curriculum.increment_lesson(10000)
assert curriculum.lesson_num == 3
def test_get_parameters():
curriculum = Curriculum("TestBrain", dummy_curriculum_config)
assert curriculum.get_config() == {"param1": 0.7, "param2": 100, "param3": 0.2}
curriculum.lesson_num = 2
assert curriculum.get_config() == {"param1": 0.3, "param2": 20, "param3": 0.7}
assert curriculum.get_config(0) == {"param1": 0.7, "param2": 100, "param3": 0.2}
def test_load_bad_curriculum_file_raises_error():
with pytest.raises(CurriculumConfigError):
Curriculum("TestBrain", bad_curriculum_config)

136
ml-agents/mlagents/trainers/tests/test_meta_curriculum.py


import pytest
from unittest.mock import patch, Mock, call
import yaml
import cattr
from mlagents.trainers.meta_curriculum import MetaCurriculum
from mlagents.trainers.tests.simple_test_envs import SimpleEnvironment
from mlagents.trainers.tests.test_simple_rl import (
_check_environment_trains,
BRAIN_NAME,
PPO_CONFIG,
)
from mlagents.trainers.tests.test_curriculum import dummy_curriculum_config
from mlagents.trainers.settings import CurriculumSettings
from mlagents.trainers.training_status import StatusType
@pytest.fixture
def measure_vals():
return {"Brain1": 0.2, "Brain2": 0.3}
@pytest.fixture
def reward_buff_sizes():
return {"Brain1": 7, "Brain2": 8}
def test_convert_from_dict():
config = yaml.safe_load(
"""
measure: progress
thresholds: [0.1, 0.3, 0.5]
min_lesson_length: 100
signal_smoothing: true
parameters:
param1: [0.0, 4.0, 6.0, 8.0]
"""
)
should_be_config = CurriculumSettings(
thresholds=[0.1, 0.3, 0.5],
min_lesson_length=100,
signal_smoothing=True,
measure=CurriculumSettings.MeasureType.PROGRESS,
parameters={"param1": [0.0, 4.0, 6.0, 8.0]},
)
assert cattr.structure(config, CurriculumSettings) == should_be_config
def test_curriculum_config(param_name="test_param1", min_lesson_length=100):
return CurriculumSettings(
thresholds=[0.1, 0.3, 0.5],
min_lesson_length=min_lesson_length,
parameters={f"{param_name}": [0.0, 4.0, 6.0, 8.0]},
)
test_meta_curriculum_config = {
"Brain1": test_curriculum_config("test_param1"),
"Brain2": test_curriculum_config("test_param2"),
}
def test_set_lesson_nums():
meta_curriculum = MetaCurriculum(test_meta_curriculum_config)
meta_curriculum.lesson_nums = {"Brain1": 1, "Brain2": 3}
assert meta_curriculum.brains_to_curricula["Brain1"].lesson_num == 1
assert meta_curriculum.brains_to_curricula["Brain2"].lesson_num == 3
def test_increment_lessons(measure_vals):
meta_curriculum = MetaCurriculum(test_meta_curriculum_config)
meta_curriculum.brains_to_curricula["Brain1"] = Mock()
meta_curriculum.brains_to_curricula["Brain2"] = Mock()
meta_curriculum.increment_lessons(measure_vals)
meta_curriculum.brains_to_curricula["Brain1"].increment_lesson.assert_called_with(
0.2
)
meta_curriculum.brains_to_curricula["Brain2"].increment_lesson.assert_called_with(
0.3
)
@patch("mlagents.trainers.curriculum.Curriculum")
@patch("mlagents.trainers.curriculum.Curriculum")
def test_increment_lessons_with_reward_buff_sizes(
curriculum_a, curriculum_b, measure_vals, reward_buff_sizes
):
curriculum_a.min_lesson_length = 5
curriculum_b.min_lesson_length = 10
meta_curriculum = MetaCurriculum(test_meta_curriculum_config)
meta_curriculum.brains_to_curricula["Brain1"] = curriculum_a
meta_curriculum.brains_to_curricula["Brain2"] = curriculum_b
meta_curriculum.increment_lessons(measure_vals, reward_buff_sizes=reward_buff_sizes)
curriculum_a.increment_lesson.assert_called_with(0.2)
curriculum_b.increment_lesson.assert_not_called()
@patch("mlagents.trainers.meta_curriculum.GlobalTrainingStatus")
def test_restore_curriculums(mock_trainingstatus):
meta_curriculum = MetaCurriculum(test_meta_curriculum_config)
# Test restore to value
mock_trainingstatus.get_parameter_state.return_value = 2
meta_curriculum.try_restore_all_curriculum()
mock_trainingstatus.get_parameter_state.assert_has_calls(
[call("Brain1", StatusType.LESSON_NUM), call("Brain2", StatusType.LESSON_NUM)],
any_order=True,
)
assert meta_curriculum.brains_to_curricula["Brain1"].lesson_num == 2
assert meta_curriculum.brains_to_curricula["Brain2"].lesson_num == 2
# Test restore to None
mock_trainingstatus.get_parameter_state.return_value = None
meta_curriculum.try_restore_all_curriculum()
assert meta_curriculum.brains_to_curricula["Brain1"].lesson_num == 0
assert meta_curriculum.brains_to_curricula["Brain2"].lesson_num == 0
def test_get_config():
meta_curriculum = MetaCurriculum(test_meta_curriculum_config)
assert meta_curriculum.get_config() == {"test_param1": 0.0, "test_param2": 0.0}
@pytest.mark.parametrize("curriculum_brain_name", [BRAIN_NAME, "WrongBrainName"])
def test_simple_metacurriculum(curriculum_brain_name):
env = SimpleEnvironment([BRAIN_NAME], use_discrete=False)
mc = MetaCurriculum({curriculum_brain_name: dummy_curriculum_config})
_check_environment_trains(
env, {BRAIN_NAME: PPO_CONFIG}, meta_curriculum=mc, success_threshold=None
)

91
ml-agents/mlagents/trainers/curriculum.py


import math
from typing import Dict, Any
from mlagents.trainers.exception import CurriculumConfigError
from mlagents_envs.logging_util import get_logger
from mlagents.trainers.settings import CurriculumSettings
logger = get_logger(__name__)
class Curriculum:
def __init__(self, brain_name: str, settings: CurriculumSettings):
"""
Initializes a Curriculum object.
:param brain_name: Name of the brain this Curriculum is associated with
:param config: Dictionary of fields needed to configure the Curriculum
"""
self.max_lesson_num = 0
self.measure = None
self._lesson_num = 0
self.brain_name = brain_name
self.settings = settings
self.smoothing_value = 0.0
self.measure = self.settings.measure
self.min_lesson_length = self.settings.min_lesson_length
self.max_lesson_num = len(self.settings.thresholds)
parameters = self.settings.parameters
for key in parameters:
if len(parameters[key]) != self.max_lesson_num + 1:
raise CurriculumConfigError(
f"The parameter {key} in {brain_name}'s curriculum must have {self.max_lesson_num + 1} values "
f"but {len(parameters[key])} were found"
)
@property
def lesson_num(self) -> int:
return self._lesson_num
@lesson_num.setter
def lesson_num(self, lesson_num: int) -> None:
self._lesson_num = max(0, min(lesson_num, self.max_lesson_num))
def increment_lesson(self, measure_val: float) -> bool:
"""
Increments the lesson number depending on the progress given.
:param measure_val: Measure of progress (either reward or percentage
steps completed).
:return Whether the lesson was incremented.
"""
if not self.settings or not measure_val or math.isnan(measure_val):
return False
if self.settings.signal_smoothing:
measure_val = self.smoothing_value * 0.25 + 0.75 * measure_val
self.smoothing_value = measure_val
if self.lesson_num < self.max_lesson_num:
if measure_val > self.settings.thresholds[self.lesson_num]:
self.lesson_num += 1
config = {}
parameters = self.settings.parameters
for key in parameters:
config[key] = parameters[key][self.lesson_num]
logger.info(
"{0} lesson changed. Now in lesson {1}: {2}".format(
self.brain_name,
self.lesson_num,
", ".join([str(x) + " -> " + str(config[x]) for x in config]),
)
)
return True
return False
def get_config(self, lesson: int = None) -> Dict[str, Any]:
"""
Returns reset parameters which correspond to the lesson.
:param lesson: The lesson you want to get the config of. If None, the
current lesson is returned.
:return: The configuration of the reset parameters.
"""
if not self.settings:
return {}
if lesson is None:
lesson = self.lesson_num
lesson = max(0, min(lesson, self.max_lesson_num))
config = {}
parameters = self.settings.parameters
for key in parameters:
config[key] = parameters[key][lesson]
return config

148
ml-agents/mlagents/trainers/meta_curriculum.py


"""Contains the MetaCurriculum class."""
from typing import Dict, Set
from mlagents.trainers.curriculum import Curriculum
from mlagents.trainers.settings import CurriculumSettings
from mlagents.trainers.training_status import GlobalTrainingStatus, StatusType
from mlagents_envs.logging_util import get_logger
logger = get_logger(__name__)
class MetaCurriculum:
"""A MetaCurriculum holds curricula. Each curriculum is associated to a
particular brain in the environment.
"""
def __init__(self, curriculum_configs: Dict[str, CurriculumSettings]):
"""Initializes a MetaCurriculum object.
:param curriculum_folder: Dictionary of brain_name to the
Curriculum for each brain.
"""
self._brains_to_curricula: Dict[str, Curriculum] = {}
used_reset_parameters: Set[str] = set()
for brain_name, curriculum_settings in curriculum_configs.items():
self._brains_to_curricula[brain_name] = Curriculum(
brain_name, curriculum_settings
)
config_keys: Set[str] = set(
self._brains_to_curricula[brain_name].get_config().keys()
)
# Check if any two curricula use the same reset params.
if config_keys & used_reset_parameters:
logger.warning(
"Two or more curricula will "
"attempt to change the same reset "
"parameter. The result will be "
"non-deterministic."
)
used_reset_parameters.update(config_keys)
@property
def brains_to_curricula(self):
"""A dict from brain_name to the brain's curriculum."""
return self._brains_to_curricula
@property
def lesson_nums(self):
"""A dict from brain name to the brain's curriculum's lesson number."""
lesson_nums = {}
for brain_name, curriculum in self.brains_to_curricula.items():
lesson_nums[brain_name] = curriculum.lesson_num
return lesson_nums
@lesson_nums.setter
def lesson_nums(self, lesson_nums):
for brain_name, lesson in lesson_nums.items():
self.brains_to_curricula[brain_name].lesson_num = lesson
def _lesson_ready_to_increment(
self, brain_name: str, reward_buff_size: int
) -> bool:
"""Determines whether the curriculum of a specified brain is ready
to attempt an increment.
Args:
brain_name (str): The name of the brain whose curriculum will be
checked for readiness.
reward_buff_size (int): The size of the reward buffer of the trainer
that corresponds to the specified brain.
Returns:
Whether the curriculum of the specified brain should attempt to
increment its lesson.
"""
if brain_name not in self.brains_to_curricula:
return False
return reward_buff_size >= (
self.brains_to_curricula[brain_name].min_lesson_length
)
def increment_lessons(self, measure_vals, reward_buff_sizes=None):
"""Attempts to increments all the lessons of all the curricula in this
MetaCurriculum. Note that calling this method does not guarantee the
lesson of a curriculum will increment. The lesson of a curriculum will
only increment if the specified measure threshold defined in the
curriculum has been reached and the minimum number of episodes in the
lesson have been completed.
Args:
measure_vals (dict): A dict of brain name to measure value.
reward_buff_sizes (dict): A dict of brain names to the size of their
corresponding reward buffers.
Returns:
A dict from brain name to whether that brain's lesson number was
incremented.
"""
ret = {}
if reward_buff_sizes:
for brain_name, buff_size in reward_buff_sizes.items():
if self._lesson_ready_to_increment(brain_name, buff_size):
measure_val = measure_vals[brain_name]
ret[brain_name] = self.brains_to_curricula[
brain_name
].increment_lesson(measure_val)
else:
for brain_name, measure_val in measure_vals.items():
ret[brain_name] = self.brains_to_curricula[brain_name].increment_lesson(
measure_val
)
return ret
def try_restore_all_curriculum(self):
"""
Tries to restore all the curriculums to what is saved in training_status.json
"""
for brain_name, curriculum in self.brains_to_curricula.items():
lesson_num = GlobalTrainingStatus.get_parameter_state(
brain_name, StatusType.LESSON_NUM
)
if lesson_num is not None:
logger.info(
f"Resuming curriculum for {brain_name} at lesson {lesson_num}."
)
curriculum.lesson_num = lesson_num
else:
curriculum.lesson_num = 0
def get_config(self):
"""Get the combined configuration of all curricula in this
MetaCurriculum.
:return: A dict from parameter to value.
"""
config = {}
for _, curriculum in self.brains_to_curricula.items():
curr_config = curriculum.get_config()
config.update(curr_config)
return config
正在加载...
取消
保存