[docs] Update docs for new results and trainer config (#3916)

- Fix after release branch merge
5 年前 · ed0a6006
--- a/docs/Learning-Environment-Executable.md
+++ b/docs/Learning-Environment-Executable.md
 the directory where you installed the ML-Agents Toolkit, run:

 ```sh
-mlagents-learn ../config/ppo/3DBall.yaml --env=3DBall --run-id=firstRun
+mlagents-learn config/ppo/3DBall.yaml --env=3DBall --run-id=firstRun
 ```

 And you should see something like
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 #### Observing Training

 Regardless of which training methods, configurations or hyperparameters you
-provide, the training process will always generate three artifacts:
+provide, the training process will always generate three artifacts, all found
+in the `results/<run-identifier>` folder:
-1. Summaries (under the `summaries/` folder): these are training metrics that
+1. Summaries: these are training metrics that
-1. Models (under the `models/` folder): these contain the model checkpoints that
+1. Models: these contain the model checkpoints that
-1. Timers file (also under the `summaries/` folder): this contains aggregated
+1. Timers file (under `results/<run-identifier>/run_logs`): this contains aggregated
   metrics on your training process, including time spent on specific code
   blocks. See [Profiling in Python](Profiling-Python.md) for more information
   on the timers generated.
 This section offers a detailed guide into how to manage the different training
 set-ups withing the toolkit.

-More specifically, this section offers a detailed guide on four command-line
+More specifically, this section offers a detailed guide on the command-line
-  Behavior in the scene
- `--curriculum`: defines the set-up for Curriculum Learning
- `--sampler`: defines the set-up for Environment Parameter Randomization
+  Behavior in the scene, and the set-ups for Curriculum Learning and
+  Environment Parameter Randomization
 - `--num-envs`: number of concurrent Unity instances to use during training

 Reminder that a detailed description of all command-line options can be found by
 process when the default parameters don't seem to be giving the level of
 performance you would like. We provide sample configuration files for our
 example environments in the [config/](../config/) directory. The
-`config/trainer_config.yaml` was used to train the 3D Balance Ball in the
+`config/ppo/3DBall.yaml` was used to train the 3D Balance Ball in the
 [Getting Started](Getting-Started.md) guide. That configuration file uses the
 PPO trainer, but we also have configuration files for SAC and GAIL.

-add typically has its own training configurations or additional configuration
-files. For instance:
+add typically has its own training configurations. For instance:

 - Use PPO or SAC?
 - Use Recurrent Neural Networks for adding memory to your agents?
  demonstrations.)
 - Use self-play? (Assuming your environment includes multiple agents.)

-The answers to the above questions will dictate the configuration files and the
-parameters within them. The rest of this section breaks down the different
-configuration files and explains the possible settings for each.
+
+The trainer config file, `<trainer-config-file>`, determines the features you will
+use during training, and the answers to the above questions will dictate its contents.
+The rest of this guide breaks down the different sub-sections of the trainer config file
+and explains the possible settings for each.
-### Trainer Config File
+### Behavior Configurations
-We begin with the trainer config file, `<trainer-config-file>`, which includes a
-set of configurations for each Behavior in your scene. Some of the
+The primary section of the trainer config file is a
+set of configurations for each Behavior in your scene. These are defined under
+the sub-section `behaviors` in your trainer config file. Some of the
-curriculum and environment parameter randomization settings are not part of this
-file, but their settings live in different files that we'll cover in subsequent
-sections.
+curriculum and environment parameter randomization settings are not part of the `behaviors`
+configuration, but their settings live in different sections that we'll cover subsequently.
-BehaviorPPO:
-  trainer: ppo
+behaviors:
+  BehaviorPPO:
+    trainer: ppo
-  # Trainer configs common to PPO/SAC (excluding reward signals)
-  batch_size: 1024
-  buffer_size: 10240
-  hidden_units: 128
-  learning_rate: 3.0e-4
-  learning_rate_schedule: linear
-  max_steps: 5.0e5
-  normalize: false
-  num_layers: 2
-  time_horizon: 64
-  vis_encoder_type: simple
+    # Trainer configs common to PPO/SAC (excluding reward signals)
+    batch_size: 1024
+    buffer_size: 10240
+    hidden_units: 128
+    learning_rate: 3.0e-4
+    learning_rate_schedule: linear
+    max_steps: 5.0e5
+    normalize: false
+    num_layers: 2
+    time_horizon: 64
+    vis_encoder_type: simple
-  # PPO-specific configs
-  beta: 5.0e-3
-  epsilon: 0.2
-  lambd: 0.95
-  num_epoch: 3
-  threaded: true
+    # PPO-specific configs
+    beta: 5.0e-3
+    epsilon: 0.2
+    lambd: 0.95
+    num_epoch: 3
+    threaded: true
-  # memory
-  use_recurrent: true
-  sequence_length: 64
-  memory_size: 256
+    # memory
+    use_recurrent: true
+    sequence_length: 64
+    memory_size: 256
-  # behavior cloning
-  behavioral_cloning:
-    demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
-    strength: 0.5
-    steps: 150000
-    batch_size: 512
-    num_epoch: 3
-    samples_per_update: 0
-    init_path:
+    # behavior cloning
+    behavioral_cloning:
+      demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
+      strength: 0.5
+      steps: 150000
+      batch_size: 512
+      num_epoch: 3
+      samples_per_update: 0
+      init_path:
-  reward_signals:
-    # environment reward
-    extrinsic:
-      strength: 1.0
-      gamma: 0.99
+    reward_signals:
+      # environment reward
+      extrinsic:
+        strength: 1.0
+        gamma: 0.99
-    # curiosity module
-    curiosity:
-      strength: 0.02
-      gamma: 0.99
-      encoding_size: 256
-      learning_rate: 3e-4
+      # curiosity module
+      curiosity:
+        strength: 0.02
+        gamma: 0.99
+        encoding_size: 256
+        learning_rate: 3e-4
-    # GAIL
-    gail:
-      strength: 0.01
-      gamma: 0.99
-      encoding_size: 128
-      demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
-      learning_rate: 3e-4
-      use_actions: false
-      use_vail: false
+      # GAIL
+      gail:
+        strength: 0.01
+        gamma: 0.99
+        encoding_size: 128
+        demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
+        learning_rate: 3e-4
+        use_actions: false
+        use_vail: false
-  # self-play
-  self_play:
-    window: 10
-    play_against_latest_model_ratio: 0.5
-    save_steps: 50000
-    swap_steps: 50000
-    team_change: 100000
+    # self-play
+    self_play:
+      window: 10
+      play_against_latest_model_ratio: 0.5
+      save_steps: 50000
+      swap_steps: 50000
+      team_change: 100000
 ```

 Here is an equivalent file if we use an SAC trainer instead. Notice that the
 ```yaml
-BehaviorSAC:
-  trainer: sac
-
-  # Trainer configs common to PPO/SAC (excluding reward signals)
-  # same as PPO config
+behaviors:
+  BehaviorSAC:
+    trainer: sac
-  # SAC-specific configs (replaces the "PPO-specific configs" section above)
-  buffer_init_steps: 0
-  tau: 0.005
-  steps_per_update: 1
-  train_interval: 1
-  init_entcoef: 1.0
-  save_replay_buffer: false
+    # Trainer configs common to PPO/SAC (excluding reward signals)
+    # same as PPO config
-  # memory
-  # same as PPO config
+    # SAC-specific configs (replaces the "PPO-specific configs" section above)
+    buffer_init_steps: 0
+    tau: 0.005
+    steps_per_update: 1
+    train_interval: 1
+    init_entcoef: 1.0
+    save_replay_buffer: false
-  # pre-training using behavior cloning
-  behavioral_cloning:
+    # memory
-  reward_signals:
-    reward_signal_num_update: 1 # only applies to SAC
-
-    # environment reward
-    extrinsic:
+    # pre-training using behavior cloning
+    behavioral_cloning:
-    # curiosity module
-    curiosity:
-      # same as PPO config
+    reward_signals:
+      reward_signal_num_update: 1 # only applies to SAC
-    # GAIL
-    gail:
-      # same as PPO config
+      # environment reward
+      extrinsic:
+        # same as PPO config
-  # self-play
-  self_play:
-    # same as PPO config
+      # curiosity module
+      curiosity:
+        # same as PPO config
+
+      # GAIL
+      gail:
+        # same as PPO config
+
+    # self-play
+    self_play:
+      # same as PPO config
 ```

 We now break apart the components of the configuration file and describe what

 ### Curriculum Learning

-To enable curriculum learning, you need to provide the `--curriculum` CLI option
-and point to a YAML file that defines the curriculum. Here is one example file:
+To enable curriculum learning, you need to add a sub-section to the corresponding
+`behaivors` entry in the trainer config YAML file that defines the curriculum for that
+behavior. Here is one example:
-BehaviorY:
-  measure: progress
-  thresholds: [0.1, 0.3, 0.5]
-  min_lesson_length: 100
-  signal_smoothing: true
-  parameters:
-    wall_height: [1.5, 2.0, 2.5, 4.0]
+behaviors:
+  BehaviorY:
+    # < Same as above >
+
+    # Add this section
+    curriculum:
+      measure: progress
+      thresholds: [0.1, 0.3, 0.5]
+      min_lesson_length: 100
+      signal_smoothing: true
+      parameters:
+        wall_height: [1.5, 2.0, 2.5, 4.0]
 ```

 Each group of Agents under the same `Behavior Name` in an environment can have a
 In order to define the curricula, the first step is to decide which parameters
 of the environment will vary. In the case of the Wall Jump environment, the
 height of the wall is what varies. Rather than adjusting it by hand, we will
-create a YAML file which describes the structure of the curricula. Within it, we
+create a configuration which describes the structure of the curricula. Within it, we
 can specify which points in the training process our wall height will change,
 either based on the percentage of training steps which have taken place, or what
 the average reward the agent has received in the recent past is. Below is an
-BigWallJump:
-  measure: progress
-  thresholds: [0.1, 0.3, 0.5]
-  min_lesson_length: 100
-  signal_smoothing: true
-  parameters:
-    big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
-    big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
-SmallWallJump:
-  measure: progress
-  thresholds: [0.1, 0.3, 0.5]
-  min_lesson_length: 100
-  signal_smoothing: true
-  parameters:
-    small_wall_height: [1.5, 2.0, 2.5, 4.0]
+behaviors:
+  BigWallJump:
+    # < Trainer parameters for BigWallJump >
+    # Curriculum configuration
+    curriculum:
+      measure: progress
+      thresholds: [0.1, 0.3, 0.5]
+      min_lesson_length: 100
+      signal_smoothing: true
+      parameters:
+        big_wall_min_height: [0.0, 4.0, 6.0, 8.0]
+        big_wall_max_height: [4.0, 7.0, 8.0, 8.0]
+
+  SmallWallJump:
+    # < Trainer parameters for BigWallJump >
+    # Curriculum configuration
+    curriculum:
+      measure: progress
+      thresholds: [0.1, 0.3, 0.5]
+      min_lesson_length: 100
+      signal_smoothing: true
+      parameters:
+        small_wall_height: [1.5, 2.0, 2.5, 4.0]
 ```

 The curriculum for each Behavior has the following parameters:
 train agents in the Wall Jump environment with curriculum learning, we can run:

 ```sh
-mlagents-learn config/trainer_config.yaml --curriculum=config/curricula/wall_jump.yaml --run-id=wall-jump-curriculum
+mlagents-learn config/ppo/WallJump_curriculum.yaml --run-id=wall-jump-curriculum
 ```

 We can then keep track of the current lessons and progresses via TensorBoard.

 ### Environment Parameter Randomization

-To enable parameter randomization, you need to provide the `--sampler` CLI
-option and point to a YAML file that defines the curriculum. Here is one example
-file:
+To enable parameter randomization, you need to add a `parameter-randomization` sub-section
+to your trainer config YAML file. Here is one example:
-resampling-interval: 5000
+behaviors:
+  # < Same as above>
-mass:
-  sampler-type: "uniform"
-  min_value: 0.5
-  max_value: 10
+parameter_randomization:
+  resampling-interval: 5000
-gravity:
-  sampler-type: "multirange_uniform"
-  intervals: [[7, 10], [15, 20]]
+  mass:
+    sampler-type: "uniform"
+    min_value: 0.5
+    max_value: 10
-scale:
-  sampler-type: "uniform"
-  min_value: 0.75
-  max_value: 3
+  gravity:
+    sampler-type: "multirange_uniform"
+    intervals: [[7, 10], [15, 20]]
+
+  scale:
+    sampler-type: "uniform"
+    min_value: 0.75
+    max_value: 3
 ```

 Note that `mass`, `gravity` and `scale` are the names of the environment
    `interval_2_max`], ...]
  - **sub-arguments** - `intervals`

-The implementation of the samplers can be found at
-`ml-agents-envs/mlagents_envs/sampler_class.py`.
+The implementation of the samplers can be found in the
+[sampler_class.py file](../ml-agents/mlagents/trainers/sampler_class.py).

 #### Defining a New Sampler Type


 #### Training with Environment Parameter Randomization

-After the sampler YAML file is defined, we proceed by launching `mlagents-learn`
-and specify our configured sampler file with the `--sampler` flag. For example,
+After the sampler configuration is defined, we proceed by launching `mlagents-learn`
+and specify trainer configuration with `parameter-randomization` defined. For example,
-`Environment Parameters` with `config/3dball_randomize.yaml` sampling setup, we
-would run
+`Environment Parameters` with sampling setup, we would run
-mlagents-learn config/trainer_config.yaml --sampler=config/3dball_randomize.yaml
--run-id=3D-Ball-randomize
+mlagents-learn config/ppo/3DBall_randomize.yaml --run-id=3D-Ball-randomize
 ```

 We can observe progress and metrics via Tensorboard.

 - **Buffer Size** - If you are having trouble getting an agent to train, even
  with multiple concurrent Unity instances, you could increase `buffer_size` in
-  the `config/trainer_config.yaml` file. A common practice is to multiply
+  the trainer config file. A common practice is to multiply
  `buffer_size` by `num-envs`.
 - **Resource Constraints** - Invoking concurrent Unity instances is constrained
  by the resources on the machine. Please use discretion when setting
--- a/docs/Using-Tensorboard.md
+++ b/docs/Using-Tensorboard.md
 [TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard).

 The `mlagents-learn` command saves training statistics to a folder named
-`summaries`, organized by the `run-id` value you assign to a training session.
+`results`, organized by the `run-id` value you assign to a training session.

 In order to observe the training process, either during training or afterward,
 start TensorBoard:
 the --port option.

 **Note:** If you don't assign a `run-id` identifier, `mlagents-learn` uses the
-default string, "ppo". All the statistics will be saved to the same sub-folder
-and displayed as one session in TensorBoard. After a few runs, the displays can
-become difficult to interpret in this situation. You can delete the folders
-under the `summaries` directory to clear out old statistics.
+default string, "ppo". You can delete the folders under the `results` directory
+to clear out old statistics.

 On the left side of the TensorBoard window, you can select which of the training
 runs you want to display. You can select multiple run-ids to compare statistics.