Develop environment bc fix and doc update (#1317)

* split the config into two files * fixed the Training-ML-Agents.md doc * added the configs for all of the IL scenes
6 年前 · bcd487a1
--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
 1. Choose an agent you would like to learn to imitate some set of demonstrations. 
 2. Record a set of demonstration using the `Demonstration Recorder` (see above). For illustrative purposes we will refer to this file as `AgentRecording.demo`. 
 3. Build the scene, assigning the agent a Learning Brain, and set the Brain to Control in the Broadcast Hub. For more information on Brains, see [here](Learning-Environment-Design-Brains.md).
-4. Open the `config/bc_config.yaml` file. 
+4. Open the `config/offline_bc_config.yaml` file. 
-6. Launch `mlagent-learn`, and providing `./config/bc_config.yaml` as the config parameter, and your environment as the `--env` parameter.
+6. Launch `mlagent-learn`, and providing `./config/offline_bc_config.yaml` as the config parameter, and your environment as the `--env` parameter.
 7. (Optional) Observe training performance using Tensorboard.

 This will use the demonstration file to train a nerual network driven agent to directly imitate the actions provided in the demonstration. The environment will launch and be used for evaluating the agent's performance during training.
   and check the `Control` checkbox on the "Student" brain. 
 4. Link the Brains to the desired Agents (one Agent as the teacher and at least
   one Agent as a student).
-5. In `config/trainer_config.yaml`, add an entry for the "Student" Brain. Set
+5. In `config/online_bc_config.yaml`, add an entry for the "Student" Brain. Set
-6. Launch the training process with `mlagents-learn config/trainer_config.yaml
+6. Launch the training process with `mlagents-learn config/online_bc_config.yaml
   --train --slow`, and press the :arrow_forward: button in Unity when the
   message _"Start training by pressing the Play button in the Unity Editor"_ is
   displayed on the screen
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 Use the command `mlagents-learn` to train your agents. This command is installed
 with the `mlagents` package and its implementation can be found at
 `ml-agents/mlagents/trainers/learn.py`. The [configuration file](#training-config-file),
-`config/trainer_config.yaml` specifies the hyperparameters used during training.
+like `config/trainer_config.yaml` specifies the hyperparameters used during training.
 You can edit this file with a text editor to add a specific configuration for
 each Brain.

 under the assigned run-id — in the cats example, the path to the model would be
 `models/cob_1/CatsOnBicycles_cob_1.bytes`.

-On Mac and Linux platform, you can press Ctrl+c to terminate your training
-early, the model will be saved as if you set your max_steps to the current step.
-(**Note:** There is a known bug on Windows that causes the saving of the model
-to fail when you early terminate the training, it's recommended to wait until
-Step has reached the max_steps parameter you set in trainer_config.yaml.) While
-this example used the default training hyperparameters, you can edit the
+While this example used the default training hyperparameters, you can edit the
 [training_config.yaml file](#training-config-file) with a text editor to set
 different values.


 ### Training config file

-The training config file, `config/trainer_config.yaml` specifies the training
-method, the hyperparameters, and a few additional values to use during training.
-The file is divided into sections. The **default** section defines the default
-values for all the available settings. You can also add new sections to override
-these defaults to train specific Brains. Name each of these override sections
-after the GameObject containing the Brain component that should use these
-settings. (This GameObject will be a child of the Academy in your scene.)
-Sections for the example environments are included in the provided config file.
+The training config files `config/trainer_config.yaml`,
+`config/online_bc_config.yaml` and `config/offline_bc_config.yaml` specifies the
+training method, the hyperparameters, and a few additional values to use during
+training with PPO, online and offline BC. These files are divided into sections.
+The **default** section defines the default values for all the available
+settings. You can also add new sections to override these defaults to train
+specific Brains. Name each of these override sections after the GameObject
+containing the Brain component that should use these settings. (This GameObject
+will be a child of the Academy in your scene.) Sections for the example
+environments are included in the provided config file.

 |     **Setting**      |                                                                                     **Description**                                                                                     | **Applies To Trainer\*** |
 | :------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- |
-| brain\_to\_imitate   | For imitation learning, the name of the GameObject containing the Brain component to imitate.                                                                                           | BC                       |
+| brain\_to\_imitate   | For online imitation learning, the name of the GameObject containing the Brain component to imitate.                                                                                    | (online)BC               |
+| demo_path            | For offline imitation learning, the file path of the recorded demonstration file                                                                                                        | (offline)BC              |
 | buffer_size          | The number of experiences to collect before updating the policy model.                                                                                                                  | PPO                      |
 | curiosity\_enc\_size | The size of the encoding to use in the forward and inverse models in the Curioity module.                                                                                               | PPO                      |
 | curiosity_strength   | Magnitude of intrinsic reward generated by Intrinsic Curiosity Module.                                                                                                                  | PPO                      |
 | num_layers           | The number of hidden layers in the neural network.                                                                                                                                      | PPO, BC                  |
 | sequence_length      | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, BC                  |
 | summary_freq         | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard.                                                                       | PPO, BC                  |
-| time_horizon         | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, BC                  |
+| time_horizon         | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, (online)BC          |
 | trainer              | The type of training to perform: "ppo" or "imitation".                                                                                                                                  | PPO, BC                  |
 | use_curiosity        | Train using an additional intrinsic reward signal generated from Intrinsic Curiosity Module.                                                                                            | PPO                      |
 | use_recurrent        | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                                                                       | PPO, BC                  |
--- a/config/offline_bc_config.yaml
+++ b/config/offline_bc_config.yaml
+default:
+    trainer: offline_bc
+    batch_size: 64
+    summary_freq: 1000
+    max_steps: 5.0e4
+    batches_per_epoch: 10
+    use_recurrent: false
+    hidden_units: 128
+    learning_rate: 3.0e-4
+    num_layers: 2
+    sequence_length: 32
+    memory_size: 256
+    demo_path: ./UnitySDK/Assets/Demonstrations/<Your_Demon_File>.demo
+
+HallwayBrain:
+    trainer: offline_bc
+    max_steps: 5.0e5
+    num_epoch: 5
+    batch_size: 64
+    batches_per_epoch: 5
+    num_layers: 2
+    hidden_units: 128
+    sequence_length: 16
+    use_recurrent: true
+    memory_size: 256
+    sequence_length: 32
+    demo_path: ./UnitySDK/Assets/Demonstrations/Hallway.demo
--- a/config/online_bc_config.yaml
+++ b/config/online_bc_config.yaml
+default:
+    trainer: online_bc
+    brain_to_imitate: <Your_Brain_Asset_Name>
+    batch_size: 64
+    time_horizon: 64
+    summary_freq: 1000
+    max_steps: 5.0e4
+    batches_per_epoch: 10
+    use_recurrent: false
+    hidden_units: 128
+    learning_rate: 3.0e-4
+    num_layers: 2
+    sequence_length: 32
+    memory_size: 256
+
+BananaLearning:
+    trainer: online_bc
+    max_steps: 10000
+    summary_freq: 1000
+    brain_to_imitate: BananaPlayer
+    batch_size: 16
+    batches_per_epoch: 5
+    num_layers: 4
+    hidden_units: 64
+    use_recurrent: false
+    sequence_length: 16
+
+BouncerLearning:
+    trainer: online_bc
+    max_steps: 10000
+    summary_freq: 10
+    brain_to_imitate: BouncerPlayer
+    batch_size: 16
+    batches_per_epoch: 1
+    num_layers: 1
+    hidden_units: 64
+    use_recurrent: false
+    sequence_length: 16
+
+HallwayLearning:
+    trainer: online_bc
+    max_steps: 10000
+    summary_freq: 1000
+    brain_to_imitate: HallwayPlayer
+    batch_size: 16
+    batches_per_epoch: 5
+    num_layers: 4
+    hidden_units: 64
+    use_recurrent: false
+    sequence_length: 16
+
+PushBlockLearning:
+    trainer: online_bc
+    max_steps: 10000
+    summary_freq: 1000
+    brain_to_imitate: PushBlockPlayer
+    batch_size: 16
+    batches_per_epoch: 5
+    num_layers: 4
+    hidden_units: 64
+    use_recurrent: false
+    sequence_length: 16
+
+PyramidsLearning:
+    trainer: online_bc
+    max_steps: 10000
+    summary_freq: 1000
+    brain_to_imitate: PyramidsPlayer
+    batch_size: 16
+    batches_per_epoch: 5
+    num_layers: 4
+    hidden_units: 64
+    use_recurrent: false
+    sequence_length: 16
+
+TennisLearning:
+    trainer: online_bc
+    max_steps: 10000
+    summary_freq: 1000
+    brain_to_imitate: TennisPlayer
+    batch_size: 16
+    batches_per_epoch: 5
+    num_layers: 4
+    hidden_units: 64
+    use_recurrent: false
+    sequence_length: 16
+
+StudentBrain:
+    trainer: online_bc
+    max_steps: 10000
+    summary_freq: 1000
+    brain_to_imitate: TeacherBrain
+    batch_size: 16
+    batches_per_epoch: 5
+    num_layers: 4
+    hidden_units: 64
+    use_recurrent: false
+    sequence_length: 16
+
+StudentRecurrentBrain:
+    trainer: online_bc
+    max_steps: 10000
+    summary_freq: 1000
+    brain_to_imitate: TeacherBrain
+    batch_size: 16
+    batches_per_epoch: 5
+    num_layers: 4
+    hidden_units: 64
+    use_recurrent: true
+    sequence_length: 32
--- a/config/bc_config.yaml
+++ b/config/bc_config.yaml
-default:
-    trainer: offline_bc
-    batch_size: 64
-    beta: 5.0e-3
-    hidden_units: 128
-    learning_rate: 3.0e-4
-    max_steps: 5.0e4
-    memory_size: 256
-    batches_per_epoch: 10
-    time_horizon: 64
-    num_epoch: 5
-    num_layers: 2
-    summary_freq: 1000
-    use_recurrent: false
-    sequence_length: 32
-    demo_path: ./UnitySDK/Assets/Demonstrations/Crawler_test.demo
-
-HallwayBrain:
-    trainer: offline_bc
-    max_steps: 5.0e5
-    num_epoch: 5
-    batch_size: 64
-    batches_per_epoch: 5
-    num_layers: 2
-    hidden_units: 128
-    sequence_length: 16
-    buffer_size: 512
-    use_recurrent: true
-    memory_size: 256
-    sequence_length: 32
-    demo_path: ./UnitySDK/Assets/Demonstrations/Hallway.demo
-
-StudentBrain:
-    trainer: online_bc
-    max_steps: 10000
-    summary_freq: 1000
-    brain_to_imitate: TeacherBrain
-    batch_size: 16
-    batches_per_epoch: 5
-    num_layers: 4
-    hidden_units: 64
-    sequence_length: 16
-    buffer_size: 128
-
-StudentRecurrentBrain:
-    trainer: online_bc
-    max_steps: 10000
-    summary_freq: 1000
-    brain_to_imitate: TeacherBrain
-    batch_size: 16
-    batches_per_epoch: 5
-    num_layers: 4
-    hidden_units: 64
-    use_recurrent: true
-    sequence_length: 32
-    buffer_size: 128