Improvements to Training-ML-Agents (#3776)

* Improvements to Training-ML-Agents - Removed duplicate documentation - Moved CLI descriptions to learn.py - Reorganized "Training with mlagents-learn" into 5 sub-sections * fixed formatting errors and incorporated minor feedback * minor improvement * Minor formatting. * fixed run-id references * Keeping link to use Inference consistent with master Will update the UIE page in a separate PR. * Squashed commit of the following: commit 9600d0fbe6684eca69fb5bab84ab0f6754fc8b0f Author: Marwan Mattar <marwan@unity3d.com> Date: Tue Apr 14 17:45:33 2020 -0700 Various doc improvements (#3775) * Various doc improvements For Using-Virtual-Environment.md: - Made a note regarding updating setuptools and pip. - Changed lists from "-" to "*" For Using-Tensorboard.md: - Changed the ordered list to use "1." For Training-on-Microsoft-Azure-Custom-Instance.md: - Deleted ...
5 年前 · 8c5edc99
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 # Training ML-Agents

-The ML-Agents toolkit conducts training using an external Python training
-process. During training, this external process communicates with the Academy
-to generate a block of agent experiences. These
-experiences become the training set for a neural network used to optimize the
-agent's policy (which is essentially a mathematical function mapping
-observations to actions). In reinforcement learning, the neural network
-optimizes the policy by maximizing the expected rewards. In imitation learning,
-the neural network optimizes the policy to achieve the smallest difference
-between the actions chosen by the agent trainee and the actions chosen by the
-expert in the same situation.
-
-The output of the training process is a model file containing the optimized
-policy. This model file is a TensorFlow data graph containing the mathematical
-operations and the optimized weights selected during the training process. You
-can set the generated model file in the Behaviors Parameters under your
-Agent in your Unity project to decide the best course of action for an agent.
-
-Use the command `mlagents-learn` to train your agents. This command is installed
-with the `mlagents` package and its implementation can be found at
-`ml-agents/mlagents/trainers/learn.py`. The [configuration file](#training-config-file),
-like `config/trainer_config.yaml` specifies the hyperparameters used during training.
-You can edit this file with a text editor to add a specific configuration for
-each Behavior.
+For a broad overview of reinforcement learning, imitation learning and all the
+training scenarios, methods and options within the ML-Agents Toolkit, see
+[ML-Agents Toolkit Overview](ML-Agents-Overview.md).
-For a broader overview of reinforcement learning, imitation learning and the
-ML-Agents training process, see [ML-Agents Toolkit
-Overview](ML-Agents-Overview.md).
+Once your learning environment has been created and is ready for training, the
+next step is to initiate a training run. Training in the ML-Agents Toolkit is
+powered by a dedicated Python package, `mlagents`. This package exposes a
+command `mlagents-learn` that is the single entry point for all training
+workflows (e.g. reinforcement leaning, imitation learning, curriculum learning).
+Its implementation can be found at
+[ml-agents/mlagents/trainers/learn.py](../ml-agents/mlagents/trainers/learn.py).
-Use the `mlagents-learn` command to train agents. `mlagents-learn` supports
-training with
-[reinforcement learning](Background-Machine-Learning.md#reinforcement-learning),
-[curriculum learning](Training-Curriculum-Learning.md),
-and [behavioral cloning imitation learning](Training-Imitation-Learning.md).
+### Starting Training
-Run `mlagents-learn` from the command line to launch the training process. Use
-the command line patterns and the `config/trainer_config.yaml` file to control
-training options.
+`mlagents-learn` is the main training utility provided by the ML-Agents Toolkit.
+It accepts a number of CLI options in addition to a YAML configuration file that
+contains all the configurations and hyperparameters to be used during training.
+The set of configurations and hyperparameters to include in this file depend on
+the agents in your environment and the specific training method you wish to
+utilize. Keep in mind that the hyperparameter values can have a big impact on
+the training performance (i.e. your agent's ability to learn a policy that
+solves the task). In this page, we will review all the hyperparameters for all
+training methods and provide guidelines and advice on their values.
-The basic command for training is:
+To view a description of all the CLI options accepted by `mlagents-learn`, use
+the `--help`:
-mlagents-learn <trainer-config-file> --env=<env_name> --run-id=<run-identifier>
+mlagents-learn --help
-where
-
-* `<trainer-config-file>` is the file path of the trainer configuration yaml.
-* `<env_name>`__(Optional)__ is the name (including path) of your Unity
-  executable containing the agents to be trained. If `<env_name>` is not passed,
-  the training will happen in the Editor. Press the :arrow_forward: button in
-  Unity when the message _"Start training by pressing the Play button in the
-  Unity Editor"_ is displayed on the screen.
-* `<run-identifier>` is an optional identifier you can use to identify the
-  results of individual training runs.
-
-For example, suppose you have a project in Unity named "CatsOnBicycles" which
-contains agents ready to train. To perform the training:
-
-1. [Build the project](Learning-Environment-Executable.md), making sure that you
-   only include the training scene.
-2. Open a terminal or console window.
-3. Navigate to the directory where you installed the ML-Agents Toolkit.
-4. Run the following to launch the training process using the path to the Unity
-   environment you built in step 1:
+The basic command for training is:
-mlagents-learn config/trainer_config.yaml --env=../../projects/Cats/CatsOnBicycles.app --run-id=cob_1
+mlagents-learn <trainer-config-file> --env=<env_name> --run-id=<run-identifier>
-During a training session, the training program prints out and saves updates at
-regular intervals (specified by the `summary_freq` option). The saved statistics
-are grouped by the `run-id` value so you should assign a unique id to each
-training run if you plan to view the statistics. You can view these statistics
-using TensorBoard during or after training by running the following command:
+where
-```sh
-tensorboard --logdir=summaries --port 6006
-```
+- `<trainer-config-file>` is the file path of the trainer configuration yaml.
+  This contains all the hyperparameter values. We offer a detailed guide on the
+  structure of this file and the meaning of the hyperameters (and advice on how
+  to set them) in the dedicated [Training Config File](#training-config-file)
+  section below.
+- `<env_name>`**(Optional)** is the name (including path) of your
+  [Unity executable](Learning-Environment-Executable.md) containing the agents
+  to be trained. If `<env_name>` is not passed, the training will happen in the
+  Editor. Press the :arrow_forward: button in Unity when the message _"Start
+  training by pressing the Play button in the Unity Editor"_ is displayed on
+  the screen.
+- `<run-identifier>` is a unique name you can use to identify the results of
+  your training runs.
-And then opening the URL: [localhost:6006](http://localhost:6006).
+See the
+[Getting Started Guide](Getting-Started.md#training-a-new-model-with-reinforcement-learning)
+for a sample execution of the `mlagents-learn` command.
-**Note:** The default port TensorBoard uses is 6006. If there is an existing session
-running on port 6006 a new session can be launched on an open port using the --port
-option.
+#### Observing Training
-When training is finished, you can find the saved model in the `models` folder
-under the assigned run-id — in the cats example, the path to the model would be
-`models/cob_1/CatsOnBicycles_cob_1.nn`.
+Regardless of which training methods, configurations or hyperparameters you
+provide, the training process will always generate three artifacts:
-While this example used the default training hyperparameters, you can edit the
-[trainer_config.yaml file](#training-config-file) with a text editor to set
-different values.
+1. Summaries (under the `summaries/` folder): these are training metrics that
+   are updated throughout the training process. They are helpful to monitor your
+   training performance and may help inform how to update your hyperparameter
+   values. See [Using TensorBoard](Using-Tensorboard.md) for more details on how
+   to visualize the training metrics.
+1. Models (under the `models/` folder): these contain the model checkpoints that
+   are updated throughout training and the final model file (`.nn`). This final
+   model file is generated once either when training completes or is
+   interrupted.
+1. Timers file (also under the `summaries/` folder): this contains aggregated
+   metrics on your training process, including time spent on specific code
+   blocks. See [Profiling in Python](Profiling-Python.md) for more information
+   on the timers generated.
-To interrupt training and save the current progress, hit Ctrl+C once and wait for the
-model to be saved out.
+These artifacts (except the `.nn` file) are updated throughout the training
+process and finalized when training completes or is interrupted.
-### Loading an Existing Model
+#### Stopping and Resuming Training
-If you've quit training early using Ctrl+C, you can resume the training run by running
-`mlagents-learn` again, specifying the same `<run-identifier>` and appending the `--resume` flag
-to the command.
+To interrupt training and save the current progress, hit `Ctrl+C` once and wait
+for the model(s) to be saved out.
-You can also use this mode to run inference of an already-trained model in Python.
-Append both the `--resume` and `--inference` to do this. Note that if you want to run
-inference in Unity, you should use the
-[Unity Inference Engine](Getting-started.md#running-a-pre-trained-model).
+To resume a previously interrupted or completed training run, use the `--resume`
+flag and make sure to specify the previously used run ID.
-If you've already trained a model using the specified `<run-identifier>` and `--resume` is not
-specified, you will not be able to continue with training. Use `--force` to force ML-Agents to
-overwrite the existing data.
+If you would like to re-run a previously interrupted or completed training run
+and re-use the same run ID (in this case, overwriting the previously generated
+artifacts), then use the `--force` flag.
-Alternatively, you might want to start a new training run but _initialize_ it using an already-trained
-model. You may want to do this, for instance, if your environment changed and you want
-a new model, but the old behavior is still better than random. You can do this by specifying `--initialize-from=<run-identifier>`, where `<run-identifier>` is the old run ID.
+#### Loading an Existing Model
-### Command Line Training Options
+You can also use this mode to run inference of an already-trained model in
+Python by using both the `--resume` and `--inference` flags. Note that if you
+want to run inference in Unity, you should use the
+[Unity Inference Engine](Getting-Started.md#running-a-pre-trained-model).
-In addition to passing the path of the Unity executable containing your training
-environment, you can set the following command line options when invoking
-`mlagents-learn`:
+Alternatively, you might want to start a new training run but _initialize_ it
+using an already-trained model. You may want to do this, for instance, if your
+environment changed and you want a new model, but the old behavior is still
+better than random. You can do this by specifying
+`--initialize-from=<run-identifier>`, where `<run-identifier>` is the old run
+ID.
-* `--env=<env>`: Specify an executable environment to train.
-* `--curriculum=<file>`: Specify a curriculum JSON file for defining the
-  lessons for curriculum training. See [Curriculum
-  Training](Training-Curriculum-Learning.md) for more information.
-* `--sampler=<file>`: Specify a sampler YAML file for defining the
-  sampler for parameter randomization. See [Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md) for more information.
-* `--keep-checkpoints=<n>`: Specify the maximum number of model checkpoints to
-  keep. Checkpoints are saved after the number of steps specified by the
-  `save-freq` option. Once the maximum number of checkpoints has been reached,
-  the oldest checkpoint is deleted when saving a new checkpoint. Defaults to 5.
-* `--lesson=<n>`: Specify which lesson to start with when performing curriculum
-  training. Defaults to 0.
-* `--num-envs=<n>`: Specifies the number of concurrent Unity environment instances to
-  collect experiences from when training. Defaults to 1.
-* `--run-id=<run-identifier>`: Specifies an identifier for each training run. This
-  identifier is used to name the subdirectories in which the trained model and
-  summary statistics are saved as well as the saved model itself. The default id
-  is "ppo". If you use TensorBoard to view the training statistics, always set a
-  unique run-id for each training run. (The statistics for all runs with the
-  same id are combined as if they were produced by a the same session.)
-* `--save-freq=<n>`: Specifies how often (in  steps) to save the model during
-  training. Defaults to 50000.
-* `--seed=<n>`: Specifies a number to use as a seed for the random number
-  generator used by the training code.
-* `--env-args=<string>`: Specify arguments for the executable environment. Be aware that
-  the standalone build will also process these as
-  [Unity Command Line Arguments](https://docs.unity3d.com/Manual/CommandLineArguments.html).
-  You should choose different argument names if you want to create environment-specific arguments.
-  All arguments after this flag will be passed to the executable. For example, setting
-  `mlagents-learn config/trainer_config.yaml --env-args --num-orcs 42` would result in
-   ` --num-orcs 42` passed to the executable.
-* `--base-port`: Specifies the starting port. Each concurrent Unity environment instance
-  will get assigned a port sequentially, starting from the `base-port`. Each instance
-  will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs
-  given to each instance from 0 to `num_envs - 1`. Default is 5005. __Note:__ When
-  training using the Editor rather than an executable, the base port will be ignored.
-* `--inference`: Specifies whether to only run in inference mode. Omit to train the model.
-  To load an existing model, specify a run-id and combine with `--resume`.
-* `--resume`: If set, the training code loads an already trained model to
-  initialize the neural network before training. The learning code looks for the
-  model in `models/<run-id>/` (which is also where it saves models at the end of
-  training). This option only works when the models exist, and have the same behavior names
-  as the current agents in your scene.
-* `--force`: Attempting to train a model with a run-id that has been used before will
-  throw an error. Use `--force` to force-overwrite this run-id's summary and model data.
-* `--initialize-from=<run-identifier>`: Specify an old run-id here to initialize your model from
-  a previously trained model. Note that the previously saved models _must_ have the same behavior
-  parameters as your current environment.
-* `--no-graphics`: Specify this option to run the Unity executable in
-  `-batchmode` and doesn't initialize the graphics driver. Use this only if your
-  training doesn't involve visual observations (reading from Pixels). See
-  [here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more
-  details.
-* `--debug`: Specify this option to enable debug-level logging for some parts of the code.
-* `--cpu`: Forces training using CPU only.
-* Engine Configuration :
-  * `--width` : The width of the executable window of the environment(s) in pixels
-  (ignored for editor training) (Default 84)
-  * `--height` : The height of the executable window of the environment(s) in pixels
-  (ignored for editor training). (Default 84)
-  * `--quality-level` : The quality level of the environment(s). Equivalent to
-  calling `QualitySettings.SetQualityLevel` in Unity. (Default 5)
-  * `--time-scale` : The time scale of the Unity environment(s). Equivalent to setting
-  `Time.timeScale` in Unity. (Default 20.0, maximum 100.0)
-  * `--target-frame-rate` : The target frame rate of the Unity environment(s).
-  Equivalent to setting `Application.targetFrameRate` in Unity. (Default: -1)
+## Training Config File
-### Training Config File
+The Unity ML-Agents Toolkit provides a wide range of training scenarios, methods
+and options. As such, specific training runs may require different training
+configurations and may generate different artifacts and TensorBoard statistics.
+This section offers a detailed guide into how to manage the different training
+set-ups withing the toolkit.
-The training config files `config/trainer_config.yaml`, `config/sac_trainer_config.yaml`,
-`config/gail_config.yaml` and `config/offline_bc_config.yaml` specifies the training method,
-the hyperparameters, and a few additional values to use when training with Proximal Policy
-Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial Imitation Learning)
-with PPO/SAC, and Behavioral Cloning(BC)/Imitation with PPO/SAC. These files are divided
-into sections. The **default** section defines the default values for all the available
-training with PPO, SAC, GAIL (with PPO), and BC. These files are divided into sections.
-The **default** section defines the default values for all the available settings. You can
-also add new sections to override these defaults to train specific Behaviors. Name each of these
-override sections after the appropriate `Behavior Name`. Sections for the
+The training config files `config/trainer_config.yaml`,
+`config/sac_trainer_config.yaml`, `config/gail_config.yaml` and
+`config/offline_bc_config.yaml` specifies the training method, the
+hyperparameters, and a few additional values to use when training with Proximal
+Policy Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial
+Imitation Learning) with PPO/SAC, and Behavioral Cloning(BC)/Imitation with
+PPO/SAC. These files are divided into sections. The **default** section defines
+the default values for all the available training with PPO, SAC, GAIL (with
+PPO), and BC. These files are divided into sections. The **default** section
+defines the default values for all the available settings. You can also add new
+sections to override these defaults to train specific Behaviors. Name each of
+these override sections after the appropriate `Behavior Name`. Sections for the
-|     **Setting**      |                                                                                     **Description**                                                                                     | **Applies To Trainer\*** |
-| :------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- |
-| batch_size           | The number of experiences in each iteration of gradient descent.                                                                                                                        | PPO, SAC             |
-| batches_per_epoch    | In imitation learning, the number of batches of training examples to collect before training the model.                                                                                 |                        |
-| beta                 | The strength of entropy regularization.                                                                                                                                                 | PPO                      |
-| buffer_size          | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer.                                                                   | PPO, SAC                 |
-| buffer_init_steps    | The number of experiences to collect into the buffer before updating the policy model.                                                                                                  | SAC                      |
-| epsilon              | Influences how rapidly the policy can evolve during training.                                                                                                                           | PPO                      |
-| hidden_units         | The number of units in the hidden layers of the neural network.                                                                                                                         | PPO, SAC             |
-| init_entcoef         | How much the agent should explore in the beginning of training.                                                                                                                         | SAC                      |
-| lambd                | The regularization parameter.                                                                                                                                                           | PPO                      |
-| learning_rate        | The initial learning rate for gradient descent.                                                                                                                                         | PPO, SAC             |
-| learning_rate_schedule | Determines how learning rate changes over time. | PPO, SAC |
-| max_steps            | The maximum number of simulation steps to run during a training session.                                                                                                                | PPO, SAC             |
-| memory_size          | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                 | PPO, SAC             |
-| normalize            | Whether to automatically normalize observations.                                                                                                                                        | PPO, SAC                 |
-| num_epoch            | The number of passes to make through the experience buffer when performing gradient descent optimization.                                                                               | PPO                      |
-| num_layers           | The number of hidden layers in the neural network.                                                                                                                                      | PPO, SAC             |
-| behavioral_cloning          | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations).                           | PPO, SAC                 |
-| reward_signals       | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options.                                         | PPO, SAC             |
-| save_replay_buffer   | Saves the replay buffer when exiting training, and loads it on resume.                                                                                                                  | SAC                      |
-| sequence_length      | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC             |
-| summary_freq         | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard.                                                                       | PPO, SAC             |
-| tau                  | How aggressively to update the target network used for bootstrapping value estimation in SAC.                                                                                           | SAC                      |
-| time_horizon         | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, SAC    |
-| trainer              | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc".                                                                                                             | PPO, SAC             |
-| train_interval       | How often to update the agent.                                                                                                                                                          | SAC                      |
-| num_update           | Number of mini-batches to update the agent with during each update.                                                                                                                     | SAC                      |
-| use_recurrent        | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                                                                       | PPO, SAC             |
-| init_path        | Initialize trainer from a previously saved model.                                                                                       | PPO, SAC             |
+\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral
+Cloning (Imitation), GAIL = Generative Adversarial Imitation Learning
-\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation), GAIL = Generative Adversarial Imitaiton Learning
+| **Setting**            | **Description**                                                                                                                                                                         | **Applies To Trainer\*** |
+| :--------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- |
+| batch_size             | The number of experiences in each iteration of gradient descent.                                                                                                                        | PPO, SAC                 |
+| batches_per_epoch      | In imitation learning, the number of batches of training examples to collect before training the model.                                                                                 |                          |
+| beta                   | The strength of entropy regularization.                                                                                                                                                 | PPO                      |
+| buffer_size            | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer.                                                                   | PPO, SAC                 |
+| buffer_init_steps      | The number of experiences to collect into the buffer before updating the policy model.                                                                                                  | SAC                      |
+| epsilon                | Influences how rapidly the policy can evolve during training.                                                                                                                           | PPO                      |
+| hidden_units           | The number of units in the hidden layers of the neural network.                                                                                                                         | PPO, SAC                 |
+| init_entcoef           | How much the agent should explore in the beginning of training.                                                                                                                         | SAC                      |
+| lambd                  | The regularization parameter.                                                                                                                                                           | PPO                      |
+| learning_rate          | The initial learning rate for gradient descent.                                                                                                                                         | PPO, SAC                 |
+| learning_rate_schedule | Determines how learning rate changes over time.                                                                                                                                         | PPO, SAC                 |
+| max_steps              | The maximum number of simulation steps to run during a training session.                                                                                                                | PPO, SAC                 |
+| memory_size            | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                 | PPO, SAC                 |
+| normalize              | Whether to automatically normalize observations.                                                                                                                                        | PPO, SAC                 |
+| num_epoch              | The number of passes to make through the experience buffer when performing gradient descent optimization.                                                                               | PPO                      |
+| num_layers             | The number of hidden layers in the neural network.                                                                                                                                      | PPO, SAC                 |
+| behavioral_cloning     | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations).                    | PPO, SAC                 |
+| reward_signals         | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options.                                         | PPO, SAC                 |
+| save_replay_buffer     | Saves the replay buffer when exiting training, and loads it on resume.                                                                                                                  | SAC                      |
+| sequence_length        | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC                 |
+| summary_freq           | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard.                                                                       | PPO, SAC                 |
+| tau                    | How aggressively to update the target network used for bootstrapping value estimation in SAC.                                                                                           | SAC                      |
+| time_horizon           | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, SAC                 |
+| trainer                | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc".                                                                                                             | PPO, SAC                 |
+| train_interval         | How often to update the agent.                                                                                                                                                          | SAC                      |
+| num_update             | Number of mini-batches to update the agent with during each update.                                                                                                                     | SAC                      |
+| use_recurrent          | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                                                                       | PPO, SAC                 |
+| init_path              | Initialize trainer from a previously saved model.                                                                                                                                       | PPO, SAC                 |
-* [Training with PPO](Training-PPO.md)
-* [Training with SAC](Training-SAC.md)
-* [Using Recurrent Neural Networks](Feature-Memory.md)
-* [Training with Curriculum Learning](Training-Curriculum-Learning.md)
-* [Training with Imitation Learning](Training-Imitation-Learning.md)
-* [Training with Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md)
+- [Training with PPO](Training-PPO.md)
+- [Training with SAC](Training-SAC.md)
+- [Training with Self-Play](Training-Self-Play.md)
+- [Using Recurrent Neural Networks](Feature-Memory.md)
+- [Training with Curriculum Learning](Training-Curriculum-Learning.md)
+- [Training with Imitation Learning](Training-Imitation-Learning.md)
+- [Training with Environment Parameter Randomization](Training-Environment-Parameter-Randomization.md)
-[example environments](Learning-Environment-Examples.md)
-to the corresponding sections of the `config/trainer_config.yaml` file for each
-example to see how the hyperparameters and other configuration variables have
-been changed from the defaults.
-
-### Debugging and Profiling
-If you enable the `--debug` flag in the command line, the trainer metrics are logged to a CSV file
-stored in the `summaries` directory. The metrics stored are:
-  * brain name
-  * time to update policy
-  * time since start of training
-  * time for last experience collection
-  * number of experiences used for training
-  * mean return
-
-This option is not available currently for Behavioral Cloning.
-
-Additionally, we have included basic [Profiling in Python](Profiling-Python.md) as part of the toolkit.
-This information is also saved in the `summaries` directory.
+[example environments](Learning-Environment-Examples.md) to the corresponding
+sections of the `config/trainer_config.yaml` file for each example to see how
+the hyperparameters and other configuration variables have been changed from the
+defaults.
--- a/ml-agents/mlagents/trainers/learn.py
+++ b/ml-agents/mlagents/trainers/learn.py
    )
    argparser.add_argument("trainer_config_path")
    argparser.add_argument(
-        "--env", default=None, dest="env_path", help="Name of the Unity executable "
+        "--env",
+        default=None,
+        dest="env_path",
+        help="Path to the Unity executable to train",
-        help="Curriculum config yaml file for environment",
+        help="YAML file for defining the lessons for curriculum training",
+    )
+    argparser.add_argument(
+        "--lesson",
+        default=0,
+        type=int,
+        help="The lesson to start with when performing curriculum training",
-        help="Reset parameter yaml file for environment",
+        help="YAML file for defining the sampler for environment parameter randomization",
-        help="How many model checkpoints to keep",
-    )
-    argparser.add_argument(
-        "--lesson", default=0, type=int, help="Start learning from this lesson"
+        help="The maximum number of model checkpoints to keep. Checkpoints are saved after the"
+        "number of steps specified by the save-freq option. Once the maximum number of checkpoints"
+        "has been reached, the oldest checkpoint is deleted when saving a new checkpoint.",
    )
    argparser.add_argument(
        "--load",
        default=False,
        dest="resume",
        action="store_true",
-        help="Resumes training from a checkpoint. Specify a --run-id to use this option.",
+        help="Whether to resume training from a checkpoint. Specify a --run-id to use this option. "
+        "If set, the training code loads an already trained model to initialize the neural network "
+        "before resuming training. This option is only valid when the models exist, and have the same "
+        "behavior names as the current agents in your scene.",
    )
    argparser.add_argument(
        "--force",
-        help="Force-overwrite existing models and summaries for a run ID that has been used "
-        "before.",
+        help="Whether to force-overwrite this run-id's existing summary and model data. (Without "
+        "this flag, attempting to train a model with a run-id that has been used before will throw "
+        "an error.",
-        help="The run identifier for model and summary statistics.",
+        help="The identifier for the training run. This identifier is used to name the "
+        "subdirectories in which the trained model and summary statistics are saved as well "
+        "as the saved model itself. If you use TensorBoard to view the training statistics, "
+        "always set a unique run-id for each training run. (The statistics for all runs with the "
+        "same id are combined as if they were produced by a the same session.)",
    )
    argparser.add_argument(
        "--initialize-from",
-        "This can be used, for instance, to fine-tune an existing model on a new environment. ",
+        "This can be used, for instance, to fine-tune an existing model on a new environment. "
+        "Note that the previously saved models must have the same behavior parameters as your "
+        "current environment.",
-        "--save-freq", default=50000, type=int, help="Frequency at which to save model"
+        "--save-freq",
+        default=50000,
+        type=int,
+        help="How often (in steps) to save the model during training",
-        "--seed", default=-1, type=int, help="Random seed used for training"
+        "--seed",
+        default=-1,
+        type=int,
+        help="A number to use as a seed for the random number generator used by the training code",
    )
    argparser.add_argument(
        "--train",
        default=False,
        dest="inference",
        action="store_true",
-        help="Run in Python inference mode (don't train). Use with --resume to load a model trained with an "
-        "existing run ID.",
+        help="Whether to run in Python inference mode (i.e. no training). Use with --resume to load "
+        "a model trained with an existing run ID.",
-        help="Base port for environment communication",
+        help="The starting port for environment communication. Each concurrent Unity environment "
+        "instance will get assigned a port sequentially, starting from the base-port. Each instance "
+        "will use the port (base_port + worker_id), where the worker_id is sequential IDs given to "
+        "each instance from 0 to (num_envs - 1). Note that when training using the Editor rather "
+        "than an executable, the base port will be ignored.",
-        help="Number of parallel environments to use for training",
+        help="The number of concurrent Unity environment instances to collect experiences "
+        "from when training",
-        help="Whether to run the environment in no-graphics mode",
+        help="Whether to run the Unity executable in no-graphics mode (i.e. without initializing "
+        "the graphics driver. Use this only if your agents don't use visual observations.",
-        help="Whether to run ML-Agents in debug mode with detailed logging",
+        help="Whether to enable debug-level logging for some parts of the code",
-        help="Arguments passed to the Unity executable.",
+        help="Arguments passed to the Unity executable. Be aware that the standalone build will also "
+        "process these as Unity Command Line Arguments. You should choose different argument names if "
+        "you want to create environment-specific arguments. All arguments after this flag will be "
+        "passed to the executable.",
-        "--cpu", default=False, action="store_true", help="Run with CPU only"
+        "--cpu",
+        default=False,
+        action="store_true",
+        help="Forces training using CPU only",
    )

    argparser.add_argument("--version", action="version", version="")
        "--width",
        default=84,
        type=int,
-        help="The width of the executable window of the environment(s)",
+        help="The width of the executable window of the environment(s) in pixels "
+        "(ignored for editor training).",
-        help="The height of the executable window of the environment(s)",
+        help="The height of the executable window of the environment(s) in pixels "
+        "(ignored for editor training)",
-        help="The quality level of the environment(s)",
+        help="The quality level of the environment(s). Equivalent to calling "
+        "QualitySettings.SetQualityLevel in Unity.",
-        help="The time scale of the Unity environment(s)",
+        help="The time scale of the Unity environment(s). Equivalent to setting "
+        "Time.timeScale in Unity.",
-        help="The target frame rate of the Unity environment(s)",
+        help="The target frame rate of the Unity environment(s). Equivalent to setting "
+        "Application.targetFrameRate in Unity.",
    )
    return argparser