first commit

5 年前 · fddede25
--- a/docs/Readme.md
+++ b/docs/Readme.md
 * [Installation](Installation.md)
  * [Background: Jupyter Notebooks](Background-Jupyter.md)
  * [Docker Set-up](Using-Docker.md)
-* [Basic Guide](Basic-Guide.md)
+  * [Using Python Virtual Environment](Python-venv.md) 
+
+* [Basic Guide](Basic-Guide.md)
 * [ML-Agents Toolkit Overview](ML-Agents-Overview.md)
  * [Background: Unity](Background-Unity.md)
  * [Background: Machine Learning](Background-Machine-Learning.md)
    [Heuristic](Learning-Environment-Design-Heuristic-Brains.md),
    [Learning](Learning-Environment-Design-Learning-Brains.md)
 * [Learning Environment Best Practices](Learning-Environment-Best-Practices.md)
-* [Using the Monitor](Feature-Monitor.md)
-* [Using the Video Recorder](https://github.com/Unity-Technologies/video-recorder)
-* [Using an Executable Environment](Learning-Environment-Executable.md)
-* [Creating Custom Protobuf Messages](Creating-Custom-Protobuf-Messages.md)
+
+Optional for first time users
+  * [Using the Monitor](Feature-Monitor.md)
+  * [Using the Video Recorder](https://github.com/Unity-Technologies/video-recorder)
+  * [Using an Executable Environment](Learning-Environment-Executable.md)
+  * [Creating Custom Protobuf Messages](Creating-Custom-Protobuf-Messages.md)
+* [Using TensorBoard to Observe Training](Using-Tensorboard.md)
+
+### Advanced Training Methods
+
+* [Training Using Concurrent Unity Instances](Training-Using-Concurrent-Unity-Instances.md)
+
+### Cloud Training 
-* [Training Using Concurrent Unity Instances](Training-Using-Concurrent-Unity-Instances.md)
-* [Using TensorBoard to Observe Training](Using-Tensorboard.md)

 ## Inference

--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 using TensorBoard during or after training by running the following command:

 ```sh
-tensorboard --logdir=summaries
+tensorboard --logdir=summaries --port 6006
 ```

 And then opening the URL: [localhost:6006](http://localhost:6006).
  the oldest checkpoint is deleted when saving a new checkpoint. Defaults to 5.
 * `--lesson=<n>`: Specify which lesson to start with when performing curriculum
  training. Defaults to 0.
-* `--load`: If set, the training code loads an already trained model to
-  initialize the neural network before training. The learning code looks for the
-  model in `models/<run-id>/` (which is also where it saves models at the end of
-  training). When not set (the default), the neural network weights are randomly
-  initialized and an existing model is not loaded.
+* `--num-envs=<n>`: Specifies the number of concurrent Unity environment instances to 
+  collect experiences from when training. Defaults to 1.
+* `--base-port`: Specifies the starting port. Each concurrent Unity environment instance 
+  will get assigned a port sequentially, starting from the `base-port`.  Each instance 
+  will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs 
+  given to each instance from 0 to `num_envs - 1`. Default is 5005.
+* `--docker-target-name=<dt>`: The Docker Volume on which to store curriculum,
+  executable and model files. See [Using Docker](Using-Docker.md).
 * `--run-id=<path>`: Specifies an identifier for each training run. This
  identifier is used to name the subdirectories in which the trained model and
  summary statistics are saved as well as the saved model itself. The default id
  [Academy Properties](Learning-Environment-Design-Academy.md#academy-properties).
 * `--train`: Specifies whether to train model or only run in inference mode.
  When training, **always** use the `--train` option.
-* `--num-envs=<n>`: Specifies the number of concurrent Unity environment instances to collect
-  experiences from when training. Defaults to 1.
-* `--base-port`: Specifies the starting port. Each concurrent Unity environment instance will get assigned a port sequentially, starting from the `base-port`.  Each instance will use the port `(base_port + worker_id)`, where the `worker_id` is sequential IDs given to each instance from 0 to `num_envs - 1`. Default is 5005.
-* `--docker-target-name=<dt>`: The Docker Volume on which to store curriculum,
-  executable and model files. See [Using Docker](Using-Docker.md).
+* `--load`: If set, the training code loads an already trained model to
+  initialize the neural network before training. The learning code looks for the
+  model in `models/<run-id>/` (which is also where it saves models at the end of
+  training). When not set (the default), the neural network weights are randomly
+  initialized and an existing model is not loaded.
 * `--no-graphics`: Specify this option to run the Unity executable in
  `-batchmode` and doesn't initialize the graphics driver. Use this only if your
  training doesn't involve visual observations (reading from Pixels). See
 The training config files `config/trainer_config.yaml`, `config/sac_trainer_config.yaml`,
 `config/gail_config.yaml`, `config/online_bc_config.yaml` and `config/offline_bc_config.yaml`
 specifies the training method, the hyperparameters, and a few additional values to use when
-training with PPO, SAC, GAIL (with PPO), and online and offline BC. These files are divided into sections.
-The **default** section defines the default values for all the available
+training with Proximal Policy Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial 
+Imitation Learning) with PPO, and online and offline Behavioral Cloning(BC)/Imitation. These files are 
+divided into sections. The **default** section defines the default values for all the available
 settings. You can also add new sections to override these defaults to train
 specific Brains. Name each of these override sections after the GameObject
 containing the Brain component that should use these settings. (This GameObject
 |     **Setting**      |                                                                                     **Description**                                                                                     | **Applies To Trainer\*** |
 | :------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- |
-| batch_size           | The number of experiences in each iteration of gradient descent.                                                                                                                        | PPO, SAC, BC                  |
+| batch_size           | The number of experiences in each iteration of gradient descent.                                                                                                                        | PPO, SAC, BC             |
-| buffer_size          | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer.                                                                   | PPO, SAC                      |
-| buffer_init_steps          | The number of experiences to collect into the buffer before updating the policy model.                                                                                            | SAC                      |
+| buffer_size          | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer.                                                                   | PPO, SAC                 |
+| buffer_init_steps    | The number of experiences to collect into the buffer before updating the policy model.                                                                                                  | SAC                      |
-| hidden_units         | The number of units in the hidden layers of the neural network.                                                                                                                         | PPO, SAC, BC                  |
-| init_entcoef         | How much the agent should explore in the beginning of training.                                                                                                                         | SAC                  |
+| hidden_units         | The number of units in the hidden layers of the neural network.                                                                                                                         | PPO, SAC, BC             |
+| init_entcoef         | How much the agent should explore in the beginning of training.                                                                                                                         | SAC                      |
-| learning_rate        | The initial learning rate for gradient descent.                                                                                                                                         | PPO, SAC, BC                  |
-| max_steps            | The maximum number of simulation steps to run during a training session.                                                                                                                | PPO, SAC, BC                  |
-| memory_size          | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                 | PPO, SAC, BC                  |
-| normalize            | Whether to automatically normalize observations.                                                                                                                                        | PPO, SAC                      |
+| learning_rate        | The initial learning rate for gradient descent.                                                                                                                                         | PPO, SAC, BC             |
+| max_steps            | The maximum number of simulation steps to run during a training session.                                                                                                                | PPO, SAC, BC             |
+| memory_size          | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                 | PPO, SAC, BC             |
+| normalize            | Whether to automatically normalize observations.                                                                                                                                        | PPO, SAC                 |
-| num_layers           | The number of hidden layers in the neural network.                                                                                                                                      | PPO, SAC, BC                  |
-| pretraining          | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations).                                                                                            | PPO, SAC                      |
-| reward_signals       | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options.                           | PPO, SAC, BC                  |
-| save_replay_buffer      | Saves the replay buffer when exiting training, and loads it on resume.                                                                                                         | SAC                |
-| sequence_length      | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC                  |
-| summary_freq         | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard.                                                                       | PPO, SAC, BC                  |
-| tau                  | How aggressively to update the target network used for bootstrapping value estimation in SAC.                                                                                            | SAC                          |
-| time_horizon         | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, SAC, (online)BC          |
-| trainer              | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc".                                                                                                                                  | PPO, SAC, BC                  |
-| train_interval              | How often to update the agent.                                                                                                                                                    | SAC                  |
-| num_update           | Number of mini-batches to update the agent with during each update.                                                                                       | SAC                  |
-| use_recurrent        | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                                                                       | PPO, SAC, BC                  |
+| num_layers           | The number of hidden layers in the neural network.                                                                                                                                      | PPO, SAC, BC             |
+| pretraining          | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations).                           | PPO, SAC                 |
+| reward_signals       | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options.                                         | PPO, SAC, BC             |
+| save_replay_buffer   | Saves the replay buffer when exiting training, and loads it on resume.                                                                                                                  | SAC                      |
+| sequence_length      | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC             |
+| summary_freq         | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard.                                                                       | PPO, SAC, BC             |
+| tau                  | How aggressively to update the target network used for bootstrapping value estimation in SAC.                                                                                           | SAC                      |
+| time_horizon         | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, SAC, (online)BC     |
+| trainer              | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc".                                                                                                             | PPO, SAC, BC             |
+| train_interval       | How often to update the agent.                                                                                                                                                          | SAC                      |
+| num_update           | Number of mini-batches to update the agent with during each update.                                                                                                                     | SAC                      |
+| use_recurrent        | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                                                                       | PPO, SAC, BC             |
-
-\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation)

 For specific advice on setting hyperparameters based on the type of training you
 are conducting, see:
--- a/docs/Using-Tensorboard.md
+++ b/docs/Using-Tensorboard.md
 3. From the command line run :

      ```sh
-      tensorboard --logdir=summaries
+      tensorboard --logdir=summaries --port=6006
+
+**Note** The default port tensorboard uses is 6006. If there is an existing session
+running on port 6006 a new session can be launched on an open port using the --port 
+option.

 **Note:** If you don't assign a `run-id` identifier, `mlagents-learn` uses the
 default string, "ppo". All the statistics will be saved to the same sub-folder
--- a/docs/Python-venv.md
+++ b/docs/Python-venv.md
+# Installing and Running ML-Agents in a virtual environment
+
+__Requirement - Python 3.6 must be installed on the server. Python 3.6 can be [here](https://www.python.org/downloads/)__ 
+
+## Mac OS X Setup
+
+1. Create a folder where the virtual environments will live ` $ mkdir ~/python-venvs `
+1. To create a new environment named `test-env` execute `$ python3 -m venv ~/python-envs/test-env`  
+1. To activate the environment execute `$ source ~/python-envs/test-env/bin/activate`
+1. Install ML-Agents package using `$ pip3 install mlagents`
+1. To deactivate the environment execute `$ deactivate `
+
+## Ubuntu Setup 
+
+1. Install the python3-venv package using `$ sudo apt-get install python3-venv`
+
+Now follow the steps in the Mac OS X installation.