浏览代码

clarify introduction doc; fix broken links. (#123)

/develop-generalizationTraining-TrainerController
Arthur Juliani 7 年前
当前提交
43ac4148
共有 2 个文件被更改,包括 26 次插入26 次删除
  1. 36
      docs/Getting-Started-with-Balance-Ball.md
  2. 16
      docs/best-practices-ppo.md

36
docs/Getting-Started-with-Balance-Ball.md


![Balance Ball](../images/balance.png)
This tutorial will walk through the end-to-end process of installing Unity Agents, building an example environment, training an agent in it, and finally embedding the trained model into the Unity environment.
This tutorial will walk through the end-to-end process of installing Unity Agents, building an example environment, training an agent in it, and finally embedding the trained model into the Unity environment.
Unity ML Agents contains a number of example environments which can be used as templates for new environments, or as ways to test a new ML algorithm to ensure it is functioning correctly.
Unity ML Agents contains a number of example environments which can be used as templates for new environments, or as ways to test a new ML algorithm to ensure it is functioning correctly.
In this walkthrough we will be using the **3D Balance Ball** environment. The environment contains a number of platforms and balls. Platforms can act to keep the ball up by rotating either horizontally or vertically. Each platform is an agent which is rewarded the longer it can keep a ball balanced on it, and provided a negative reward for dropping the ball. The goal of the training process is to have the platforms learn to never drop the ball.

In order to install and set-up the Python and Unity environments, see the instructions [here](installation.md).
## Building Unity Environment
Launch the Unity Editor, and log in, if necessary.
Launch the Unity Editor, and log in, if necessary.
1. Open the `unity-environment` folder using the Unity editor. *(If this is not first time running Unity, you'll be able to skip most of these immediate steps, choose directly from the list of recently opened projects)*
- On the initial dialog, choose `Open` on the top options

To launch jupyter, run in the command line:
`jupyter notebook`
`jupyter notebook`
Then navigate to `localhost:8888` to access the notebooks. If you're new to jupyter, check out the [quick start guide](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/execute.html) before you continue.

In order to train an agent to correctly balance the ball, we will use a Reinforcement Learning algorithm called Proximal Policy Optimization (PPO). This is a method that has been shown to be safe, efficient, and more general purpose than many other RL algorithms, as such we have chosen it as the example algorithm for use with ML Agents. For more information on PPO, OpenAI has a recent [blog post](https://blog.openai.com/openai-baselines-ppo/) explaining it.
In order to train an agent to correctly balance the ball, we will use a Reinforcement Learning algorithm called Proximal Policy Optimization (PPO). This is a method that has been shown to be safe, efficient, and more general purpose than many other RL algorithms, as such we have chosen it as the example algorithm for use with ML Agents. For more information on PPO, OpenAI has a recent [blog post](https://blog.openai.com/openai-baselines-ppo/) explaining it.
2. Set `env_name` to whatever you named your environment file.
3. (optional) Set `run_path` directory to your choice.
2. Set `env_name` to the name of your environment file earlier.
3. (optional) Set `run_path` directory to your choice.
In order to observe the training process in more detail, you can use Tensorboard.
In order to observe the training process in more detail, you can use Tensorboard.
In your command line, enter into `python` directory and then run :
`tensorboard --logdir=summaries`

## Embedding Trained Brain into Unity Environment _[Experimental]_
Once the training process displays an average reward of ~75 or greater, and there has been a recently saved model (denoted by the `Saved Model` message) you can choose to stop the training process by stopping the cell execution. Once this is done, you now have a trained TensorFlow model. You must now convert the saved model to a Unity-ready format which can be embedded directly into the Unity project by following the steps below.
### Setting up TensorFlowSharp Support
### Setting up TensorFlowSharp Support
2. Make sure the TensorFlowSharp plugin is in your Asset folder. A Plugins folder which includes TF# can be downloaded [here](https://s3.amazonaws.com/unity-agents/TFSharpPlugin.unitypackage).
2. Make sure the TensorFlowSharp plugin is in your `Assets` folder. A Plugins folder which includes TF# can be downloaded [here](https://s3.amazonaws.com/unity-agents/TFSharpPlugin.unitypackage). Double click and import it once downloaded.
2. Select `Scripting Runtime Version` to `Experimental (.NET 4.6 Equivalent)`
2. Select `Scripting Runtime Version` to `Experimental (.NET 4.6 Equivalent)`
1. Run the final cell of the notebook under "Export the trained TensorFlow graph" to produce an `<env_name >.bytes` file.
2. Move `<env_name>.bytes` from `python/models/...` into `unity-environment/Assets/ML-Agents/Examples/3DBall/TFModels/`.
3. Open the Unity Editor, and select the `3DBall` scene as described above.
4. Select the `3DBallBrain` object from the Scene hierarchy.
1. Run the final cell of the notebook under "Export the trained TensorFlow graph" to produce an `<env_name >.bytes` file.
2. Move `<env_name>.bytes` from `python/models/ppo/` into `unity-environment/Assets/ML-Agents/Examples/3DBall/TFModels/`.
3. Open the Unity Editor, and select the `3DBall` scene as described above.
4. Select the `3DBallBrain` object from the Scene hierarchy.
7. Set the `Graph Placeholder` size to 1 (_Note that step 7 and 8 are done because 3DBall is a continuous control environment, and the TensorFlow model requires a noise parameter to decide actions. In cases with discrete control, epsilon is not needed_).
8. Add a placeholder called `epsilon` with a type of `floating point` and a range of values from 0 to 0.
7. Set the `Graph Placeholder` size to 1 (_Note that step 7 and 8 are done because 3DBall is a continuous control environment, and the TensorFlow model requires a noise parameter to decide actions. In cases with discrete control, epsilon is not needed_).
8. Add a placeholder called `epsilon` with a type of `floating point` and a range of values from `0` to `0`.
If you followed these steps correctly, you should now see the trained model being used to control the behavior of the balance ball within the Editor itself. From here you can re-build the Unity binary, and run it standalone with your agent's new learned behavior built right in.
If you followed these steps correctly, you should now see the trained model being used to control the behavior of the balance ball within the Editor itself. From here you can re-build the Unity binary, and run it standalone with your agent's new learned behavior built right in.

16
docs/best-practices-ppo.md


# Best Practices when training with PPO
The process of training a Reinforcement Learning model can often involve the need to tune the hyperparameters in order to achieve
The process of training a Reinforcement Learning model can often involve the need to tune the hyperparameters in order to achieve
a level of performance that is desirable. This guide contains some best practices for tuning the training process when the default
parameters don't seem to be giving the level of performance you would like.

`batch_size` corresponds to how many experiences are used for each gradient descent update. This should always be a fraction
of the `buffer_size`. If you are using a continuous action space, this value should be large. If you are using a discrete action space, this value should be smaller.
of the `buffer_size`. If you are using a continuous action space, this value should be large. If you are using a discrete action space, this value should be smaller.
Typical Range (Continuous): `512` - `5120`

### Beta
`beta` corresponds to the strength of the entropy regularization. This ensures that discrete action space agents properly
explore during training. Increasing this will ensure more random actions are taken. This should be adjusted such that
explore during training. Increasing this will ensure more random actions are taken. This should be adjusted such that
the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly,
increase `beta`. If entropy drops too slowly, decrease `beta`.

`buffer_size` corresponds to how many experiences should be collected before gradient descent is performed on them all.
This should be a multiple of `batch_size`.
This should be a multiple of `batch_size`.
Typical Range: `2048` - `409600`

`time_horizon` corresponds to how many steps of experience to collect per-agent before adding it to the experience buffer.
In cases where there are frequent rewards within an episode, or episodes are prohibitively large, this can be a smaller number.
For most stable training however, this number should be large enough to capture all the important behavior within a sequence of
For most stable training however, this number should be large enough to capture all the important behavior within a sequence of
an agent's actions.
Typical Range: `64` - `2048`

To view training statistics, use Tensorboard. For information on launching and using Tensorboard, see [here](../Getting-Started-with-Balance-Ball.md#observing-training-progress).
To view training statistics, use Tensorboard. For information on launching and using Tensorboard, see [here](./Getting-Started-with-Balance-Ball.md#observing-training-progress).
### Cumulative Reward

This corresponds to how random the decisions of a brain are. This should consistently decrease during training. If it decreases
This corresponds to how random the decisions of a brain are. This should consistently decrease during training. If it decreases
too soon or not at all, `beta` should be adjusted (when using discrete action space).
### Learning Rate

### Value Estimate
These values should increase with the reward. They corresponds to how much future reward the agent predicts itself receiving at
These values should increase with the reward. They corresponds to how much future reward the agent predicts itself receiving at
any given point.
### Value Loss
正在加载...
取消
保存