浏览代码

Wrapping lines in curriculum docs.

/develop-generalizationTraining-TrainerController
Deric Pang 6 年前
当前提交
8cb290c4
共有 1 个文件被更改,包括 53 次插入46 次删除
  1. 99
      docs/Training-Curriculum-Learning.md

99
docs/Training-Curriculum-Learning.md


## Sample Environment
Imagine a task in which an agent needs to scale a wall to arrive at a goal. The starting
point when training an agent to accomplish this task will be a random policy. That
starting policy will have the agent running in circles, and will likely never, or very
rarely scale the wall properly to the achieve the reward. If we start with a simpler
task, such as moving toward an unobstructed goal, then the agent can easily learn to
accomplish the task. From there, we can slowly add to the difficulty of the task by
increasing the size of the wall, until the agent can complete the initially
near-impossible task of scaling the wall. We are including just such an environment with
the ML-Agents toolkit 0.2, called __Wall Jump__.
Imagine a task in which an agent needs to scale a wall to arrive at a goal. The
starting point when training an agent to accomplish this task will be a random
policy. That starting policy will have the agent running in circles, and will
likely never, or very rarely scale the wall properly to the achieve the reward.
If we start with a simpler task, such as moving toward an unobstructed goal,
then the agent can easily learn to accomplish the task. From there, we can
slowly add to the difficulty of the task by increasing the size of the wall,
until the agent can complete the initially near-impossible task of scaling the
wall. We are including just such an environment with the ML-Agents toolkit 0.2,
called __Wall Jump__.
_Demonstration of a curriculum training scenario in which a progressively taller wall
obstructs the path to the goal._
_Demonstration of a curriculum training scenario in which a progressively taller
wall obstructs the path to the goal._
To see this in action, observe the two learning curves below. Each displays the reward
over time for an agent trained using PPO with the same set of training hyperparameters.
The difference is that one agent was trained using the full-height wall
version of the task, and the other agent was trained using the curriculum version of
the task. As you can see, without using curriculum learning the agent has a lot of
difficulty. We think that by using well-crafted curricula, agents trained using
reinforcement learning will be able to accomplish tasks otherwise much more difficult.
To see this in action, observe the two learning curves below. Each displays the
reward over time for an agent trained using PPO with the same set of training
hyperparameters. The difference is that one agent was trained using the
full-height wall version of the task, and the other agent was trained using the
curriculum version of the task. As you can see, without using curriculum
learning the agent has a lot of difficulty. We think that by using well-crafted
curricula, agents trained using reinforcement learning will be able to
accomplish tasks otherwise much more difficult.
![Log](images/curriculum_progress.png)

### Specifying a Metacurriculum
We first create a folder inside `python/curricula/` for the environment we want
to use curriculum learning with. For example, if we were creating a metacurriculum
for Wall Jump, we would create the folder `python/curricula/wall-jump/`. We will place
our curriculums inside this folder.
to use curriculum learning with. For example, if we were creating a
metacurriculum for Wall Jump, we would create the folder
`python/curricula/wall-jump/`. We will place our curriculums inside this folder.
In order to define a curriculum, the first step is to decide which
parameters of the environment will vary. In the case of the Wall Jump environment, what
varies is the height of the wall. We define this as a `Reset Parameter` in the Academy
object of our scene, and by doing so it becomes adjustable via the Python API. Rather
than adjusting it by hand, we will create a simple JSON file which describes the
structure of the curriculum. Within it, we can specify which points in the training process
our wall height will change, either based on the percentage of training steps which have
taken place, or what the average reward the agent has received in the recent past is.
Below is an example curriculum for the BigWallBrain in the Wall Jump environment.
In order to define a curriculum, the first step is to decide which parameters of
the environment will vary. In the case of the Wall Jump environment, what varies
is the height of the wall. We define this as a `Reset Parameter` in the Academy
object of our scene, and by doing so it becomes adjustable via the Python API.
Rather than adjusting it by hand, we will create a simple JSON file which
describes the structure of the curriculum. Within it, we can specify which
points in the training process our wall height will change, either based on the
percentage of training steps which have taken place, or what the average reward
the agent has received in the recent past is. Below is an example curriculum for
the BigWallBrain in the Wall Jump environment.
```json
{

* `measure` - What to measure learning progress, and advancement in lessons by.
* `reward` - Uses a measure received reward.
* `progress` - Uses ratio of steps/max_steps.
* `thresholds` (float array) - Points in value of `measure` where lesson should be increased.
* `min_lesson_length` (int) - How many times the progress measure should be reported before
incrementing the lesson.
* `signal_smoothing` (true/false) - Whether to weight the current progress measure by previous values.
* `thresholds` (float array) - Points in value of `measure` where lesson should
be increased.
* `min_lesson_length` (int) - How many times the progress measure should be
reported before incrementing the lesson.
* `signal_smoothing` (true/false) - Whether to weight the current progress
measure by previous values.
* `parameters` (dictionary of key:string, value:float array) - Corresponds to academy reset parameters to control. Length of each array
should be one greater than number of thresholds.
* `parameters` (dictionary of key:string, value:float array) - Corresponds to
academy reset parameters to control. Length of each array should be one
greater than number of thresholds.
Once our curriculum is defined, we have to use the reset parameters we defined and modify
the environment from the agent's `AgentReset()` function. See
Once our curriculum is defined, we have to use the reset parameters we defined
and modify the environment from the agent's `AgentReset()` function. See
corresponding Brain. For example, in the Wall Jump environment, there are
two brains---BigWallBrain and SmallWallBrain. If we want to define a
curriculum for the BigWallBrain, we will save `BigWallBrain.json` into
corresponding Brain. For example, in the Wall Jump environment, there are two
brains---BigWallBrain and SmallWallBrain. If we want to define a curriculum for
the BigWallBrain, we will save `BigWallBrain.json` into
Once we have specified our metacurriculum and curriculums, we can launch `learn.py` using the `–curriculum`
flag to point to the metacurriculum folder and PPO will train using Curriculum Learning. For example,
to train agents in the Wall Jump environment with curriculum learning, we can run
`python learn.py --curriculum=curricula/wall-jump/ --run-id=wall-jump-curriculum --train`.
We can then keep track of the current lessons and progresses via TensorBoard.
Once we have specified our metacurriculum and curriculums, we can launch
`learn.py` using the `–curriculum` flag to point to the metacurriculum folder
and PPO will train using Curriculum Learning. For example, to train agents in
the Wall Jump environment with curriculum learning, we can run `python learn.py
--curriculum=curricula/wall-jump/ --run-id=wall-jump-curriculum --train`. We can
then keep track of the current lessons and progresses via TensorBoard.
正在加载...
取消
保存