Wrapping lines in curriculum docs.

6 年前 · 8cb290c4
--- a/docs/Training-Curriculum-Learning.md
+++ b/docs/Training-Curriculum-Learning.md

 ## Sample Environment

-Imagine a task in which an agent needs to scale a wall to arrive at a goal. The starting
-point when training an agent to accomplish this task will be a random policy. That
-starting policy will have the agent running in circles, and will likely never, or very
-rarely scale the wall properly to the achieve the reward. If we start with a simpler
-task, such as moving toward an unobstructed goal, then the agent can easily learn to
-accomplish the task. From there, we can slowly add to the difficulty of the task by
-increasing the size of the wall, until the agent can complete the initially
-near-impossible task of scaling the wall. We are including just such an environment with
-the ML-Agents toolkit 0.2, called __Wall Jump__.
+Imagine a task in which an agent needs to scale a wall to arrive at a goal. The
+starting point when training an agent to accomplish this task will be a random
+policy. That starting policy will have the agent running in circles, and will
+likely never, or very rarely scale the wall properly to the achieve the reward.
+If we start with a simpler task, such as moving toward an unobstructed goal,
+then the agent can easily learn to accomplish the task. From there, we can
+slowly add to the difficulty of the task by increasing the size of the wall,
+until the agent can complete the initially near-impossible task of scaling the
+wall. We are including just such an environment with the ML-Agents toolkit 0.2,
+called __Wall Jump__.
-_Demonstration of a curriculum training scenario in which a progressively taller wall
-obstructs the path to the goal._
+_Demonstration of a curriculum training scenario in which a progressively taller
+wall obstructs the path to the goal._
-To see this in action, observe the two learning curves below. Each displays the reward
-over time for an agent trained using PPO with the same set of training hyperparameters.
-The difference is that one agent was trained using the full-height wall
-version of the task, and the other agent was trained using the curriculum version of
-the task. As you can see, without using curriculum learning the agent has a lot of
-difficulty. We think that by using well-crafted curricula, agents trained using
-reinforcement learning will be able to accomplish tasks otherwise much more difficult.
+To see this in action, observe the two learning curves below. Each displays the
+reward over time for an agent trained using PPO with the same set of training
+hyperparameters. The difference is that one agent was trained using the
+full-height wall version of the task, and the other agent was trained using the
+curriculum version of the task. As you can see, without using curriculum
+learning the agent has a lot of difficulty. We think that by using well-crafted
+curricula, agents trained using reinforcement learning will be able to
+accomplish tasks otherwise much more difficult.

 ![Log](images/curriculum_progress.png)

 ### Specifying a Metacurriculum

 We first create a folder inside `python/curricula/` for the environment we want
-to use curriculum learning with. For example, if we were creating a metacurriculum
-for Wall Jump, we would create the folder `python/curricula/wall-jump/`. We will place
-our curriculums inside this folder.
+to use curriculum learning with. For example, if we were creating a
+metacurriculum for Wall Jump, we would create the folder
+`python/curricula/wall-jump/`. We will place our curriculums inside this folder.
-In order to define a curriculum, the first step is to decide which
-parameters of the environment will vary. In the case of the Wall Jump environment, what
-varies is the height of the wall. We define this as a `Reset Parameter` in the Academy
-object of our scene, and by doing so it becomes adjustable via the Python API. Rather
-than adjusting it by hand, we will create a simple JSON file which describes the
-structure of the curriculum. Within it, we can specify which points in the training process
-our wall height will change, either based on the percentage of training steps which have
-taken place, or what the average reward the agent has received in the recent past is.
-Below is an example curriculum for the BigWallBrain in the Wall Jump environment.
+In order to define a curriculum, the first step is to decide which parameters of
+the environment will vary. In the case of the Wall Jump environment, what varies
+is the height of the wall. We define this as a `Reset Parameter` in the Academy
+object of our scene, and by doing so it becomes adjustable via the Python API.
+Rather than adjusting it by hand, we will create a simple JSON file which
+describes the structure of the curriculum. Within it, we can specify which
+points in the training process our wall height will change, either based on the
+percentage of training steps which have taken place, or what the average reward
+the agent has received in the recent past is. Below is an example curriculum for
+the BigWallBrain in the Wall Jump environment.

 ```json
 {
 * `measure` - What to measure learning progress, and advancement in lessons by.
    * `reward` - Uses a measure received reward.
    * `progress` - Uses ratio of steps/max_steps.
-* `thresholds` (float array) - Points in value of `measure` where lesson should be increased.
-* `min_lesson_length` (int) - How many times the progress measure should be reported before
-incrementing the lesson.
-* `signal_smoothing` (true/false) - Whether to weight the current progress measure by previous values.
+* `thresholds` (float array) - Points in value of `measure` where lesson should
+  be increased.
+* `min_lesson_length` (int) - How many times the progress measure should be
+  reported before incrementing the lesson.
+* `signal_smoothing` (true/false) - Whether to weight the current progress
+  measure by previous values.
-* `parameters` (dictionary of key:string, value:float array) - Corresponds to academy reset parameters to control. Length of each array
-should be one greater than number of thresholds.
+* `parameters` (dictionary of key:string, value:float array) - Corresponds to
+  academy reset parameters to control. Length of each array should be one
+  greater than number of thresholds.
-Once our curriculum is defined, we have to use the reset parameters we defined and modify
-the environment from the agent's `AgentReset()` function. See
+Once our curriculum is defined, we have to use the reset parameters we defined
+and modify the environment from the agent's `AgentReset()` function. See
-corresponding Brain. For example, in the Wall Jump environment, there are
-two brains---BigWallBrain and SmallWallBrain. If we want to define a
-curriculum for the BigWallBrain, we will save `BigWallBrain.json` into
+corresponding Brain. For example, in the Wall Jump environment, there are two
+brains---BigWallBrain and SmallWallBrain. If we want to define a curriculum for
+the BigWallBrain, we will save `BigWallBrain.json` into
-Once we have specified our metacurriculum and curriculums, we can launch `learn.py` using the `–curriculum`
-flag to point to the metacurriculum folder and PPO will train using Curriculum Learning. For example,
-to train agents in the Wall Jump environment with curriculum learning, we can run
-`python learn.py --curriculum=curricula/wall-jump/ --run-id=wall-jump-curriculum --train`.
-We can then keep track of the current lessons and progresses via TensorBoard.
+Once we have specified our metacurriculum and curriculums, we can launch
+`learn.py` using the `–curriculum` flag to point to the metacurriculum folder
+and PPO will train using Curriculum Learning. For example, to train agents in
+the Wall Jump environment with curriculum learning, we can run `python learn.py
+--curriculum=curricula/wall-jump/ --run-id=wall-jump-curriculum --train`. We can
+then keep track of the current lessons and progresses via TensorBoard.