浏览代码

Action Docs part2 (#4739)

* reduce usage of "vector action" and "action space"

* more cleanup

* undo GettingStarted change for now

* batch size description

* Apply suggestions from code review

Co-authored-by: andrewcoh <54679309+andrewcoh@users.noreply.github.com>

Co-authored-by: andrewcoh <54679309+andrewcoh@users.noreply.github.com>
/MLA-1734-demo-provider
GitHub 4 年前
当前提交
a0d1c829
共有 13 个文件被更改,包括 60 次插入163 次删除
  1. 5
      docs/Getting-Started.md
  2. 59
      docs/Learning-Environment-Design-Agents.md
  3. 38
      docs/Learning-Environment-Examples.md
  4. 3
      docs/Learning-Environment-Executable.md
  5. 4
      docs/ML-Agents-Overview.md
  6. 3
      docs/Python-API.md
  7. 4
      docs/Training-Configuration-File.md
  8. 4
      ml-agents-envs/mlagents_envs/base_env.py
  9. 2
      ml-agents/mlagents/trainers/demo_loader.py
  10. 3
      ml-agents/mlagents/trainers/policy/tf_policy.py
  11. 2
      ml-agents/mlagents/trainers/policy/torch_policy.py
  12. 3
      ml-agents/mlagents/trainers/tests/mock_brain.py
  13. 93
      docs/images/monitor.png

5
docs/Getting-Started.md


eight elements: the `x` and `z` components of the agent cube's rotation and the
`x`, `y`, and `z` components of the ball's relative position and velocity.
#### Behavior Parameters : Vector Action Space
#### Behavior Parameters : Actions
An Agent is given instructions in the form of actions.
ML-Agents Toolkit classifies actions into two types: continuous and discrete.

Number of Visual Observations (per agent): 0
Vector Observation space size (per agent): 8
Number of stacked Vector Observation: 1
Vector Action space type: continuous
Vector Action space size (per agent): [2]
Vector Action descriptions: ,
INFO:mlagents_envs:Hyperparameters for the PPO Trainer of brain 3DBallLearning:
batch_size: 64
beta: 0.001

59
docs/Learning-Environment-Design-Agents.md


- [Raycast Observations](#raycast-observations)
- [RayCast Observation Summary & Best Practices](#raycast-observation-summary--best-practices)
- [Actions](#actions)
- [Continuous Action Space](#continuous-action-space)
- [Discrete Action Space](#discrete-action-space)
- [Continuous Actions](#continuous-actions)
- [Discrete Actions](#discrete-actions)
- [Masking Discrete Actions](#masking-discrete-actions)
- [Actions Summary & Best Practices](#actions-summary--best-practices)
- [Rewards](#rewards)

method calls `VectorSensor.AddObservation()` such that vector size adds up to 8,
the Behavior Parameters of the Agent are set with vector observation space
with a state size of 8.
- `Agent.OnActionReceived()` — The vector action spaces result
- `Agent.OnActionReceived()` — The action results
in a small change in the agent cube's rotation at each step. In this example,
an Agent receives a small positive reward for each step it keeps the ball on the
agent cube's head and a larger, negative reward for dropping the ball. An

An action is an instruction from the Policy that the agent carries out. The
action is passed to the Agent as the `ActionBuffers` parameter when the Academy invokes the
agent's `OnActionReceived()` function. There are two types of actions supported:
agent's `OnActionReceived()` function. There are two types of actions that an Agent can use:
**Continuous** and **Discrete**.
Neither the Policy nor the training algorithm know anything about what the

for an Agent is in the `OnActionReceived()` function.
For example, if you designed an agent to move in two dimensions, you could use
either continuous or the discrete vector actions. In the continuous case, you
would set the vector action size to two (one for each dimension), and the
agent's Policy would create an action with two floating point values. In the
either continuous or the discrete actions. In the continuous case, you
would set the action size to two (one for each dimension), and the
agent's Policy would output an action with two floating point values. In the
movement), and the Policy would create an action array containing two elements
with values ranging from zero to one.
movement), and the Policy would output an action array containing two elements
with values ranging from zero to one. You could alternatively use a combination of continuous
and discrete actions e.g., using one continuous action for horizontal movement
and a discrete branch of size two for the vertical movement.
The [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) and
[Area](Learning-Environment-Examples.md#push-block) example environments are set
up to use either the continuous or the discrete vector action spaces.
### Continuous Action Space
### Continuous Actions
is an array with length equal to the `Vector Action Space Size` property value. The
is an array with length equal to the `Continuous Action Size` property value. The
The [Reacher example](Learning-Environment-Examples.md#reacher) defines a
continuous action space with four control values.
The [Reacher example](Learning-Environment-Examples.md#reacher) uses
continuous actions with four control values.
![reacher](images/reacher.png)

```
By default the output from our provided PPO algorithm pre-clamps the values of
`vectorAction` into the [-1, 1] range. It is a best practice to manually clip
`ActionBuffers.ContinuousActions` into the [-1, 1] range. It is a best practice to manually clip
### Discrete Action Space
### Discrete Actions
is an array of integers. When defining the discrete vector action space, `Branches`
is an array of integers with length equal to `Discrete Branch Size`. When defining the discrete actions, `Branches`
is an array of integers, each value corresponds to the number of possibilities for each branch.
For example, if we wanted an Agent that can move in a plane and jump, we could

### Actions Summary & Best Practices
- Agents can either use `Discrete` or `Continuous` actions.
- Agents can use `Discrete` and/or `Continuous` actions.
- In general, smaller action spaces will make for easier learning.
- Be sure to set the Vector Action's Space Size to the number of used Vector
Actions, and not greater, as doing the latter can interfere with the
- In general, fewer actions will make for easier learning.
- Be sure to set the Continuous Action Size and Discrete Branch Size to the desired
number for each type of action, and not greater, as doing the latter can interfere with the
efficiency of the training process.
- Continuous action values should be clipped to an
appropriate range. The provided PPO model automatically clips these values

be stacked and used collectively for decision making. This results in the
effective size of the vector observation being passed to the Policy being:
_Space Size_ x _Stacked Vectors_.
- `Vector Action`
- `Space Type` - Corresponds to whether action vector contains a single
integer (Discrete) or a series of real-valued floats (Continuous).
- `Space Size` (Continuous) - Length of action vector.
- `Branches` (Discrete) - An array of integers, defines multiple concurrent
discrete actions. The values in the `Branches` array correspond to the
number of possible discrete values for each action branch.
- `Actions`
- `Continuous Actions` - The number of concurrent continuous actions that
the Agent can take.
- `Discrete Branches` - An array of integers, defines multiple concurrent
discrete actions. The values in the `Discrete Branches` array correspond
to the number of possible discrete values for each action branch.
- `Model` - The neural network model used for inference (obtained after
training)
- `Inference Device` - Whether to use CPU or GPU to run the model during

38
docs/Learning-Environment-Examples.md


- +1.0 for arriving at optimal state.
- Behavior Parameters:
- Vector Observation space: One variable corresponding to current state.
- Vector Action space: (Discrete) Two possible actions (Move left, move
- Actions: 1 discrete action branch with 3 actions (Move left, do nothing, move
right).
- Visual Observations: None
- Float Properties: None

cube, and position and velocity of ball.
- Vector Observation space (Hard Version): 5 variables corresponding to
rotation of the agent cube and position of ball.
- Vector Action space: (Continuous) Size of 2, with one value corresponding to
- Actions: 2 continuous actions, with one value corresponding to
X-rotation, and the other to Z-rotation.
- Visual Observations: Third-person view from the upper-front of the agent. Use
`Visual3DBall` scene.

- -1.0 if the agent navigates to an obstacle (episode ends).
- Behavior Parameters:
- Vector Observation space: None
- Vector Action space: (Discrete) Size of 4, corresponding to movement in
cardinal directions. Note that for this environment,
- Actions: 1 discrete action branch with 5 actions, corresponding to movement in
cardinal directions or not moving. Note that for this environment,
[action masking](Learning-Environment-Design-Agents.md#masking-discrete-actions)
is turned on by default (this option can be toggled using the `Mask Actions`
checkbox within the `trueAgent` GameObject). The trained model file provided

- Behavior Parameters:
- Vector Observation space: 9 variables corresponding to position, velocity
and orientation of ball and racket.
- Vector Action space: (Continuous) Size of 3, corresponding to movement
- Actions: 3 continuous actions, corresponding to movement
toward net or away from net, jumping and rotation.
- Visual Observations: None
- Float Properties: Three

- Vector Observation space: (Continuous) 70 variables corresponding to 14
ray-casts each detecting one of three possible objects (wall, goal, or
block).
- Vector Action space: (Discrete) Size of 6, corresponding to turn clockwise
and counterclockwise and move along four different face directions.
- Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise
and counterclockwise, move along four different face directions, or do nothing.
- Visual Observations (Optional): One first-person camera. Use
`VisualPushBlock` scene. **The visual observation version of this
environment does not train with the provided default training parameters.**

- Vector Observation space: Size of 74, corresponding to 14 ray casts each
detecting 4 possible objects. plus the global position of the agent and
whether or not the agent is grounded.
- Vector Action space: (Discrete) 4 Branches:
- Actions: 4 discrete action branches:
- Forward Motion (3 possible actions: Forward, Backwards, No Action)
- Rotation (3 possible actions: Rotate Left, Rotate Right, No Action)
- Side Motion (3 possible actions: Left, Right, No Action)

- Behavior Parameters:
- Vector Observation space: 26 variables corresponding to position, rotation,
velocity, and angular velocities of the two arm rigid bodies.
- Vector Action space: (Continuous) Size of 4, corresponding to torque
- Actions: 4 continuous actions, corresponding to torque
applicable to two joints.
- Visual Observations: None.
- Float Properties: Five

- Vector Observation space: 172 variables corresponding to position, rotation,
velocity, and angular velocities of each limb plus the acceleration and
angular acceleration of the body.
- Vector Action space: (Continuous) Size of 20, corresponding to target
- Actions: 20 continuous actions, corresponding to target
rotations for joints.
- Visual Observations: None
- Float Properties: None

- Vector Observation space: 64 variables corresponding to position, rotation,
velocity, and angular velocities of each limb plus the acceleration and
angular acceleration of the body.
- Vector Action space: (Continuous) Size of 9, corresponding to target
- Actions: 9 continuous actions, corresponding to target
rotations for joints.
- Visual Observations: None
- Float Properties: None

agent is frozen and/or shot its laser (2), plus ray-based perception of
objects around agent's forward direction (49; 7 raycast angles with 7
measurements for each).
- Vector Action space: (Discrete) 4 Branches:
- Actions: 4 discrete action ranches:
- Forward Motion (3 possible actions: Forward, Backwards, No Action)
- Side Motion (3 possible actions: Left, Right, No Action)
- Rotation (3 possible actions: Rotate Left, Rotate Right, No Action)

- Behavior Parameters:
- Vector Observation space: 30 corresponding to local ray-casts detecting
objects, goals, and walls.
- Vector Action space: (Discrete) 1 Branch, 4 actions corresponding to agent
- Actions: 1 discrete action Branch, with 4 actions corresponding to agent
rotation and forward/backward movement.
- Visual Observations (Optional): First-person view for the agent. Use
`VisualHallway` scene. **The visual observation version of this environment

- Behavior Parameters:
- Vector Observation space: 6 corresponding to local position of agent and
green cube.
- Vector Action space: (Continuous) 3 corresponding to agent force applied for
- Actions: 3 continuous actions corresponding to agent force applied for
the jump.
- Visual Observations: None
- Float Properties: Two

degrees each detecting 6 possible object types, along with the object's
distance. The forward ray-casts contribute 264 state dimensions and backward
72 state dimensions over three observation stacks.
- Vector Action space: (Discrete) Three branched actions corresponding to
- Actions: 3 discrete branched actions corresponding to
forward, backward, sideways movement, as well as rotation.
- Visual Observations: None
- Float Properties: Two

degrees each detecting 5 possible object types, along with the object's
distance. The forward ray-casts contribute 231 state dimensions and backward
63 state dimensions over three observation stacks.
- Striker Vector Action space: (Discrete) Three branched actions corresponding
- Striker Actions: 3 discrete branched actions corresponding
- Goalie Vector Action space: (Discrete) Three branched actions corresponding
- Goalie Actions: 3 discrete branched actions corresponding
to forward, backward, sideways movement, as well as rotation.
- Visual Observations: None
- Float Properties: Two

- Behavior Parameters:
- Vector Observation space: 243 variables corresponding to position, rotation,
velocity, and angular velocities of each limb, along with goal direction.
- Vector Action space: (Continuous) Size of 39, corresponding to target
- Actions: 39 continuous actions, corresponding to target
rotations and strength applicable to the joints.
- Visual Observations: None
- Float Properties: Four

- Vector Observation space: 148 corresponding to local ray-casts detecting
switch, bricks, golden brick, and walls, plus variable indicating switch
state.
- Vector Action space: (Discrete) 4 corresponding to agent rotation and
- Actions: 1 discrete action branch, with 4 actions corresponding to agent rotation and
forward/backward movement.
- Visual Observations (Optional): First-person camera per-agent. Us
`VisualPyramids` scene. **The visual observation version of this environment

3
docs/Learning-Environment-Executable.md


Number of Visual Observations (per agent): 0
Vector Observation space size (per agent): 8
Number of stacked Vector Observation: 1
Vector Action space type: continuous
Vector Action space size (per agent): [2]
Vector Action descriptions: ,
INFO:mlagents_envs:Hyperparameters for the PPO Trainer of brain Ball3DLearning:
batch_size: 64
beta: 0.001

4
docs/ML-Agents-Overview.md


one in which opposing agents are equal in form, function and objective. Examples
of symmetric games are our Tennis and Soccer example environments. In
reinforcement learning, this means both agents have the same observation and
action spaces and learn from the same reward function and so _they can share the
actions and learn from the same reward function and so _they can share the
have the same observation or action spaces and so sharing policy networks is not
have the same observation or actions and so sharing policy networks is not
necessarily ideal.
With self-play, an agent learns in adversarial games by competing against fixed,

3
docs/Python-API.md


name of the group the Agent belongs to and `agent_id` is the integer
identifier of the Agent. `action` is an `ActionTuple` as described above.
**Note:** If no action is provided for an agent group between two calls to
`env.step()` then the default action will be all zeros (in either discrete or
continuous action space)
`env.step()` then the default action will be all zeros.
#### DecisionSteps and DecisionStep

4
docs/Training-Configuration-File.md


| `init_path` | (default = None) Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run. |
| `threaded` | (default = `true`) By default, model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. There is usually no reason to turn `threaded` off for SAC. |
| `hyperparameters -> learning_rate` | (default = `3e-4`) Initial learning rate for gradient descent. Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase. <br><br>Typical range: `1e-5` - `1e-3` |
| `hyperparameters -> batch_size` | Number of experiences in each iteration of gradient descent. **This should always be multiple times smaller than `buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value should be smaller (in order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`. |
| `hyperparameters -> batch_size` | Number of experiences in each iteration of gradient descent. **This should always be multiple times smaller than `buffer_size`**. If you are using continuous actions, this value should be large (on the order of 1000s). If you are using only discrete actions, this value should be smaller (on the order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`. |
| `hyperparameters -> buffer_size` | (default = `10240` for PPO and `50000` for SAC)<br> **PPO:** Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be multiple times larger than `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates. <br> **SAC:** The max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences. <br><br>Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000` |
| `hyperparameters -> learning_rate_schedule` | (default = `linear` for PPO and `constant` for SAC) Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally. <br><br>`linear` decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run. |
| `network_settings -> hidden_units` | (default = `128`) Number of units in the hidden layers of the neural network. Correspond to how many units are in each fully connected layer of the neural network. For simple problems where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where the action is a very complex interaction between the observation variables, this should be larger. <br><br> Typical range: `32` - `512` |

A few considerations when deciding to use memory:
- LSTM does not work well with continuous vector actions. Please use
- LSTM does not work well with continuous actions. Please use
discrete actions for better results.
- Since the memories must be sent back and forth between Python and Unity, using
too large `memory_size` will slow down training.

4
ml-agents-envs/mlagents_envs/base_env.py


since the last simulation step.
- agent_id is an int and an unique identifier for the corresponding Agent.
- action_mask is an optional list of one dimensional array of booleans.
Only available in multi-discrete action space type.
Only available when using multi-discrete actions.
Each array corresponds to an action branch. Each array contains a mask
for each action of the branch. If true, the action is not available for
the agent during this simulation step.

identifier for the corresponding Agent. This is used to track Agents
across simulation steps.
- action_mask is an optional list of two dimensional array of booleans.
Only available in multi-discrete action space type.
Only available when using multi-discrete actions.
Each array corresponds to an action branch. The first dimension of each
array is the batch size and the second contains a mask for each action of
the branch. If true, the action is not available for the agent during

2
ml-agents/mlagents/trainers/demo_loader.py


# check action dimensions in demonstration match
if behavior_spec.action_spec != expected_behavior_spec.action_spec:
raise RuntimeError(
"The action spaces {} in demonstration do not match the policy's {}.".format(
"The actions {} in demonstration do not match the policy's {}.".format(
behavior_spec.action_spec, expected_behavior_spec.action_spec
)
)

3
ml-agents/mlagents/trainers/policy/tf_policy.py


and self.behavior_spec.action_spec.discrete_size > 0
):
raise UnityPolicyException(
"TensorFlow does not support mixed action spaces. Please run with the Torch framework."
"TensorFlow does not support continuous and discrete actions on the same behavior. "
"Please run with the Torch framework."
)
# for ghost trainer save/load snapshots
self.assign_phs: List[tf.Tensor] = []

2
ml-agents/mlagents/trainers/policy/torch_policy.py


"""
Policy that uses a multilayer perceptron to map the observations to actions. Could
also use a CNN to encode visual input prior to the MLP. Supports discrete and
continuous action spaces, as well as recurrent networks.
continuous actions, as well as recurrent networks.
:param seed: Random seed.
:param behavior_spec: Assigned BehaviorSpec object.
:param trainer_settings: Defined training parameters.

3
ml-agents/mlagents/trainers/tests/mock_brain.py


:int num_agents: Number of "agents" to imitate.
:List observation_shapes: A List of the observation spaces in your steps
:int num_vector_acts: Number of actions in your action space
:bool discrete: Whether or not action space is discrete
:int action_spec: ActionSpec for the agent
:bool done: Whether all the agents in the batch are done
"""
obs_list = []

93
docs/images/monitor.png

之前 之后
正在加载...
取消
保存