Action Docs part2 (#4739)

* reduce usage of "vector action" and "action space" * more cleanup * undo GettingStarted change for now * batch size description * Apply suggestions from code review Co-authored-by: andrewcoh <54679309+andrewcoh@users.noreply.github.com> Co-authored-by: andrewcoh <54679309+andrewcoh@users.noreply.github.com>
4 年前 · a0d1c829
--- a/docs/Getting-Started.md
+++ b/docs/Getting-Started.md
 eight elements: the `x` and `z` components of the agent cube's rotation and the
 `x`, `y`, and `z` components of the ball's relative position and velocity.

-#### Behavior Parameters : Vector Action Space
+#### Behavior Parameters : Actions

 An Agent is given instructions in the form of actions.
 ML-Agents Toolkit classifies actions into two types: continuous and discrete.
        Number of Visual Observations (per agent): 0
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 1
-        Vector Action space type: continuous
-        Vector Action space size (per agent): [2]
-        Vector Action descriptions: ,
 INFO:mlagents_envs:Hyperparameters for the PPO Trainer of brain 3DBallLearning:
        batch_size:          64
        beta:                0.001
--- a/docs/Learning-Environment-Design-Agents.md
+++ b/docs/Learning-Environment-Design-Agents.md
  - [Raycast Observations](#raycast-observations)
    - [RayCast Observation Summary & Best Practices](#raycast-observation-summary--best-practices)
 - [Actions](#actions)
-  - [Continuous Action Space](#continuous-action-space)
-  - [Discrete Action Space](#discrete-action-space)
+  - [Continuous Actions](#continuous-actions)
+  - [Discrete Actions](#discrete-actions)
    - [Masking Discrete Actions](#masking-discrete-actions)
  - [Actions Summary & Best Practices](#actions-summary--best-practices)
 - [Rewards](#rewards)
  method calls `VectorSensor.AddObservation()` such that vector size adds up to 8,
  the Behavior Parameters of the Agent are set with vector observation space
  with a state size of 8.
- `Agent.OnActionReceived()` — The vector action spaces result
+- `Agent.OnActionReceived()` — The action results
  in a small change in the agent cube's rotation at each step. In this example,
  an Agent receives a small positive reward for each step it keeps the ball on the
  agent cube's head and a larger, negative reward for dropping the ball. An

 An action is an instruction from the Policy that the agent carries out. The
 action is passed to the Agent as the `ActionBuffers` parameter when the Academy invokes the
-agent's `OnActionReceived()` function. There are two types of actions supported:
+agent's `OnActionReceived()` function. There are two types of actions that an Agent can use:
 **Continuous** and **Discrete**.

 Neither the Policy nor the training algorithm know anything about what the
 for an Agent is in the `OnActionReceived()` function.

 For example, if you designed an agent to move in two dimensions, you could use
-either continuous or the discrete vector actions. In the continuous case, you
-would set the vector action size to two (one for each dimension), and the
-agent's Policy would create an action with two floating point values. In the
+either continuous or the discrete actions. In the continuous case, you
+would set the action size to two (one for each dimension), and the
+agent's Policy would output an action with two floating point values. In the
-movement), and the Policy would create an action array containing two elements
-with values ranging from zero to one.
+movement), and the Policy would output an action array containing two elements
+with values ranging from zero to one. You could alternatively use a combination of continuous
+and discrete actions e.g., using one continuous action for horizontal movement
+and a discrete branch of size two for the vertical movement.
-The [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) and
-[Area](Learning-Environment-Examples.md#push-block) example environments are set
-up to use either the continuous or the discrete vector action spaces.
-
-### Continuous Action Space
+### Continuous Actions
-is an array with length equal to the `Vector Action Space Size` property value. The
+is an array with length equal to the `Continuous Action Size` property value. The
-The [Reacher example](Learning-Environment-Examples.md#reacher) defines a
-continuous action space with four control values.
+The [Reacher example](Learning-Environment-Examples.md#reacher) uses
+continuous actions with four control values.

 ![reacher](images/reacher.png)

 ```

 By default the output from our provided PPO algorithm pre-clamps the values of
-`vectorAction` into the [-1, 1] range. It is a best practice to manually clip
+`ActionBuffers.ContinuousActions` into the [-1, 1] range. It is a best practice to manually clip
-### Discrete Action Space
+### Discrete Actions
-is an array of integers. When defining the discrete vector action space, `Branches`
+is an array of integers with length equal to `Discrete Branch Size`. When defining the discrete actions, `Branches`
 is an array of integers, each value corresponds to the number of possibilities for each branch.

 For example, if we wanted an Agent that can move in a plane and jump, we could

 ### Actions Summary & Best Practices

- Agents can either use `Discrete` or `Continuous` actions.
+- Agents can use `Discrete` and/or `Continuous` actions.
- In general, smaller action spaces will make for easier learning.
- Be sure to set the Vector Action's Space Size to the number of used Vector
-  Actions, and not greater, as doing the latter can interfere with the
+- In general, fewer actions will make for easier learning.
+- Be sure to set the Continuous Action Size and Discrete Branch Size to the desired
+  number for each type of action, and not greater, as doing the latter can interfere with the
  efficiency of the training process.
 - Continuous action values should be clipped to an
  appropriate range. The provided PPO model automatically clips these values
      be stacked and used collectively for decision making. This results in the
      effective size of the vector observation being passed to the Policy being:
      _Space Size_ x _Stacked Vectors_.
-  - `Vector Action`
-    - `Space Type` - Corresponds to whether action vector contains a single
-      integer (Discrete) or a series of real-valued floats (Continuous).
-    - `Space Size` (Continuous) - Length of action vector.
-    - `Branches` (Discrete) - An array of integers, defines multiple concurrent
-      discrete actions. The values in the `Branches` array correspond to the
-      number of possible discrete values for each action branch.
+  - `Actions`
+    - `Continuous Actions` - The number of concurrent continuous actions that
+     the Agent can take.
+    - `Discrete Branches` - An array of integers, defines multiple concurrent
+      discrete actions. The values in the `Discrete Branches` array correspond
+      to the number of possible discrete values for each action branch.
  - `Model` - The neural network model used for inference (obtained after
    training)
  - `Inference Device` - Whether to use CPU or GPU to run the model during
--- a/docs/Learning-Environment-Examples.md
+++ b/docs/Learning-Environment-Examples.md
  - +1.0 for arriving at optimal state.
 - Behavior Parameters:
  - Vector Observation space: One variable corresponding to current state.
-  - Vector Action space: (Discrete) Two possible actions (Move left, move
+  - Actions: 1 discrete action branch with 3 actions (Move left, do nothing, move
    right).
  - Visual Observations: None
 - Float Properties: None
    cube, and position and velocity of ball.
  - Vector Observation space (Hard Version): 5 variables corresponding to
    rotation of the agent cube and position of ball.
-  - Vector Action space: (Continuous) Size of 2, with one value corresponding to
+  - Actions: 2 continuous actions, with one value corresponding to
    X-rotation, and the other to Z-rotation.
  - Visual Observations: Third-person view from the upper-front of the agent. Use
    `Visual3DBall` scene.
  - -1.0 if the agent navigates to an obstacle (episode ends).
 - Behavior Parameters:
  - Vector Observation space: None
-  - Vector Action space: (Discrete) Size of 4, corresponding to movement in
-    cardinal directions. Note that for this environment,
+  - Actions: 1 discrete action branch with 5 actions, corresponding to movement in
+    cardinal directions or not moving. Note that for this environment,
    [action masking](Learning-Environment-Design-Agents.md#masking-discrete-actions)
    is turned on by default (this option can be toggled using the `Mask Actions`
    checkbox within the `trueAgent` GameObject). The trained model file provided
 - Behavior Parameters:
  - Vector Observation space: 9 variables corresponding to position, velocity
    and orientation of ball and racket.
-  - Vector Action space: (Continuous) Size of 3, corresponding to movement
+  - Actions: 3 continuous actions, corresponding to movement
    toward net or away from net, jumping and rotation.
  - Visual Observations: None
 - Float Properties: Three
  - Vector Observation space: (Continuous) 70 variables corresponding to 14
    ray-casts each detecting one of three possible objects (wall, goal, or
    block).
-  - Vector Action space: (Discrete) Size of 6, corresponding to turn clockwise
-    and counterclockwise and move along four different face directions.
+  - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise
+    and counterclockwise, move along four different face directions, or do nothing.
  - Visual Observations (Optional): One first-person camera. Use
    `VisualPushBlock` scene. **The visual observation version of this
    environment does not train with the provided default training parameters.**
  - Vector Observation space: Size of 74, corresponding to 14 ray casts each
    detecting 4 possible objects. plus the global position of the agent and
    whether or not the agent is grounded.
-  - Vector Action space: (Discrete) 4 Branches:
+  - Actions: 4 discrete action branches:
    - Forward Motion (3 possible actions: Forward, Backwards, No Action)
    - Rotation (3 possible actions: Rotate Left, Rotate Right, No Action)
    - Side Motion (3 possible actions: Left, Right, No Action)
 - Behavior Parameters:
  - Vector Observation space: 26 variables corresponding to position, rotation,
    velocity, and angular velocities of the two arm rigid bodies.
-  - Vector Action space: (Continuous) Size of 4, corresponding to torque
+  - Actions: 4 continuous actions, corresponding to torque
    applicable to two joints.
  - Visual Observations: None.
 - Float Properties: Five
  - Vector Observation space: 172 variables corresponding to position, rotation,
    velocity, and angular velocities of each limb plus the acceleration and
    angular acceleration of the body.
-  - Vector Action space: (Continuous) Size of 20, corresponding to target
+  - Actions: 20 continuous actions, corresponding to target
    rotations for joints.
  - Visual Observations: None
 - Float Properties: None
  - Vector Observation space: 64 variables corresponding to position, rotation,
    velocity, and angular velocities of each limb plus the acceleration and
    angular acceleration of the body.
-  - Vector Action space: (Continuous) Size of 9, corresponding to target
+  - Actions: 9 continuous actions, corresponding to target
    rotations for joints.
  - Visual Observations: None
 - Float Properties: None
    agent is frozen and/or shot its laser (2), plus ray-based perception of
    objects around agent's forward direction (49; 7 raycast angles with 7
    measurements for each).
-  - Vector Action space: (Discrete) 4 Branches:
+  - Actions: 4 discrete action ranches:
    - Forward Motion (3 possible actions: Forward, Backwards, No Action)
    - Side Motion (3 possible actions: Left, Right, No Action)
    - Rotation (3 possible actions: Rotate Left, Rotate Right, No Action)
 - Behavior Parameters:
  - Vector Observation space: 30 corresponding to local ray-casts detecting
    objects, goals, and walls.
-  - Vector Action space: (Discrete) 1 Branch, 4 actions corresponding to agent
+  - Actions: 1 discrete action Branch, with 4 actions corresponding to agent
    rotation and forward/backward movement.
  - Visual Observations (Optional): First-person view for the agent. Use
    `VisualHallway` scene. **The visual observation version of this environment
 - Behavior Parameters:
  - Vector Observation space: 6 corresponding to local position of agent and
    green cube.
-  - Vector Action space: (Continuous) 3 corresponding to agent force applied for
+  - Actions: 3 continuous actions corresponding to agent force applied for
    the jump.
  - Visual Observations: None
 - Float Properties: Two
    degrees each detecting 6 possible object types, along with the object's
    distance. The forward ray-casts contribute 264 state dimensions and backward
    72 state dimensions over three observation stacks.
-  - Vector Action space: (Discrete) Three branched actions corresponding to
+  - Actions: 3 discrete branched actions corresponding to
    forward, backward, sideways movement, as well as rotation.
  - Visual Observations: None
 - Float Properties: Two
    degrees each detecting 5 possible object types, along with the object's
    distance. The forward ray-casts contribute 231 state dimensions and backward
    63 state dimensions over three observation stacks.
-  - Striker Vector Action space: (Discrete) Three branched actions corresponding
+  - Striker Actions: 3 discrete branched actions corresponding
-  - Goalie Vector Action space: (Discrete) Three branched actions corresponding
+  - Goalie Actions: 3 discrete branched actions corresponding
    to forward, backward, sideways movement, as well as rotation.
  - Visual Observations: None
 - Float Properties: Two
 - Behavior Parameters:
  - Vector Observation space: 243 variables corresponding to position, rotation,
    velocity, and angular velocities of each limb, along with goal direction.
-  - Vector Action space: (Continuous) Size of 39, corresponding to target
+  - Actions: 39 continuous actions, corresponding to target
    rotations and strength applicable to the joints.
  - Visual Observations: None
 - Float Properties: Four
  - Vector Observation space: 148 corresponding to local ray-casts detecting
    switch, bricks, golden brick, and walls, plus variable indicating switch
    state.
-  - Vector Action space: (Discrete) 4 corresponding to agent rotation and
+  - Actions: 1 discrete action branch, with 4 actions corresponding to agent rotation and
    forward/backward movement.
  - Visual Observations (Optional): First-person camera per-agent. Us
    `VisualPyramids` scene. **The visual observation version of this environment
--- a/docs/Learning-Environment-Executable.md
+++ b/docs/Learning-Environment-Executable.md
        Number of Visual Observations (per agent): 0
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 1
-        Vector Action space type: continuous
-        Vector Action space size (per agent): [2]
-        Vector Action descriptions: ,
 INFO:mlagents_envs:Hyperparameters for the PPO Trainer of brain Ball3DLearning:
        batch_size:          64
        beta:                0.001
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
 one in which opposing agents are equal in form, function and objective. Examples
 of symmetric games are our Tennis and Soccer example environments. In
 reinforcement learning, this means both agents have the same observation and
-action spaces and learn from the same reward function and so _they can share the
+actions and learn from the same reward function and so _they can share the
-have the same observation or action spaces and so sharing policy networks is not
+have the same observation or actions and so sharing policy networks is not
 necessarily ideal.

 With self-play, an agent learns in adversarial games by competing against fixed,
--- a/docs/Python-API.md
+++ b/docs/Python-API.md
  name of the group the Agent belongs to and `agent_id` is the integer
  identifier of the Agent. `action` is an `ActionTuple` as described above.
 **Note:** If no action is provided for an agent group between two calls to
-`env.step()` then the default action will be all zeros (in either discrete or
-continuous action space)
+`env.step()` then the default action will be all zeros.

 #### DecisionSteps and DecisionStep

--- a/docs/Training-Configuration-File.md
+++ b/docs/Training-Configuration-File.md
 | `init_path`              | (default = None) Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run.                                                                                                                                  |
 | `threaded`               | (default = `true`) By default, model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. There is usually no reason to turn `threaded` off for SAC.                                                                                                                                                                                                                                                       |
 | `hyperparameters -> learning_rate`          | (default = `3e-4`) Initial learning rate for gradient descent. Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase. <br><br>Typical range: `1e-5` - `1e-3`                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
-| `hyperparameters -> batch_size`             | Number of experiences in each iteration of gradient descent. **This should always be multiple times smaller than `buffer_size`**. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value should be smaller (in order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`.                                                                                                                                                                                                                                                               |
+| `hyperparameters -> batch_size`             | Number of experiences in each iteration of gradient descent. **This should always be multiple times smaller than `buffer_size`**. If you are using continuous actions, this value should be large (on the order of 1000s). If you are using only discrete actions, this value should be smaller (on the order of 10s). <br><br> Typical range: (Continuous - PPO): `512` - `5120`; (Continuous - SAC): `128` - `1024`; (Discrete, PPO & SAC): `32` - `512`.                                                                                                                                                                                                                                                               |
 | `hyperparameters -> buffer_size`            | (default = `10240` for PPO and `50000` for SAC)<br> **PPO:** Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be multiple times larger than `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates. <br> **SAC:** The max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences. <br><br>Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000`                                                                                                                                                      |
 | `hyperparameters -> learning_rate_schedule` | (default = `linear` for PPO and `constant` for SAC) Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally. <br><br>`linear` decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run.                                                                                                           |
 | `network_settings -> hidden_units`           | (default = `128`) Number of units in the hidden layers of the neural network. Correspond to how many units are in each fully connected layer of the neural network. For simple problems where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where the action is a very complex interaction between the observation variables, this should be larger. <br><br> Typical range: `32` - `512`                                                                                                                                                                                                                                                                                    |

 A few considerations when deciding to use memory:

- LSTM does not work well with continuous vector actions. Please use
+- LSTM does not work well with continuous actions. Please use
  discrete actions for better results.
 - Since the memories must be sent back and forth between Python and Unity, using
  too large `memory_size` will slow down training.
--- a/ml-agents-envs/mlagents_envs/base_env.py
+++ b/ml-agents-envs/mlagents_envs/base_env.py
     since the last simulation step.
     - agent_id is an int and an unique identifier for the corresponding Agent.
     - action_mask is an optional list of one dimensional array of booleans.
-     Only available in multi-discrete action space type.
+     Only available when using multi-discrete actions.
     Each array corresponds to an action branch. Each array contains a mask
     for each action of the branch. If true, the action is not available for
     the agent during this simulation step.
     identifier for the corresponding Agent. This is used to track Agents
     across simulation steps.
     - action_mask is an optional list of two dimensional array of booleans.
-     Only available in multi-discrete action space type.
+     Only available when using multi-discrete actions.
     Each array corresponds to an action branch. The first dimension of each
     array is the batch size and the second contains a mask for each action of
     the branch. If true, the action is not available for the agent during
--- a/ml-agents/mlagents/trainers/demo_loader.py
+++ b/ml-agents/mlagents/trainers/demo_loader.py
        # check action dimensions in demonstration match
        if behavior_spec.action_spec != expected_behavior_spec.action_spec:
            raise RuntimeError(
-                "The action spaces {} in demonstration do not match the policy's {}.".format(
+                "The actions {} in demonstration do not match the policy's {}.".format(
                    behavior_spec.action_spec, expected_behavior_spec.action_spec
                )
            )
--- a/ml-agents/mlagents/trainers/policy/tf_policy.py
+++ b/ml-agents/mlagents/trainers/policy/tf_policy.py
            and self.behavior_spec.action_spec.discrete_size > 0
        ):
            raise UnityPolicyException(
-                "TensorFlow does not support mixed action spaces. Please run with the Torch framework."
+                "TensorFlow does not support continuous and discrete actions on the same behavior. "
+                "Please run with the Torch framework."
            )
        # for ghost trainer save/load snapshots
        self.assign_phs: List[tf.Tensor] = []
--- a/ml-agents/mlagents/trainers/policy/torch_policy.py
+++ b/ml-agents/mlagents/trainers/policy/torch_policy.py
        """
        Policy that uses a multilayer perceptron to map the observations to actions. Could
        also use a CNN to encode visual input prior to the MLP. Supports discrete and
-        continuous action spaces, as well as recurrent networks.
+        continuous actions, as well as recurrent networks.
        :param seed: Random seed.
        :param behavior_spec: Assigned BehaviorSpec object.
        :param trainer_settings: Defined training parameters.
--- a/ml-agents/mlagents/trainers/tests/mock_brain.py
+++ b/ml-agents/mlagents/trainers/tests/mock_brain.py

    :int num_agents: Number of "agents" to imitate.
    :List observation_shapes: A List of the observation spaces in your steps
-    :int num_vector_acts: Number of actions in your action space
-    :bool discrete: Whether or not action space is discrete
+    :int action_spec: ActionSpec for the agent
    :bool done: Whether all the agents in the batch are done
    """
    obs_list = []
--- a/docs/images/monitor.png
+++ b/docs/images/monitor.png