Merge branch 'develop-hybrid-actions-singleton' into develop-hybrid-actions-csharp

4 年前 · 8ed14762
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
 - Unity Version: [e.g. Unity 2020.1f1]
 - OS + version: [e.g. Windows 10]
 - _ML-Agents version_: (e.g. ML-Agents v0.8, or latest `develop` branch from source)
- _TensorFlow version_: (you can run `pip3 show tensorflow` to get this)
+- _Torch version_: (you can run `pip3 show torch` to get this)
 - _Environment_: (which example environment you used to reproduce the error)

 **NOTE:** We are unable to help reproduce bugs with custom environments.  Please attempt to reproduce your issue with one of the example environments, or provide a minimal patch to one of the environments needed to reproduce the issue.
--- a/.github/workflows/pytest.yml
+++ b/.github/workflows/pytest.yml
        python -m pip install --progress-bar=off -r test_requirements.txt -c ${{ matrix.pip_constraints }}
        python -m pip install --progress-bar=off -e ./gym-unity -c ${{ matrix.pip_constraints }}
    - name: Save python dependencies
-      run: pip freeze > pip_versions-${{ matrix.python-version }}.txt
+      run: |
+        pip freeze > pip_versions-${{ matrix.python-version }}.txt
+        cat pip_versions-${{ matrix.python-version }}.txt
    - name: Run pytest
      run: pytest --cov=ml-agents --cov=ml-agents-envs --cov=gym-unity --cov-report html --junitxml=junit/test-results-${{ matrix.python-version }}.xml -p no:warnings
    - name: Upload pytest test results
--- a/.yamato/test_versions.metafile
+++ b/.yamato/test_versions.metafile
 # List of editor versions for standalone-build-test and its dependencies.
+# csharp_backcompat_version is used in training-int-tests to determine the
+# older package version to run the backwards compat tests against.
+    csharp_backcompat_version: 1.0.0
+    csharp_backcompat_version: 1.0.0
-# Waiting on a barracuda fix, see https://jira.unity3d.com/browse/MLA-1464
-#  - version: 2020.2
+    csharp_backcompat_version: 1.0.0
+  - version: 2020.2
+    # 2020.2 moved the AssetImporters namespace
+    # but we didn't handle this until 1.2.0
+    csharp_backcompat_version: 1.2.0
--- a/.yamato/training-int-tests.yml
+++ b/.yamato/training-int-tests.yml
    # If we make a breaking change to the communication protocol, these will need
    # to be disabled until the next release.
    - python -u -m ml-agents.tests.yamato.training_int_tests --python=0.16.0
-    - python -u -m ml-agents.tests.yamato.training_int_tests --csharp=1.0.0
+    - python -u -m ml-agents.tests.yamato.training_int_tests --csharp={{ editor.csharp_backcompat_version }}
  dependencies:
    - .yamato/standalone-build-test.yml#test_mac_standalone_{{ editor.version }}
  triggers:
--- a/Project/Assets/ML-Agents/Examples/SharedAssets/Scripts/ModelOverrider.cs
+++ b/Project/Assets/ML-Agents/Examples/SharedAssets/Scripts/ModelOverrider.cs
            var bp = m_Agent.GetComponent<BehaviorParameters>();
            var behaviorName = bp.BehaviorName;

-            var nnModel = GetModelForBehaviorName(behaviorName);
+            NNModel nnModel = null;
+            try
+            {
+                nnModel = GetModelForBehaviorName(behaviorName);
+            }
+            catch (Exception e)
+            {
+                overrideError = $"Exception calling GetModelForBehaviorName: {e}";
+            }
+
-                overrideError =
-                    $"Didn't find a model for behaviorName {behaviorName}. Make " +
-                    $"sure the behaviorName is set correctly in the commandline " +
-                    $"and that the model file exists";
+                if (string.IsNullOrEmpty(overrideError))
+                {
+                    overrideError =
+                        $"Didn't find a model for behaviorName {behaviorName}. Make " +
+                        "sure the behaviorName is set correctly in the commandline " +
+                        "and that the model file exists";
+                }
            }
            else
            {
--- a/README.md
+++ b/README.md

 **The Unity Machine Learning Agents Toolkit** (ML-Agents) is an open-source
 project that enables games and simulations to serve as environments for
-training intelligent agents. Agents can be trained using reinforcement learning,
-imitation learning, neuroevolution, or other machine learning methods through a
-simple-to-use Python API. We also provide implementations (based on TensorFlow)
+training intelligent agents. We provide implementations (based on PyTorch)
-train intelligent agents for 2D, 3D and VR/AR games. These trained agents can be
+train intelligent agents for 2D, 3D and VR/AR games. Researchers can also use the
+provided simple-to-use Python API to train Agents using reinforcement learning,
+imitation learning, neuroevolution, or any other methods. These trained agents can be
 used for multiple purposes, including controlling NPC behavior (in a variety of
 settings such as multi-agent and adversarial), automated testing of game builds
 and evaluating different game design decisions pre-release. The ML-Agents
--- a/com.unity.ml-agents/CHANGELOG.md
+++ b/com.unity.ml-agents/CHANGELOG.md
 ### Major Changes
 #### com.unity.ml-agents (C#)
 #### ml-agents / ml-agents-envs / gym-unity (Python)
+ - PyTorch trainers are now the default. See the
+ [installation docs](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) for
+ more information on installing PyTorch. For the time being, TensorFlow is still available;
+ you can use the TensorFlow backend by adding `--tensorflow` to the CLI, or
+ adding `framework: tensorflow` in the configuration YAML. (#4517)
+- The Barracuda dependency was upgraded to 1.1.2 (#4571)
+- The `action_probs` node is no longer listed as an output in TensorFlow models (#4613).
+- `Agent.CollectObservations()` and `Agent.EndEpisode()` will now throw an exception
+if they are called recursively (for example, if they call `Agent.EndEpisode()`).
+Previously, this would result in an infinite loop and cause the editor to hang. (#4573)
+- Fixed an issue where runs could not be resumed when using TensorFlow and Ghost Training. (#4593)


 ## [1.5.0-preview] - 2020-10-14
--- a/com.unity.ml-agents/Runtime/Academy.cs
+++ b/com.unity.ml-agents/Runtime/Academy.cs
        // Flag used to keep track of the first time the Academy is reset.
        bool m_HadFirstReset;

-        // Whether the Academy is in the middle of a step. This is used to detect and Academy
-        // step called by user code that is also called by the Academy.
-        bool m_IsStepping;
+        // Detect an Academy step called by user code that is also called by the Academy.
+        private RecursionChecker m_StepRecursionChecker = new RecursionChecker("EnvironmentStep");

        // Random seed used for inference.
        int m_InferenceSeed;
        /// </summary>
        public void EnvironmentStep()
        {
-            // Check whether we're already in the middle of a step.
-            // This shouldn't happen generally, but could happen if user code (e.g. CollectObservations)
-            // that is called by EnvironmentStep() also calls EnvironmentStep(). This would result
-            // in an infinite loop and/or stack overflow, so stop it before it happens.
-            if (m_IsStepping)
-            {
-                throw new UnityAgentsException(
-                    "Academy.EnvironmentStep() called recursively. " +
-                    "This might happen if you call EnvironmentStep() from custom code such as " +
-                    "CollectObservations() or OnActionReceived()."
-                );
-            }
-
-            m_IsStepping = true;
-
-            try
+            using (m_StepRecursionChecker.Start())
            {
                if (!m_HadFirstReset)
                {
                {
                    AgentAct?.Invoke();
                }
-            }
-            finally
-            {
-                // Reset m_IsStepping when we're done (or if an exception occurred).
-                m_IsStepping = false;
            }
        }

--- a/com.unity.ml-agents/Runtime/Agent.cs
+++ b/com.unity.ml-agents/Runtime/Agent.cs
        /// </summary>
        internal VectorSensor collectObservationsSensor;

+        private RecursionChecker m_CollectObservationsChecker = new RecursionChecker("CollectObservations");
+        private RecursionChecker m_OnEpisodeBeginChecker = new RecursionChecker("OnEpisodeBegin");
+
        /// <summary>
        /// List of IActuators that this Agent will delegate actions to if any exist.
        /// </summary>
            // episode when initializing until after the Academy had its first reset.
            if (Academy.Instance.TotalStepCount != 0)
            {
-                OnEpisodeBegin();
+                using (m_OnEpisodeBeginChecker.Start())
+                {
+                    OnEpisodeBegin();
+                }
            }
        }

            {
                // Make sure the latest observations are being passed to training.
                collectObservationsSensor.Reset();
-                CollectObservations(collectObservationsSensor);
+                using (m_CollectObservationsChecker.Start())
+                {
+                    CollectObservations(collectObservationsSensor);
+                }
            }
            // Request the last decision with no callbacks
            // We request a decision so Python knows the Agent is done immediately
            UpdateSensors();
            using (TimerStack.Instance.Scoped("CollectObservations"))
            {
-                CollectObservations(collectObservationsSensor);
+                using (m_CollectObservationsChecker.Start())
+                {
+                    CollectObservations(collectObservationsSensor);
+                }
            }
            using (TimerStack.Instance.Scoped("CollectDiscreteActionMasks"))
            {
        {
            ResetData();
            m_StepCount = 0;
-            OnEpisodeBegin();
+            using (m_OnEpisodeBeginChecker.Start())
+            {
+                OnEpisodeBegin();
+            }
+
        }

        /// <summary>
--- a/com.unity.ml-agents/Tests/Editor/MLAgentsEditModeTest.cs
+++ b/com.unity.ml-agents/Tests/Editor/MLAgentsEditModeTest.cs
            }
        }
    }
+
+    [TestFixture]
+    public class AgentRecursionTests
+    {
+        [SetUp]
+        public void SetUp()
+        {
+            if (Academy.IsInitialized)
+            {
+                Academy.Instance.Dispose();
+            }
+        }
+
+        class CollectObsEndEpisodeAgent : Agent
+        {
+            public override void CollectObservations(VectorSensor sensor)
+            {
+                // NEVER DO THIS IN REAL CODE!
+                EndEpisode();
+            }
+        }
+
+        class OnEpisodeBeginEndEpisodeAgent : Agent
+        {
+            public override void OnEpisodeBegin()
+            {
+                // NEVER DO THIS IN REAL CODE!
+                EndEpisode();
+            }
+        }
+
+        void TestRecursiveThrows<T>() where T : Agent
+        {
+            var gameObj = new GameObject();
+            var agent = gameObj.AddComponent<T>();
+            agent.LazyInitialize();
+            agent.RequestDecision();
+
+            Assert.Throws<UnityAgentsException>(() =>
+            {
+                Academy.Instance.EnvironmentStep();
+            });
+        }
+
+        [Test]
+        public void TestRecursiveCollectObsEndEpisodeThrows()
+        {
+            TestRecursiveThrows<CollectObsEndEpisodeAgent>();
+        }
+
+        [Test]
+        public void TestRecursiveOnEpisodeBeginEndEpisodeThrows()
+        {
+            TestRecursiveThrows<OnEpisodeBeginEndEpisodeAgent>();
+        }
+    }
 }
--- a/com.unity.ml-agents/package.json
+++ b/com.unity.ml-agents/package.json
  "unity": "2018.4",
  "description": "Use state-of-the-art machine learning to create intelligent character behaviors in any Unity environment (games, robotics, film, etc.).",
  "dependencies": {
-    "com.unity.barracuda": "1.1.1-preview",
+    "com.unity.barracuda": "1.1.2-preview",
    "com.unity.modules.imageconversion": "1.0.0",
    "com.unity.modules.jsonserialize": "1.0.0",
    "com.unity.modules.physics": "1.0.0",
--- a/docs/Background-Machine-Learning.md
+++ b/docs/Background-Machine-Learning.md
 one where the number of observations an agent perceives and the number of
 actions they can take are large). Many of the algorithms we provide in ML-Agents
 use some form of deep learning, built on top of the open-source library,
-[TensorFlow](Background-TensorFlow.md).
+[PyTorch](Background-PyTorch.md).
--- a/docs/Getting-Started.md
+++ b/docs/Getting-Started.md

 ## Running a pre-trained model

-We include pre-trained models for our agents (`.nn` files) and we use the
+We include pre-trained models for our agents (`.onnx` files) and we use the
 [Unity Inference Engine](Unity-Inference-Engine.md) to run these models inside
 Unity. In this section, we will use the pre-trained model for the 3D Ball
 example.

 ## Training a new model with Reinforcement Learning

-While we provide pre-trained `.nn` files for the agents in this environment, any
+While we provide pre-trained models for the agents in this environment, any
 environment you make yourself will require training agents from scratch to
 generate a new model file. In this section we will demonstrate how to use the
 reinforcement learning algorithms that are part of the ML-Agents Python package
 use it with compatible Agents (the Agents that generated the model). **Note:**
 Do not just close the Unity Window once the `Saved Model` message appears.
 Either wait for the training process to close the window or press `Ctrl+C` at
-the command-line prompt. If you close the window manually, the `.nn` file
+the command-line prompt. If you close the window manually, the `.onnx` file
 containing the trained model is not exported into the ml-agents folder.

 If you've quit the training early using `Ctrl+C` and want to resume training,
 mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun --resume
 ```

-Your trained model will be at `results/<run-identifier>/<behavior_name>.nn` where
+Your trained model will be at `results/<run-identifier>/<behavior_name>.onnx` where
 `<behavior_name>` is the name of the `Behavior Name` of the agents corresponding
 to the model. This file corresponds to your model's latest checkpoint. You can
 now embed this trained model into your Agents by following the steps below,
   `Project/Assets/ML-Agents/Examples/3DBall/TFModels/`.
 1. Open the Unity Editor, and select the **3DBall** scene as described above.
 1. Select the **3DBall** prefab Agent object.
-1. Drag the `<behavior_name>.nn` file from the Project window of the Editor to
+1. Drag the `<behavior_name>.onnx` file from the Project window of the Editor to
   the **Model** placeholder in the **Ball3DAgent** inspector window.
 1. Press the **Play** button at the top of the Editor.

--- a/docs/Installation.md
+++ b/docs/Installation.md
 [instructions](https://packaging.python.org/guides/installing-using-linux-tools/#installing-pip-setuptools-wheel-with-linux-package-managers)
 on installing it.

-Although we do not provide support for Anaconda installation on Windows, the
-previous
-[Windows Anaconda Installation (Deprecated) guide](Installation-Anaconda-Windows.md)
-is still available.
-
 ### Clone the ML-Agents Toolkit Repository (Optional)

 Now that you have installed Unity and Python, you can now install the Unity and
 dependencies for each project and are supported on Mac / Windows / Linux. We
 offer a dedicated [guide on Virtual Environments](Using-Virtual-Environment.md).

+#### (Windows) Installing PyTorch
+
+On Windows, you'll have to install the PyTorch package separately prior to
+installing ML-Agents. Activate your virtual environment and run from the command line:
+
+```sh
+pip3 install torch==1.7.0 -f https://download.pytorch.org/whl/torch_stable.html
+```
+
+Note that on Windows, you may also need Microsoft's
+[Visual C++ Redistributable](https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads)
+if you don't have it already. See the [PyTorch installation guide](https://pytorch.org/get-started/locally/)
+for more installation options and versions.
+
+#### Installing `mlagents`
+
 To install the `mlagents` Python package, activate your virtual environment and
 run from the command line:


 By installing the `mlagents` package, the dependencies listed in the
 [setup.py file](../ml-agents/setup.py) are also installed. These include
-[TensorFlow](Background-TensorFlow.md) (Requires a CPU w/ AVX support).
+[PyTorch](Background-PyTorch.md) (Requires a CPU w/ AVX support).

 #### Advanced: Local Installation for Development

 the repository's root directory, run:

 ```sh
+pip3 install torch -f https://download.pytorch.org/whl/torch_stable.html
 pip3 install -e ./ml-agents-envs
 pip3 install -e ./ml-agents
 ```
--- a/docs/Learning-Environment-Executable.md
+++ b/docs/Learning-Environment-Executable.md
 ```

 You can press Ctrl+C to stop the training, and your trained model will be at
-`results/<run-identifier>/<behavior_name>.nn`, which corresponds to your model's
+`results/<run-identifier>/<behavior_name>.onnx`, which corresponds to your model's
 latest checkpoint. (**Note:** There is a known bug on Windows that causes the
 saving of the model to fail when you early terminate the training, it's
 recommended to wait until Step has reached the max_steps parameter you set in
   `Project/Assets/ML-Agents/Examples/3DBall/TFModels/`.
 1. Open the Unity Editor, and select the **3DBall** scene as described above.
 1. Select the **3DBall** prefab from the Project window and select **Agent**.
-1. Drag the `<behavior_name>.nn` file from the Project window of the Editor to
+1. Drag the `<behavior_name>.onnx` file from the Project window of the Editor to
   the **Model** placeholder in the **Ball3DAgent** inspector window.
 1. Press the **Play** button at the top of the Editor.
--- a/docs/ML-Agents-Overview.md
+++ b/docs/ML-Agents-Overview.md
 for training intelligent agents. Agents can be trained using reinforcement
 learning, imitation learning, neuroevolution, or other machine learning methods
 through a simple-to-use Python API. We also provide implementations (based on
-TensorFlow) of state-of-the-art algorithms to enable game developers and
+PyTorch) of state-of-the-art algorithms to enable game developers and
 hobbyists to easily train intelligent agents for 2D, 3D and VR/AR games. These
 trained agents can be used for multiple purposes, including controlling NPC
 behavior (in a variety of settings such as multi-agent and adversarial),
 that include overviews and helpful resources on the
 [Unity Engine](Background-Unity.md),
 [machine learning](Background-Machine-Learning.md) and
-[TensorFlow](Background-TensorFlow.md). We **strongly** recommend browsing the
+[PyTorch](Background-PyTorch.md). We **strongly** recommend browsing the
-machine learning concepts or have not previously heard of TensorFlow.
+machine learning concepts or have not previously heard of PyTorch.

 The remainder of this page contains a deep dive into ML-Agents, its key
 components, different training modes and scenarios. By the end of it, you should

 ### Custom Training and Inference

-In the previous mode, the Agents were used for training to generate a TensorFlow
+In the previous mode, the Agents were used for training to generate a PyTorch
 model that the Agents can later use. However, any user of the ML-Agents Toolkit
 can leverage their own algorithms for training. In this case, the behaviors of
 all the Agents in the scene will be controlled within Python. You can even turn
--- a/docs/Readme.md
+++ b/docs/Readme.md
 - [ML-Agents Toolkit Overview](ML-Agents-Overview.md)
  - [Background: Unity](Background-Unity.md)
  - [Background: Machine Learning](Background-Machine-Learning.md)
-  - [Background: TensorFlow](Background-TensorFlow.md)
+  - [Background: PyTorch](Background-PyTorch.md)
 - [Example Environments](Learning-Environment-Examples.md)

 ## Creating Learning Environments
--- a/docs/Training-Configuration-File.md
+++ b/docs/Training-Configuration-File.md
 | `time_horizon`           | (default = `64`) How many steps of experience to collect per-agent before adding it to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions. <br><br> Typical range: `32` - `2048` |
 | `max_steps`              | (default = `500000`) Total number of steps (i.e., observation collected and action taken) that must be taken in the environment (or across all environments if using multiple in parallel) before ending the training process. If you have multiple agents with the same behavior name within your environment, all steps taken by those agents will contribute to the same `max_steps` count. <br><br>Typical range: `5e5` - `1e7`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
 | `keep_checkpoints`         | (default = `5`) The maximum number of model checkpoints to keep. Checkpoints are saved after the number of steps specified by the checkpoint_interval option. Once the maximum number of checkpoints has been reached, the oldest checkpoint is deleted when saving a new checkpoint. |
-| `checkpoint_interval`         | (default = `500000`) The number of experiences collected between each checkpoint by the trainer. A maximum of `keep_checkpoints` checkpoints are saved before old ones are deleted. Each checkpoint saves the `.nn` (and `.onnx` if applicable) files in `results/` folder.|
+| `checkpoint_interval`         | (default = `500000`) The number of experiences collected between each checkpoint by the trainer. A maximum of `keep_checkpoints` checkpoints are saved before old ones are deleted. Each checkpoint saves the `.onnx` (and `.nn` if using TensorFlow) files in `results/` folder.|
 | `init_path`              | (default = None) Initialize trainer from a previously saved model. Note that the prior run should have used the same trainer configurations as the current run, and have been saved with the same version of ML-Agents. <br><br>You should provide the full path to the folder where the checkpoints were saved, e.g. `./models/{run-id}/{behavior_name}`. This option is provided in case you want to initialize different behaviors from different runs; in most cases, it is sufficient to use the `--initialize-from` CLI parameter to initialize all models from the same run.                                                                                                                                  |
 | `threaded`               | (default = `true`) By default, model updates can happen while the environment is being stepped. This violates the [on-policy](https://spinningup.openai.com/en/latest/user/algorithms.html#the-on-policy-algorithms) assumption of PPO slightly in exchange for a training speedup. To maintain the strict on-policyness of PPO, you can disable parallel updates by setting `threaded` to `false`. There is usually no reason to turn `threaded` off for SAC.                                                                                                                                                                                                                                                       |
 | `hyperparameters -> learning_rate`          | (default = `3e-4`) Initial learning rate for gradient descent. Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase. <br><br>Typical range: `1e-5` - `1e-3`                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
    - [Curriculum Learning](#curriculum)
      - [Training with a Curriculum](#training-with-a-curriculum)
  - [Training Using Concurrent Unity Instances](#training-using-concurrent-unity-instances)
-  - [Using PyTorch (Experimental)](#using-pytorch-experimental)

 For a broad overview of reinforcement learning, imitation learning and all the
 training scenarios, methods and options within the ML-Agents Toolkit, see
   values. See [Using TensorBoard](Using-Tensorboard.md) for more details on how
   to visualize the training metrics.
 1. Models: these contain the model checkpoints that
-   are updated throughout training and the final model file (`.nn`). This final
+   are updated throughout training and the final model file (`.onnx`). This final
   model file is generated once either when training completes or is
   interrupted.
 1. Timers file (under `results/<run-identifier>/run_logs`): this contains aggregated
 - **Result Variation Using Concurrent Unity Instances** - If you keep all the
  hyperparameters the same, but change `--num-envs=<n>`, the results and model
  would likely change.
-
-### Using PyTorch (Experimental)
-
-ML-Agents, by default, uses TensorFlow as its backend, but experimental support
-for PyTorch has been added. To use PyTorch, the `torch` Python package must
-be installed, and PyTorch must be enabled for your trainer.
-
-#### Installing PyTorch
-
-If you've already installed ML-Agents, follow the
-[official PyTorch install instructions](https://pytorch.org/get-started/locally/) for
-your platform and configuration. Note that on Windows, you may also need Microsoft's
-[Visual C++ Redistributable](https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads) if you don't have it already.
-
-If you're installing or upgrading ML-Agents on Linux or Mac, you can also run
-`pip3 install mlagents[torch]` instead of `pip3 install mlagents`
-during [installation](Installation.md). On Windows, install ML-Agents first and then
-separately install PyTorch.
-
-#### Enabling PyTorch
-
-PyTorch can be enabled in one of two ways. First, by adding `--torch` to the
-`mlagents-learn` command. This will make all behaviors train with PyTorch.
-
-Second, by changing the `framework` option for your agent behavior in the
-configuration YAML as below. This will use PyTorch just for that behavior.
-
-```yaml
-behaviors:
-  YourAgentBehavior:
-    framework: pytorch
-```
--- a/docs/Training-on-Amazon-Web-Service.md
+++ b/docs/Training-on-Amazon-Web-Service.md
    # Download and install the latest Nvidia driver for ubuntu
    # Please refer to http://download.nvidia.com/XFree86/Linux-#x86_64/latest.txt
    $ wget http://download.nvidia.com/XFree86/Linux-x86_64/390.87/NVIDIA-Linux-x86_64-390.87.run
-    $ sudo /bin/bash ./NVIDIA-Linux-x86_64-390.67.run --accept-license --no-questions --ui=none
+    $ sudo /bin/bash ./NVIDIA-Linux-x86_64-390.87.run --accept-license --no-questions --ui=none

    # Disable Nouveau as it will clash with the Nvidia driver
    $ sudo echo 'blacklist nouveau'  | sudo tee -a /etc/modprobe.d/blacklist.conf
--- a/docs/Unity-Inference-Engine.md
+++ b/docs/Unity-Inference-Engine.md
  [industry-standard open format](https://onnx.ai/about.html) produced by the
  [tf2onnx package](https://github.com/onnx/tensorflow-onnx).

-Export to ONNX is currently considered beta. To enable it, make sure
-`tf2onnx>=1.5.5` is installed in pip. tf2onnx does not currently support
-tensorflow 2.0.0 or later, or earlier than 1.12.0.
+Export to ONNX is used if using PyTorch (the default). To enable it
+while using TensorFlow, make sure `tf2onnx>=1.6.1` is installed in pip.

 ## Using the Unity Inference Engine

--- a/gym-unity/gym_unity/envs/init.py
+++ b/gym-unity/gym_unity/envs/init.py
        self._previous_decision_step = decision_steps

        # Set action spaces
-        if self.group_spec.is_action_discrete():
-            branches = self.group_spec.discrete_action_branches
-            if self.group_spec.action_size == 1:
+        if self.group_spec.action_spec.is_discrete():
+            self.action_size = self.group_spec.action_spec.discrete_size
+            branches = self.group_spec.action_spec.discrete_branches
+            if self.group_spec.action_spec.discrete_size == 1:
                self._action_space = spaces.Discrete(branches[0])
            else:
                if flatten_branched:
                    self._action_space = spaces.MultiDiscrete(branches)

-        else:
+        elif self.group_spec.action_spec.is_continuous():
-            high = np.array([1] * self.group_spec.action_shape)
+
+            self.action_size = self.group_spec.action_spec.continuous_size
+            high = np.array([1] * self.group_spec.action_spec.continuous_size)
+        else:
+            raise UnityGymException(
+                "The gym wrapper does not provide explicit support for both discrete "
+                "and continuous actions."
+            )

        # Set observations space
        list_spaces: List[gym.Space] = []
            # Translate action into list
            action = self._flattener.lookup_action(action)

-        spec = self.group_spec
-        action = np.array(action).reshape((1, spec.action_size))
+        action = np.array(action).reshape((1, self.action_size))
        self._env.set_actions(self.name, action)

        self._env.step()
--- a/gym-unity/gym_unity/tests/test_gym.py
+++ b/gym-unity/gym_unity/tests/test_gym.py
 from gym_unity.envs import UnityToGymWrapper
 from mlagents_envs.base_env import (
    BehaviorSpec,
-    ActionType,
+    ActionSpec,
    DecisionSteps,
    TerminalSteps,
    BehaviorMapping,
    Creates a mock BrainParameters object with parameters.
    """
    # Avoid using mutable object as default param
-    act_type = ActionType.DISCRETE
-        act_type = ActionType.CONTINUOUS
+        action_spec = ActionSpec.create_continuous(vector_action_space_size)
+        action_spec = ActionSpec.create_discrete(vector_action_space_size)
-    return BehaviorSpec(obs_shapes, act_type, vector_action_space_size)
+    return BehaviorSpec(obs_shapes, action_spec)


 def create_mock_vector_steps(specs, num_agents=1, number_visual_observations=0):
--- a/ml-agents-envs/mlagents_envs/base_env.py
+++ b/ml-agents-envs/mlagents_envs/base_env.py
    NamedTuple,
    Tuple,
    Optional,
-    Union,
    Dict,
    Iterator,
    Any,
-from enum import Enum
+
+from mlagents_envs.exception import UnityActionException

 AgentId = int
 BehaviorName = str
        )


-class ActionType(Enum):
-    DISCRETE = 0
-    CONTINUOUS = 1
+class ActionTuple:
+    """
+    An object whose fields correspond to actions of different types.
+    Continuous and discrete actions are numpy arrays of type float32 and
+    int32, respectively and are type checked on construction.
+    Dimensions are of (n_agents, continuous_size) and (n_agents, discrete_size),
+    respectively.
+    """
+    def __init__(
+        self,
+        continuous: Optional[np.ndarray] = None,
+        discrete: Optional[np.ndarray] = None,
+    ):
+        if continuous is not None and continuous.dtype != np.float32:
+            continuous = continuous.astype(np.float32, copy=False)
+        self._continuous = continuous
+        if discrete is not None and discrete.dtype != np.int32:
+            discrete = discrete.astype(np.int32, copy=False)
+        self._discrete = discrete
-class BehaviorSpec(NamedTuple):
+    @property
+    def continuous(self) -> np.ndarray:
+        return self._continuous
+
+    @property
+    def discrete(self) -> np.ndarray:
+        return self._discrete
+
+
+class ActionSpec(NamedTuple):
-    A NamedTuple to containing information about the observations and actions
-    spaces for a group of Agents under the same behavior.
-     - observation_shapes is a List of Tuples of int : Each Tuple corresponds
-     to an observation's dimensions. The shape tuples have the same ordering as
-     the ordering of the DecisionSteps and TerminalSteps.
-     - action_type is the type of data of the action. it can be discrete or
-     continuous. If discrete, the action tensors are expected to be int32. If
-     continuous, the actions are expected to be float32.
-     - action_shape is:
-       - An int in continuous action space corresponding to the number of
-     floats that constitute the action.
-       - A Tuple of int in discrete action space where each int corresponds to
-       the number of discrete actions available to the agent.
+    A NamedTuple containing utility functions and information about the action spaces
+    for a group of Agents under the same behavior.
+    - num_continuous_actions is an int corresponding to the number of floats which
+    constitute the action.
+    - discrete_branch_sizes is a Tuple of int where each int corresponds to
+    the number of discrete actions available to the agent on an independent action branch.
-    observation_shapes: List[Tuple]
-    action_type: ActionType
-    action_shape: Union[int, Tuple[int, ...]]
+    continuous_size: int
+    discrete_branches: Tuple[int, ...]
+
+    def __eq__(self, other):
+        return (
+            self.continuous_size == other.continuous_size
+            and self.discrete_branches == other.discrete_branches
+        )
+
+    def __str__(self):
+        return f"Continuous: {self.continuous_size}, Discrete: {self.discrete_branches}"
-    def is_action_discrete(self) -> bool:
+    # For backwards compatibility
+    def is_discrete(self) -> bool:
-        return self.action_type == ActionType.DISCRETE
+        return self.discrete_size > 0 and self.continuous_size == 0
-    def is_action_continuous(self) -> bool:
+    # For backwards compatibility
+    def is_continuous(self) -> bool:
-        return self.action_type == ActionType.CONTINUOUS
+        return self.discrete_size == 0 and self.continuous_size > 0
-    def action_size(self) -> int:
-        """
-        Returns the dimension of the action.
-         - In the continuous case, will return the number of continuous actions.
-         - In the (multi-)discrete case, will return the number of action.
-         branches.
-        """
-        if self.action_type == ActionType.DISCRETE:
-            return len(self.action_shape)  # type: ignore
-        else:
-            return self.action_shape  # type: ignore
-
-    @property
-    def discrete_action_branches(self) -> Optional[Tuple[int, ...]]:
+    def discrete_size(self) -> int:
-        Returns a Tuple of int corresponding to the number of possible actions
-        for each branch (only for discrete actions). Will return None in
-        for continuous actions.
+        Returns a an int corresponding to the number of discrete branches.
-        if self.action_type == ActionType.DISCRETE:
-            return self.action_shape  # type: ignore
-        else:
-            return None
+        return len(self.discrete_branches)
-    def create_empty_action(self, n_agents: int) -> np.ndarray:
+    def empty_action(self, n_agents: int) -> ActionTuple:
-        Generates a numpy array corresponding to an empty action (all zeros)
+        Generates ActionTuple corresponding to an empty action (all zeros)
-        if self.action_type == ActionType.DISCRETE:
-            return np.zeros((n_agents, self.action_size), dtype=np.int32)
-        else:
-            return np.zeros((n_agents, self.action_size), dtype=np.float32)
+        continuous = np.zeros((n_agents, self.continuous_size), dtype=np.float32)
+        discrete = np.zeros((n_agents, self.discrete_size), dtype=np.int32)
+        return ActionTuple(continuous, discrete)
-    def create_random_action(self, n_agents: int) -> np.ndarray:
+    def random_action(self, n_agents: int) -> ActionTuple:
-        Generates a numpy array corresponding to a random action (either discrete
+        Generates ActionTuple corresponding to a random action (either discrete
-        :param generator: The random number generator used for creating random action
-        if self.is_action_continuous():
-            action = np.random.uniform(
-                low=-1.0, high=1.0, size=(n_agents, self.action_size)
-            ).astype(np.float32)
-            return action
-        elif self.is_action_discrete():
-            branch_size = self.discrete_action_branches
-            action = np.column_stack(
+        continuous = np.random.uniform(
+            low=-1.0, high=1.0, size=(n_agents, self.continuous_size)
+        )
+        discrete = np.zeros((n_agents, self.discrete_size), dtype=np.int32)
+        if self.discrete_size > 0:
+            discrete = np.column_stack(
-                        branch_size[i],  # type: ignore
+                        self.discrete_branches[i],  # type: ignore
-                    for i in range(self.action_size)
+                    for i in range(self.discrete_size)
-            return action
+        return ActionTuple(continuous, discrete)
+
+    def _validate_action(
+        self, actions: ActionTuple, n_agents: int, name: str
+    ) -> ActionTuple:
+        """
+        Validates that action has the correct action dim
+        for the correct number of agents and ensures the type.
+        """
+        _expected_shape = (n_agents, self.continuous_size)
+        if self.continuous_size > 0 and actions.continuous.shape != _expected_shape:
+            raise UnityActionException(
+                f"The behavior {name} needs a continuous input of dimension "
+                f"{_expected_shape} for (<number of agents>, <action size>) but "
+                f"received input of dimension {actions.continuous.shape}"
+            )
+        _expected_shape = (n_agents, self.discrete_size)
+        if self.discrete_size > 0 and actions.discrete.shape != _expected_shape:
+            raise UnityActionException(
+                f"The behavior {name} needs a discrete input of dimension "
+                f"{_expected_shape} for (<number of agents>, <action size>) but "
+                f"received input of dimension {actions.discrete.shape}"
+            )
+        return actions
+
+    @staticmethod
+    def create_continuous(continuous_size: int) -> "ActionSpec":
+        """
+        Creates an ActionSpec that is homogenously continuous
+        """
+        return ActionSpec(continuous_size, ())
+
+    @staticmethod
+    def create_discrete(discrete_branches: Tuple[int]) -> "ActionSpec":
+        """
+        Creates an ActionSpec that is homogenously discrete
+        """
+        return ActionSpec(0, discrete_branches)
+
+
+class BehaviorSpec(NamedTuple):
+    """
+    A NamedTuple containing information about the observation and action
+    spaces for a group of Agents under the same behavior.
+    - observation_shapes is a List of Tuples of int : Each Tuple corresponds
+    to an observation's dimensions. The shape tuples have the same ordering as
+    the ordering of the DecisionSteps and TerminalSteps.
+    - action_spec is an ActionSpec NamedTuple
+    """
+
+    observation_shapes: List[Tuple]
+    action_spec: ActionSpec


 class BehaviorMapping(Mapping):
        """

    @abstractmethod
-    def set_actions(self, behavior_name: BehaviorName, action: np.ndarray) -> None:
+    def set_actions(self, behavior_name: BehaviorName, action: ActionTuple) -> None:
-        :param action: A two dimensional np.ndarray corresponding to the action
-        (either int or float)
+        :param action: ActionTuple tuple of continuous and/or discrete action.
+        Actions are np.arrays with dimensions  (n_agents, continuous_size) and
+        (n_agents, discrete_size), respectively.
-        self, behavior_name: BehaviorName, agent_id: AgentId, action: np.ndarray
+        self, behavior_name: BehaviorName, agent_id: AgentId, action: ActionTuple
    ) -> None:
        """
        Sets the action for one of the agents in the simulation for the next
-        :param action: A one dimensional np.ndarray corresponding to the action
-        (either int or float)
+        :param action: ActionTuple tuple of continuous and/or discrete action
+        Actions are np.arrays with dimensions  (1, continuous_size) and
+        (1, discrete_size), respectively. Note, this initial dimensions of 1 is because
+        this action is meant for a single agent.
        """

    @abstractmethod
--- a/ml-agents-envs/mlagents_envs/environment.py
+++ b/ml-agents-envs/mlagents_envs/environment.py
    DecisionSteps,
    TerminalSteps,
    BehaviorSpec,
+    ActionTuple,
    BehaviorName,
    AgentId,
    BehaviorMapping,

        self._env_state: Dict[str, Tuple[DecisionSteps, TerminalSteps]] = {}
        self._env_specs: Dict[str, BehaviorSpec] = {}
-        self._env_actions: Dict[str, np.ndarray] = {}
+        self._env_actions: Dict[str, ActionTuple] = {}
        self._is_first_message = True
        self._update_behavior_specs(aca_output)

                    n_agents = len(self._env_state[group_name][0])
                self._env_actions[group_name] = self._env_specs[
                    group_name
-                ].create_empty_action(n_agents)
+                ].action_spec.empty_action(n_agents)
        step_input = self._generate_step_input(self._env_actions)
        with hierarchical_timer("communicator.exchange"):
            outputs = self._communicator.exchange(step_input)
                f"agent group in the environment"
            )

-    def set_actions(self, behavior_name: BehaviorName, action: np.ndarray) -> None:
+    def set_actions(self, behavior_name: BehaviorName, action: ActionTuple) -> None:
-        spec = self._env_specs[behavior_name]
-        expected_type = np.float32 if spec.is_action_continuous() else np.int32
-        expected_shape = (len(self._env_state[behavior_name][0]), spec.action_size)
-        if action.shape != expected_shape:
-            raise UnityActionException(
-                f"The behavior {behavior_name} needs an input of dimension "
-                f"{expected_shape} for (<number of agents>, <action size>) but "
-                f"received input of dimension {action.shape}"
-            )
-        if action.dtype != expected_type:
-            action = action.astype(expected_type)
+        action_spec = self._env_specs[behavior_name].action_spec
+        num_agents = len(self._env_state[behavior_name][0])
+        action = action_spec._validate_action(action, num_agents, behavior_name)
-        self, behavior_name: BehaviorName, agent_id: AgentId, action: np.ndarray
+        self, behavior_name: BehaviorName, agent_id: AgentId, action: ActionTuple
-        spec = self._env_specs[behavior_name]
-        expected_shape = (spec.action_size,)
-        if action.shape != expected_shape:
-            raise UnityActionException(
-                f"The Agent {agent_id} with BehaviorName {behavior_name} needs "
-                f"an input of dimension {expected_shape} but received input of "
-                f"dimension {action.shape}"
-            )
-        expected_type = np.float32 if spec.is_action_continuous() else np.int32
-        if action.dtype != expected_type:
-            action = action.astype(expected_type)
-
+        action_spec = self._env_specs[behavior_name].action_spec
+        num_agents = len(self._env_state[behavior_name][0])
+        action = action_spec._validate_action(action, num_agents, behavior_name)
-            self._env_actions[behavior_name] = spec.create_empty_action(
-                len(self._env_state[behavior_name][0])
-            )
+            self._env_actions[behavior_name] = action_spec.empty_action(num_agents)
        try:
            index = np.where(self._env_state[behavior_name][0].agent_id == agent_id)[0][
                0
                    agent_id
                )
            ) from ie
-        self._env_actions[behavior_name][index] = action
+        if action_spec.continuous_size > 0:
+            self._env_actions[behavior_name].continuous[index] = action.continuous[0, :]
+        if action_spec.discrete_size > 0:
+            self._env_actions[behavior_name].discrete[index] = action.discrete[0, :]

    def get_steps(
        self, behavior_name: BehaviorName

    @timed
    def _generate_step_input(
-        self, vector_action: Dict[str, np.ndarray]
+        self, vector_action: Dict[str, ActionTuple]
    ) -> UnityInputProto:
        rl_in = UnityRLInputProto()
        for b in vector_action:
            for i in range(n_agents):
-                action = AgentActionProto(vector_actions=vector_action[b][i])
+                # TODO: extend to AgentBuffers
+                if vector_action[b].continuous is not None:
+                    _act = vector_action[b].continuous[i]
+                else:
+                    _act = vector_action[b].discrete[i]
+                action = AgentActionProto(vector_actions=_act)
                rl_in.agent_actions[b].value.extend([action])
                rl_in.command = STEP
        rl_in.side_channel = bytes(
--- a/ml-agents-envs/mlagents_envs/rpc_utils.py
+++ b/ml-agents-envs/mlagents_envs/rpc_utils.py
 from mlagents_envs.base_env import (
+    ActionSpec,
-    ActionType,
    DecisionSteps,
    TerminalSteps,
 )
 from mlagents_envs.communicator_objects.brain_parameters_pb2 import BrainParametersProto
 import numpy as np
 import io
-from typing import cast, List, Tuple, Union, Collection, Optional, Iterable
+from typing import cast, List, Tuple, Collection, Optional, Iterable
 from PIL import Image


    :return: BehaviorSpec object.
    """
    observation_shape = [tuple(obs.shape) for obs in agent_info.observations]
-    action_type = (
-        ActionType.DISCRETE
-        if brain_param_proto.vector_action_space_type_deprecated == 0
-        else ActionType.CONTINUOUS
+    action_spec_proto = brain_param_proto.action_spec
+    action_spec = ActionSpec(
+        action_spec_proto.num_continuous_actions,
+        tuple(branch for branch in action_spec_proto.discrete_branch_sizes),
-    if action_type == ActionType.CONTINUOUS:
-        action_shape: Union[
-            int, Tuple[int, ...]
-        ] = brain_param_proto.vector_action_size_deprecated[0]
-    else:
-        action_shape = tuple(brain_param_proto.vector_action_size_deprecated)
-    return BehaviorSpec(observation_shape, action_type, action_shape)
+    return BehaviorSpec(observation_shape, action_spec)


 class OffsetBytesIO:
        [agent_info.id for agent_info in terminal_agent_info_list], dtype=np.int32
    )
    action_mask = None
-    if behavior_spec.is_action_discrete():
+    if behavior_spec.action_spec.discrete_size > 0:
-            a_size = np.sum(behavior_spec.discrete_action_branches)
+            a_size = np.sum(behavior_spec.action_spec.discrete_branches)
            mask_matrix = np.ones((n_agents, a_size), dtype=np.bool)
            for agent_index, agent_info in enumerate(decision_agent_info_list):
                if agent_info.action_mask is not None:
                            for k in range(a_size)
                        ]
            action_mask = (1 - mask_matrix).astype(np.bool)
-            indices = _generate_split_indices(behavior_spec.discrete_action_branches)
+            indices = _generate_split_indices(
+                behavior_spec.action_spec.discrete_branches
+            )
            action_mask = np.split(action_mask, indices, axis=1)
    return (
        DecisionSteps(
--- a/ml-agents-envs/mlagents_envs/tests/test_envs.py
+++ b/ml-agents-envs/mlagents_envs/tests/test_envs.py
 from unittest import mock
 import pytest

-import numpy as np
-
-from mlagents_envs.base_env import DecisionSteps, TerminalSteps
+from mlagents_envs.base_env import DecisionSteps, TerminalSteps, ActionTuple
 from mlagents_envs.exception import UnityEnvironmentException, UnityActionException
 from mlagents_envs.mock_communicator import MockCommunicator

    env.step()
    decision_steps, terminal_steps = env.get_steps("RealFakeBrain")
    n_agents = len(decision_steps)
-    env.set_actions(
-        "RealFakeBrain", np.zeros((n_agents, spec.action_size), dtype=np.float32)
-    )
+    env.set_actions("RealFakeBrain", spec.action_spec.empty_action(n_agents))
-        env.set_actions(
-            "RealFakeBrain",
-            np.zeros((n_agents - 1, spec.action_size), dtype=np.float32),
-        )
+        env.set_actions("RealFakeBrain", spec.action_spec.empty_action(n_agents - 1))
-    env.set_actions(
-        "RealFakeBrain", -1 * np.ones((n_agents, spec.action_size), dtype=np.float32)
-    )
+    _empty_act = spec.action_spec.empty_action(n_agents)
+    next_action = ActionTuple(_empty_act.continuous - 1, _empty_act.discrete - 1)
+    env.set_actions("RealFakeBrain", next_action)
    env.step()

    env.close()
--- a/ml-agents-envs/mlagents_envs/tests/test_rpc_utils.py
+++ b/ml-agents-envs/mlagents_envs/tests/test_rpc_utils.py
 from mlagents_envs.communicator_objects.agent_action_pb2 import AgentActionProto
 from mlagents_envs.base_env import (
    BehaviorSpec,
-    ActionType,
+    ActionSpec,
    DecisionSteps,
    TerminalSteps,
 )
 def test_batched_step_result_from_proto():
    n_agents = 10
    shapes = [(3,), (4,)]
-    spec = BehaviorSpec(shapes, ActionType.CONTINUOUS, 3)
+    spec = BehaviorSpec(shapes, ActionSpec.create_continuous(3))
    ap_list = generate_list_agent_proto(n_agents, shapes)
    decision_steps, terminal_steps = steps_from_proto(ap_list, spec)
    for agent_id in range(n_agents):
 def test_action_masking_discrete():
    n_agents = 10
    shapes = [(3,), (4,)]
-    behavior_spec = BehaviorSpec(shapes, ActionType.DISCRETE, (7, 3))
+    behavior_spec = BehaviorSpec(shapes, ActionSpec.create_discrete((7, 3)))
    ap_list = generate_list_agent_proto(n_agents, shapes)
    decision_steps, terminal_steps = steps_from_proto(ap_list, behavior_spec)
    masks = decision_steps.action_mask
 def test_action_masking_discrete_1():
    n_agents = 10
    shapes = [(3,), (4,)]
-    behavior_spec = BehaviorSpec(shapes, ActionType.DISCRETE, (10,))
+    behavior_spec = BehaviorSpec(shapes, ActionSpec.create_discrete((10,)))
    ap_list = generate_list_agent_proto(n_agents, shapes)
    decision_steps, terminal_steps = steps_from_proto(ap_list, behavior_spec)
    masks = decision_steps.action_mask
 def test_action_masking_discrete_2():
    n_agents = 10
    shapes = [(3,), (4,)]
-    behavior_spec = BehaviorSpec(shapes, ActionType.DISCRETE, (2, 2, 6))
+    behavior_spec = BehaviorSpec(shapes, ActionSpec.create_discrete((2, 2, 6)))
    ap_list = generate_list_agent_proto(n_agents, shapes)
    decision_steps, terminal_steps = steps_from_proto(ap_list, behavior_spec)
    masks = decision_steps.action_mask
 def test_action_masking_continuous():
    n_agents = 10
    shapes = [(3,), (4,)]
-    behavior_spec = BehaviorSpec(shapes, ActionType.CONTINUOUS, 10)
+    behavior_spec = BehaviorSpec(shapes, ActionSpec.create_continuous(10))
    ap_list = generate_list_agent_proto(n_agents, shapes)
    decision_steps, terminal_steps = steps_from_proto(ap_list, behavior_spec)
    masks = decision_steps.action_mask
    bp.vector_action_size_deprecated.extend([5, 4])
    bp.vector_action_space_type_deprecated = 0
    behavior_spec = behavior_spec_from_proto(bp, agent_proto)
-    assert behavior_spec.is_action_discrete()
-    assert not behavior_spec.is_action_continuous()
+    assert behavior_spec.action_spec.is_discrete()
+    assert not behavior_spec.action_spec.is_continuous()
-    assert behavior_spec.discrete_action_branches == (5, 4)
-    assert behavior_spec.action_size == 2
+    assert behavior_spec.action_spec.discrete_branches == (5, 4)
+    assert behavior_spec.action_spec.discrete_size == 2
-    assert not behavior_spec.is_action_discrete()
-    assert behavior_spec.is_action_continuous()
-    assert behavior_spec.action_size == 6
+    assert not behavior_spec.action_spec.is_discrete()
+    assert behavior_spec.action_spec.is_continuous()
+    assert behavior_spec.action_spec.continuous_size == 6
-    behavior_spec = BehaviorSpec(shapes, ActionType.CONTINUOUS, 3)
+    behavior_spec = BehaviorSpec(shapes, ActionSpec.create_continuous(3))
    ap_list = generate_list_agent_proto(n_agents, shapes, infinite_rewards=True)
    with pytest.raises(RuntimeError):
        steps_from_proto(ap_list, behavior_spec)
    n_agents = 10
    shapes = [(3,), (4,)]
-    behavior_spec = BehaviorSpec(shapes, ActionType.CONTINUOUS, 3)
+    behavior_spec = BehaviorSpec(shapes, ActionSpec.create_continuous(3))
    ap_list = generate_list_agent_proto(n_agents, shapes, nan_observations=True)
    with pytest.raises(RuntimeError):
        steps_from_proto(ap_list, behavior_spec)
--- a/ml-agents-envs/mlagents_envs/tests/test_steps.py
+++ b/ml-agents-envs/mlagents_envs/tests/test_steps.py
 from mlagents_envs.base_env import (
    DecisionSteps,
    TerminalSteps,
-    ActionType,
+    ActionSpec,
    BehaviorSpec,
 )


 def test_empty_decision_steps():
    specs = BehaviorSpec(
-        observation_shapes=[(3, 2), (5,)],
-        action_type=ActionType.CONTINUOUS,
-        action_shape=3,
+        observation_shapes=[(3, 2), (5,)], action_spec=ActionSpec.create_continuous(3)
    )
    ds = DecisionSteps.empty(specs)
    assert len(ds.obs) == 2

 def test_empty_terminal_steps():
    specs = BehaviorSpec(
-        observation_shapes=[(3, 2), (5,)],
-        action_type=ActionType.CONTINUOUS,
-        action_shape=3,
+        observation_shapes=[(3, 2), (5,)], action_spec=ActionSpec.create_continuous(3)
    )
    ts = TerminalSteps.empty(specs)
    assert len(ts.obs) == 2

 def test_specs():
-    specs = BehaviorSpec(
-        observation_shapes=[(3, 2), (5,)],
-        action_type=ActionType.CONTINUOUS,
-        action_shape=3,
-    )
-    assert specs.discrete_action_branches is None
-    assert specs.action_size == 3
-    assert specs.create_empty_action(5).shape == (5, 3)
-    assert specs.create_empty_action(5).dtype == np.float32
+    specs = ActionSpec.create_continuous(3)
+    assert specs.discrete_branches == ()
+    assert specs.discrete_size == 0
+    assert specs.continuous_size == 3
+    assert specs.empty_action(5).continuous.shape == (5, 3)
+    assert specs.empty_action(5).continuous.dtype == np.float32
-    specs = BehaviorSpec(
-        observation_shapes=[(3, 2), (5,)],
-        action_type=ActionType.DISCRETE,
-        action_shape=(3,),
-    )
-    assert specs.discrete_action_branches == (3,)
-    assert specs.action_size == 1
-    assert specs.create_empty_action(5).shape == (5, 1)
-    assert specs.create_empty_action(5).dtype == np.int32
+    specs = ActionSpec.create_discrete((3,))
+    assert specs.discrete_branches == (3,)
+    assert specs.discrete_size == 1
+    assert specs.continuous_size == 0
+    assert specs.empty_action(5).discrete.shape == (5, 1)
+    assert specs.empty_action(5).discrete.dtype == np.int32
+
+    specs = ActionSpec(3, (3,))
+    assert specs.continuous_size == 3
+    assert specs.discrete_branches == (3,)
+    assert specs.discrete_size == 1
+    assert specs.empty_action(5).continuous.shape == (5, 3)
+    assert specs.empty_action(5).continuous.dtype == np.float32
+    assert specs.empty_action(5).discrete.shape == (5, 1)
+    assert specs.empty_action(5).discrete.dtype == np.int32
-    specs = BehaviorSpec(
-        observation_shapes=[(5,)],
-        action_type=ActionType.CONTINUOUS,
-        action_shape=action_len,
-    )
-    zero_action = specs.create_empty_action(4)
+    specs = ActionSpec.create_continuous(action_len)
+    zero_action = specs.empty_action(4).continuous
-    random_action = specs.create_random_action(4)
+    print(specs.random_action(4))
+    random_action = specs.random_action(4).continuous
+    print(random_action)
    assert random_action.dtype == np.float32
    assert random_action.shape == (4, action_len)
    assert np.min(random_action) >= -1
    action_shape = (10, 20, 30)
-    specs = BehaviorSpec(
-        observation_shapes=[(5,)],
-        action_type=ActionType.DISCRETE,
-        action_shape=action_shape,
-    )
-    zero_action = specs.create_empty_action(4)
+    specs = ActionSpec.create_discrete(action_shape)
+    zero_action = specs.empty_action(4).discrete
-    random_action = specs.create_random_action(4)
+    random_action = specs.random_action(4).discrete
    assert random_action.dtype == np.int32
    assert random_action.shape == (4, len(action_shape))
    assert np.min(random_action) >= 0
--- a/ml-agents/mlagents/tf_utils/init.py
+++ b/ml-agents/mlagents/tf_utils/init.py
 from mlagents.tf_utils.tf import tf as tf  # noqa
 from mlagents.tf_utils.tf import set_warnings_enabled  # noqa
 from mlagents.tf_utils.tf import generate_session_config  # noqa
+from mlagents.tf_utils.tf import is_available  # noqa
--- a/ml-agents/mlagents/tf_utils/tf.py
+++ b/ml-agents/mlagents/tf_utils/tf.py
 # This should be the only place that we import tensorflow directly.
 # Everywhere else is caught by the banned-modules setting for flake8
-import tensorflow as tf  # noqa I201
+
+try:
+    import tensorflow as tf  # noqa I201
-# LooseVersion handles things "1.2.3a" or "4.5.6-rc7" fairly sensibly.
-_is_tensorflow2 = LooseVersion(tf.__version__) >= LooseVersion("2.0.0")
+    # LooseVersion handles things "1.2.3a" or "4.5.6-rc7" fairly sensibly.
+    _is_tensorflow2 = LooseVersion(tf.__version__) >= LooseVersion("2.0.0")
-if _is_tensorflow2:
-    import tensorflow.compat.v1 as tf
+    if _is_tensorflow2:
+        import tensorflow.compat.v1 as tf
-    tf.disable_v2_behavior()
-    tf_logging = tf.logging
-else:
-    try:
-        # Newer versions of tf 1.x will complain that tf.logging is deprecated
-        tf_logging = tf.compat.v1.logging
-    except AttributeError:
-        # Fall back to the safe import, even if it might generate a warning or two.
+        tf.disable_v2_behavior()
+    else:
+        try:
+            # Newer versions of tf 1.x will complain that tf.logging is deprecated
+            tf_logging = tf.compat.v1.logging
+        except AttributeError:
+            # Fall back to the safe import, even if it might generate a warning or two.
+            tf_logging = tf.logging
+except ImportError:
+    tf = None
+
+
+def is_available():
+    """
+    Returns whether Torch is available in this Python environment
+    """
+    return tf is not None


 def set_warnings_enabled(is_enabled: bool) -> None:
    """
-    level = tf_logging.WARN if is_enabled else tf_logging.ERROR
-    tf_logging.set_verbosity(level)
+    if is_available():
+        level = tf_logging.WARN if is_enabled else tf_logging.ERROR
+        tf_logging.set_verbosity(level)
-def generate_session_config() -> tf.ConfigProto:
+def generate_session_config() -> "tf.ConfigProto":
-    config = tf.ConfigProto()
-    config.gpu_options.allow_growth = True
-    # For multi-GPU training, set allow_soft_placement to True to allow
-    # placing the operation into an alternative device automatically
-    # to prevent from exceptions if the device doesn't suppport the operation
-    # or the device does not exist
-    config.allow_soft_placement = True
-    return config
+    if is_available():
+        config = tf.ConfigProto()
+        config.gpu_options.allow_growth = True
+        # For multi-GPU training, set allow_soft_placement to True to allow
+        # placing the operation into an alternative device automatically
+        # to prevent from exceptions if the device doesn't suppport the operation
+        # or the device does not exist
+        config.allow_soft_placement = True
+        return config
+    else:
+        return None
--- a/ml-agents/mlagents/torch_utils/init.py
+++ b/ml-agents/mlagents/torch_utils/init.py
 from mlagents.torch_utils.torch import torch as torch  # noqa
 from mlagents.torch_utils.torch import nn  # noqa
-from mlagents.torch_utils.torch import is_available  # noqa
 from mlagents.torch_utils.torch import default_device  # noqa
--- a/ml-agents/mlagents/torch_utils/torch.py
+++ b/ml-agents/mlagents/torch_utils/torch.py
 import os

+from distutils.version import LooseVersion
+import pkg_resources
-# Detect availability of torch package here.
-# NOTE: this try/except is temporary until torch is required for ML-Agents.
-try:
-    # This should be the only place that we import torch directly.
-    # Everywhere else is caught by the banned-modules setting for flake8
-    import torch  # noqa I201
-    torch.set_num_threads(cpu_utils.get_num_threads_to_use())
-    os.environ["KMP_BLOCKTIME"] = "0"
+def assert_torch_installed():
+    # Check that torch version 1.6.0 or later has been installed. If not, refer
+    # user to the PyTorch webpage for install instructions.
+    torch_pkg = None
+    try:
+        torch_pkg = pkg_resources.get_distribution("torch")
+    except pkg_resources.DistributionNotFound:
+        pass
+    assert torch_pkg is not None and LooseVersion(torch_pkg.version) >= LooseVersion(
+        "1.6.0"
+    ), (
+        "A compatible version of PyTorch was not installed. Please visit the PyTorch homepage "
+        + "(https://pytorch.org/get-started/locally/) and follow the instructions to install. "
+        + "Version 1.6.0 and later are supported."
+    )
-    # Known PyLint compatibility with PyTorch https://github.com/pytorch/pytorch/issues/701
-    # pylint: disable=E1101
-    if torch.cuda.is_available():
-        torch.set_default_tensor_type(torch.cuda.FloatTensor)
-        device = torch.device("cuda")
-    else:
-        torch.set_default_tensor_type(torch.FloatTensor)
-        device = torch.device("cpu")
-    nn = torch.nn
-    # pylint: disable=E1101
-except ImportError:
-    torch = None
-    nn = None
-    device = None
+
+assert_torch_installed()
+
+# This should be the only place that we import torch directly.
+# Everywhere else is caught by the banned-modules setting for flake8
+import torch  # noqa I201
+
+
+torch.set_num_threads(cpu_utils.get_num_threads_to_use())
+os.environ["KMP_BLOCKTIME"] = "0"
+
+# Known PyLint compatibility with PyTorch https://github.com/pytorch/pytorch/issues/701
+# pylint: disable=E1101
+if torch.cuda.is_available():
+    torch.set_default_tensor_type(torch.cuda.FloatTensor)
+    device = torch.device("cuda")
+else:
+    torch.set_default_tensor_type(torch.FloatTensor)
+    device = torch.device("cpu")
+nn = torch.nn
-
-
-def is_available():
-    """
-    Returns whether Torch is available in this Python environment
-    """
-    return torch is not None
--- a/ml-agents/mlagents/trainers/agent_processor.py
+++ b/ml-agents/mlagents/trainers/agent_processor.py
 from typing import List, Dict, TypeVar, Generic, Tuple, Any, Union
 from collections import defaultdict, Counter
 import queue
+import numpy as np

 from mlagents_envs.base_env import (
    DecisionSteps,
            done = terminated  # Since this is an ongoing step
            interrupted = step.interrupted if terminated else False
            # Add the outputs of the last eval
-            action = stored_take_action_outputs["action"][idx]
+            action_dict = stored_take_action_outputs["action"]
+            action: Dict[str, np.ndarray] = {}
+            for act_type, act_array in action_dict.items():
+                action[act_type] = act_array[idx]
-            action_probs = stored_take_action_outputs["log_probs"][idx]
+            action_probs_dict = stored_take_action_outputs["log_probs"]
+            action_probs: Dict[str, np.ndarray] = {}
+            for prob_type, prob_array in action_probs_dict.items():
+                action_probs[prob_type] = prob_array[idx]
+
            action_mask = stored_decision_step.action_mask
            prev_action = self.policy.retrieve_previous_action([global_id])[0, :]
            experience = AgentExperience(
--- a/ml-agents/mlagents/trainers/buffer.py
+++ b/ml-agents/mlagents/trainers/buffer.py

    class AgentBufferField(list):
        """
-        AgentBufferField is a list of numpy arrays. When an agent collects a field, you can add it to his
+        AgentBufferField is a list of numpy arrays. When an agent collects a field, you can add it to its
        AgentBufferField with the append method.
        """

--- a/ml-agents/mlagents/trainers/cli_utils.py
+++ b/ml-agents/mlagents/trainers/cli_utils.py
        "--torch",
        default=False,
        action=DetectDefaultStoreTrue,
-        help="(Experimental) Use the PyTorch framework instead of TensorFlow. Install PyTorch "
-        "before using this option",
+        help="Use the PyTorch framework. Note that this option is not required anymore as PyTorch is the"
+        "default framework, and will be removed in the next release.",
+    )
+    argparser.add_argument(
+        "--tensorflow",
+        default=False,
+        action=DetectDefaultStoreTrue,
+        help="(Deprecated) Use the TensorFlow framework instead of PyTorch. Install TensorFlow "
+        "before using this option.",
    )

    eng_conf = argparser.add_argument_group(title="Engine Configuration")
--- a/ml-agents/mlagents/trainers/demo_loader.py
+++ b/ml-agents/mlagents/trainers/demo_loader.py
        for i, obs in enumerate(split_obs.visual_observations):
            demo_raw_buffer["visual_obs%d" % i].append(obs)
        demo_raw_buffer["vector_obs"].append(split_obs.vector_observations)
-        demo_raw_buffer["actions"].append(current_pair_info.action_info.vector_actions)
+        # TODO: update to read from the new proto format
+        if behavior_spec.action_spec.continuous_size > 0:
+            demo_raw_buffer["continuous_action"].append(
+                current_pair_info.action_info.vector_actions
+            )
+        if behavior_spec.action_spec.discrete_size > 0:
+            demo_raw_buffer["discrete_action"].append(
+                current_pair_info.action_info.vector_actions
+            )
        demo_raw_buffer["prev_action"].append(previous_action)
        if next_done:
            demo_raw_buffer.resequence_and_append(
    demo_buffer = make_demo_buffer(info_action_pair, behavior_spec, sequence_length)
    if expected_behavior_spec:
        # check action dimensions in demonstration match
-        if behavior_spec.action_shape != expected_behavior_spec.action_shape:
+        if behavior_spec.action_spec != expected_behavior_spec.action_spec:
-                "The action dimensions {} in demonstration do not match the policy's {}.".format(
-                    behavior_spec.action_shape, expected_behavior_spec.action_shape
-                )
-            )
-        # check the action types in demonstration match
-        if behavior_spec.action_type != expected_behavior_spec.action_type:
-            raise RuntimeError(
-                "The action type of {} in demonstration do not match the policy's {}.".format(
-                    behavior_spec.action_type, expected_behavior_spec.action_type
+                "The action spaces {} in demonstration do not match the policy's {}.".format(
+                    behavior_spec.action_spec, expected_behavior_spec.action_spec
                )
            )
        # check observations match
--- a/ml-agents/mlagents/trainers/env_manager.py
+++ b/ml-agents/mlagents/trainers/env_manager.py
 from abc import ABC, abstractmethod
+import numpy as np
+
 from typing import List, Dict, NamedTuple, Iterable, Tuple
 from mlagents_envs.base_env import (
    DecisionSteps,
+    ActionTuple,
 )
 from mlagents_envs.side_channel.stats_side_channel import EnvironmentStats

                    step_info.environment_stats, step_info.worker_id
                )
        return len(step_infos)
+
+    @staticmethod
+    def action_tuple_from_numpy_dict(action_dict: Dict[str, np.ndarray]) -> ActionTuple:
+        continuous: np.ndarray = None
+        discrete: np.ndarray = None
+        if "continuous_action" in action_dict:
+            continuous = action_dict["continuous_action"]
+        if "discrete_action" in action_dict:
+            discrete = action_dict["discrete_action"]
+        return ActionTuple(continuous, discrete)
--- a/ml-agents/mlagents/trainers/ghost/trainer.py
+++ b/ml-agents/mlagents/trainers/ghost/trainer.py
    @property
    def reward_buffer(self) -> Deque[float]:
        """
-         Returns the reward buffer. The reward buffer contains the cumulative
-         rewards of the most recent episodes completed by agents using this
-         trainer.
-         :return: the reward buffer.
-         """
+        Returns the reward buffer. The reward buffer contains the cumulative
+        rewards of the most recent episodes completed by agents using this
+        trainer.
+        :return: the reward buffer.
+        """
        return self.trainer.reward_buffer

    @property
        policy = self.trainer.create_policy(
            parsed_behavior_id, behavior_spec, create_graph=True
        )
-        self.trainer.model_saver.initialize_or_load(policy)
        team_id = parsed_behavior_id.team_id
        self.controller.subscribe_team_id(team_id, self)

            self._save_snapshot()  # Need to save after trainer initializes policy
            self._learning_team = self.controller.get_learning_team
            self.wrapped_trainer_team = team_id
+        else:
+            # Load the weights of the ghost policy from the wrapped one
+            policy.load_weights(
+                self.trainer.get_policy(parsed_behavior_id).get_weights()
+            )
        return policy

    def add_policy(
--- a/ml-agents/mlagents/trainers/learn.py
+++ b/ml-agents/mlagents/trainers/learn.py
 # # Unity ML-Agents Toolkit
+from mlagents import torch_utils
 import yaml

 import os
  ml-agents: {mlagents.trainers.__version__},
  ml-agents-envs: {mlagents_envs.__version__},
  Communicator API: {UnityEnvironment.API_VERSION},
-  TensorFlow: {tf_utils.tf.__version__}"""
+  PyTorch: {torch_utils.torch.__version__}"""


 def parse_command_line(argv: Optional[List[str]] = None) -> RunOptions:
            init_path=maybe_init_path,
            multi_gpu=False,
            force_torch="torch" in DetectDefault.non_default_args,
+            force_tensorflow="tensorflow" in DetectDefault.non_default_args,
        )
        # Create controller and begin training.
        tc = TrainerController(
    add_timer_metadata("mlagents_version", mlagents.trainers.__version__)
    add_timer_metadata("mlagents_envs_version", mlagents_envs.__version__)
    add_timer_metadata("communication_protocol_version", UnityEnvironment.API_VERSION)
-    add_timer_metadata("tensorflow_version", tf_utils.tf.__version__)
+    add_timer_metadata("pytorch_version", torch_utils.torch.__version__)
    add_timer_metadata("numpy_version", np.__version__)

    if options.env_settings.seed == -1:
--- a/ml-agents/mlagents/trainers/optimizer/tf_optimizer.py
+++ b/ml-agents/mlagents/trainers/optimizer/tf_optimizer.py
                [self.value_heads, self.policy.memory_out, self.memory_out], feed_dict
            )
            prev_action = (
-                batch["actions"][-1] if not self.policy.use_continuous_act else None
+                batch["discrete_action"][-1]
+                if not self.policy.use_continuous_act
+                else None
            )
        else:
            value_estimates = self.sess.run(self.value_heads, feed_dict)
--- a/ml-agents/mlagents/trainers/policy/policy.py
+++ b/ml-agents/mlagents/trainers/policy/policy.py
        condition_sigma_on_obs: bool = True,
    ):
        self.behavior_spec = behavior_spec
+        self.action_spec = behavior_spec.action_spec
-            list(behavior_spec.discrete_action_branches)
-            if behavior_spec.is_action_discrete()
-            else [behavior_spec.action_size]
+            list(self.behavior_spec.action_spec.discrete_branches)
+            if self.behavior_spec.action_spec.is_discrete()
+            else [self.behavior_spec.action_spec.continuous_size]
        )
        self.vec_obs_size = sum(
            shape[0] for shape in behavior_spec.observation_shapes if len(shape) == 1
        )
-        self.use_continuous_act = behavior_spec.is_action_continuous()
-        self.num_branches = self.behavior_spec.action_size
-        self.previous_action_dict: Dict[str, np.array] = {}
+        self.use_continuous_act = self.behavior_spec.action_spec.is_continuous()
+        self.previous_action_dict: Dict[str, np.ndarray] = {}
        self.memory_dict: Dict[str, np.ndarray] = {}
        self.normalize = trainer_settings.network_settings.normalize
        self.use_recurrent = self.network_settings.memory is not None
    ) -> None:
        if memory_matrix is None:
            return
+
        for index, agent_id in enumerate(agent_ids):
            self.memory_dict[agent_id] = memory_matrix[index, :]

            if agent_id in self.memory_dict:
                self.memory_dict.pop(agent_id)

-    def make_empty_previous_action(self, num_agents):
+    def make_empty_previous_action(self, num_agents: int) -> np.ndarray:
-        return np.zeros((num_agents, self.num_branches), dtype=np.int)
+        return np.zeros(
+            (num_agents, self.behavior_spec.action_spec.discrete_size), dtype=np.int32
+        )
-        self, agent_ids: List[str], action_matrix: Optional[np.ndarray]
+        self, agent_ids: List[str], action_dict: Dict[str, np.ndarray]
-        if action_matrix is None:
+        if action_dict is None or "discrete_action" not in action_dict:
-            self.previous_action_dict[agent_id] = action_matrix[index, :]
+            self.previous_action_dict[agent_id] = action_dict["discrete_action"][
+                index, :
+            ]
-        action_matrix = np.zeros((len(agent_ids), self.num_branches), dtype=np.int)
+        action_matrix = self.make_empty_previous_action(len(agent_ids))
        for index, agent_id in enumerate(agent_ids):
            if agent_id in self.previous_action_dict:
                action_matrix[index, :] = self.previous_action_dict[agent_id]
--- a/ml-agents/mlagents/trainers/policy/tf_policy.py
+++ b/ml-agents/mlagents/trainers/policy/tf_policy.py
            reparameterize,
            condition_sigma_on_obs,
        )
+        if self.action_spec.continuous_size > 0 and self.action_spec.discrete_size > 0:
+            raise UnityPolicyException(
+                "TensorFlow does not support mixed action spaces. Please run with the Torch framework."
+            )
        # for ghost trainer save/load snapshots
        self.assign_phs: List[tf.Tensor] = []
        self.assign_ops: List[tf.Operation] = []
                feed_dict[self.prev_action] = self.retrieve_previous_action(
                    global_agent_ids
                )
+
            feed_dict[self.memory_in] = self.retrieve_memories(global_agent_ids)
        feed_dict = self.fill_eval_dict(feed_dict, decision_requests)
        run_out = self._execute_model(feed_dict, self.inference_dict)
        )

        self.save_memories(global_agent_ids, run_out.get("memory_out"))
+        # For Compatibility with buffer changes for hybrid action support
+        if "log_probs" in run_out:
+            run_out["log_probs"] = {"action_probs": run_out["log_probs"]}
+        if "action" in run_out:
+            if self.behavior_spec.action_spec.is_continuous():
+                run_out["action"] = {"continuous_action": run_out["action"]}
+            else:
+                run_out["action"] = {"discrete_action": run_out["action"]}
        return ActionInfo(
            action=run_out.get("action"),
            value=run_out.get("value"),
            mask = np.ones(
                (
                    len(batched_step_result),
-                    sum(self.behavior_spec.discrete_action_branches),
+                    sum(self.behavior_spec.action_spec.discrete_branches),
                ),
                dtype=np.float32,
            )
            self.mask = tf.cast(self.mask_input, tf.int32)

            tf.Variable(
-                int(self.behavior_spec.is_action_continuous()),
+                int(self.behavior_spec.action_spec.is_continuous()),
                name="is_continuous_control",
                trainable=False,
                dtype=tf.int32,
            tf.Variable(
                self.m_size, name="memory_size", trainable=False, dtype=tf.int32
            )
-            if self.behavior_spec.is_action_continuous():
+            if self.behavior_spec.action_spec.is_continuous():
                tf.Variable(
                    self.act_size[0],
                    name="action_output_shape",
--- a/ml-agents/mlagents/trainers/policy/torch_policy.py
+++ b/ml-agents/mlagents/trainers/policy/torch_policy.py
    SeparateActorCritic,
    GlobalSteps,
 )
+
+from mlagents.trainers.torch.agent_action import AgentAction
+from mlagents.trainers.torch.action_log_probs import ActionLogProbs

 EPSILON = 1e-7  # Small value to avoid divide by zero

        self.actor_critic = ac_class(
            observation_shapes=self.behavior_spec.observation_shapes,
            network_settings=trainer_settings.network_settings,
-            act_type=behavior_spec.action_type,
-            act_size=self.act_size,
+            action_spec=behavior_spec.action_spec,
            stream_names=reward_signal_names,
            conditional_sigma=self.condition_sigma_on_obs,
            tanh_squash=tanh_squash,
    ) -> Tuple[SplitObservations, np.ndarray]:
        vec_vis_obs = SplitObservations.from_observations(decision_requests.obs)
        mask = None
-        if not self.use_continuous_act:
+        if self.action_spec.discrete_size > 0:
            mask = torch.ones([len(decision_requests), np.sum(self.act_size)])
            if decision_requests.action_mask is not None:
                mask = torch.as_tensor(
        masks: Optional[torch.Tensor] = None,
        memories: Optional[torch.Tensor] = None,
        seq_len: int = 1,
-        all_log_probs: bool = False,
-    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+    ) -> Tuple[AgentAction, ActionLogProbs, torch.Tensor, torch.Tensor]:
        """
        :param vec_obs: List of vector observations.
        :param vis_obs: List of visual observations.
-        :param all_log_probs: Returns (for discrete actions) a tensor of log probs, one for each action.
-        :return: Tuple of actions, log probabilities (dependent on all_log_probs), entropies, and
-            output memories, all as Torch Tensors.
+        :return: Tuple of AgentAction, ActionLogProbs, entropies, and output memories.
-        if memories is None:
-            dists, memories = self.actor_critic.get_dists(
-                vec_obs, vis_obs, masks, memories, seq_len
-            )
-        else:
-            # If we're using LSTM. we need to execute the values to get the critic memories
-            dists, _, memories = self.actor_critic.get_dist_and_value(
-                vec_obs, vis_obs, masks, memories, seq_len
-            )
-        action_list = self.actor_critic.sample_action(dists)
-        log_probs, entropies, all_logs = ModelUtils.get_probs_and_entropy(
-            action_list, dists
+        actions, log_probs, entropies, _, memories = self.actor_critic.get_action_stats_and_value(
+            vec_obs, vis_obs, masks, memories, seq_len
-        actions = torch.stack(action_list, dim=-1)
-        if self.use_continuous_act:
-            actions = actions[:, :, 0]
-        else:
-            actions = actions[:, 0, :]
-
-        return (actions, all_logs if all_log_probs else log_probs, entropies, memories)
+        return (actions, log_probs, entropies, memories)
-        actions: torch.Tensor,
+        actions: AgentAction,
-    ) -> Tuple[torch.Tensor, torch.Tensor, Dict[str, torch.Tensor]]:
-        dists, value_heads, _ = self.actor_critic.get_dist_and_value(
-            vec_obs, vis_obs, masks, memories, seq_len
+    ) -> Tuple[ActionLogProbs, torch.Tensor, Dict[str, torch.Tensor]]:
+        log_probs, entropies, value_heads = self.actor_critic.get_stats_and_value(
+            vec_obs, vis_obs, actions, masks, memories, seq_len
-        action_list = [actions[..., i] for i in range(actions.shape[-1])]
-        log_probs, entropies, _ = ModelUtils.get_probs_and_entropy(action_list, dists)
-
        return log_probs, entropies, value_heads

    @timed
            action, log_probs, entropy, memories = self.sample_actions(
                vec_obs, vis_obs, masks=masks, memories=memories
            )
-        run_out["action"] = ModelUtils.to_numpy(action)
-        run_out["pre_action"] = ModelUtils.to_numpy(action)
-        # Todo - make pre_action difference
-        run_out["log_probs"] = ModelUtils.to_numpy(log_probs)
+        action_dict = action.to_numpy_dict()
+        run_out["action"] = action_dict
+        run_out["pre_action"] = (
+            action_dict["continuous_action"] if self.use_continuous_act else None
+        )
+        run_out["log_probs"] = log_probs.to_numpy_dict()
        run_out["entropy"] = ModelUtils.to_numpy(entropy)
        run_out["learning_rate"] = 0.0
        if self.use_recurrent:
--- a/ml-agents/mlagents/trainers/ppo/optimizer_tf.py
+++ b/ml-agents/mlagents/trainers/ppo/optimizer_tf.py
        if self.policy.output_pre is not None and "actions_pre" in mini_batch:
            feed_dict[self.policy.output_pre] = mini_batch["actions_pre"]
        else:
-            feed_dict[self.policy.output] = mini_batch["actions"]
-            if self.policy.use_recurrent:
-                feed_dict[self.policy.prev_action] = mini_batch["prev_action"]
+            if self.policy.use_continuous_act:  # For hybrid action buffer support
+                feed_dict[self.policy.output] = mini_batch["continuous_action"]
+            else:
+                feed_dict[self.policy.output] = mini_batch["discrete_action"]
+                if self.policy.use_recurrent:
+                    feed_dict[self.policy.prev_action] = mini_batch["prev_action"]
            feed_dict[self.policy.action_masks] = mini_batch["action_mask"]
        if "vector_obs" in mini_batch:
            feed_dict[self.policy.vector_in] = mini_batch["vector_obs"]
--- a/ml-agents/mlagents/trainers/ppo/optimizer_torch.py
+++ b/ml-agents/mlagents/trainers/ppo/optimizer_torch.py
 from mlagents.trainers.policy.torch_policy import TorchPolicy
 from mlagents.trainers.optimizer.torch_optimizer import TorchOptimizer
 from mlagents.trainers.settings import TrainerSettings, PPOSettings
+from mlagents.trainers.torch.agent_action import AgentAction
+from mlagents.trainers.torch.action_log_probs import ActionLogProbs
 from mlagents.trainers.torch.utils import ModelUtils


        advantage = advantages.unsqueeze(-1)

        decay_epsilon = self.hyperparameters.epsilon
-
        r_theta = torch.exp(log_probs - old_log_probs)
        p_opt_a = r_theta * advantage
        p_opt_b = (

        vec_obs = [ModelUtils.list_to_tensor(batch["vector_obs"])]
        act_masks = ModelUtils.list_to_tensor(batch["action_mask"])
-        if self.policy.use_continuous_act:
-            actions = ModelUtils.list_to_tensor(batch["actions"]).unsqueeze(-1)
-        else:
-            actions = ModelUtils.list_to_tensor(batch["actions"], dtype=torch.long)
+        actions = AgentAction.from_dict(batch)

        memories = [
            ModelUtils.list_to_tensor(batch["memory"][i])
                vis_obs.append(vis_ob)
        else:
            vis_obs = []
+
        log_probs, entropy, values = self.policy.evaluate_actions(
            vec_obs,
            vis_obs,
            seq_len=self.policy.sequence_length,
        )
+        old_log_probs = ActionLogProbs.from_dict(batch).flatten()
+        log_probs = log_probs.flatten()
        loss_masks = ModelUtils.list_to_tensor(batch["masks"], dtype=torch.bool)
        value_loss = self.ppo_value_loss(
            values, old_values, returns, decay_eps, loss_masks
            log_probs,
-            ModelUtils.list_to_tensor(batch["action_probs"]),
+            old_log_probs,
            loss_masks,
        )
        loss = (

        self.optimizer.step()
        update_stats = {
-            "Losses/Policy Loss": policy_loss.item(),
+            # NOTE: abs() is not technically correct, but matches the behavior in TensorFlow.
+            # TODO: After PyTorch is default, change to something more correct.
+            "Losses/Policy Loss": torch.abs(policy_loss).item(),
            "Losses/Value Loss": value_loss.item(),
            "Policy/Learning Rate": decay_lr,
            "Policy/Epsilon": decay_eps,
--- a/ml-agents/mlagents/trainers/ppo/trainer.py
+++ b/ml-agents/mlagents/trainers/ppo/trainer.py
 from mlagents_envs.base_env import BehaviorSpec
 from mlagents.trainers.trainer.rl_trainer import RLTrainer
 from mlagents.trainers.policy import Policy
-from mlagents.trainers.policy.tf_policy import TFPolicy
-from mlagents.trainers.ppo.optimizer_tf import PPOOptimizer
+from mlagents.trainers.policy.torch_policy import TorchPolicy
+from mlagents.trainers.ppo.optimizer_torch import TorchPPOOptimizer
-from mlagents.trainers.tf.components.reward_signals import RewardSignal
-from mlagents import torch_utils
+from mlagents.trainers.torch.components.reward_providers.base_reward_provider import (
+    BaseRewardProvider,
+)
+from mlagents import tf_utils
-if torch_utils.is_available():
-    from mlagents.trainers.policy.torch_policy import TorchPolicy
-    from mlagents.trainers.ppo.optimizer_torch import TorchPPOOptimizer
+if tf_utils.is_available():
+    from mlagents.trainers.policy.tf_policy import TFPolicy
+    from mlagents.trainers.ppo.optimizer_tf import PPOOptimizer
-    TorchPolicy = None  # type: ignore
-    TorchPPOOptimizer = None  # type: ignore
+    TFPolicy = None  # type: ignore
+    PPOOptimizer = None  # type: ignore


 logger = get_logger(__name__)

        for name, v in value_estimates.items():
            agent_buffer_trajectory[f"{name}_value_estimates"].extend(v)
-            if isinstance(self.optimizer.reward_signals[name], RewardSignal):
+            if isinstance(self.optimizer.reward_signals[name], BaseRewardProvider):
-                    self.optimizer.reward_signals[name].value_name, np.mean(v)
+                    f"Policy/{self.optimizer.reward_signals[name].name.capitalize()} Value Estimate",
+                    np.mean(v),
-                    f"Policy/{self.optimizer.reward_signals[name].name.capitalize()} Value Estimate",
-                    np.mean(v),
+                    self.optimizer.reward_signals[name].value_name, np.mean(v)
                )

        # Evaluate all reward functions
        for name, reward_signal in self.optimizer.reward_signals.items():
-            if isinstance(reward_signal, RewardSignal):
-                evaluate_result = reward_signal.evaluate_batch(
-                    agent_buffer_trajectory
-                ).scaled_reward
-            else:
+            # BaseRewardProvider is a PyTorch-based reward signal
+            if isinstance(reward_signal, BaseRewardProvider):
+            else:  # reward_signal is a TensorFlow-based RewardSignal class
+                evaluate_result = reward_signal.evaluate_batch(
+                    agent_buffer_trajectory
+                ).scaled_reward
            agent_buffer_trajectory[f"{name}_rewards"].extend(evaluate_result)
            # Report the reward signals
            self.collected_rewards[name][agent_id] += np.sum(evaluate_result)
            behavior_spec,
            self.trainer_settings,
            condition_sigma_on_obs=False,  # Faster training for PPO
-            separate_critic=behavior_spec.is_action_continuous(),
+            separate_critic=behavior_spec.action_spec.continuous_size > 0,
        )
        return policy

--- a/ml-agents/mlagents/trainers/sac/optimizer_tf.py
+++ b/ml-agents/mlagents/trainers/sac/optimizer_tf.py
            feed_dict[self.rewards_holders[name]] = batch[f"{name}_rewards"]

        if self.policy.use_continuous_act:
-            feed_dict[self.policy_network.external_action_in] = batch["actions"]
+            feed_dict[self.policy_network.external_action_in] = batch[
+                "continuous_action"
+            ]
-            feed_dict[policy.output] = batch["actions"]
+            feed_dict[policy.output] = batch["discrete_action"]
            if self.policy.use_recurrent:
                feed_dict[policy.prev_action] = batch["prev_action"]
            feed_dict[policy.action_masks] = batch["action_mask"]
--- a/ml-agents/mlagents/trainers/sac/optimizer_torch.py
+++ b/ml-agents/mlagents/trainers/sac/optimizer_torch.py
 from mlagents.torch_utils import torch, nn, default_device

 from mlagents_envs.logging_util import get_logger
-from mlagents_envs.base_env import ActionType
+from mlagents_envs.base_env import ActionSpec
+from mlagents.trainers.torch.agent_action import AgentAction
+from mlagents.trainers.torch.action_log_probs import ActionLogProbs
 from mlagents.trainers.torch.utils import ModelUtils
 from mlagents.trainers.buffer import AgentBuffer
 from mlagents_envs.timers import timed
            stream_names: List[str],
            observation_shapes: List[Tuple[int, ...]],
            network_settings: NetworkSettings,
-            act_type: ActionType,
-            act_size: List[int],
+            action_spec: ActionSpec,
-            if act_type == ActionType.CONTINUOUS:
+            self.action_spec = action_spec
+            if self.action_spec.is_continuous():
+                self.act_size = self.action_spec.continuous_size
-                num_action_ins = sum(act_size)
+                num_action_ins = self.act_size
+
-                num_value_outs = sum(act_size)
+                self.act_size = self.action_spec.discrete_branches
+                num_value_outs = sum(self.act_size)
                num_action_ins = 0
            self.q1_network = ValueNetwork(
                stream_names,
            self.stream_names,
            self.policy.behavior_spec.observation_shapes,
            policy_network_settings,
-            self.policy.behavior_spec.action_type,
-            self.act_size,
+            self.policy.behavior_spec.action_spec,
        )

        self.target_network = ValueNetwork(

    def sac_value_loss(
        self,
-        log_probs: torch.Tensor,
+        log_probs: ActionLogProbs,
        values: Dict[str, torch.Tensor],
        q1p_out: Dict[str, torch.Tensor],
        q2p_out: Dict[str, torch.Tensor],
                if not discrete:
                    min_policy_qs[name] = torch.min(q1p_out[name], q2p_out[name])
                else:
-                    action_probs = log_probs.exp()
+                    action_probs = log_probs.all_discrete_tensor.exp()
                    _branched_q1p = ModelUtils.break_into_branches(
                        q1p_out[name] * action_probs, self.act_size
                    )
            for name in values.keys():
                with torch.no_grad():
                    v_backup = min_policy_qs[name] - torch.sum(
-                        _ent_coef * log_probs, dim=1
+                        _ent_coef * log_probs.continuous_tensor, dim=1
                    )
                value_loss = 0.5 * ModelUtils.masked_mean(
                    torch.nn.functional.mse_loss(values[name], v_backup), loss_masks
            branched_per_action_ent = ModelUtils.break_into_branches(
-                log_probs * log_probs.exp(), self.act_size
+                log_probs.all_discrete_tensor * log_probs.all_discrete_tensor.exp(),
+                self.act_size,
            )
            # We have to do entropy bonus per action branch
            branched_ent_bonus = torch.stack(

    def sac_policy_loss(
        self,
-        log_probs: torch.Tensor,
+        log_probs: ActionLogProbs,
        q1p_outs: Dict[str, torch.Tensor],
        loss_masks: torch.Tensor,
        discrete: bool,
        if not discrete:
            mean_q1 = mean_q1.unsqueeze(1)
-            batch_policy_loss = torch.mean(_ent_coef * log_probs - mean_q1, dim=1)
+            batch_policy_loss = torch.mean(
+                _ent_coef * log_probs.continuous_tensor - mean_q1, dim=1
+            )
-            action_probs = log_probs.exp()
+            action_probs = log_probs.all_discrete_tensor.exp()
-                log_probs * action_probs, self.act_size
+                log_probs.all_discrete_tensor * action_probs, self.act_size
            )
            branched_q_term = ModelUtils.break_into_branches(
                mean_q1 * action_probs, self.act_size
                    for i, (_lp, _qt) in enumerate(
                        zip(branched_per_action_ent, branched_q_term)
                    )
-                ]
+                ],
+                dim=1,
-        policy_loss = torch.mean(loss_masks * batch_policy_loss)
+            policy_loss = ModelUtils.masked_mean(batch_policy_loss, loss_masks)
-        self, log_probs: torch.Tensor, loss_masks: torch.Tensor, discrete: bool
+        self, log_probs: ActionLogProbs, loss_masks: torch.Tensor, discrete: bool
-                target_current_diff = torch.sum(log_probs + self.target_entropy, dim=1)
-            entropy_loss = -torch.mean(
-                self._log_ent_coef * loss_masks * target_current_diff
+                target_current_diff = torch.sum(
+                    log_probs.continuous_tensor + self.target_entropy, dim=1
+                )
+            entropy_loss = -1 * ModelUtils.masked_mean(
+                self._log_ent_coef * target_current_diff, loss_masks
-                    log_probs * log_probs.exp(), self.act_size
+                    log_probs.all_discrete_tensor * log_probs.all_discrete_tensor.exp(),
+                    self.act_size,
                )
                target_current_diff_branched = torch.stack(
                    [
        vec_obs = [ModelUtils.list_to_tensor(batch["vector_obs"])]
        next_vec_obs = [ModelUtils.list_to_tensor(batch["next_vector_in"])]
        act_masks = ModelUtils.list_to_tensor(batch["action_mask"])
-        if self.policy.use_continuous_act:
-            actions = ModelUtils.list_to_tensor(batch["actions"]).unsqueeze(-1)
-        else:
-            actions = ModelUtils.list_to_tensor(batch["actions"], dtype=torch.long)
+        actions = AgentAction.from_dict(batch)

        memories_list = [
            ModelUtils.list_to_tensor(batch["memory"][i])
            masks=act_masks,
            memories=memories,
            seq_len=self.policy.sequence_length,
-            all_log_probs=not self.policy.use_continuous_act,
-            squeezed_actions = actions.squeeze(-1)
+            squeezed_actions = actions.continuous_tensor
-                sampled_actions,
+                sampled_actions.continuous_tensor,
                memories=q_memories,
                sequence_length=self.policy.sequence_length,
                q2_grad=False,
                memories=q_memories,
                sequence_length=self.policy.sequence_length,
            )
-            q1_stream = self._condense_q_streams(q1_out, actions)
-            q2_stream = self._condense_q_streams(q2_out, actions)
+            q1_stream = self._condense_q_streams(q1_out, actions.discrete_tensor)
+            q2_stream = self._condense_q_streams(q2_out, actions.discrete_tensor)

        with torch.no_grad():
            target_values, _ = self.target_network(
--- a/ml-agents/mlagents/trainers/sac/trainer.py
+++ b/ml-agents/mlagents/trainers/sac/trainer.py
 from mlagents_envs.logging_util import get_logger
 from mlagents_envs.timers import timed
 from mlagents_envs.base_env import BehaviorSpec
-from mlagents.trainers.policy.tf_policy import TFPolicy
-from mlagents.trainers.sac.optimizer_tf import SACOptimizer
+from mlagents.trainers.policy.torch_policy import TorchPolicy
+from mlagents.trainers.sac.optimizer_torch import TorchSACOptimizer
-from mlagents.trainers.tf.components.reward_signals import RewardSignal
-from mlagents import torch_utils
+from mlagents.trainers.torch.components.reward_providers import BaseRewardProvider
+from mlagents import tf_utils
-if torch_utils.is_available():
-    from mlagents.trainers.policy.torch_policy import TorchPolicy
-    from mlagents.trainers.sac.optimizer_torch import TorchSACOptimizer
+if tf_utils.is_available():
+    from mlagents.trainers.policy.tf_policy import TFPolicy
+    from mlagents.trainers.sac.optimizer_tf import SACOptimizer
-    TorchPolicy = None  # type: ignore
-    TorchSACOptimizer = None  # type: ignore
+    TFPolicy = None  # type: ignore
+    SACOptimizer = None  # type: ignore

 logger = get_logger(__name__)


        self.seed = seed
        self.policy: Policy = None  # type: ignore
-        self.optimizer: SACOptimizer = None  # type: ignore
+        self.optimizer: TorchSACOptimizer = None  # type: ignore
        self.hyperparameters: SACSettings = cast(
            SACSettings, trainer_settings.hyperparameters
        )
            agent_buffer_trajectory["environment_rewards"]
        )
        for name, reward_signal in self.optimizer.reward_signals.items():
-            if isinstance(reward_signal, RewardSignal):
-                evaluate_result = reward_signal.evaluate_batch(
-                    agent_buffer_trajectory
-                ).scaled_reward
-            else:
+            # BaseRewardProvider is a PyTorch-based reward signal
+            if isinstance(reward_signal, BaseRewardProvider):
+            else:  # reward_signal uses TensorFlow
+                evaluate_result = reward_signal.evaluate_batch(
+                    agent_buffer_trajectory
+                ).scaled_reward
+
            # Report the reward signals
            self.collected_rewards[name][agent_id] += np.sum(evaluate_result)

        )
        for name, v in value_estimates.items():
-            if isinstance(self.optimizer.reward_signals[name], RewardSignal):
-                self._stats_reporter.add_stat(
-                    self.optimizer.reward_signals[name].value_name, np.mean(v)
-                )
-            else:
+            # BaseRewardProvider is a PyTorch-based reward signal
+            if isinstance(self.optimizer.reward_signals[name], BaseRewardProvider):
+                )
+            else:  # TensorFlow reward signal
+                self._stats_reporter.add_stat(
+                    self.optimizer.reward_signals[name].value_name, np.mean(v)
                )

        # Bootstrap using the last step rather than the bootstrap step if max step is reached.
                )
                # Get rewards for each reward
                for name, signal in self.optimizer.reward_signals.items():
-                    if isinstance(signal, RewardSignal):
+                    # BaseRewardProvider is a PyTorch-based reward signal
+                    if isinstance(signal, BaseRewardProvider):
+                        sampled_minibatch[f"{name}_rewards"] = (
+                            signal.evaluate(sampled_minibatch) * signal.strength
+                        )
+                    else:  # reward_signal is a TensorFlow-based RewardSignal class
-                    else:
-                        sampled_minibatch[f"{name}_rewards"] = (
-                            signal.evaluate(sampled_minibatch) * signal.strength
-                        )

                update_stats = self.optimizer.update(sampled_minibatch, n_sequences)
                for stat_name, value in update_stats.items():
            reward_signal_minibatches = {}
            for name, signal in self.optimizer.reward_signals.items():
                logger.debug(f"Updating {name} at step {self.step}")
-                if isinstance(signal, RewardSignal):
+                # BaseRewardProvider is a PyTorch-based reward signal
+                if not isinstance(signal, BaseRewardProvider):
                    # Some signals don't need a minibatch to be sampled - so we don't!
                    if signal.update_dict:
                        reward_signal_minibatches[name] = buffer.sample_mini_batch(
-                else:
+                else:  # TensorFlow reward signal
                    if name != "extrinsic":
                        reward_signal_minibatches[name] = buffer.sample_mini_batch(
                            self.hyperparameters.batch_size,
            for stat, stat_list in batch_update_stats.items():
                self._stats_reporter.add_stat(stat, np.mean(stat_list))

-    def create_sac_optimizer(self) -> SACOptimizer:
+    def create_sac_optimizer(self) -> TorchSACOptimizer:
        if self.framework == FrameworkType.PYTORCH:
            return TorchSACOptimizer(  # type: ignore
                cast(TorchPolicy, self.policy), self.trainer_settings  # type: ignore
--- a/ml-agents/mlagents/trainers/settings.py
+++ b/ml-agents/mlagents/trainers/settings.py
    threaded: bool = True
    self_play: Optional[SelfPlaySettings] = None
    behavioral_cloning: Optional[BehavioralCloningSettings] = None
-    framework: FrameworkType = FrameworkType.TENSORFLOW
+    framework: FrameworkType = FrameworkType.PYTORCH

    cattr.register_structure_hook(
        Dict[RewardSignalType, RewardSignalSettings], RewardSignalSettings.structure
--- a/ml-agents/mlagents/trainers/simple_env_manager.py
+++ b/ml-agents/mlagents/trainers/simple_env_manager.py
        self.previous_all_action_info = all_action_info

        for brain_name, action_info in all_action_info.items():
-            self.env.set_actions(brain_name, action_info.action)
+            _action = EnvManager.action_tuple_from_numpy_dict(action_info.action)
+            self.env.set_actions(brain_name, _action)
        self.env.step()
        all_step_result = self._generate_all_results()

--- a/ml-agents/mlagents/trainers/stats.py
+++ b/ml-agents/mlagents/trainers/stats.py
 from collections import defaultdict
 from enum import Enum
-from typing import List, Dict, NamedTuple, Any, Optional
+from typing import List, Dict, NamedTuple, Any
 import numpy as np
 import abc
 import os
 from mlagents_envs.logging_util import get_logger
 from mlagents_envs.timers import set_gauge
-from mlagents.tf_utils import tf, generate_session_config
+from torch.utils.tensorboard import SummaryWriter
+
+
+def _dict_to_str(param_dict: Dict[str, Any], num_tabs: int) -> str:
+    """
+    Takes a parameter dictionary and converts it to a human-readable string.
+    Recurses if there are multiple levels of dict. Used to print out hyperparameters.
+    param: param_dict: A Dictionary of key, value parameters.
+    return: A string version of this dictionary.
+    """
+    if not isinstance(param_dict, dict):
+        return str(param_dict)
+    else:
+        append_newline = "\n" if num_tabs > 0 else ""
+        return append_newline + "\n".join(
+            [
+                "\t"
+                + "  " * num_tabs
+                + "{}:\t{}".format(x, _dict_to_str(param_dict[x], num_tabs + 1))
+                for x in param_dict
+            ]
+        )


 class StatsSummary(NamedTuple):
        if property_type == StatsPropertyType.HYPERPARAMETERS:
            logger.info(
                """Hyperparameters for behavior name {}: \n{}""".format(
-                    category, self._dict_to_str(value, 0)
+                    category, _dict_to_str(value, 0)
                )
            )
        elif property_type == StatsPropertyType.SELF_PLAY:
-    def _dict_to_str(self, param_dict: Dict[str, Any], num_tabs: int) -> str:
-        """
-        Takes a parameter dictionary and converts it to a human-readable string.
-        Recurses if there are multiple levels of dict. Used to print out hyperparameters.
-        param: param_dict: A Dictionary of key, value parameters.
-        return: A string version of this dictionary.
-        """
-        if not isinstance(param_dict, dict):
-            return str(param_dict)
-        else:
-            append_newline = "\n" if num_tabs > 0 else ""
-            return append_newline + "\n".join(
-                [
-                    "\t"
-                    + "  " * num_tabs
-                    + "{}:\t{}".format(
-                        x, self._dict_to_str(param_dict[x], num_tabs + 1)
-                    )
-                    for x in param_dict
-                ]
-            )
-

 class TensorboardWriter(StatsWriter):
    def __init__(self, base_dir: str, clear_past_data: bool = False):
        :param clear_past_data: Whether or not to clean up existing Tensorboard files associated with the base_dir and
            category.
        """
-        self.summary_writers: Dict[str, tf.summary.FileWriter] = {}
+        self.summary_writers: Dict[str, SummaryWriter] = {}
        self.base_dir: str = base_dir
        self._clear_past_data = clear_past_data

        self._maybe_create_summary_writer(category)
        for key, value in values.items():
-            summary = tf.Summary()
-            summary.value.add(tag=f"{key}", simple_value=value.mean)
-            self.summary_writers[category].add_summary(summary, step)
+            self.summary_writers[category].add_scalar(f"{key}", value.mean, step)
            self.summary_writers[category].flush()

    def _maybe_create_summary_writer(self, category: str) -> None:
            os.makedirs(filewriter_dir, exist_ok=True)
            if self._clear_past_data:
                self._delete_all_events_files(filewriter_dir)
-            self.summary_writers[category] = tf.summary.FileWriter(filewriter_dir)
+            self.summary_writers[category] = SummaryWriter(filewriter_dir)

    def _delete_all_events_files(self, directory_name: str) -> None:
        for file_name in os.listdir(directory_name):
    ) -> None:
        if property_type == StatsPropertyType.HYPERPARAMETERS:
            assert isinstance(value, dict)
-            summary = self._dict_to_tensorboard("Hyperparameters", value)
+            summary = _dict_to_str(value, 0)
-                self.summary_writers[category].add_summary(summary, 0)
-
-    def _dict_to_tensorboard(
-        self, name: str, input_dict: Dict[str, Any]
-    ) -> Optional[bytes]:
-        """
-        Convert a dict to a Tensorboard-encoded string.
-        :param name: The name of the text.
-        :param input_dict: A dictionary that will be displayed in a table on Tensorboard.
-        """
-        try:
-            with tf.Session(config=generate_session_config()) as sess:
-                s_op = tf.summary.text(
-                    name,
-                    tf.convert_to_tensor(
-                        [[str(x), str(input_dict[x])] for x in input_dict]
-                    ),
-                )
-                s = sess.run(s_op)
-                return s
-        except Exception:
-            logger.warning(
-                f"Could not write {name} summary for Tensorboard: {input_dict}"
-            )
-            return None
+                self.summary_writers[category].add_text("Hyperparameters", summary)
+                self.summary_writers[category].flush()


 class StatsReporter:
--- a/ml-agents/mlagents/trainers/subprocess_env_manager.py
+++ b/ml-agents/mlagents/trainers/subprocess_env_manager.py
                all_action_info = req.payload
                for brain_name, action_info in all_action_info.items():
                    if len(action_info.action) != 0:
-                        env.set_actions(brain_name, action_info.action)
+                        _action = EnvManager.action_tuple_from_numpy_dict(
+                            action_info.action
+                        )
+                        env.set_actions(brain_name, _action)
                env.step()
                all_step_result = _generate_all_results()
                # The timers in this process are independent from all the processes and the "main" process
--- a/ml-agents/mlagents/trainers/tests/mock_brain.py
+++ b/ml-agents/mlagents/trainers/tests/mock_brain.py
-from typing import List, Tuple, Union
-from collections.abc import Iterable
+from typing import List, Tuple
 import numpy as np

 from mlagents.trainers.buffer import AgentBuffer
    TerminalSteps,
    BehaviorSpec,
-    ActionType,
+    ActionSpec,
 )


-    action_shape: Union[int, Tuple[int]] = None,
-    discrete: bool = False,
+    action_spec: ActionSpec,
    done: bool = False,
 ) -> Tuple[DecisionSteps, TerminalSteps]:
    """
    :bool discrete: Whether or not action space is discrete
    :bool done: Whether all the agents in the batch are done
    """
-    if action_shape is None:
-        action_shape = 2
-
-    if discrete and isinstance(action_shape, Iterable):
+    if action_spec.is_discrete():
-            for action_size in action_shape  # type: ignore
+            for action_size in action_spec.discrete_branches  # type: ignore
-    behavior_spec = BehaviorSpec(
-        observation_shapes,
-        ActionType.DISCRETE if discrete else ActionType.CONTINUOUS,
-        action_shape,
-    )
+    behavior_spec = BehaviorSpec(observation_shapes, action_spec)
    if done:
        return (
            DecisionSteps.empty(behavior_spec),
    return create_mock_steps(
        num_agents=num_agents,
        observation_shapes=behavior_spec.observation_shapes,
-        action_shape=behavior_spec.action_shape,
-        discrete=behavior_spec.is_action_discrete(),
+        action_spec=behavior_spec.action_spec,
    )


+    action_spec: ActionSpec,
-    action_space: Union[int, Tuple[int]] = 2,
-    is_discrete: bool = True,
 ) -> Trajectory:
    """
    Makes a fake trajectory of length length. If max_step_complete,
+
+    action_size = action_spec.discrete_size + action_spec.continuous_size
+    action_probs = {
+        "action_probs": np.ones(
+            int(np.sum(action_spec.discrete_branches) + action_spec.continuous_size),
+            dtype=np.float32,
+        )
+    }
    for _i in range(length - 1):
        obs = []
        for _shape in observation_shapes:
-        if is_discrete:
-            action_size = len(action_space)  # type: ignore
-            action_probs = np.ones(np.sum(action_space), dtype=np.float32)
+        if action_spec.is_continuous():
+            action = {"continuous_action": np.zeros(action_size, dtype=np.float32)}
-            action_size = int(action_space)  # type: ignore
-            action_probs = np.ones((action_size), dtype=np.float32)
-        action = np.zeros(action_size, dtype=np.float32)
+            action = {"discrete_action": np.zeros(action_size, dtype=np.float32)}
-            [[False for _ in range(branch)] for branch in action_space]  # type: ignore
-            if is_discrete
+            [
+                [False for _ in range(branch)]
+                for branch in action_spec.discrete_branches
+            ]  # type: ignore
+            if action_spec.is_discrete()
-        prev_action = np.ones(action_size, dtype=np.float32)
+        if action_spec.is_discrete():
+            prev_action = np.ones(action_size, dtype=np.int32)
+        else:
+            prev_action = np.ones(action_size, dtype=np.float32)
+
        max_step = False
        memory = np.ones(memory_size, dtype=np.float32)
        agent_id = "test_agent"
    memory_size: int = 10,
    exclude_key_list: List[str] = None,
 ) -> AgentBuffer:
-    action_space = behavior_spec.action_shape
-    is_discrete = behavior_spec.is_action_discrete()
-
-        action_space=action_space,
+        action_spec=behavior_spec.action_spec,
-        is_discrete=is_discrete,
    )
    buffer = trajectory.to_agentbuffer()
    # If a key_list was given, remove those keys
 def setup_test_behavior_specs(
    use_discrete=True, use_visual=False, vector_action_space=2, vector_obs_space=8
 ):
+    if use_discrete:
+        action_spec = ActionSpec.create_discrete(tuple(vector_action_space))
+    else:
+        action_spec = ActionSpec.create_continuous(vector_action_space)
-        [(84, 84, 3)] * int(use_visual) + [(vector_obs_space,)],
-        ActionType.DISCRETE if use_discrete else ActionType.CONTINUOUS,
-        tuple(vector_action_space) if use_discrete else vector_action_space,
+        [(84, 84, 3)] * int(use_visual) + [(vector_obs_space,)], action_spec
    )
    return behavior_spec

--- a/ml-agents/mlagents/trainers/tests/simple_test_envs.py
+++ b/ml-agents/mlagents/trainers/tests/simple_test_envs.py
 import numpy as np

 from mlagents_envs.base_env import (
+    ActionSpec,
+    ActionTuple,
-    ActionType,
    BehaviorMapping,
 )
 from mlagents_envs.tests.test_rpc_utils import proto_from_steps_and_action

 OBS_SIZE = 1
 VIS_OBS_SIZE = (20, 20, 3)
-STEP_SIZE = 0.1
+STEP_SIZE = 0.2

 TIME_PENALTY = 0.01
 MIN_STEPS = int(1.0 / STEP_SIZE) + 1
    def __init__(
        self,
        brain_names,
-        use_discrete,
-        action_size=1,
+        action_sizes=(1, 0),
-        self.discrete = use_discrete
-        action_type = ActionType.DISCRETE if use_discrete else ActionType.CONTINUOUS
-        self.behavior_spec = BehaviorSpec(
-            self._make_obs_spec(),
-            action_type,
-            tuple(2 for _ in range(action_size)) if use_discrete else action_size,
-        )
-        self.action_size = action_size
+        continuous_action_size, discrete_action_size = action_sizes
+        discrete_tuple = tuple(2 for _ in range(discrete_action_size))
+        action_spec = ActionSpec(continuous_action_size, discrete_tuple)
+        self.total_action_size = (
+            continuous_action_size + discrete_action_size
+        )  # to set the goals/positions
+        self.action_spec = action_spec
+        self.behavior_spec = BehaviorSpec(self._make_obs_spec(), action_spec)
+        self.action_spec = action_spec
        self.names = brain_names
        self.positions: Dict[str, List[float]] = {}
        self.step_count: Dict[str, float] = {}

    def _take_action(self, name: str) -> bool:
        deltas = []
-        for _act in self.action[name][0]:
-            if self.discrete:
-                deltas.append(1 if _act else -1)
-            else:
-                deltas.append(_act)
+        _act = self.action[name]
+        if self.action_spec.discrete_size > 0:
+            for _disc in _act.discrete[0]:
+                deltas.append(1 if _disc else -1)
+        if self.action_spec.continuous_size > 0:
+            for _cont in _act.continuous[0]:
+                deltas.append(_cont)
        for i, _delta in enumerate(deltas):
            _delta = clamp(_delta, -self.step_size, self.step_size)
            self.positions[name][i] += _delta
        return done

    def _generate_mask(self):
-        if self.discrete:
+        action_mask = None
+        if self.action_spec.discrete_size > 0:
-            ndmask = np.array(2 * self.action_size * [False], dtype=np.bool)
+            ndmask = np.array(
+                2 * self.action_spec.discrete_size * [False], dtype=np.bool
+            )
-        else:
-            action_mask = None
        return action_mask

    def _compute_reward(self, name: str, done: bool) -> float:

    def _reset_agent(self, name):
        self.goal[name] = self.random.choice([-1, 1])
-        self.positions[name] = [0.0 for _ in range(self.action_size)]
+        self.positions[name] = [0.0 for _ in range(self.total_action_size)]
        self.step_count[name] = 0
        self.rewards[name] = 0
        self.agent_id[name] = self.agent_id[name] + 1


 class MemoryEnvironment(SimpleEnvironment):
-    def __init__(self, brain_names, use_discrete, step_size=0.2):
-        super().__init__(brain_names, use_discrete, step_size=step_size)
+    def __init__(self, brain_names, action_sizes=(1, 0), step_size=0.2):
+        super().__init__(brain_names, action_sizes=action_sizes, step_size=step_size)
        # Number of steps to reveal the goal for. Lower is harder. Should be
        # less than 1/step_size to force agent to use memory
        self.num_show_steps = 2
    def __init__(
        self,
        brain_names,
-        use_discrete,
+        action_sizes=(1, 0),
-            use_discrete,
+            action_sizes=action_sizes,
        )
        self.demonstration_protos: Dict[str, List[AgentInfoActionPairProto]] = {}
        self.n_demos = n_demos
    def step(self) -> None:
        super().step()
        for name in self.names:
+            if self.action_spec.discrete_size > 0:
+                action = self.action[name].discrete
+            else:
+                action = self.action[name].continuous
-                self.step_result[name][0], self.step_result[name][1], self.action[name]
+                self.step_result[name][0], self.step_result[name][1], action
            )
            self.demonstration_protos[name] = self.demonstration_protos[name][
                -self.n_demos :
        self.reset()
        for _ in range(self.n_demos):
            for name in self.names:
-                if self.discrete:
-                    self.action[name] = [[1]] if self.goal[name] > 0 else [[0]]
+                if self.action_spec.discrete_size > 0:
+                    self.action[name] = ActionTuple(
+                        np.array([], dtype=np.float32),
+                        np.array(
+                            [[1]] if self.goal[name] > 0 else [[0]], dtype=np.int32
+                        ),
+                    )
-                    self.action[name] = [[float(self.goal[name])]]
+                    self.action[name] = ActionTuple(
+                        np.array([[float(self.goal[name])]], dtype=np.float32),
+                        np.array([], dtype=np.int32),
+                    )
            self.step()
--- a/ml-agents/mlagents/trainers/tests/tensorflow/test_ghost.py
+++ b/ml-agents/mlagents/trainers/tests/tensorflow/test_ghost.py
        np.testing.assert_array_equal(w, lw)


+def test_resume(dummy_config, tmp_path):
+    mock_specs = mb.setup_test_behavior_specs(
+        True, False, vector_action_space=[2], vector_obs_space=1
+    )
+    behavior_id_team0 = "test_brain?team=0"
+    behavior_id_team1 = "test_brain?team=1"
+    brain_name = BehaviorIdentifiers.from_name_behavior_id(behavior_id_team0).brain_name
+    tmp_path = tmp_path.as_posix()
+    ppo_trainer = PPOTrainer(brain_name, 0, dummy_config, True, False, 0, tmp_path)
+    controller = GhostController(100)
+    trainer = GhostTrainer(
+        ppo_trainer, brain_name, controller, 0, dummy_config, True, tmp_path
+    )
+
+    parsed_behavior_id0 = BehaviorIdentifiers.from_name_behavior_id(behavior_id_team0)
+    policy = trainer.create_policy(parsed_behavior_id0, mock_specs)
+    trainer.add_policy(parsed_behavior_id0, policy)
+
+    parsed_behavior_id1 = BehaviorIdentifiers.from_name_behavior_id(behavior_id_team1)
+    policy = trainer.create_policy(parsed_behavior_id1, mock_specs)
+    trainer.add_policy(parsed_behavior_id1, policy)
+
+    trainer.save_model()
+
+    # Make a new trainer, check that the policies are the same
+    ppo_trainer2 = PPOTrainer(brain_name, 0, dummy_config, True, True, 0, tmp_path)
+    trainer2 = GhostTrainer(
+        ppo_trainer2, brain_name, controller, 0, dummy_config, True, tmp_path
+    )
+    policy = trainer2.create_policy(parsed_behavior_id0, mock_specs)
+    trainer2.add_policy(parsed_behavior_id0, policy)
+
+    policy = trainer2.create_policy(parsed_behavior_id1, mock_specs)
+    trainer2.add_policy(parsed_behavior_id1, policy)
+
+    trainer1_policy = trainer.get_policy(parsed_behavior_id1.behavior_id)
+    trainer2_policy = trainer2.get_policy(parsed_behavior_id1.behavior_id)
+    weights = trainer1_policy.get_weights()
+    weights2 = trainer2_policy.get_weights()
+
+    for w, lw in zip(weights, weights2):
+        np.testing.assert_array_equal(w, lw)
+
+
 def test_process_trajectory(dummy_config):
    mock_specs = mb.setup_test_behavior_specs(
        True, False, vector_action_space=[2], vector_obs_space=1
        length=time_horizon,
        max_step_complete=True,
        observation_shapes=[(1,)],
-        action_space=[2],
+        action_spec=mock_specs.action_spec,
    )
    trajectory_queue0.put(trajectory)
    trainer.advance()
--- a/ml-agents/mlagents/trainers/tests/tensorflow/test_models.py
+++ b/ml-agents/mlagents/trainers/tests/tensorflow/test_models.py

 from mlagents.trainers.tf.models import ModelUtils
 from mlagents.tf_utils import tf
-from mlagents_envs.base_env import BehaviorSpec, ActionType
+from mlagents_envs.base_env import BehaviorSpec, ActionSpec
-        ActionType.DISCRETE,
-        (1,),
+        ActionSpec.create_discrete((1,)),
    )
    return behavior_spec

--- a/ml-agents/mlagents/trainers/tests/tensorflow/test_nn_policy.py
+++ b/ml-agents/mlagents/trainers/tests/tensorflow/test_nn_policy.py
        length=time_horizon,
        max_step_complete=True,
        observation_shapes=[(1,)],
-        action_space=[2],
+        action_spec=behavior_spec.action_spec,
    )
    for i in range(time_horizon):
        trajectory.steps[i].obs[0] = np.array([large_obs1[i]], dtype=np.float32)
        length=time_horizon,
        max_step_complete=True,
        observation_shapes=[(1,)],
-        action_space=[2],
+        action_spec=behavior_spec.action_spec,
    )
    for i in range(time_horizon):
        trajectory.steps[i].obs[0] = np.array([large_obs2[i]], dtype=np.float32)
        length=time_horizon,
        max_step_complete=True,
        observation_shapes=[(1,)],
-        action_space=[2],
+        action_spec=behavior_spec.action_spec,
    )
    # Change half of the obs to 0
    for i in range(3):
        length=time_horizon,
        max_step_complete=True,
        observation_shapes=[(1,)],
-        action_space=[2],
+        action_spec=behavior_spec.action_spec,
    )
    trajectory_buffer = trajectory.to_agentbuffer()
    policy.update_normalization(trajectory_buffer["vector_obs"])
--- a/ml-agents/mlagents/trainers/tests/tensorflow/test_ppo.py
+++ b/ml-agents/mlagents/trainers/tests/tensorflow/test_ppo.py
    ppo_dummy_config,
 )

+from mlagents_envs.base_env import ActionSpec
+

@pytest.fixture
 def dummy_config():
 DISCRETE_ACTION_SPACE = [3, 3, 3, 2]
 BUFFER_INIT_SAMPLES = 64
 NUM_AGENTS = 12
+
+CONTINUOUS_ACTION_SPEC = ActionSpec.create_continuous(VECTOR_ACTION_SPACE)
+DISCRETE_ACTION_SPEC = ActionSpec.create_discrete(tuple(DISCRETE_ACTION_SPACE))


 def _create_ppo_optimizer_ops_mock(dummy_config, use_rnn, use_discrete, use_visual):
        length=time_horizon,
        observation_shapes=optimizer.policy.behavior_spec.observation_shapes,
        max_step_complete=True,
-        action_space=DISCRETE_ACTION_SPACE if discrete else VECTOR_ACTION_SPACE,
-        is_discrete=discrete,
+        action_spec=DISCRETE_ACTION_SPEC if discrete else CONTINUOUS_ACTION_SPEC,
    )
    run_out, final_value_out = optimizer.get_trajectory_value_estimates(
        trajectory.to_agentbuffer(), trajectory.next_obs, done=False
        length=time_horizon,
        observation_shapes=behavior_spec.observation_shapes,
        max_step_complete=True,
-        action_space=[2],
+        action_spec=behavior_spec.action_spec,
    )
    trajectory_queue.put(trajectory)
    trainer.advance()
        length=time_horizon + 1,
        max_step_complete=False,
        observation_shapes=behavior_spec.observation_shapes,
-        action_space=[2],
+        action_spec=behavior_spec.action_spec,
    )
    trajectory_queue.put(trajectory)
    trainer.advance()
--- a/ml-agents/mlagents/trainers/tests/tensorflow/test_sac.py
+++ b/ml-agents/mlagents/trainers/tests/tensorflow/test_sac.py
        length=15,
        observation_shapes=specs.observation_shapes,
        max_step_complete=True,
-        action_space=2,
-        is_discrete=False,
+        action_spec=specs.action_spec,
    )
    trajectory_queue.put(trajectory)
    trainer.advance()
        length=6,
        observation_shapes=specs.observation_shapes,
        max_step_complete=False,
-        action_space=2,
-        is_discrete=False,
+        action_spec=specs.action_spec,
    )
    trajectory_queue.put(trajectory)
    trainer.advance()
    trajectory = make_fake_trajectory(
        length=5,
        observation_shapes=specs.observation_shapes,
+        action_spec=specs.action_spec,
-        action_space=2,
-        is_discrete=False,
    )
    trajectory_queue.put(trajectory)
    trainer.advance()
--- a/ml-agents/mlagents/trainers/tests/tensorflow/test_saver.py
+++ b/ml-agents/mlagents/trainers/tests/tensorflow/test_saver.py
        length=time_horizon,
        max_step_complete=True,
        observation_shapes=[(1,)],
-        action_space=[2],
+        action_spec=behavior_spec.action_spec,
    )
    # Change half of the obs to 0
    for i in range(3):
        length=time_horizon,
        max_step_complete=True,
        observation_shapes=[(1,)],
-        action_space=[2],
+        action_spec=behavior_spec.action_spec,
    )
    trajectory_buffer = trajectory.to_agentbuffer()
    policy1.update_normalization(trajectory_buffer["vector_obs"])
--- a/ml-agents/mlagents/trainers/tests/tensorflow/test_simple_rl.py
+++ b/ml-agents/mlagents/trainers/tests/tensorflow/test_simple_rl.py
            assert all(reward > success_threshold for reward in processed_rewards)


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_ppo(use_discrete):
-    env = SimpleEnvironment([BRAIN_NAME], use_discrete=use_discrete)
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_ppo(action_sizes):
+    env = SimpleEnvironment([BRAIN_NAME], action_sizes=action_sizes)
-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_2d_ppo(use_discrete):
-    env = SimpleEnvironment(
-        [BRAIN_NAME], use_discrete=use_discrete, action_size=2, step_size=0.8
-    )
+@pytest.mark.parametrize("action_sizes", [(0, 2), (2, 0)])
+def test_2d_ppo(action_sizes):
+    env = SimpleEnvironment([BRAIN_NAME], action_sizes=action_sizes, step_size=0.8)
    new_hyperparams = attr.evolve(
        PPO_TF_CONFIG.hyperparameters, batch_size=64, buffer_size=640
    )
    _check_environment_trains(env, {BRAIN_NAME: config})


-@pytest.mark.parametrize("use_discrete", [True, False])
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
-def test_visual_ppo(num_visual, use_discrete):
+def test_visual_ppo(num_visual, action_sizes):
-        use_discrete=use_discrete,
+        action_sizes=action_sizes,
        num_visual=num_visual,
        num_vector=0,
        step_size=0.2,
 def test_visual_advanced_ppo(vis_encode_type, num_visual):
    env = SimpleEnvironment(
        [BRAIN_NAME],
-        use_discrete=True,
+        action_sizes=(0, 1),
        num_visual=num_visual,
        num_vector=0,
        step_size=0.5,
        PPO_TF_CONFIG,
        hyperparameters=new_hyperparams,
        network_settings=new_networksettings,
-        max_steps=500,
+        max_steps=300,
        summary_freq=100,
        framework=FrameworkType.TENSORFLOW,
    )

-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_recurrent_ppo(use_discrete):
-    env = MemoryEnvironment([BRAIN_NAME], use_discrete=use_discrete)
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_recurrent_ppo(action_sizes):
+    env = MemoryEnvironment([BRAIN_NAME], action_sizes=action_sizes)
    new_network_settings = attr.evolve(
        PPO_TF_CONFIG.network_settings,
        memory=NetworkSettings.MemorySettings(memory_size=16),
    _check_environment_trains(env, {BRAIN_NAME: config}, success_threshold=0.9)


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_sac(use_discrete):
-    env = SimpleEnvironment([BRAIN_NAME], use_discrete=use_discrete)
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_sac(action_sizes):
+    env = SimpleEnvironment([BRAIN_NAME], action_sizes=action_sizes)
-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_2d_sac(use_discrete):
+@pytest.mark.parametrize("action_sizes", [(0, 2), (2, 0)])
+def test_2d_sac(action_sizes):
-        [BRAIN_NAME], use_discrete=use_discrete, action_size=2, step_size=0.8
+        [BRAIN_NAME], action_sizes=action_sizes, action_size=2, step_size=0.8
    )
    new_hyperparams = attr.evolve(SAC_TF_CONFIG.hyperparameters, buffer_init_steps=2000)
    config = attr.evolve(
    _check_environment_trains(env, {BRAIN_NAME: config}, success_threshold=0.8)


-@pytest.mark.parametrize("use_discrete", [True, False])
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
-def test_visual_sac(num_visual, use_discrete):
+def test_visual_sac(num_visual, action_sizes):
-        use_discrete=use_discrete,
+        action_sizes=action_sizes,
        num_visual=num_visual,
        num_vector=0,
        step_size=0.2,
 def test_visual_advanced_sac(vis_encode_type, num_visual):
    env = SimpleEnvironment(
        [BRAIN_NAME],
-        use_discrete=True,
+        action_sizes=(0, 1),
        num_visual=num_visual,
        num_vector=0,
        step_size=0.5,
    _check_environment_trains(env, {BRAIN_NAME: config}, success_threshold=0.5)


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_recurrent_sac(use_discrete):
-    step_size = 0.5 if use_discrete else 0.2
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_recurrent_sac(action_sizes):
+    step_size = 0.2 if action_sizes else 0.5
-        [BRAIN_NAME], use_discrete=use_discrete, step_size=step_size
+        [BRAIN_NAME], action_sizes=action_sizes, step_size=step_size
-        memory=NetworkSettings.MemorySettings(memory_size=16, sequence_length=16),
+        memory=NetworkSettings.MemorySettings(memory_size=16),
    )
    new_hyperparams = attr.evolve(
        SAC_TF_CONFIG.hyperparameters,
        SAC_TF_CONFIG,
        hyperparameters=new_hyperparams,
        network_settings=new_networksettings,
-        max_steps=5000,
+        max_steps=4000,
-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_ghost(use_discrete):
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_ghost(action_sizes):
-        [BRAIN_NAME + "?team=0", BRAIN_NAME + "?team=1"], use_discrete=use_discrete
+        [BRAIN_NAME + "?team=0", BRAIN_NAME + "?team=1"], action_sizes=action_sizes
    )
    self_play_settings = SelfPlaySettings(
        play_against_latest_model_ratio=1.0, save_steps=2000, swap_steps=2000
    _check_environment_trains(env, {BRAIN_NAME: config})


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_ghost_fails(use_discrete):
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_ghost_fails(action_sizes):
-        [BRAIN_NAME + "?team=0", BRAIN_NAME + "?team=1"], use_discrete=use_discrete
+        [BRAIN_NAME + "?team=0", BRAIN_NAME + "?team=1"], action_sizes=action_sizes
    )
    # This config should fail because the ghosted policy is never swapped with a competent policy.
    # Swap occurs after max step is reached.
    )


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_asymm_ghost(use_discrete):
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_asymm_ghost(action_sizes):
-        [BRAIN_NAME + "?team=0", brain_name_opp + "?team=1"], use_discrete=use_discrete
+        [BRAIN_NAME + "?team=0", brain_name_opp + "?team=1"], action_sizes=action_sizes
    )
    self_play_settings = SelfPlaySettings(
        play_against_latest_model_ratio=1.0,
    _check_environment_trains(env, {BRAIN_NAME: config, brain_name_opp: config})


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_asymm_ghost_fails(use_discrete):
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_asymm_ghost_fails(action_sizes):
-        [BRAIN_NAME + "?team=0", brain_name_opp + "?team=1"], use_discrete=use_discrete
+        [BRAIN_NAME + "?team=0", brain_name_opp + "?team=1"], action_sizes=action_sizes
    )
    # This config should fail because the team that us not learning when both have reached
    # max step should be executing the initial, untrained poliy.

@pytest.fixture(scope="session")
 def simple_record(tmpdir_factory):
-    def record_demo(use_discrete, num_visual=0, num_vector=1):
+    def record_demo(action_sizes, num_visual=0, num_vector=1):
-            use_discrete=use_discrete,
+            action_sizes=action_sizes,
            num_visual=num_visual,
            num_vector=num_vector,
            n_demos=100,
        env.solve()
+        continuous_size, discrete_size = action_sizes
+        use_discrete = True if discrete_size > 0 else False
        agent_info_protos = env.demonstration_protos[BRAIN_NAME]
        meta_data_proto = DemonstrationMetaProto()
        brain_param_proto = BrainParametersProto(
    return record_demo


-@pytest.mark.parametrize("use_discrete", [True, False])
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
-def test_gail(simple_record, use_discrete, trainer_config):
-    demo_path = simple_record(use_discrete)
-    env = SimpleEnvironment([BRAIN_NAME], use_discrete=use_discrete, step_size=0.2)
+def test_gail(simple_record, action_sizes, trainer_config):
+    demo_path = simple_record(action_sizes)
+    env = SimpleEnvironment([BRAIN_NAME], action_sizes=action_sizes, step_size=0.2)
    bc_settings = BehavioralCloningSettings(demo_path=demo_path, steps=1000)
    reward_signals = {
        RewardSignalType.GAIL: GAILSettings(encoding_size=32, demo_path=demo_path)
    _check_environment_trains(env, {BRAIN_NAME: config}, success_threshold=0.9)


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_gail_visual_ppo(simple_record, use_discrete):
-    demo_path = simple_record(use_discrete, num_visual=1, num_vector=0)
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_gail_visual_ppo(simple_record, action_sizes):
+    demo_path = simple_record(action_sizes, num_visual=1, num_vector=0)
-        use_discrete=use_discrete,
+        action_sizes=action_sizes,
        step_size=0.2,
    )
    bc_settings = BehavioralCloningSettings(demo_path=demo_path, steps=1500)
    _check_environment_trains(env, {BRAIN_NAME: config}, success_threshold=0.9)


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_gail_visual_sac(simple_record, use_discrete):
-    demo_path = simple_record(use_discrete, num_visual=1, num_vector=0)
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_gail_visual_sac(simple_record, action_sizes):
+    demo_path = simple_record(action_sizes, num_visual=1, num_vector=0)
-        use_discrete=use_discrete,
+        action_sizes=action_sizes,
        step_size=0.2,
    )
    bc_settings = BehavioralCloningSettings(demo_path=demo_path, steps=1000)
--- a/ml-agents/mlagents/trainers/tests/tensorflow/test_tf_policy.py
+++ b/ml-agents/mlagents/trainers/tests/tensorflow/test_tf_policy.py
 from unittest.mock import MagicMock
 from mlagents.trainers.settings import TrainerSettings
 import numpy as np
+from mlagents_envs.base_env import ActionSpec
-def basic_mock_brain():
-    mock_brain = MagicMock()
-    mock_brain.vector_action_space_type = "continuous"
-    mock_brain.vector_observation_space_size = 1
-    mock_brain.vector_action_space_size = [1]
-    mock_brain.brain_name = "MockBrain"
-    return mock_brain
+def basic_behavior_spec():
+    dummy_actionspec = ActionSpec.create_continuous(1)
+    dummy_groupspec = BehaviorSpec([(1,)], dummy_actionspec)
+    return dummy_groupspec


 class FakePolicy(TFPolicy):

 def test_take_action_returns_empty_with_no_agents():
    test_seed = 3
-    policy = FakePolicy(test_seed, basic_mock_brain(), TrainerSettings(), "output")
-    # Doesn't really matter what this is
-    dummy_groupspec = BehaviorSpec([(1,)], "continuous", 1)
-    no_agent_step = DecisionSteps.empty(dummy_groupspec)
+    behavior_spec = basic_behavior_spec()
+    policy = FakePolicy(test_seed, behavior_spec, TrainerSettings(), "output")
+    no_agent_step = DecisionSteps.empty(behavior_spec)
    result = policy.get_action(no_agent_step)
    assert result == ActionInfo.empty()

-    policy = FakePolicy(test_seed, basic_mock_brain(), TrainerSettings(), "output")
+    behavior_spec = basic_behavior_spec()
+    policy = FakePolicy(test_seed, behavior_spec, TrainerSettings(), "output")
    policy.evaluate = MagicMock(return_value={})
    policy.save_memories = MagicMock()
    step_with_agents = DecisionSteps(

 def test_take_action_returns_action_info_when_available():
    test_seed = 3
-    policy = FakePolicy(test_seed, basic_mock_brain(), TrainerSettings(), "output")
+    behavior_spec = basic_behavior_spec()
+    policy = FakePolicy(test_seed, behavior_spec, TrainerSettings(), "output")
-        "action": np.array([1.0], dtype=np.float32),
+        "action": {"continuous_action": np.array([1.0], dtype=np.float32)},
        "memory_out": np.array([[2.5]], dtype=np.float32),
        "value": np.array([1.1], dtype=np.float32),
    }
--- a/ml-agents/mlagents/trainers/tests/test_agent_processor.py
+++ b/ml-agents/mlagents/trainers/tests/test_agent_processor.py
 from mlagents.trainers.behavior_id_utils import get_global_agent_id
 from mlagents_envs.side_channel.stats_side_channel import StatsAggregationMethod

+from mlagents_envs.base_env import ActionSpec
+
-    mock_policy.retrieve_previous_action.return_value = np.zeros(
-        (1, 1), dtype=np.float32
-    )
+    mock_policy.retrieve_previous_action.return_value = np.zeros((1, 1), dtype=np.int32)
    return mock_policy


    )

    fake_action_outputs = {
-        "action": [0.1, 0.1],
+        "action": {"continuous_action": [0.1, 0.1]},
-        "log_probs": [0.1, 0.1],
+        "log_probs": {"continuous_log_probs": [0.1, 0.1]},
-        action_shape=2,
+        action_spec=ActionSpec.create_continuous(2),
-        action=[0.1, 0.1],
+        action={"continuous_action": [0.1, 0.1]},
        value=[0.1, 0.1],
        outputs=fake_action_outputs,
        agent_ids=mock_decision_steps.agent_id,
    mock_decision_steps, mock_terminal_steps = mb.create_mock_steps(
        num_agents=0,
        observation_shapes=[(8,)] + num_vis_obs * [(84, 84, 3)],
-        action_shape=2,
+        action_spec=ActionSpec.create_continuous(2),
    )
    processor.add_experiences(
        mock_decision_steps, mock_terminal_steps, 0, ActionInfo([], [], {}, [])
    )

    fake_action_outputs = {
-        "action": [0.1],
+        "action": {"continuous_action": [0.1]},
-        "log_probs": [0.1],
+        "log_probs": {"continuous_log_probs": [0.1]},
-        num_agents=1, observation_shapes=[(8,)], action_shape=2
+        num_agents=1,
+        observation_shapes=[(8,)],
+        action_spec=ActionSpec.create_continuous(2),
-        num_agents=1, observation_shapes=[(8,)], action_shape=2, done=True
+        num_agents=1,
+        observation_shapes=[(8,)],
+        action_spec=ActionSpec.create_continuous(2),
+        done=True,
-        action=[0.1],
+        action={"continuous_action": [0.1]},
        value=[0.1],
        outputs=fake_action_outputs,
        agent_ids=mock_decision_step.agent_id,
            processor.add_experiences(
                mock_decision_step, mock_terminal_step, _ep, fake_action_info
            )
-            add_calls.append(mock.call([get_global_agent_id(_ep, 0)], [0.1]))
+            add_calls.append(
+                mock.call([get_global_agent_id(_ep, 0)], {"continuous_action": [0.1]})
+            )
        processor.add_experiences(
            mock_done_decision_step, mock_done_terminal_step, _ep, fake_action_info
        )
    )

    fake_action_outputs = {
-        "action": [0.1],
+        "action": {"continuous_action": [0.1]},
-        "log_probs": [0.1],
+        "log_probs": {"continuous_log_probs": [0.1]},
-        num_agents=1, observation_shapes=[(8,)], action_shape=2
+        num_agents=1,
+        observation_shapes=[(8,)],
+        action_spec=ActionSpec.create_continuous(2),
-        action=[0.1],
+        action={"continuous_action": [0.1]},
        value=[0.1],
        outputs=fake_action_outputs,
        agent_ids=mock_decision_step.agent_id,
--- a/ml-agents/mlagents/trainers/tests/test_demo_loader.py
+++ b/ml-agents/mlagents/trainers/tests/test_demo_loader.py
    assert len(pair_infos) == total_expected

    _, demo_buffer = demo_to_buffer(path_prefix + "/test.demo", 1, BEHAVIOR_SPEC)
-    assert len(demo_buffer["actions"]) == total_expected - 1
+    assert (
+        len(demo_buffer["continuous_action"]) == total_expected - 1
+        or len(demo_buffer["discrete_action"]) == total_expected - 1
+    )


 def test_load_demo_dir():
    assert len(pair_infos) == total_expected

    _, demo_buffer = demo_to_buffer(path_prefix + "/test_demo_dir", 1, BEHAVIOR_SPEC)
-    assert len(demo_buffer["actions"]) == total_expected - 1
+    assert (
+        len(demo_buffer["continuous_action"]) == total_expected - 1
+        or len(demo_buffer["discrete_action"]) == total_expected - 1
+    )


 def test_demo_mismatch():
--- a/ml-agents/mlagents/trainers/tests/test_rl_trainer.py
+++ b/ml-agents/mlagents/trainers/tests/test_rl_trainer.py
 from mlagents.trainers.agent_processor import AgentManagerQueue
 from mlagents.trainers.settings import TrainerSettings, FrameworkType

+from mlagents_envs.base_env import ActionSpec
+

 # Add concrete implementations of abstract methods
 class FakeTrainer(RLTrainer):
        length=time_horizon,
        observation_shapes=[(1,)],
        max_step_complete=True,
-        action_space=[2],
+        action_spec=ActionSpec.create_discrete((2,)),
    )
    trajectory_queue.put(trajectory)

        length=time_horizon,
        observation_shapes=[(1,)],
        max_step_complete=True,
-        action_space=[2],
+        action_spec=ActionSpec.create_discrete((2,)),
    )
    # Check that we can turn off the trainer and that the buffer is cleared
    num_trajectories = 5
--- a/ml-agents/mlagents/trainers/tests/test_stats.py
+++ b/ml-agents/mlagents/trainers/tests/test_stats.py
    )


-@mock.patch("mlagents.tf_utils.tf.Summary")
-@mock.patch("mlagents.tf_utils.tf.summary.FileWriter")
-def test_tensorboard_writer(mock_filewriter, mock_summary):
+@mock.patch("mlagents.trainers.stats.SummaryWriter")
+def test_tensorboard_writer(mock_summary):
    # Test write_stats
    category = "category1"
    with tempfile.TemporaryDirectory(prefix="unittest-") as base_dir:
            basedir=base_dir, category=category
        )
        assert os.path.exists(filewriter_dir)
-        mock_filewriter.assert_called_once_with(filewriter_dir)
+        mock_summary.assert_called_once_with(filewriter_dir)
-        mock_summary.return_value.value.add.assert_called_once_with(
-            tag="key1", simple_value=1.0
-        )
-        mock_filewriter.return_value.add_summary.assert_called_once_with(
-            mock_summary.return_value, 10
-        )
-        mock_filewriter.return_value.flush.assert_called_once()
+        mock_summary.return_value.add_scalar.assert_called_once_with("key1", 1.0, 10)
+        mock_summary.return_value.flush.assert_called_once()
-        assert mock_filewriter.return_value.add_summary.call_count > 1
+        assert mock_summary.return_value.add_text.call_count >= 1


 def test_tensorboard_writer_clear(tmp_path):
                },
                10,
            )
-            # Test hyperparameter writing - no good way to parse the TB string though.
+            # Test hyperparameter writing
            console_writer.add_property(
                "category1", StatsPropertyType.HYPERPARAMETERS, {"example": 1.0}
            )
--- a/ml-agents/mlagents/trainers/tests/test_subprocess_env_manager.py
+++ b/ml-agents/mlagents/trainers/tests/test_subprocess_env_manager.py
@pytest.mark.parametrize("num_envs", [1, 4])
 def test_subprocess_env_endtoend(num_envs):
    def simple_env_factory(worker_id, config):
-        env = SimpleEnvironment(["1D"], use_discrete=True)
+        env = SimpleEnvironment(["1D"], action_sizes=(0, 1))
        return env

    env_manager = SubprocessEnvManager(
--- a/ml-agents/mlagents/trainers/tests/test_trajectory.py
+++ b/ml-agents/mlagents/trainers/tests/test_trajectory.py
 from mlagents.trainers.trajectory import SplitObservations
 from mlagents.trainers.tests.mock_brain import make_fake_trajectory

+from mlagents_envs.base_env import ActionSpec
+
 VEC_OBS_SIZE = 6
 ACTION_SIZE = 4

        "masks",
        "done",
        "actions_pre",
-        "actions",
+        "continuous_action",
        "action_probs",
        "action_mask",
        "prev_action",
    trajectory = make_fake_trajectory(
        length=length,
        observation_shapes=[(VEC_OBS_SIZE,), (84, 84, 3)],
-        action_space=[ACTION_SIZE],
+        action_spec=ActionSpec.create_continuous(ACTION_SIZE),
    )
    agentbuffer = trajectory.to_agentbuffer()
    seen_keys = set()
--- a/ml-agents/mlagents/trainers/tests/torch/saver/test_saver.py
+++ b/ml-agents/mlagents/trainers/tests/torch/saver/test_saver.py

    with torch.no_grad():
        _, log_probs1, _, _ = policy1.sample_actions(
-            vec_obs, vis_obs, masks=masks, memories=memories, all_log_probs=True
+            vec_obs, vis_obs, masks=masks, memories=memories
-            vec_obs, vis_obs, masks=masks, memories=memories, all_log_probs=True
+            vec_obs, vis_obs, masks=masks, memories=memories
-
-    np.testing.assert_array_equal(log_probs1, log_probs2)
+    np.testing.assert_array_equal(
+        log_probs1.all_discrete_tensor, log_probs2.all_discrete_tensor
+    )


@pytest.mark.parametrize("discrete", [True, False], ids=["discrete", "continuous"])
--- a/ml-agents/mlagents/trainers/tests/torch/test_ghost.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_ghost.py
        np.testing.assert_array_equal(w, lw)


+def test_resume(dummy_config, tmp_path):
+    mock_specs = mb.setup_test_behavior_specs(
+        True, False, vector_action_space=[2], vector_obs_space=1
+    )
+    behavior_id_team0 = "test_brain?team=0"
+    behavior_id_team1 = "test_brain?team=1"
+    brain_name = BehaviorIdentifiers.from_name_behavior_id(behavior_id_team0).brain_name
+    tmp_path = tmp_path.as_posix()
+    ppo_trainer = PPOTrainer(brain_name, 0, dummy_config, True, False, 0, tmp_path)
+    controller = GhostController(100)
+    trainer = GhostTrainer(
+        ppo_trainer, brain_name, controller, 0, dummy_config, True, tmp_path
+    )
+
+    parsed_behavior_id0 = BehaviorIdentifiers.from_name_behavior_id(behavior_id_team0)
+    policy = trainer.create_policy(parsed_behavior_id0, mock_specs)
+    trainer.add_policy(parsed_behavior_id0, policy)
+
+    parsed_behavior_id1 = BehaviorIdentifiers.from_name_behavior_id(behavior_id_team1)
+    policy = trainer.create_policy(parsed_behavior_id1, mock_specs)
+    trainer.add_policy(parsed_behavior_id1, policy)
+
+    trainer.save_model()
+
+    # Make a new trainer, check that the policies are the same
+    ppo_trainer2 = PPOTrainer(brain_name, 0, dummy_config, True, True, 0, tmp_path)
+    trainer2 = GhostTrainer(
+        ppo_trainer2, brain_name, controller, 0, dummy_config, True, tmp_path
+    )
+    policy = trainer2.create_policy(parsed_behavior_id0, mock_specs)
+    trainer2.add_policy(parsed_behavior_id0, policy)
+
+    policy = trainer2.create_policy(parsed_behavior_id1, mock_specs)
+    trainer2.add_policy(parsed_behavior_id1, policy)
+
+    trainer1_policy = trainer.get_policy(parsed_behavior_id1.behavior_id)
+    trainer2_policy = trainer2.get_policy(parsed_behavior_id1.behavior_id)
+    weights = trainer1_policy.get_weights()
+    weights2 = trainer2_policy.get_weights()
+
+    for w, lw in zip(weights, weights2):
+        np.testing.assert_array_equal(w, lw)
+
+
 def test_process_trajectory(dummy_config):
    mock_specs = mb.setup_test_behavior_specs(
        True, False, vector_action_space=[2], vector_obs_space=1
        length=time_horizon,
        max_step_complete=True,
        observation_shapes=[(1,)],
-        action_space=[2],
+        action_spec=mock_specs.action_spec,
    )
    trajectory_queue0.put(trajectory)
    trainer.advance()
--- a/ml-agents/mlagents/trainers/tests/torch/test_networks.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_networks.py
 from mlagents.trainers.torch.networks import (
    NetworkBody,
    ValueNetwork,
-    SimpleActor,
-from mlagents_envs.base_env import ActionType
-from mlagents.trainers.torch.distributions import (
-    GaussianDistInstance,
-    CategoricalDistInstance,
-)
+from mlagents_envs.base_env import ActionSpec


 def test_networkbody_vector():
            assert _out[0] == pytest.approx(1.0, abs=0.1)


-@pytest.mark.parametrize("action_type", [ActionType.DISCRETE, ActionType.CONTINUOUS])
-def test_simple_actor(action_type):
-    obs_size = 4
-    network_settings = NetworkSettings()
-    obs_shapes = [(obs_size,)]
-    act_size = [2]
-    masks = None if action_type == ActionType.CONTINUOUS else torch.ones((1, 1))
-    actor = SimpleActor(obs_shapes, network_settings, action_type, act_size)
-    # Test get_dist
-    sample_obs = torch.ones((1, obs_size))
-    dists, _ = actor.get_dists([sample_obs], [], masks=masks)
-    for dist in dists:
-        if action_type == ActionType.CONTINUOUS:
-            assert isinstance(dist, GaussianDistInstance)
-        else:
-            assert isinstance(dist, CategoricalDistInstance)
-
-    # Test sample_actions
-    actions = actor.sample_action(dists)
-    for act in actions:
-        if action_type == ActionType.CONTINUOUS:
-            assert act.shape == (1, act_size[0])
-        else:
-            assert act.shape == (1, 1)
-
-    # Test forward
-    actions, ver_num, mem_size, is_cont, act_size_vec = actor.forward(
-        [sample_obs], [], masks=masks
-    )
-    for act in actions:
-        # This is different from above for ONNX export
-        if action_type == ActionType.CONTINUOUS:
-            assert act.shape == (act_size[0], 1)
-        else:
-            assert act.shape == tuple(act_size)
-
-    assert mem_size == 0
-    assert is_cont == int(action_type == ActionType.CONTINUOUS)
-    assert act_size_vec == torch.tensor(act_size)
-
-
@pytest.mark.parametrize("ac_type", [SharedActorCritic, SeparateActorCritic])
@pytest.mark.parametrize("lstm", [True, False])
 def test_actor_critic(ac_type, lstm):
    )
    obs_shapes = [(obs_size,)]
-    act_size = [2]
+    act_size = 2
+    mask = torch.ones([1, act_size * 2])
-    actor = ac_type(
-        obs_shapes, network_settings, ActionType.CONTINUOUS, act_size, stream_names
-    )
+    # action_spec = ActionSpec.create_continuous(act_size[0])
+    action_spec = ActionSpec(act_size, tuple(act_size for _ in range(act_size)))
+    actor = ac_type(obs_shapes, network_settings, action_spec, stream_names)
    if lstm:
        sample_obs = torch.ones((1, network_settings.memory.sequence_length, obs_size))
        memories = torch.ones(
        else:
            assert value_out[stream].shape == (1,)

-    # Test get_dist_and_value
-    dists, value_out, mem_out = actor.get_dist_and_value(
-        [sample_obs], [], memories=memories
+    # Test get action stats and_value
+    action, log_probs, entropies, value_out, mem_out = actor.get_action_stats_and_value(
+        [sample_obs], [], memories=memories, masks=mask
+    if lstm:
+        assert action.continuous_tensor.shape == (64, 2)
+    else:
+        assert action.continuous_tensor.shape == (1, 2)
+
+    assert len(action.discrete_list) == 2
+    for _disc in action.discrete_list:
+        if lstm:
+            assert _disc.shape == (64, 1)
+        else:
+            assert _disc.shape == (1, 1)
+
-    for dist in dists:
-        assert isinstance(dist, GaussianDistInstance)
    for stream in stream_names:
        if lstm:
            assert value_out[stream].shape == (network_settings.memory.sequence_length,)
--- a/ml-agents/mlagents/trainers/tests/torch/test_policy.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_policy.py
 from mlagents.trainers.tests import mock_brain as mb
 from mlagents.trainers.settings import TrainerSettings, NetworkSettings
 from mlagents.trainers.torch.utils import ModelUtils
+from mlagents.trainers.torch.agent_action import AgentAction

 VECTOR_ACTION_SPACE = 2
 VECTOR_OBS_SPACE = 8

    run_out = policy.evaluate(decision_step, list(decision_step.agent_id))
    if discrete:
-        run_out["action"].shape == (NUM_AGENTS, len(DISCRETE_ACTION_SPACE))
+        run_out["action"]["discrete_action"].shape == (
+            NUM_AGENTS,
+            len(DISCRETE_ACTION_SPACE),
+        )
-        assert run_out["action"].shape == (NUM_AGENTS, VECTOR_ACTION_SPACE)
+        assert run_out["action"]["continuous_action"].shape == (
+            NUM_AGENTS,
+            VECTOR_ACTION_SPACE,
+        )


@pytest.mark.parametrize("discrete", [True, False], ids=["discrete", "continuous"])
    buffer = mb.simulate_rollout(64, policy.behavior_spec, memory_size=policy.m_size)
    vec_obs = [ModelUtils.list_to_tensor(buffer["vector_obs"])]
    act_masks = ModelUtils.list_to_tensor(buffer["action_mask"])
-    if policy.use_continuous_act:
-        actions = ModelUtils.list_to_tensor(buffer["actions"]).unsqueeze(-1)
-    else:
-        actions = ModelUtils.list_to_tensor(buffer["actions"], dtype=torch.long)
+    agent_action = AgentAction.from_dict(buffer)
    vis_obs = []
    for idx, _ in enumerate(policy.actor_critic.network_body.visual_processors):
        vis_ob = ModelUtils.list_to_tensor(buffer["visual_obs%d" % idx])
        vec_obs,
        vis_obs,
        masks=act_masks,
-        actions=actions,
+        actions=agent_action,
-    assert log_probs.shape == (64, policy.behavior_spec.action_size)
-    assert entropy.shape == (64, policy.behavior_spec.action_size)
+    if discrete:
+        _size = policy.behavior_spec.action_spec.discrete_size
+    else:
+        _size = policy.behavior_spec.action_spec.continuous_size
+
+    assert log_probs.flatten().shape == (64, _size)
+    assert entropy.shape == (64,)
    for val in values.values():
        assert val.shape == (64,)

        masks=act_masks,
        memories=memories,
        seq_len=policy.sequence_length,
-        all_log_probs=not policy.use_continuous_act,
-        assert log_probs.shape == (
+        assert log_probs.all_discrete_tensor.shape == (
-            sum(policy.behavior_spec.discrete_action_branches),
+            sum(policy.behavior_spec.action_spec.discrete_branches),
-        assert log_probs.shape == (64, policy.behavior_spec.action_shape)
-    assert entropies.shape == (64, policy.behavior_spec.action_size)
+        assert log_probs.continuous_tensor.shape == (
+            64,
+            policy.behavior_spec.action_spec.continuous_size,
+        )
+    assert entropies.shape == (64,)

    if rnn:
        assert memories.shape == (1, 1, policy.m_size)
--- a/ml-agents/mlagents/trainers/tests/torch/test_ppo.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_ppo.py
    gail_dummy_config,
 )

+from mlagents_envs.base_env import ActionSpec
+

@pytest.fixture
 def dummy_config():
 DISCRETE_ACTION_SPACE = [3, 3, 3, 2]
 BUFFER_INIT_SAMPLES = 64
 NUM_AGENTS = 12
+
+CONTINUOUS_ACTION_SPEC = ActionSpec.create_continuous(VECTOR_ACTION_SPACE)
+DISCRETE_ACTION_SPEC = ActionSpec.create_discrete(tuple(DISCRETE_ACTION_SPACE))


 def create_test_ppo_optimizer(dummy_config, use_rnn, use_discrete, use_visual):
    # NOTE: In TensorFlow, the log_probs are saved as one for every discrete action, whereas
    # in PyTorch it is saved as the total probability per branch. So we need to modify the
    # log prob in the fake buffer here.
-    update_buffer["action_probs"] = np.ones_like(update_buffer["actions"])
+    if discrete:
+        update_buffer["discrete_log_probs"] = np.ones_like(
+            update_buffer["discrete_action"]
+        )
+    else:
+        update_buffer["continuous_log_probs"] = np.ones_like(
+            update_buffer["continuous_action"]
+        )
    return_stats = optimizer.update(
        update_buffer,
        num_sequences=update_buffer.num_experiences // optimizer.policy.sequence_length,
    # NOTE: In TensorFlow, the log_probs are saved as one for every discrete action, whereas
    # in PyTorch it is saved as the total probability per branch. So we need to modify the
    # log prob in the fake buffer here.
-    update_buffer["action_probs"] = np.ones_like(update_buffer["actions"])
+    if discrete:
+        update_buffer["discrete_log_probs"] = np.ones_like(
+            update_buffer["discrete_action"]
+        )
+    else:
+        update_buffer["continuous_log_probs"] = np.ones_like(
+            update_buffer["continuous_action"]
+        )
    optimizer.update(
        update_buffer,
        num_sequences=update_buffer.num_experiences // optimizer.policy.sequence_length,
    update_buffer["extrinsic_value_estimates"] = update_buffer["environment_rewards"]
    update_buffer["gail_returns"] = update_buffer["environment_rewards"]
    update_buffer["gail_value_estimates"] = update_buffer["environment_rewards"]
+    update_buffer["continuous_log_probs"] = np.ones_like(
+        update_buffer["continuous_action"]
+    )
    optimizer.update(
        update_buffer,
        num_sequences=update_buffer.num_experiences // optimizer.policy.sequence_length,
    # NOTE: In TensorFlow, the log_probs are saved as one for every discrete action, whereas
    # in PyTorch it is saved as the total probability per branch. So we need to modify the
    # log prob in the fake buffer here.
-    update_buffer["action_probs"] = np.ones_like(update_buffer["actions"])
+    update_buffer["continuous_log_probs"] = np.ones_like(
+        update_buffer["continuous_action"]
+    )
    optimizer.update(
        update_buffer,
        num_sequences=update_buffer.num_experiences // optimizer.policy.sequence_length,
    trajectory = make_fake_trajectory(
        length=time_horizon,
        observation_shapes=optimizer.policy.behavior_spec.observation_shapes,
+        action_spec=DISCRETE_ACTION_SPEC if discrete else CONTINUOUS_ACTION_SPEC,
-        action_space=DISCRETE_ACTION_SPACE if discrete else VECTOR_ACTION_SPACE,
-        is_discrete=discrete,
    )
    run_out, final_value_out = optimizer.get_trajectory_value_estimates(
        trajectory.to_agentbuffer(), trajectory.next_obs, done=False
--- a/ml-agents/mlagents/trainers/tests/torch/test_reward_providers/test_curiosity.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_reward_providers/test_curiosity.py
    CuriosityRewardProvider,
    create_reward_provider,
 )
-from mlagents_envs.base_env import BehaviorSpec, ActionType
+from mlagents_envs.base_env import BehaviorSpec, ActionSpec
 from mlagents.trainers.settings import CuriositySettings, RewardSignalType
 from mlagents.trainers.tests.torch.test_reward_providers.utils import (
    create_agent_buffer,
 SEED = [42]

+ACTIONSPEC_CONTINUOUS = ActionSpec.create_continuous(5)
+ACTIONSPEC_TWODISCRETE = ActionSpec.create_discrete((2, 3))
+ACTIONSPEC_DISCRETE = ActionSpec.create_discrete((2,))
+
-        BehaviorSpec([(10,)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2, 3)),
+        BehaviorSpec([(10,)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,)], ACTIONSPEC_TWODISCRETE),
    ],
 )
 def test_construction(behavior_spec: BehaviorSpec) -> None:
@pytest.mark.parametrize(
    "behavior_spec",
    [
-        BehaviorSpec([(10,)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,), (64, 66, 3), (84, 86, 1)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,), (64, 66, 1)], ActionType.DISCRETE, (2, 3)),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2,)),
+        BehaviorSpec([(10,)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,), (64, 66, 3), (84, 86, 1)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,), (64, 66, 1)], ACTIONSPEC_TWODISCRETE),
+        BehaviorSpec([(10,)], ACTIONSPEC_DISCRETE),
    ],
 )
 def test_factory(behavior_spec: BehaviorSpec) -> None:
@pytest.mark.parametrize(
    "behavior_spec",
    [
-        BehaviorSpec([(10,), (64, 66, 3), (24, 26, 1)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2, 3)),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2,)),
+        BehaviorSpec([(10,), (64, 66, 3), (24, 26, 1)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,)], ACTIONSPEC_TWODISCRETE),
+        BehaviorSpec([(10,)], ACTIONSPEC_DISCRETE),
    ],
 )
 def test_reward_decreases(behavior_spec: BehaviorSpec, seed: int) -> None:

@pytest.mark.parametrize("seed", SEED)
@pytest.mark.parametrize(
-    "behavior_spec", [BehaviorSpec([(10,)], ActionType.CONTINUOUS, 5)]
+    "behavior_spec", [BehaviorSpec([(10,)], ACTIONSPEC_CONTINUOUS)]
 )
 def test_continuous_action_prediction(behavior_spec: BehaviorSpec, seed: int) -> None:
    np.random.seed(seed)
    for _ in range(200):
        curiosity_rp.update(buffer)
    prediction = curiosity_rp._network.predict_action(buffer)[0]
-    target = torch.tensor(buffer["actions"][0])
+    target = torch.tensor(buffer["continuous_action"][0])
    error = torch.mean((prediction - target) ** 2).item()
    assert error < 0.001

    "behavior_spec",
    [
-        BehaviorSpec([(10,), (64, 66, 3)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2, 3)),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2,)),
+        BehaviorSpec([(10,), (64, 66, 3), (24, 26, 1)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,)], ACTIONSPEC_TWODISCRETE),
+        BehaviorSpec([(10,)], ACTIONSPEC_DISCRETE),
    ],
 )
 def test_next_state_prediction(behavior_spec: BehaviorSpec, seed: int) -> None:
--- a/ml-agents/mlagents/trainers/tests/torch/test_reward_providers/test_extrinsic.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_reward_providers/test_extrinsic.py
    ExtrinsicRewardProvider,
    create_reward_provider,
 )
-from mlagents_envs.base_env import BehaviorSpec, ActionType
+from mlagents_envs.base_env import BehaviorSpec, ActionSpec
 from mlagents.trainers.settings import RewardSignalSettings, RewardSignalType
 from mlagents.trainers.tests.torch.test_reward_providers.utils import (
    create_agent_buffer,
+ACTIONSPEC_CONTINUOUS = ActionSpec.create_continuous(5)
+ACTIONSPEC_TWODISCRETE = ActionSpec.create_discrete((2, 3))
+
+
-        BehaviorSpec([(10,)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2, 3)),
+        BehaviorSpec([(10,)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,)], ACTIONSPEC_TWODISCRETE),
    ],
 )
 def test_construction(behavior_spec: BehaviorSpec) -> None:
@pytest.mark.parametrize(
    "behavior_spec",
    [
-        BehaviorSpec([(10,)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2, 3)),
+        BehaviorSpec([(10,)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,)], ACTIONSPEC_TWODISCRETE),
    ],
 )
 def test_factory(behavior_spec: BehaviorSpec) -> None:
@pytest.mark.parametrize(
    "behavior_spec",
    [
-        BehaviorSpec([(10,)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2, 3)),
+        BehaviorSpec([(10,)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,)], ACTIONSPEC_TWODISCRETE),
    ],
 )
 def test_reward(behavior_spec: BehaviorSpec, reward: float) -> None:
--- a/ml-agents/mlagents/trainers/tests/torch/test_reward_providers/test_gail.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_reward_providers/test_gail.py
    GAILRewardProvider,
    create_reward_provider,
 )
-from mlagents_envs.base_env import BehaviorSpec, ActionType
+from mlagents_envs.base_env import BehaviorSpec, ActionSpec
 from mlagents.trainers.settings import GAILSettings, RewardSignalType
 from mlagents.trainers.tests.torch.test_reward_providers.utils import (
    create_agent_buffer,
 )
+

 CONTINUOUS_PATH = (
    os.path.join(os.path.dirname(os.path.abspath(__file__)), os.pardir, os.pardir)
 )
 SEED = [42]

+ACTIONSPEC_CONTINUOUS = ActionSpec.create_continuous(2)
+ACTIONSPEC_FOURDISCRETE = ActionSpec.create_discrete((2, 3, 3, 3))
+ACTIONSPEC_DISCRETE = ActionSpec.create_discrete((20,))
-@pytest.mark.parametrize(
-    "behavior_spec", [BehaviorSpec([(8,)], ActionType.CONTINUOUS, 2)]
-)
+
+@pytest.mark.parametrize("behavior_spec", [BehaviorSpec([(8,)], ACTIONSPEC_CONTINUOUS)])
 def test_construction(behavior_spec: BehaviorSpec) -> None:
    gail_settings = GAILSettings(demo_path=CONTINUOUS_PATH)
    gail_rp = GAILRewardProvider(behavior_spec, gail_settings)
-@pytest.mark.parametrize(
-    "behavior_spec", [BehaviorSpec([(8,)], ActionType.CONTINUOUS, 2)]
-)
+@pytest.mark.parametrize("behavior_spec", [BehaviorSpec([(8,)], ACTIONSPEC_CONTINUOUS)])
 def test_factory(behavior_spec: BehaviorSpec) -> None:
    gail_settings = GAILSettings(demo_path=CONTINUOUS_PATH)
    gail_rp = create_reward_provider(
@pytest.mark.parametrize(
    "behavior_spec",
    [
-        BehaviorSpec([(8,), (24, 26, 1)], ActionType.CONTINUOUS, 2),
-        BehaviorSpec([(50,)], ActionType.DISCRETE, (2, 3, 3, 3)),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (20,)),
+        BehaviorSpec([(8,), (24, 26, 1)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(50,)], ACTIONSPEC_FOURDISCRETE),
+        BehaviorSpec([(10,)], ACTIONSPEC_DISCRETE),
    ],
 )
@pytest.mark.parametrize("use_actions", [False, True])
@pytest.mark.parametrize(
    "behavior_spec",
    [
-        BehaviorSpec([(8,)], ActionType.CONTINUOUS, 2),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2, 3, 3, 3)),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (20,)),
+        BehaviorSpec([(8,), (24, 26, 1)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(50,)], ACTIONSPEC_FOURDISCRETE),
+        BehaviorSpec([(10,)], ACTIONSPEC_DISCRETE),
    ],
 )
@pytest.mark.parametrize("use_actions", [False, True])
        RewardSignalType.GAIL, behavior_spec, gail_settings
    )

-    for _ in range(200):
+    for _ in range(300):
        gail_rp.update(buffer_policy)
        reward_expert = gail_rp.evaluate(buffer_expert)[0]
        reward_policy = gail_rp.evaluate(buffer_policy)[0]
--- a/ml-agents/mlagents/trainers/tests/torch/test_reward_providers/test_rnd.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_reward_providers/test_rnd.py
    RNDRewardProvider,
    create_reward_provider,
 )
-from mlagents_envs.base_env import BehaviorSpec, ActionType
+from mlagents_envs.base_env import BehaviorSpec, ActionSpec
+
+
+ACTIONSPEC_CONTINUOUS = ActionSpec.create_continuous(5)
+ACTIONSPEC_TWODISCRETE = ActionSpec.create_discrete((2, 3))
+ACTIONSPEC_DISCRETE = ActionSpec.create_discrete((2,))
-        BehaviorSpec([(10,)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2, 3)),
+        BehaviorSpec([(10,)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,)], ACTIONSPEC_TWODISCRETE),
    ],
 )
 def test_construction(behavior_spec: BehaviorSpec) -> None:
@pytest.mark.parametrize(
    "behavior_spec",
    [
-        BehaviorSpec([(10,)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,), (64, 66, 3), (84, 86, 1)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,), (64, 66, 1)], ActionType.DISCRETE, (2, 3)),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2,)),
+        BehaviorSpec([(10,)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,), (64, 66, 3), (84, 86, 1)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,), (64, 66, 1)], ACTIONSPEC_TWODISCRETE),
+        BehaviorSpec([(10,)], ACTIONSPEC_DISCRETE),
    ],
 )
 def test_factory(behavior_spec: BehaviorSpec) -> None:
@pytest.mark.parametrize(
    "behavior_spec",
    [
-        BehaviorSpec([(10,), (64, 66, 3), (24, 26, 1)], ActionType.CONTINUOUS, 5),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2, 3)),
-        BehaviorSpec([(10,)], ActionType.DISCRETE, (2,)),
+        BehaviorSpec([(10,), (64, 66, 3), (24, 26, 1)], ACTIONSPEC_CONTINUOUS),
+        BehaviorSpec([(10,)], ACTIONSPEC_TWODISCRETE),
+        BehaviorSpec([(10,)], ACTIONSPEC_DISCRETE),
    ],
 )
 def test_reward_decreases(behavior_spec: BehaviorSpec, seed: int) -> None:
--- a/ml-agents/mlagents/trainers/tests/torch/test_reward_providers/utils.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_reward_providers/utils.py
 ) -> AgentBuffer:
    buffer = AgentBuffer()
    curr_observations = [
-        np.random.normal(size=shape) for shape in behavior_spec.observation_shapes
+        np.random.normal(size=shape).astype(np.float32)
+        for shape in behavior_spec.observation_shapes
-        np.random.normal(size=shape) for shape in behavior_spec.observation_shapes
+        np.random.normal(size=shape).astype(np.float32)
+        for shape in behavior_spec.observation_shapes
-    action = behavior_spec.create_random_action(1)[0, :]
+    action_buffer = behavior_spec.action_spec.random_action(1)
+    action = {}
+    if behavior_spec.action_spec.continuous_size > 0:
+        action["continuous_action"] = action_buffer.continuous
+    if behavior_spec.action_spec.discrete_size > 0:
+        action["discrete_action"] = action_buffer.discrete
+
    for _ in range(number):
        curr_split_obs = SplitObservations.from_observations(curr_observations)
        next_split_obs = SplitObservations.from_observations(next_observations)
            )
        buffer["vector_obs"].append(curr_split_obs.vector_observations)
        buffer["next_vector_in"].append(next_split_obs.vector_observations)
-        buffer["actions"].append(action)
+        for _act_type, _act in action.items():
+            buffer[_act_type].append(_act[0, :])
        buffer["reward"].append(np.ones(1, dtype=np.float32) * reward)
        buffer["masks"].append(np.ones(1, dtype=np.float32))
    buffer["done"] = np.zeros(number, dtype=np.float32)
--- a/ml-agents/mlagents/trainers/tests/torch/test_simple_rl.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_simple_rl.py
 SAC_TORCH_CONFIG = attr.evolve(sac_dummy_config(), framework=FrameworkType.PYTORCH)


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_ppo(use_discrete):
-    env = SimpleEnvironment([BRAIN_NAME], use_discrete=use_discrete)
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_ppo(action_sizes):
+    env = SimpleEnvironment([BRAIN_NAME], action_sizes=action_sizes)
-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_2d_ppo(use_discrete):
-    env = SimpleEnvironment(
-        [BRAIN_NAME], use_discrete=use_discrete, action_size=2, step_size=0.8
-    )
+@pytest.mark.parametrize("action_sizes", [(0, 2), (2, 0)])
+def test_2d_ppo(action_sizes):
+    env = SimpleEnvironment([BRAIN_NAME], action_sizes=action_sizes, step_size=0.8)
    new_hyperparams = attr.evolve(
        PPO_TORCH_CONFIG.hyperparameters, batch_size=64, buffer_size=640
    )
    check_environment_trains(env, {BRAIN_NAME: config})


-@pytest.mark.parametrize("use_discrete", [True, False])
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
-def test_visual_ppo(num_visual, use_discrete):
+def test_visual_ppo(num_visual, action_sizes):
-        use_discrete=use_discrete,
+        action_sizes=action_sizes,
        num_visual=num_visual,
        num_vector=0,
        step_size=0.2,
 def test_visual_advanced_ppo(vis_encode_type, num_visual):
    env = SimpleEnvironment(
        [BRAIN_NAME],
-        use_discrete=True,
+        action_sizes=True,
        num_visual=num_visual,
        num_vector=0,
        step_size=0.5,
        PPO_TORCH_CONFIG,
        hyperparameters=new_hyperparams,
        network_settings=new_networksettings,
-        max_steps=700,
+        max_steps=900,
        summary_freq=100,
    )
    # The number of steps is pretty small for these encoders
-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_recurrent_ppo(use_discrete):
-    env = MemoryEnvironment([BRAIN_NAME], use_discrete=use_discrete)
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_recurrent_ppo(action_sizes):
+    env = MemoryEnvironment([BRAIN_NAME], action_sizes=action_sizes)
    new_network_settings = attr.evolve(
        PPO_TORCH_CONFIG.network_settings,
        memory=NetworkSettings.MemorySettings(memory_size=16),
    check_environment_trains(env, {BRAIN_NAME: config}, success_threshold=0.9)


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_sac(use_discrete):
-    env = SimpleEnvironment([BRAIN_NAME], use_discrete=use_discrete)
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_sac(action_sizes):
+    env = SimpleEnvironment([BRAIN_NAME], action_sizes=action_sizes)
-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_2d_sac(use_discrete):
-    env = SimpleEnvironment(
-        [BRAIN_NAME], use_discrete=use_discrete, action_size=2, step_size=0.8
-    )
+@pytest.mark.parametrize("action_sizes", [(0, 2), (2, 0)])
+def test_2d_sac(action_sizes):
+    env = SimpleEnvironment([BRAIN_NAME], action_sizes=action_sizes, step_size=0.8)
    new_hyperparams = attr.evolve(
        SAC_TORCH_CONFIG.hyperparameters, buffer_init_steps=2000
    )
    check_environment_trains(env, {BRAIN_NAME: config}, success_threshold=0.8)


-@pytest.mark.parametrize("use_discrete", [True, False])
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
-def test_visual_sac(num_visual, use_discrete):
+def test_visual_sac(num_visual, action_sizes):
-        use_discrete=use_discrete,
+        action_sizes=action_sizes,
        num_visual=num_visual,
        num_vector=0,
        step_size=0.2,
 def test_visual_advanced_sac(vis_encode_type, num_visual):
    env = SimpleEnvironment(
        [BRAIN_NAME],
-        use_discrete=True,
+        action_sizes=True,
        num_visual=num_visual,
        num_vector=0,
        step_size=0.5,
    check_environment_trains(env, {BRAIN_NAME: config}, success_threshold=0.5)


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_recurrent_sac(use_discrete):
-    step_size = 0.2 if use_discrete else 0.5
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_recurrent_sac(action_sizes):
+    step_size = 0.2 if action_sizes else 0.5
-        [BRAIN_NAME], use_discrete=use_discrete, step_size=step_size
+        [BRAIN_NAME], action_sizes=action_sizes, step_size=step_size
    )
    new_networksettings = attr.evolve(
        SAC_TORCH_CONFIG.network_settings,
        SAC_TORCH_CONFIG.hyperparameters,
-        batch_size=128,
+        batch_size=256,
        learning_rate=1e-3,
        buffer_init_steps=1000,
        steps_per_update=2,
        hyperparameters=new_hyperparams,
        network_settings=new_networksettings,
-        max_steps=5000,
+        max_steps=2000,
-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_ghost(use_discrete):
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_ghost(action_sizes):
-        [BRAIN_NAME + "?team=0", BRAIN_NAME + "?team=1"], use_discrete=use_discrete
+        [BRAIN_NAME + "?team=0", BRAIN_NAME + "?team=1"], action_sizes=action_sizes
    )
    self_play_settings = SelfPlaySettings(
        play_against_latest_model_ratio=1.0, save_steps=2000, swap_steps=2000


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_ghost_fails(use_discrete):
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_ghost_fails(action_sizes):
-        [BRAIN_NAME + "?team=0", BRAIN_NAME + "?team=1"], use_discrete=use_discrete
+        [BRAIN_NAME + "?team=0", BRAIN_NAME + "?team=1"], action_sizes=action_sizes
    )
    # This config should fail because the ghosted policy is never swapped with a competent policy.
    # Swap occurs after max step is reached.
    )


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_asymm_ghost(use_discrete):
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_asymm_ghost(action_sizes):
-        [BRAIN_NAME + "?team=0", brain_name_opp + "?team=1"], use_discrete=use_discrete
+        [BRAIN_NAME + "?team=0", brain_name_opp + "?team=1"], action_sizes=action_sizes
    )
    self_play_settings = SelfPlaySettings(
        play_against_latest_model_ratio=1.0,
    check_environment_trains(env, {BRAIN_NAME: config, brain_name_opp: config})


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_simple_asymm_ghost_fails(use_discrete):
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_simple_asymm_ghost_fails(action_sizes):
-        [BRAIN_NAME + "?team=0", brain_name_opp + "?team=1"], use_discrete=use_discrete
+        [BRAIN_NAME + "?team=0", brain_name_opp + "?team=1"], action_sizes=action_sizes
    )
    # This config should fail because the team that us not learning when both have reached
    # max step should be executing the initial, untrained poliy.

@pytest.fixture(scope="session")
 def simple_record(tmpdir_factory):
-    def record_demo(use_discrete, num_visual=0, num_vector=1):
+    def record_demo(action_sizes, num_visual=0, num_vector=1):
-            use_discrete=use_discrete,
+            action_sizes=action_sizes,
            num_visual=num_visual,
            num_vector=num_vector,
            n_demos=100,
        agent_info_protos = env.demonstration_protos[BRAIN_NAME]
        meta_data_proto = DemonstrationMetaProto()
        brain_param_proto = BrainParametersProto(
-            vector_action_size_deprecated=[2] if use_discrete else [1],
+            vector_action_size_deprecated=[2] if action_sizes else [1],
-            if use_discrete
+            if action_sizes
-        action_type = "Discrete" if use_discrete else "Continuous"
+        action_type = "Discrete" if action_sizes else "Continuous"
        demo_path_name = "1DTest" + action_type + ".demo"
        demo_path = str(tmpdir_factory.mktemp("tmp_demo").join(demo_path_name))
        write_demo(demo_path, meta_data_proto, brain_param_proto, agent_info_protos)


-@pytest.mark.parametrize("use_discrete", [True, False])
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
-def test_gail(simple_record, use_discrete, trainer_config):
-    demo_path = simple_record(use_discrete)
-    env = SimpleEnvironment([BRAIN_NAME], use_discrete=use_discrete, step_size=0.2)
+def test_gail(simple_record, action_sizes, trainer_config):
+    demo_path = simple_record(action_sizes)
+    env = SimpleEnvironment([BRAIN_NAME], action_sizes=action_sizes, step_size=0.2)
    bc_settings = BehavioralCloningSettings(demo_path=demo_path, steps=1000)
    reward_signals = {
        RewardSignalType.GAIL: GAILSettings(encoding_size=32, demo_path=demo_path)
    check_environment_trains(env, {BRAIN_NAME: config}, success_threshold=0.9)


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_gail_visual_ppo(simple_record, use_discrete):
-    demo_path = simple_record(use_discrete, num_visual=1, num_vector=0)
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_gail_visual_ppo(simple_record, action_sizes):
+    demo_path = simple_record(action_sizes, num_visual=1, num_vector=0)
-        use_discrete=use_discrete,
+        action_sizes=action_sizes,
        step_size=0.2,
    )
    bc_settings = BehavioralCloningSettings(demo_path=demo_path, steps=1500)
    check_environment_trains(env, {BRAIN_NAME: config}, success_threshold=0.9)


-@pytest.mark.parametrize("use_discrete", [True, False])
-def test_gail_visual_sac(simple_record, use_discrete):
-    demo_path = simple_record(use_discrete, num_visual=1, num_vector=0)
+@pytest.mark.parametrize("action_sizes", [(0, 1), (1, 0)])
+def test_gail_visual_sac(simple_record, action_sizes):
+    demo_path = simple_record(action_sizes, num_visual=1, num_vector=0)
-        use_discrete=use_discrete,
+        action_sizes=action_sizes,
        step_size=0.2,
    )
    bc_settings = BehavioralCloningSettings(demo_path=demo_path, steps=1000)
--- a/ml-agents/mlagents/trainers/tests/torch/test_utils.py
+++ b/ml-agents/mlagents/trainers/tests/torch/test_utils.py

 def test_list_to_tensor():
    # Test converting pure list
-    unconverted_list = [[1, 2], [1, 3], [1, 4]]
+    unconverted_list = [[1.0, 2], [1, 3], [1, 4]]
    tensor = ModelUtils.list_to_tensor(unconverted_list)
    # Should be equivalent to torch.tensor conversion
    assert torch.equal(tensor, torch.tensor(unconverted_list))
    list_of_np = [np.asarray(_el) for _el in unconverted_list]
    tensor = ModelUtils.list_to_tensor(list_of_np)
    # Should be equivalent to torch.tensor conversion
-    assert torch.equal(tensor, torch.tensor(unconverted_list))
+    assert torch.equal(tensor, torch.tensor(unconverted_list, dtype=torch.float32))


 def test_break_into_branches():
    ]
    for res, exp in zip(oh_actions, expected_result):
        assert torch.equal(res, exp)
-
-
-def test_get_probs_and_entropy():
-    # Test continuous
-    # Add two dists to the list. This isn't done in the code but we'd like to support it.
-    dist_list = [
-        GaussianDistInstance(torch.zeros((1, 2)), torch.ones((1, 2))),
-        GaussianDistInstance(torch.zeros((1, 2)), torch.ones((1, 2))),
-    ]
-    action_list = [torch.zeros((1, 2)), torch.zeros((1, 2))]
-    log_probs, entropies, all_probs = ModelUtils.get_probs_and_entropy(
-        action_list, dist_list
-    )
-    assert log_probs.shape == (1, 2, 2)
-    assert entropies.shape == (1, 2, 2)
-    assert all_probs is None
-
-    for log_prob in log_probs.flatten():
-        # Log prob of standard normal at 0
-        assert log_prob == pytest.approx(-0.919, abs=0.01)
-
-    for ent in entropies.flatten():
-        # entropy of standard normal at 0
-        assert ent == pytest.approx(1.42, abs=0.01)
-
-    # Test continuous
-    # Add two dists to the list.
-    act_size = 2
-    test_prob = torch.tensor(
-        [[1.0 - 0.1 * (act_size - 1)] + [0.1] * (act_size - 1)]
-    )  # High prob for first action
-    dist_list = [CategoricalDistInstance(test_prob), CategoricalDistInstance(test_prob)]
-    action_list = [torch.tensor([0]), torch.tensor([1])]
-    log_probs, entropies, all_probs = ModelUtils.get_probs_and_entropy(
-        action_list, dist_list
-    )
-    assert all_probs.shape == (1, len(dist_list * act_size))
-    assert entropies.shape == (1, len(dist_list))
-    # Make sure the first action has high probability than the others.
-    assert log_probs.flatten()[0] > log_probs.flatten()[1]


 def test_masked_mean():
--- a/ml-agents/mlagents/trainers/tf/components/bc/model.py
+++ b/ml-agents/mlagents/trainers/tf/components/bc/model.py
        self.done_expert = tf.placeholder(shape=[None, 1], dtype=tf.float32)
        self.done_policy = tf.placeholder(shape=[None, 1], dtype=tf.float32)

-        if self.policy.behavior_spec.is_action_continuous():
+        if self.policy.behavior_spec.action_spec.is_continuous():
            action_length = self.policy.act_size[0]
            self.action_in_expert = tf.placeholder(
                shape=[None, action_length], dtype=tf.float32
--- a/ml-agents/mlagents/trainers/tf/components/bc/module.py
+++ b/ml-agents/mlagents/trainers/tf/components/bc/module.py
            self.policy.batch_size_ph: n_sequences,
            self.policy.sequence_length_ph: self.policy.sequence_length,
        }
-        feed_dict[self.model.action_in_expert] = mini_batch_demo["actions"]
-        if self.policy.behavior_spec.is_action_discrete():
+        if self.policy.behavior_spec.action_spec.is_discrete():
+            feed_dict[self.model.action_in_expert] = mini_batch_demo["discrete_action"]
-                    sum(self.policy.behavior_spec.discrete_action_branches),
+                    sum(self.policy.behavior_spec.action_spec.discrete_branches),
+        else:
+            feed_dict[self.model.action_in_expert] = mini_batch_demo[
+                "continuous_action"
+            ]
        if self.policy.vec_obs_size > 0:
            feed_dict[self.policy.vector_in] = mini_batch_demo["vector_obs"]
        for i, _ in enumerate(self.policy.visual_in):
--- a/ml-agents/mlagents/trainers/tf/components/reward_signals/curiosity/model.py
+++ b/ml-agents/mlagents/trainers/tf/components/reward_signals/curiosity/model.py
        """
        combined_input = tf.concat([encoded_state, encoded_next_state], axis=1)
        hidden = tf.layers.dense(combined_input, 256, activation=ModelUtils.swish)
-        if self.policy.behavior_spec.is_action_continuous():
+        if self.policy.behavior_spec.action_spec.is_continuous():
            pred_action = tf.layers.dense(
                hidden, self.policy.act_size[0], activation=None
            )
--- a/ml-agents/mlagents/trainers/tf/components/reward_signals/curiosity/signal.py
+++ b/ml-agents/mlagents/trainers/tf/components/reward_signals/curiosity/signal.py

    def evaluate_batch(self, mini_batch: AgentBuffer) -> RewardSignalResult:
        feed_dict: Dict[tf.Tensor, Any] = {
-            self.policy.batch_size_ph: len(mini_batch["actions"]),
+            self.policy.batch_size_ph: len(mini_batch["vector_obs"]),
            self.policy.sequence_length_ph: self.policy.sequence_length,
        }
        if self.policy.use_vec_obs:
                feed_dict[self.model.next_visual_in[i]] = _next_obs

        if self.policy.use_continuous_act:
-            feed_dict[self.policy.selected_actions] = mini_batch["actions"]
+            feed_dict[self.policy.selected_actions] = mini_batch["continuous_action"]
-            feed_dict[self.policy.output] = mini_batch["actions"]
+            feed_dict[self.policy.output] = mini_batch["discrete_action"]
        unscaled_reward = self.policy.sess.run(
            self.model.intrinsic_reward, feed_dict=feed_dict
        )
            policy.mask_input: mini_batch["masks"],
        }
        if self.policy.use_continuous_act:
-            feed_dict[policy.selected_actions] = mini_batch["actions"]
+            feed_dict[policy.selected_actions] = mini_batch["continuous_action"]
-            feed_dict[policy.output] = mini_batch["actions"]
+            feed_dict[policy.output] = mini_batch["discrete_action"]
        if self.policy.use_vec_obs:
            feed_dict[policy.vector_in] = mini_batch["vector_obs"]
            feed_dict[self.model.next_vector_in] = mini_batch["next_vector_in"]
--- a/ml-agents/mlagents/trainers/tf/components/reward_signals/gail/model.py
+++ b/ml-agents/mlagents/trainers/tf/components/reward_signals/gail/model.py
        self.done_expert = tf.expand_dims(self.done_expert_holder, -1)
        self.done_policy = tf.expand_dims(self.done_policy_holder, -1)

-        if self.policy.behavior_spec.is_action_continuous():
+        if self.policy.behavior_spec.action_spec.is_continuous():
            action_length = self.policy.act_size[0]
            self.action_in_expert = tf.placeholder(
                shape=[None, action_length], dtype=tf.float32
--- a/ml-agents/mlagents/trainers/tf/components/reward_signals/gail/signal.py
+++ b/ml-agents/mlagents/trainers/tf/components/reward_signals/gail/signal.py

    def evaluate_batch(self, mini_batch: AgentBuffer) -> RewardSignalResult:
        feed_dict: Dict[tf.Tensor, Any] = {
-            self.policy.batch_size_ph: len(mini_batch["actions"]),
+            self.policy.batch_size_ph: len(mini_batch["vector_obs"]),
            self.policy.sequence_length_ph: self.policy.sequence_length,
        }
        if self.model.use_vail:
                feed_dict[self.policy.visual_in[i]] = _obs

        if self.policy.use_continuous_act:
-            feed_dict[self.policy.selected_actions] = mini_batch["actions"]
+            feed_dict[self.policy.selected_actions] = mini_batch["continuous_action"]
-            feed_dict[self.policy.output] = mini_batch["actions"]
+            feed_dict[self.policy.output] = mini_batch["discrete_action"]
        feed_dict[self.model.done_policy_holder] = np.array(
            mini_batch["done"]
        ).flatten()
        if self.model.use_vail:
            feed_dict[self.model.use_noise] = [1]

-        feed_dict[self.model.action_in_expert] = np.array(mini_batch_demo["actions"])
-            feed_dict[policy.selected_actions] = mini_batch["actions"]
+            feed_dict[policy.selected_actions] = mini_batch["continuous_action"]
+            feed_dict[self.model.action_in_expert] = np.array(
+                mini_batch_demo["continuous_action"]
+            )
-            feed_dict[policy.output] = mini_batch["actions"]
+            feed_dict[policy.output] = mini_batch["discrete_action"]
+            feed_dict[self.model.action_in_expert] = np.array(
+                mini_batch_demo["discrete_action"]
+            )

        if self.policy.use_vis_obs > 0:
            for i in range(len(policy.visual_in)):
--- a/ml-agents/mlagents/trainers/tf/model_serialization.py
+++ b/ml-agents/mlagents/trainers/tf/model_serialization.py
    ]
 )

-POSSIBLE_OUTPUT_NODES = frozenset(
-    ["action", "action_probs", "recurrent_out", "value_estimate"]
-)
+POSSIBLE_OUTPUT_NODES = frozenset(["action", "recurrent_out", "value_estimate"])

 MODEL_CONSTANTS = frozenset(
    [
--- a/ml-agents/mlagents/trainers/torch/components/bc/module.py
+++ b/ml-agents/mlagents/trainers/torch/components/bc/module.py
 from mlagents.trainers.policy.torch_policy import TorchPolicy
 from mlagents.trainers.demo_loader import demo_to_buffer
 from mlagents.trainers.settings import BehavioralCloningSettings, ScheduleType
+from mlagents.trainers.torch.agent_action import AgentAction
+from mlagents.trainers.torch.action_log_probs import ActionLogProbs
 from mlagents.trainers.torch.utils import ModelUtils


        update_stats = {"Losses/Pretraining Loss": np.mean(batch_losses)}
        return update_stats

-    def _behavioral_cloning_loss(self, selected_actions, log_probs, expert_actions):
-        if self.policy.use_continuous_act:
-            bc_loss = torch.nn.functional.mse_loss(selected_actions, expert_actions)
-        else:
+    def _behavioral_cloning_loss(
+        self,
+        selected_actions: AgentAction,
+        log_probs: ActionLogProbs,
+        expert_actions: torch.Tensor,
+    ) -> torch.Tensor:
+        bc_loss = 0
+        if self.policy.action_spec.continuous_size > 0:
+            bc_loss += torch.nn.functional.mse_loss(
+                selected_actions.continuous_tensor, expert_actions.continuous_tensor
+            )
+        if self.policy.action_spec.discrete_size > 0:
+            one_hot_expert_actions = ModelUtils.actions_to_onehot(
+                expert_actions.discrete_tensor,
+                self.policy.action_spec.discrete_branches,
+            )
+
-                log_probs, self.policy.act_size
+                log_probs.all_discrete_tensor,
+                self.policy.behavior_spec.action_spec.discrete_branches,
-            bc_loss = torch.mean(
+            bc_loss += torch.mean(
                torch.stack(
                    [
                        torch.sum(
                        )
                        for log_prob_branch, expert_actions_branch in zip(
-                            log_prob_branches, expert_actions
+                            log_prob_branches, one_hot_expert_actions
                        )
                    ]
                )
        """
        vec_obs = [ModelUtils.list_to_tensor(mini_batch_demo["vector_obs"])]
        act_masks = None
-        if self.policy.use_continuous_act:
-            expert_actions = ModelUtils.list_to_tensor(mini_batch_demo["actions"])
-        else:
-            raw_expert_actions = ModelUtils.list_to_tensor(
-                mini_batch_demo["actions"], dtype=torch.long
-            )
-            expert_actions = ModelUtils.actions_to_onehot(
-                raw_expert_actions, self.policy.act_size
-            )
+        expert_actions = AgentAction.from_dict(mini_batch_demo)
+        if self.policy.action_spec.discrete_size > 0:
-                        sum(self.policy.behavior_spec.discrete_action_branches),
+                        sum(self.policy.behavior_spec.action_spec.discrete_branches),
                    ),
                    dtype=np.float32,
                )
        else:
            vis_obs = []

-        selected_actions, all_log_probs, _, _ = self.policy.sample_actions(
+        selected_actions, log_probs, _, _, _ = self.policy.sample_actions(
-            all_log_probs=True,
-            selected_actions, all_log_probs, expert_actions
+            selected_actions, log_probs, expert_actions
        )
        self.optimizer.zero_grad()
        bc_loss.backward()
--- a/ml-agents/mlagents/trainers/torch/components/reward_providers/curiosity_reward_provider.py
+++ b/ml-agents/mlagents/trainers/torch/components/reward_providers/curiosity_reward_provider.py
 import numpy as np
-from typing import Dict
+from typing import Dict, NamedTuple
 from mlagents.torch_utils import torch, default_device

 from mlagents.trainers.buffer import AgentBuffer
 from mlagents.trainers.settings import CuriositySettings

 from mlagents_envs.base_env import BehaviorSpec
+from mlagents.trainers.torch.agent_action import AgentAction
+from mlagents.trainers.torch.action_flattener import ActionFlattener
+
+
+class ActionPredictionTuple(NamedTuple):
+    continuous: torch.Tensor
+    discrete: torch.Tensor


 class CuriosityRewardProvider(BaseRewardProvider):

    def __init__(self, specs: BehaviorSpec, settings: CuriositySettings) -> None:
        super().__init__()
-        self._policy_specs = specs
+        self._action_spec = specs.action_spec
        state_encoder_settings = NetworkSettings(
            normalize=False,
            hidden_units=settings.encoding_size,
            specs.observation_shapes, state_encoder_settings
        )

-        self._action_flattener = ModelUtils.ActionFlattener(specs)
+        self._action_flattener = ActionFlattener(self._action_spec)
-        self.inverse_model_action_prediction = torch.nn.Sequential(
-            LinearEncoder(2 * settings.encoding_size, 1, 256),
-            linear_layer(256, self._action_flattener.flattened_size),
+        self.inverse_model_action_encoding = torch.nn.Sequential(
+            LinearEncoder(2 * settings.encoding_size, 1, 256)
+        if self._action_spec.continuous_size > 0:
+            self.continuous_action_prediction = linear_layer(
+                256, self._action_spec.continuous_size
+            )
+        if self._action_spec.discrete_size > 0:
+            self.discrete_action_prediction = linear_layer(
+                256, sum(self._action_spec.discrete_branches)
+            )
+
        self.forward_model_next_state_prediction = torch.nn.Sequential(
            LinearEncoder(
                settings.encoding_size + self._action_flattener.flattened_size, 1, 256
        )
        return hidden

-    def predict_action(self, mini_batch: AgentBuffer) -> torch.Tensor:
+    def predict_action(self, mini_batch: AgentBuffer) -> ActionPredictionTuple:
        """
        In the continuous case, returns the predicted action.
        In the discrete case, returns the logits.
        )
-        hidden = self.inverse_model_action_prediction(inverse_model_input)
-        if self._policy_specs.is_action_continuous():
-            return hidden
-        else:
+
+        continuous_pred = None
+        discrete_pred = None
+        hidden = self.inverse_model_action_encoding(inverse_model_input)
+        if self._action_spec.continuous_size > 0:
+            continuous_pred = self.continuous_action_prediction(hidden)
+        if self._action_spec.discrete_size > 0:
+            raw_discrete_pred = self.discrete_action_prediction(hidden)
-                hidden, self._policy_specs.discrete_action_branches
+                raw_discrete_pred, self._action_spec.discrete_branches
-            return torch.cat(branches, dim=1)
+            discrete_pred = torch.cat(branches, dim=1)
+        return ActionPredictionTuple(continuous_pred, discrete_pred)

    def predict_next_state(self, mini_batch: AgentBuffer) -> torch.Tensor:
        """
-        if self._policy_specs.is_action_continuous():
-            action = ModelUtils.list_to_tensor(mini_batch["actions"], dtype=torch.float)
-        else:
-            action = torch.cat(
-                ModelUtils.actions_to_onehot(
-                    ModelUtils.list_to_tensor(mini_batch["actions"], dtype=torch.long),
-                    self._policy_specs.discrete_action_branches,
-                ),
-                dim=1,
-            )
+        actions = AgentAction.from_dict(mini_batch)
+        flattened_action = self._action_flattener.forward(actions)
-            (self.get_current_state(mini_batch), action), dim=1
+            (self.get_current_state(mini_batch), flattened_action), dim=1
        )

        return self.forward_model_next_state_prediction(forward_model_input)
        action prediction (given the current and next state).
        """
        predicted_action = self.predict_action(mini_batch)
-        if self._policy_specs.is_action_continuous():
+        actions = AgentAction.from_dict(mini_batch)
+        _inverse_loss = 0
+        if self._action_spec.continuous_size > 0:
-                ModelUtils.list_to_tensor(mini_batch["actions"], dtype=torch.float)
-                - predicted_action
+                actions.continuous_tensor - predicted_action.continuous
-            return torch.mean(
+            _inverse_loss += torch.mean(
                ModelUtils.dynamic_partition(
                    sq_difference,
                    ModelUtils.list_to_tensor(mini_batch["masks"], dtype=torch.float),
-        else:
+        if self._action_spec.discrete_size > 0:
-                    ModelUtils.list_to_tensor(mini_batch["actions"], dtype=torch.long),
-                    self._policy_specs.discrete_action_branches,
+                    actions.discrete_tensor, self._action_spec.discrete_branches
-                -torch.log(predicted_action + self.EPSILON) * true_action, dim=1
+                -torch.log(predicted_action.discrete + self.EPSILON) * true_action,
+                dim=1,
-            return torch.mean(
+            _inverse_loss += torch.mean(
                ModelUtils.dynamic_partition(
                    cross_entropy,
                    ModelUtils.list_to_tensor(
                )[1]
            )
+        return _inverse_loss

    def compute_reward(self, mini_batch: AgentBuffer) -> torch.Tensor:
        """
--- a/ml-agents/mlagents/trainers/torch/components/reward_providers/gail_reward_provider.py
+++ b/ml-agents/mlagents/trainers/torch/components/reward_providers/gail_reward_provider.py
 from mlagents.trainers.settings import GAILSettings
 from mlagents_envs.base_env import BehaviorSpec
 from mlagents.trainers.torch.utils import ModelUtils
+from mlagents.trainers.torch.agent_action import AgentAction
+from mlagents.trainers.torch.action_flattener import ActionFlattener
 from mlagents.trainers.torch.networks import NetworkBody
 from mlagents.trainers.torch.layers import linear_layer, Initialization
 from mlagents.trainers.settings import NetworkSettings, EncoderType

    def __init__(self, specs: BehaviorSpec, settings: GAILSettings) -> None:
        super().__init__()
-        self._policy_specs = specs
        self._use_vail = settings.use_vail
        self._settings = settings

            vis_encode_type=EncoderType.SIMPLE,
            memory=None,
        )
-        self._action_flattener = ModelUtils.ActionFlattener(specs)
+        self._action_flattener = ActionFlattener(specs.action_spec)
        unencoded_size = (
            self._action_flattener.flattened_size + 1 if settings.use_actions else 0
        )  # +1 is for dones
        Creates the action Tensor. In continuous case, corresponds to the action. In
        the discrete case, corresponds to the concatenation of one hot action Tensors.
        """
-        return self._action_flattener.forward(
-            torch.as_tensor(mini_batch["actions"], dtype=torch.float)
-        )
+        return self._action_flattener.forward(AgentAction.from_dict(mini_batch))

    def get_state_inputs(
        self, mini_batch: AgentBuffer
--- a/ml-agents/mlagents/trainers/torch/components/reward_providers/rnd_reward_provider.py
+++ b/ml-agents/mlagents/trainers/torch/components/reward_providers/rnd_reward_provider.py

    def __init__(self, specs: BehaviorSpec, settings: RNDSettings) -> None:
        super().__init__()
-        self._policy_specs = specs
        state_encoder_settings = NetworkSettings(
            normalize=True,
            hidden_units=settings.encoding_size,
--- a/ml-agents/mlagents/trainers/torch/distributions.py
+++ b/ml-agents/mlagents/trainers/torch/distributions.py
        """
        pass

+    @abc.abstractmethod
+    def exported_model_output(self) -> torch.Tensor:
+        """
+        Returns the tensor to be exported to ONNX for the distribution
+        """
+        pass
+

 class DiscreteDistInstance(DistInstance):
    @abc.abstractmethod
    def entropy(self):
        return 0.5 * torch.log(2 * math.pi * math.e * self.std + EPSILON)

+    def exported_model_output(self):
+        return self.sample()
+

 class TanhGaussianDistInstance(GaussianDistInstance):
    def __init__(self, mean, std):
        return torch.log(self.probs)

    def entropy(self):
-        return -torch.sum(self.probs * torch.log(self.probs), dim=-1)
+        return -torch.sum(self.probs * torch.log(self.probs), dim=-1).unsqueeze(-1)
+
+    def exported_model_output(self):
+        return self.all_log_prob()


 class GaussianDistribution(nn.Module):
                torch.zeros(1, num_outputs, requires_grad=True)
            )

-    def forward(self, inputs: torch.Tensor) -> List[DistInstance]:
+    def forward(self, inputs: torch.Tensor, masks: torch.Tensor) -> List[DistInstance]:
-            log_sigma = self.log_sigma
+            # Expand so that entropy matches batch size. Note that we're using
+            # torch.cat here instead of torch.expand() becuase it is not supported in the
+            # verified version of Barracuda (1.0.2).
+            log_sigma = torch.cat([self.log_sigma] * inputs.shape[0], axis=0)
-            return [TanhGaussianDistInstance(mu, torch.exp(log_sigma))]
+            return TanhGaussianDistInstance(mu, torch.exp(log_sigma))
-            return [GaussianDistInstance(mu, torch.exp(log_sigma))]
+            return GaussianDistInstance(mu, torch.exp(log_sigma))


 class MultiCategoricalDistribution(nn.Module):
--- a/ml-agents/mlagents/trainers/torch/model_serialization.py
+++ b/ml-agents/mlagents/trainers/torch/model_serialization.py
            for shape in self.policy.behavior_spec.observation_shapes
            if len(shape) == 3
        ]
-        dummy_masks = torch.ones(batch_dim + [sum(self.policy.actor_critic.act_size)])
+        dummy_masks = torch.ones(
+            batch_dim + [sum(self.policy.behavior_spec.action_spec.discrete_branches)]
+        )
        dummy_memories = torch.zeros(
            batch_dim + seq_len_dim + [self.policy.export_memory_size]
        )
--- a/ml-agents/mlagents/trainers/torch/networks.py
+++ b/ml-agents/mlagents/trainers/torch/networks.py

 from mlagents.torch_utils import torch, nn

-from mlagents_envs.base_env import ActionType
-from mlagents.trainers.torch.distributions import (
-    GaussianDistribution,
-    MultiCategoricalDistribution,
-    DistInstance,
-)
+from mlagents_envs.base_env import ActionSpec
+from mlagents.trainers.torch.action_model import ActionModel
+from mlagents.trainers.torch.agent_action import AgentAction
+from mlagents.trainers.torch.action_log_probs import ActionLogProbs
 from mlagents.trainers.settings import NetworkSettings
 from mlagents.trainers.torch.utils import ModelUtils
 from mlagents.trainers.torch.decoders import ValueHeads
        pass

    @abc.abstractmethod
-    def sample_action(self, dists: List[DistInstance]) -> List[torch.Tensor]:
-        """
-        Takes a List of Distribution iinstances and samples an action from each.
-        """
-        pass
-
-    @abc.abstractmethod
-    def get_dists(
-        self,
-        vec_inputs: List[torch.Tensor],
-        vis_inputs: List[torch.Tensor],
-        masks: Optional[torch.Tensor] = None,
-        memories: Optional[torch.Tensor] = None,
-        sequence_length: int = 1,
-    ) -> Tuple[List[DistInstance], Optional[torch.Tensor]]:
-        """
-        Returns distributions from this Actor, from which actions can be sampled.
-        If memory is enabled, return the memories as well.
-        :param vec_inputs: A List of vector inputs as tensors.
-        :param vis_inputs: A List of visual inputs as tensors.
-        :param masks: If using discrete actions, a Tensor of action masks.
-        :param memories: If using memory, a Tensor of initial memories.
-        :param sequence_length: If using memory, the sequence length.
-        :return: A Tuple of a List of action distribution instances, and memories.
-            Memories will be None if not using memory.
-        """
-        pass
-
-    @abc.abstractmethod
    def forward(
        self,
        vec_inputs: List[torch.Tensor],
        pass

    @abc.abstractmethod
-    def get_dist_and_value(
+    def get_action_stats_and_value(
        self,
        vec_inputs: List[torch.Tensor],
        vis_inputs: List[torch.Tensor],
-    ) -> Tuple[List[DistInstance], Dict[str, torch.Tensor], torch.Tensor]:
+    ) -> Tuple[
+        AgentAction, ActionLogProbs, torch.Tensor, Dict[str, torch.Tensor], torch.Tensor
+    ]:
        """
        Returns distributions, from which actions can be sampled, and value estimates.
        If memory is enabled, return the memories as well.
        :param memories: If using memory, a Tensor of initial memories.
        :param sequence_length: If using memory, the sequence length.
-        :return: A Tuple of a List of action distribution instances, a Dict of reward signal
+        :return: A Tuple of AgentAction, ActionLogProbs, entropies, Dict of reward signal
            name to value estimate, and memories. Memories will be None if not using memory.
        """
        pass
        self,
        observation_shapes: List[Tuple[int, ...]],
        network_settings: NetworkSettings,
-        act_type: ActionType,
-        act_size: List[int],
+        action_spec: ActionSpec,
-        self.act_type = act_type
-        self.act_size = act_size
+        self.action_spec = action_spec
-            torch.Tensor([int(act_type == ActionType.CONTINUOUS)])
+            torch.Tensor([int(self.action_spec.is_continuous())])
-            torch.Tensor([sum(act_size)]), requires_grad=False
+            torch.Tensor(
+                [
+                    self.action_spec.continuous_size
+                    + sum(self.action_spec.discrete_branches)
+                ]
+            ),
+            requires_grad=False,
        )
        self.network_body = NetworkBody(observation_shapes, network_settings)
        if network_settings.memory is not None:

-        if self.act_type == ActionType.CONTINUOUS:
-            self.distribution = GaussianDistribution(
-                self.encoding_size,
-                act_size[0],
-                conditional_sigma=conditional_sigma,
-                tanh_squash=tanh_squash,
-            )
-        else:
-            self.distribution = MultiCategoricalDistribution(
-                self.encoding_size, act_size
-            )
+        self.action_model = ActionModel(
+            self.encoding_size,
+            action_spec,
+            conditional_sigma=conditional_sigma,
+            tanh_squash=tanh_squash,
+        )

    @property
    def memory_size(self) -> int:
        self.network_body.update_normalization(vector_obs)

-    def sample_action(self, dists: List[DistInstance]) -> List[torch.Tensor]:
-        actions = []
-        for action_dist in dists:
-            action = action_dist.sample()
-            actions.append(action)
-        return actions
-
-    def get_dists(
-        self,
-        vec_inputs: List[torch.Tensor],
-        vis_inputs: List[torch.Tensor],
-        masks: Optional[torch.Tensor] = None,
-        memories: Optional[torch.Tensor] = None,
-        sequence_length: int = 1,
-    ) -> Tuple[List[DistInstance], Optional[torch.Tensor]]:
-        encoding, memories = self.network_body(
-            vec_inputs, vis_inputs, memories=memories, sequence_length=sequence_length
-        )
-        if self.act_type == ActionType.CONTINUOUS:
-            dists = self.distribution(encoding)
-        else:
-            dists = self.distribution(encoding, masks)
-
-        return dists, memories
-
    def forward(
        self,
        vec_inputs: List[torch.Tensor],
        """
        Note: This forward() method is required for exporting to ONNX. Don't modify the inputs and outputs.
        """
-        dists, _ = self.get_dists(vec_inputs, vis_inputs, masks, memories, 1)
-        if self.act_type == ActionType.CONTINUOUS:
-            action_list = self.sample_action(dists)
-            action_out = torch.stack(action_list, dim=-1)
-        else:
-            action_out = torch.cat([dist.all_log_prob() for dist in dists], dim=1)
+        encoding, memories_out = self.network_body(
+            vec_inputs, vis_inputs, memories=memories, sequence_length=1
+        )
+
+        # TODO: How this is written depends on how the inference model is structured
+        action_out = self.action_model.get_action_out(encoding, masks)
        return (
            action_out,
            self.version_number,
        self,
        observation_shapes: List[Tuple[int, ...]],
        network_settings: NetworkSettings,
-        act_type: ActionType,
-        act_size: List[int],
+        action_spec: ActionSpec,
+        self.use_lstm = network_settings.memory is not None
-            act_type,
-            act_size,
+            action_spec,
            conditional_sigma,
            tanh_squash,
        )
        )
        return self.value_heads(encoding), memories_out

-    def get_dist_and_value(
+    def get_stats_and_value(
+        actions: AgentAction,
-    ) -> Tuple[List[DistInstance], Dict[str, torch.Tensor], torch.Tensor]:
+    ) -> Tuple[ActionLogProbs, torch.Tensor, Dict[str, torch.Tensor]]:
-        if self.act_type == ActionType.CONTINUOUS:
-            dists = self.distribution(encoding)
-        else:
-            dists = self.distribution(encoding, masks=masks)
+        log_probs, entropies = self.action_model.evaluate(encoding, masks, actions)
+        value_outputs = self.value_heads(encoding)
+        return log_probs, entropies, value_outputs
+    def get_action_stats_and_value(
+        self,
+        vec_inputs: List[torch.Tensor],
+        vis_inputs: List[torch.Tensor],
+        masks: Optional[torch.Tensor] = None,
+        memories: Optional[torch.Tensor] = None,
+        sequence_length: int = 1,
+    ) -> Tuple[
+        AgentAction, ActionLogProbs, torch.Tensor, Dict[str, torch.Tensor], torch.Tensor
+    ]:
+
+        encoding, memories = self.network_body(
+            vec_inputs, vis_inputs, memories=memories, sequence_length=sequence_length
+        )
+        action, log_probs, entropies = self.action_model(encoding, masks)
-        return dists, value_outputs, memories
+        return action, log_probs, entropies, value_outputs, memories


 class SeparateActorCritic(SimpleActor, ActorCritic):
        network_settings: NetworkSettings,
-        act_type: ActionType,
-        act_size: List[int],
+        action_spec: ActionSpec,
-        # Give the Actor only half the memories. Note we previously validate
-        # that memory_size must be a multiple of 4.
-            act_type,
-            act_size,
+            action_spec,
            conditional_sigma,
            tanh_squash,
        )
            memories_out = None
        return value_outputs, memories_out

-    def get_dist_and_value(
+    def get_stats_and_value(
+        self,
+        vec_inputs: List[torch.Tensor],
+        vis_inputs: List[torch.Tensor],
+        actions: AgentAction,
+        masks: Optional[torch.Tensor] = None,
+        memories: Optional[torch.Tensor] = None,
+        sequence_length: int = 1,
+    ) -> Tuple[ActionLogProbs, torch.Tensor, Dict[str, torch.Tensor]]:
+        if self.use_lstm:
+            # Use only the back half of memories for critic and actor
+            actor_mem, critic_mem = torch.split(memories, self.memory_size // 2, dim=-1)
+        else:
+            critic_mem = None
+            actor_mem = None
+        encoding, actor_mem_outs = self.network_body(
+            vec_inputs, vis_inputs, memories=actor_mem, sequence_length=sequence_length
+        )
+        log_probs, entropies = self.action_model.evaluate(encoding, masks, actions)
+        value_outputs, critic_mem_outs = self.critic(
+            vec_inputs, vis_inputs, memories=critic_mem, sequence_length=sequence_length
+        )
+
+        return log_probs, entropies, value_outputs
+
+    def get_action_stats_and_value(
        self,
        vec_inputs: List[torch.Tensor],
        vis_inputs: List[torch.Tensor],
-    ) -> Tuple[List[DistInstance], Dict[str, torch.Tensor], torch.Tensor]:
+    ) -> Tuple[
+        AgentAction, ActionLogProbs, torch.Tensor, Dict[str, torch.Tensor], torch.Tensor
+    ]:
        if self.use_lstm:
            # Use only the back half of memories for critic and actor
            actor_mem, critic_mem = torch.split(memories, self.memory_size // 2, dim=-1)
-        dists, actor_mem_outs = self.get_dists(
-            vec_inputs,
-            vis_inputs,
-            memories=actor_mem,
-            sequence_length=sequence_length,
-            masks=masks,
+        encoding, actor_mem_outs = self.network_body(
+            vec_inputs, vis_inputs, memories=actor_mem, sequence_length=sequence_length
+        action, log_probs, entropies = self.action_model(encoding, masks)
        value_outputs, critic_mem_outs = self.critic(
            vec_inputs, vis_inputs, memories=critic_mem, sequence_length=sequence_length
        )
            mem_out = None
-        return dists, value_outputs, mem_out
-
-    def update_normalization(self, vector_obs: List[torch.Tensor]) -> None:
-        super().update_normalization(vector_obs)
-        self.critic.network_body.update_normalization(vector_obs)
+        return action, log_probs, entropies, value_outputs, mem_out


 class GlobalSteps(nn.Module):
--- a/ml-agents/mlagents/trainers/torch/utils.py
+++ b/ml-agents/mlagents/trainers/torch/utils.py
 )
 from mlagents.trainers.settings import EncoderType, ScheduleType
 from mlagents.trainers.exception import UnityTrainerException
-from mlagents_envs.base_env import BehaviorSpec
-from mlagents.trainers.torch.distributions import DistInstance, DiscreteDistInstance


 class ModelUtils:
        EncoderType.NATURE_CNN: 36,
        EncoderType.RESNET: 15,
    }
-
-    class ActionFlattener:
-        def __init__(self, behavior_spec: BehaviorSpec):
-            self._specs = behavior_spec
-
-        @property
-        def flattened_size(self) -> int:
-            if self._specs.is_action_continuous():
-                return self._specs.action_size
-            else:
-                return sum(self._specs.discrete_action_branches)
-
-        def forward(self, action: torch.Tensor) -> torch.Tensor:
-            if self._specs.is_action_continuous():
-                return action
-            else:
-                return torch.cat(
-                    ModelUtils.actions_to_onehot(
-                        torch.as_tensor(action, dtype=torch.long),
-                        self._specs.discrete_action_branches,
-                    ),
-                    dim=1,
-                )

    @staticmethod
    def update_learning_rate(optim: torch.optim.Optimizer, lr: float) -> None:

    @staticmethod
    def list_to_tensor(
-        ndarray_list: List[np.ndarray], dtype: Optional[torch.dtype] = None
+        ndarray_list: List[np.ndarray], dtype: Optional[torch.dtype] = torch.float32
    ) -> torch.Tensor:
        """
        Converts a list of numpy arrays into a tensor. MUCH faster than
        for i in range(num_partitions):
            res += [data[(partitions == i).nonzero().squeeze(1)]]
        return res
-
-    @staticmethod
-    def get_probs_and_entropy(
-        action_list: List[torch.Tensor], dists: List[DistInstance]
-    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:
-        log_probs_list = []
-        all_probs_list = []
-        entropies_list = []
-        for action, action_dist in zip(action_list, dists):
-            log_prob = action_dist.log_prob(action)
-            log_probs_list.append(log_prob)
-            entropies_list.append(action_dist.entropy())
-            if isinstance(action_dist, DiscreteDistInstance):
-                all_probs_list.append(action_dist.all_log_prob())
-        log_probs = torch.stack(log_probs_list, dim=-1)
-        entropies = torch.stack(entropies_list, dim=-1)
-        if not all_probs_list:
-            log_probs = log_probs.squeeze(-1)
-            entropies = entropies.squeeze(-1)
-            all_probs = None
-        else:
-            all_probs = torch.cat(all_probs_list, dim=-1)
-        return log_probs, entropies, all_probs

    @staticmethod
    def masked_mean(tensor: torch.Tensor, masks: torch.Tensor) -> torch.Tensor:
--- a/ml-agents/mlagents/trainers/trainer/rl_trainer.py
+++ b/ml-agents/mlagents/trainers/trainer/rl_trainer.py
 from mlagents.trainers.optimizer import Optimizer
 from mlagents.trainers.buffer import AgentBuffer
 from mlagents.trainers.trainer import Trainer
-from mlagents.trainers.tf.components.reward_signals import (
-    RewardSignalResult,
-    RewardSignal,
+from mlagents.trainers.torch.components.reward_providers.base_reward_provider import (
+    BaseRewardProvider,
-from mlagents.trainers.policy.tf_policy import TFPolicy
+from mlagents.trainers.policy.torch_policy import TorchPolicy
+from mlagents.trainers.model_saver.torch_model_saver import TorchModelSaver
 from mlagents.trainers.behavior_id_utils import BehaviorIdentifiers
 from mlagents.trainers.agent_processor import AgentManagerQueue
 from mlagents.trainers.trajectory import Trajectory
-from mlagents.trainers.model_saver.tf_model_saver import TFModelSaver
+
-from mlagents import torch_utils
+from mlagents import tf_utils
-if torch_utils.is_available():
-    from mlagents.trainers.policy.torch_policy import TorchPolicy
-    from mlagents.trainers.model_saver.torch_model_saver import TorchModelSaver
+if tf_utils.is_available():
+    from mlagents.trainers.policy.tf_policy import TFPolicy
+    from mlagents.trainers.model_saver.tf_model_saver import TFModelSaver
-    TorchPolicy = None  # type: ignore
-    TorchSaver = None  # type: ignore
-
-RewardSignalResults = Dict[str, RewardSignalResult]
+    TFPolicy = None  # type: ignore
+    TFModelSaver = None  # type: ignore

 logger = get_logger(__name__)

            StatsPropertyType.HYPERPARAMETERS, self.trainer_settings.as_dict()
        )
        self.framework = self.trainer_settings.framework
-        if self.framework == FrameworkType.PYTORCH and not torch_utils.is_available():
+        if self.framework == FrameworkType.TENSORFLOW and not tf_utils.is_available():
-                "To use the experimental PyTorch backend, install the PyTorch Python package first."
+                "To use the TensorFlow backend, install the TensorFlow Python package first."
            )

        logger.debug(f"Using framework {self.framework.value}")
                self.reward_buffer.appendleft(rewards.get(agent_id, 0))
                rewards[agent_id] = 0
            else:
-                if isinstance(optimizer.reward_signals[name], RewardSignal):
+                if isinstance(optimizer.reward_signals[name], BaseRewardProvider):
-                        optimizer.reward_signals[name].stat_name,
+                        f"Policy/{optimizer.reward_signals[name].name.capitalize()} Reward",
-                        f"Policy/{optimizer.reward_signals[name].name.capitalize()} Reward",
+                        optimizer.reward_signals[name].stat_name,
                        rewards.get(agent_id, 0),
                    )
                rewards[agent_id] = 0
--- a/ml-agents/mlagents/trainers/trainer/trainer_factory.py
+++ b/ml-agents/mlagents/trainers/trainer/trainer_factory.py
        init_path: str = None,
        multi_gpu: bool = False,
        force_torch: bool = False,
+        force_tensorflow: bool = False,
    ):
        """
        The TrainerFactory generates the Trainers based on the configuration passed as
        :param init_path: Path from which to load model.
        :param multi_gpu: If True, multi-gpu will be used. (currently not available)
        :param force_torch: If True, the Trainers will all use the PyTorch framework
-        instead of the TensorFlow framework.
+        instead of what is specified in the config YAML.
+        :param force_tensorflow: If True, thee Trainers will all use the TensorFlow
+        framework.
        """
        self.trainer_config = trainer_config
        self.output_path = output_path
        self.multi_gpu = multi_gpu
        self.ghost_controller = GhostController()
        self._force_torch = force_torch
+        self._force_tf = force_tensorflow

    def generate(self, behavior_name: str) -> Trainer:
        if behavior_name not in self.trainer_config.keys():
        trainer_settings = self.trainer_config[behavior_name]
        if self._force_torch:
            trainer_settings.framework = FrameworkType.PYTORCH
+            logger.warning(
+                "Note that specifying --torch is not required anymore as PyTorch is the default framework."
+            )
+        if self._force_tf:
+            trainer_settings.framework = FrameworkType.TENSORFLOW
+            logger.warning(
+                "Setting the framework to TensorFlow. TensorFlow trainers will be deprecated in the future."
+            )
+            if self._force_torch:
+                logger.warning(
+                    "Both --torch and --tensorflow CLI options were specified. Using TensorFlow."
+                )
        return TrainerFactory._initialize_trainer(
            trainer_settings,
            behavior_name,
--- a/ml-agents/mlagents/trainers/trainer_controller.py
+++ b/ml-agents/mlagents/trainers/trainer_controller.py

 import numpy as np
 from mlagents.tf_utils import tf
+from mlagents import tf_utils

 from mlagents_envs.logging_util import get_logger
 from mlagents.trainers.env_manager import EnvManager, EnvironmentStep
        self.trainer_threads: List[threading.Thread] = []
        self.kill_trainers = False
        np.random.seed(training_seed)
-        tf.set_random_seed(training_seed)
-        if torch_utils.is_available():
-            torch_utils.torch.manual_seed(training_seed)
+        if tf_utils.is_available():
+            tf.set_random_seed(training_seed)
+        torch_utils.torch.manual_seed(training_seed)
        self.rank = get_rank()

    @timed
    @timed
    def start_learning(self, env_manager: EnvManager) -> None:
        self._create_output_path(self.output_path)
-        tf.reset_default_graph()
+        if tf_utils.is_available():
+            tf.reset_default_graph()
        try:
            # Initial reset
            self._reset_env(env_manager)