Merge branch 'master' into develop-sampler-refactor

5 年前 · e7750fc9
--- a/.pre-commit-search-and-replace.yaml
+++ b/.pre-commit-search-and-replace.yaml
  search: /ML[ -]Agents toolkit/
  replacement: ML-Agents Toolkit
  insensitive: true
+- description: Replace "the the"
+  search: /the the/
+  replacement: the
+  insensitive: true
--- a/README.md
+++ b/README.md
 [contribution guidelines](com.unity.ml-agents/CONTRIBUTING.md) and
 [code of conduct](CODE_OF_CONDUCT.md).

-For problems with the installation and setup of the the ML-Agents Toolkit, or
+For problems with the installation and setup of the ML-Agents Toolkit, or
-using the ML-Agents Toolkit, or have a specific feature requests, please
+using the ML-Agents Toolkit or have a specific feature request, please
 [submit a GitHub issue](https://github.com/Unity-Technologies/ml-agents/issues).

 Your opinion matters a great deal to us. Only by hearing your thoughts on the
--- a/com.unity.ml-agents/CHANGELOG.md
+++ b/com.unity.ml-agents/CHANGELOG.md
 - `use_visual` and `allow_multiple_visual_obs` in the `UnityToGymWrapper` constructor
 were replaced by `allow_multiple_obs` which allows one or more visual observations and
 vector observations to be used simultaneously. (#3981) Thank you @shakenes !
-### Minor Changes
-#### com.unity.ml-agents (C#)
- `ObservableAttribute` was added. Adding the attribute to fields or properties on an Agent will allow it to generate
-  observations via reflection. (#3925, #4006)
-#### ml-agents / ml-agents-envs / gym-unity (Python)
 - Curriculum and Parameter Randomization configurations have been merged
  into the main training configuration file. Note that this means training
  configuration files are now environment-specific. (#3791)
  directory. (#3829)
+- When using Curriculum, the current lesson will resume if training is quit and resumed. As such,
+  the `--lesson` CLI option has been removed. (#4025)
+### Minor Changes
+#### com.unity.ml-agents (C#)
+- `ObservableAttribute` was added. Adding the attribute to fields or properties on an Agent will allow it to generate
+  observations via reflection. (#3925, #4006)
+#### ml-agents / ml-agents-envs / gym-unity (Python)
+- When trying to load/resume from a checkpoint created with an earlier verison of ML-Agents,
+  a warning will be thrown. (#4035)
+- Fixed an issue where SAC would perform too many model updates when resuming from a
+  checkpoint, and too few when using `buffer_init_steps`. (#4038)
 #### com.unity.ml-agents (C#)
 #### ml-agents / ml-agents-envs / gym-unity (Python)

--- a/config/sac/Walker.yaml
+++ b/config/sac/Walker.yaml
    hyperparameters:
      learning_rate: 0.0003
      learning_rate_schedule: constant
-      batch_size: 256
-      buffer_size: 500000
+      batch_size: 1024
+      buffer_size: 2000000
      buffer_init_steps: 0
      tau: 0.005
      steps_per_update: 30.0
    network_settings:
      normalize: true
-      hidden_units: 512
-      num_layers: 4
+      hidden_units: 256
+      num_layers: 3
      vis_encode_type: simple
    reward_signals:
      extrinsic:
    keep_checkpoints: 5
-    max_steps: 20000000
+    max_steps: 15000000
    time_horizon: 1000
    summary_freq: 30000
    threaded: true
--- a/config/sac/WallJump.yaml
+++ b/config/sac/WallJump.yaml
      learning_rate: 0.0003
      learning_rate_schedule: constant
      batch_size: 128
-      buffer_size: 50000
+      buffer_size: 200000
-      steps_per_update: 10.0
+      steps_per_update: 20.0
      save_replay_buffer: false
      init_entcoef: 0.1
      reward_signal_steps_per_update: 10.0
        strength: 1.0
    output_path: default
    keep_checkpoints: 5
-    max_steps: 20000000
+    max_steps: 15000000
    time_horizon: 128
    summary_freq: 20000
    threaded: true
      buffer_size: 50000
      buffer_init_steps: 0
      tau: 0.005
-      steps_per_update: 10.0
+      steps_per_update: 20.0
      save_replay_buffer: false
      init_entcoef: 0.1
      reward_signal_steps_per_update: 10.0
--- a/docs/Background-Machine-Learning.md
+++ b/docs/Background-Machine-Learning.md
 water hose and whether the hose is on or off).

 The last remaining piece of the reinforcement learning task is the **reward
-signal**. When training a robot to be a mean firefighting machine, we provide it
+signal**. The robot is trained to learn a policy that maximizes its overall rewards. When training a robot to be a mean firefighting machine, we provide it
 with rewards (positive and negative) indicating how well it is doing on
 completing the task. Note that the robot does not _know_ how to put out fires
 before it is trained. It learns the objective because it receives a large
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
 search the tensorflow github issues for similar problems and solutions before
 creating a new issue.

+#### Visual C++ Dependency (Windows Users)
+When running `mlagents-learn`, if you see a stack trace with a message like this:
+
+```console
+ImportError: DLL load failed: The specified module could not be found.
+```
+
+then either of the required DLLs, `msvcp140.dll` (old) or `msvcp140_1.dll` (new), are missing on your machine. The `import tensorflow` command will print this warning message.
+
+To solve it, download and install (with a reboot) the install [Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017 and 2019](https://support.microsoft.com/en-my/help/2977003/the-latest-supported-visual-c-downloads).
+
+For more details, please see the [TensorFlow 2.1.0 release notes](https://github.com/tensorflow/tensorflow/releases/tag/v2.1.0)
+and the [TensorFlow github issue](https://github.com/tensorflow/tensorflow/issues/22794#issuecomment-573297027).
+
 ## Environment Permission Error

 If you directly import your Unity environment without building it in the editor,
--- a/docs/Installation.md
+++ b/docs/Installation.md
 order to find it.

 **NOTE:** If you do not see the ML-Agents package listed in the Package Manager
-please follow the the [advanced installation instructions](#advanced-local-installation-for-development) below.
+please follow the [advanced installation instructions](#advanced-local-installation-for-development) below.

 #### Advanced: Local Installation for Development

--- a/docs/Learning-Environment-Create-New.md
+++ b/docs/Learning-Environment-Create-New.md
 ```yml
 behaviors:
  RollerBall:
-    trainer: ppo
-    batch_size: 10
-    beta: 5.0e-3
-    buffer_size: 100
-    epsilon: 0.2
-    hidden_units: 128
-    lambd: 0.95
-    learning_rate: 3.0e-4
-    learning_rate_schedule: linear
-    max_steps: 5.0e4
-    memory_size: 128
-    normalize: false
-    num_epoch: 3
-    num_layers: 2
+    trainer_type: ppo
+    hyperparameters:
+      batch_size: 10
+      buffer_size: 100
+      learning_rate: 3.0e-4
+      beta: 5.0e-4
+      epsilon: 0.2
+      lambd: 0.99
+      num_epoch: 3
+      learning_rate_schedule: linear
+    network_settings:
+      normalize: false
+      hidden_units: 128
+      num_layers: 2
+    reward_signals:
+      extrinsic:
+        gamma: 0.99
+        strength: 1.0
+    max_steps: 500000
-    use_recurrent: false
-    reward_signals:
-        extrinsic:
-            strength: 1.0
-            gamma: 0.99
 ```

 Since this example creates a very simple training environment with only a few
--- a/docs/Migrating.md
+++ b/docs/Migrating.md
 - `use_visual` and `allow_multiple_visual_obs` in the `UnityToGymWrapper` constructor
 were replaced by `allow_multiple_obs` which allows one or more visual observations and
 vector observations to be used simultaneously.
+- `--lesson` has been removed from the CLI. Lessons will resume when using `--resume`.
+  To start at a different lesson, modify your Curriculum configuration.

 ### Steps to Migrate
 - To upgrade your configuration files, an upgrade script has been provided. Run `python config/update_config.py
  `RayPerception3d.Perceive()` that was causing the `endOffset` to be used
  incorrectly. However this may produce different behavior from previous
  versions if you use a non-zero `startOffset`. To reproduce the old behavior,
-  you should increase the the value of `endOffset` by `startOffset`. You can
+  you should increase the value of `endOffset` by `startOffset`. You can
  verify your raycasts are performing as expected in scene view using the debug
  rays.
 - If you use RayPerception3D, replace it with RayPerceptionSensorComponent3D
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 mlagents-learn config/ppo/WallJump_curriculum.yaml --run-id=wall-jump-curriculum
 ```

-We can then keep track of the current lessons and progresses via TensorBoard.
-
-**Note**: If you are resuming a training session that uses curriculum, please
-pass the number of the last-reached lesson using the `--lesson` flag when
-running `mlagents-learn`.
+We can then keep track of the current lessons and progresses via TensorBoard. If you've terminated
+the run, you can resume it using `--resume` and lesson progress will start off where it
+ended.

 ### Environment Parameter Randomization

--- a/docs/Training-on-Amazon-Web-Service.md
+++ b/docs/Training-on-Amazon-Web-Service.md
 Fatal server error:
 (EE) no screens found(EE)
 (EE)
-Please consult the The X.Org Foundation support
+Please consult the X.Org Foundation support
         at http://wiki.x.org
 for help.
 (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
--- a/ml-agents-envs/mlagents_envs/rpc_communicator.py
+++ b/ml-agents-envs/mlagents_envs/rpc_communicator.py
 import grpc
 from typing import Optional

+from sys import platform
 import socket
 from multiprocessing import Pipe
 from concurrent.futures import ThreadPoolExecutor
        Attempts to bind to the requested communicator port, checking if it is already in use.
        """
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+        if platform == "linux" or platform == "linux2":
+            # On linux, the port remains unusable for TIME_WAIT=60 seconds after closing
+            # SO_REUSEADDR frees the port right after closing the environment
+            s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        try:
            s.bind(("localhost", port))
        except socket.error:
--- a/ml-agents/README.md
+++ b/ml-agents/README.md
  cooperative behavior among different agents is not stable.
 - Resuming self-play from a checkpoint resets the reported ELO to the default
  value.
- Resuming curriculum learning from a checkpoint requires the last lesson be
-  specified using the `--lesson` CLI option
--- a/ml-agents/mlagents/model_serialization.py
+++ b/ml-agents/mlagents/model_serialization.py
 )

 MODEL_CONSTANTS = frozenset(
-    ["action_output_shape", "is_continuous_control", "memory_size", "version_number"]
+    [
+        "action_output_shape",
+        "is_continuous_control",
+        "memory_size",
+        "version_number",
+        "trainer_major_version",
+        "trainer_minor_version",
+        "trainer_patch_version",
+    ]
 )
 VISUAL_OBSERVATION_PREFIX = "visual_observation_"

--- a/ml-agents/mlagents/trainers/cli_utils.py
+++ b/ml-agents/mlagents/trainers/cli_utils.py
        action=DetectDefault,
    )
    argparser.add_argument(
-        "--lesson",
-        default=0,
-        type=int,
-        help="The lesson to start with when performing curriculum training",
-        action=DetectDefault,
-    )
-    argparser.add_argument(
        "--load",
        default=False,
        dest="load_model",
--- a/ml-agents/mlagents/trainers/ghost/trainer.py
+++ b/ml-agents/mlagents/trainers/ghost/trainer.py
                )
            )

-        # Counts the The number of steps of the ghost policies. Snapshot swapping
+        # Counts the number of steps of the ghost policies. Snapshot swapping
        # depends on this counter whereas snapshot saving and team switching depends
        # on the wrapped. This ensures that all teams train for the same number of trainer
        # steps.
--- a/ml-agents/mlagents/trainers/learn.py
+++ b/ml-agents/mlagents/trainers/learn.py
 from mlagents.trainers.cli_utils import parser
 from mlagents_envs.environment import UnityEnvironment
 from mlagents.trainers.settings import RunOptions
+from mlagents.trainers.training_status import GlobalTrainingStatus
 from mlagents_envs.base_env import BaseEnv
 from mlagents.trainers.subprocess_env_manager import SubprocessEnvManager
 from mlagents_envs.side_channel.side_channel import SideChannel
 from mlagents_envs import logging_util

 logger = logging_util.get_logger(__name__)
+
+TRAINING_STATUS_FILE_NAME = "training_status.json"


 def get_version_string() -> str:
        )
        # Make run logs directory
        os.makedirs(run_logs_dir, exist_ok=True)
+        # Load any needed states
+        if checkpoint_settings.resume:
+            GlobalTrainingStatus.load_state(
+                os.path.join(run_logs_dir, "training_status.json")
+            )
        # Configure CSV, Tensorboard Writers and StatsReporter
        # We assume reward and episode length are needed in the CSV.
        csv_writer = CSVWriter(
            env_factory, engine_config, env_settings.num_envs
        )
        maybe_meta_curriculum = try_create_meta_curriculum(
-            options.curriculum, env_manager, checkpoint_settings.lesson
+            options.curriculum, env_manager, restore=checkpoint_settings.resume
        )
        maybe_add_samplers(options.parameter_randomization, env_manager)

        env_manager.close()
        write_run_options(write_path, options)
        write_timing_tree(run_logs_dir)
+        write_training_status(run_logs_dir)


 def write_run_options(output_dir: str, run_options: RunOptions) -> None:
        )


+def write_training_status(output_dir: str) -> None:
+    GlobalTrainingStatus.save_state(os.path.join(output_dir, TRAINING_STATUS_FILE_NAME))
+
+
 def write_timing_tree(output_dir: str) -> None:
    timing_path = os.path.join(output_dir, "timers.json")
    try:


 def try_create_meta_curriculum(
-    curriculum_config: Optional[Dict], env: SubprocessEnvManager, lesson: int
+    curriculum_config: Optional[Dict], env: SubprocessEnvManager, restore: bool = False
-        # TODO: Should be able to start learning at different lesson numbers
-        # for each curriculum.
-        meta_curriculum.set_all_curricula_to_lesson_num(lesson)
+        if restore:
+            meta_curriculum.try_restore_all_curriculum()
        return meta_curriculum


--- a/ml-agents/mlagents/trainers/meta_curriculum.py
+++ b/ml-agents/mlagents/trainers/meta_curriculum.py
 from typing import Dict, Set
 from mlagents.trainers.curriculum import Curriculum
 from mlagents.trainers.settings import CurriculumSettings
+from mlagents.trainers.training_status import GlobalTrainingStatus, StatusType

 from mlagents_envs.logging_util import get_logger

                )
        return ret

-    def set_all_curricula_to_lesson_num(self, lesson_num):
-        """Sets all the curricula in this meta curriculum to a specified
-        lesson number.
-
-        Args:
-            lesson_num (int): The lesson number which all the curricula will
-                be set to.
+    def try_restore_all_curriculum(self):
-        for _, curriculum in self.brains_to_curricula.items():
-            curriculum.lesson_num = lesson_num
+        Tries to restore all the curriculums to what is saved in training_status.json
+        """
+
+        for brain_name, curriculum in self.brains_to_curricula.items():
+            lesson_num = GlobalTrainingStatus.get_parameter_state(
+                brain_name, StatusType.LESSON_NUM
+            )
+            if lesson_num is not None:
+                logger.info(
+                    f"Resuming curriculum for {brain_name} at lesson {lesson_num}."
+                )
+                curriculum.lesson_num = lesson_num
+            else:
+                curriculum.lesson_num = 0

    def get_config(self):
        """Get the combined configuration of all curricula in this
--- a/ml-agents/mlagents/trainers/policy/tf_policy.py
+++ b/ml-agents/mlagents/trainers/policy/tf_policy.py
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Optional, Tuple
+from distutils.version import LooseVersion
+
 from mlagents.tf_utils import tf
 from mlagents import tf_utils
 from mlagents_envs.exception import UnityException
 from mlagents.trainers.models import ModelUtils
 from mlagents.trainers.settings import TrainerSettings, NetworkSettings
 from mlagents.trainers.brain import BrainParameters
+from mlagents.trainers import __version__
+
+
+# This is the version number of the inputs and outputs of the model, and
+# determines compatibility with inference in Barracuda.
+MODEL_FORMAT_VERSION = 2


 class UnityPolicyException(UnityException):
        :param brain: The corresponding Brain for this policy.
        :param trainer_settings: The trainer parameters.
        """
-        self._version_number_ = 2
+
        self.m_size = 0
        self.trainer_settings = trainer_settings
        self.network_settings: NetworkSettings = trainer_settings.network_settings
        """
        pass

+    @staticmethod
+    def _convert_version_string(version_string: str) -> Tuple[int, ...]:
+        """
+        Converts the version string into a Tuple of ints (major_ver, minor_ver, patch_ver).
+        :param version_string: The semantic-versioned version string (X.Y.Z).
+        :return: A Tuple containing (major_ver, minor_ver, patch_ver).
+        """
+        ver = LooseVersion(version_string)
+        return tuple(map(int, ver.version[0:3]))
+
+    def _check_model_version(self, version: str) -> None:
+        """
+        Checks whether the model being loaded was created with the same version of
+        ML-Agents, and throw a warning if not so.
+        """
+        if self.version_tensors is not None:
+            loaded_ver = tuple(
+                num.eval(session=self.sess) for num in self.version_tensors
+            )
+            if loaded_ver != TFPolicy._convert_version_string(version):
+                logger.warning(
+                    f"The model checkpoint you are loading from was saved with ML-Agents version "
+                    f"{loaded_ver[0]}.{loaded_ver[1]}.{loaded_ver[2]} but your current ML-Agents"
+                    f"version is {version}. Model may not behave properly."
+                )
+
    def _initialize_graph(self):
        with self.graph.as_default():
            self.saver = tf.train.Saver(max_to_keep=self.keep_checkpoints)
                        model_path
                    )
                )
+            self._check_model_version(__version__)
            if reset_global_steps:
                self._set_step(0)
                logger.info(
        self.prev_action: Optional[tf.Tensor] = None
        self.memory_in: Optional[tf.Tensor] = None
        self.memory_out: Optional[tf.Tensor] = None
+        self.version_tensors: Optional[Tuple[tf.Tensor, tf.Tensor, tf.Tensor]] = None

    def create_input_placeholders(self):
        with self.graph.as_default():
                trainable=False,
                dtype=tf.int32,
            )
+            int_version = TFPolicy._convert_version_string(__version__)
+            major_ver_t = tf.Variable(
+                int_version[0],
+                name="trainer_major_version",
+                trainable=False,
+                dtype=tf.int32,
+            )
+            minor_ver_t = tf.Variable(
+                int_version[1],
+                name="trainer_minor_version",
+                trainable=False,
+                dtype=tf.int32,
+            )
+            patch_ver_t = tf.Variable(
+                int_version[2],
+                name="trainer_patch_version",
+                trainable=False,
+                dtype=tf.int32,
+            )
+            self.version_tensors = (major_ver_t, minor_ver_t, patch_ver_t)
-                self._version_number_,
+                MODEL_FORMAT_VERSION,
                name="version_number",
                trainable=False,
                dtype=tf.int32,
--- a/ml-agents/mlagents/trainers/sac/trainer.py
+++ b/ml-agents/mlagents/trainers/sac/trainer.py
        :param training: Whether the trainer is set for training.
        :param load: Whether the model should be loaded.
        :param seed: The seed the model will be initialized with
-        :param run_id: The The identifier of the current run
+        :param run_id: The identifier of the current run
        """
        super().__init__(
            brain_name, trainer_settings, training, run_id, reward_buff_cap
        )
        self.step = 0

-        # Don't count buffer_init_steps in steps_per_update ratio, but also don't divide-by-0
-        self.update_steps = max(1, self.hyperparameters.buffer_init_steps)
-        self.reward_signal_update_steps = max(1, self.hyperparameters.buffer_init_steps)
+        # Don't divide by zero
+        self.update_steps = 1
+        self.reward_signal_update_steps = 1

        self.steps_per_update = self.hyperparameters.steps_per_update
        self.reward_signal_steps_per_update = (
        )

        batch_update_stats: Dict[str, list] = defaultdict(list)
-        while self.step / self.update_steps > self.steps_per_update:
+        while (
+            self.step - self.hyperparameters.buffer_init_steps
+        ) / self.update_steps > self.steps_per_update:
            logger.debug("Updating SAC policy at step {}".format(self.step))
            buffer = self.update_buffer
            if self.update_buffer.num_experiences >= self.hyperparameters.batch_size:
        )
        batch_update_stats: Dict[str, list] = defaultdict(list)
        while (
-            self.step / self.reward_signal_update_steps
-            > self.reward_signal_steps_per_update
-        ):
+            self.step - self.hyperparameters.buffer_init_steps
+        ) / self.reward_signal_update_steps > self.reward_signal_steps_per_update:
            # Get minibatches for reward signal update if needed
            reward_signal_minibatches = {}
            for name, signal in self.optimizer.reward_signals.items():
            self.collected_rewards[_reward_signal] = defaultdict(lambda: 0)
        # Needed to resume loads properly
        self.step = policy.get_current_step()
+        # Assume steps were updated at the correct ratio before
+        self.update_steps = int(max(1, self.step / self.steps_per_update))
+        self.reward_signal_update_steps = int(
+            max(1, self.step / self.reward_signal_steps_per_update)
+        )
        self.next_summary_step = self._get_next_summary_step()

    def get_policy(self, name_behavior_id: str) -> TFPolicy:
--- a/ml-agents/mlagents/trainers/settings.py
+++ b/ml-agents/mlagents/trainers/settings.py
    force: bool = parser.get_default("force")
    train_model: bool = parser.get_default("train_model")
    inference: bool = parser.get_default("inference")
-    lesson: int = parser.get_default("lesson")


@attr.s(auto_attribs=True)
--- a/ml-agents/mlagents/trainers/tests/test_learn.py
+++ b/ml-agents/mlagents/trainers/tests/test_learn.py
        base_port: 4001
        seed: 9870
    checkpoint_settings:
-        lesson: 2
        run_id: uselessrun
        save_freq: 654321
    debug: false
    assert opt.behaviors == {}
    assert opt.env_settings.env_path is None
    assert opt.parameter_randomization is None
-    assert opt.checkpoint_settings.lesson == 0
    assert opt.checkpoint_settings.resume is False
    assert opt.checkpoint_settings.inference is False
    assert opt.checkpoint_settings.run_id == "ppo"
    full_args = [
        "mytrainerpath",
        "--env=./myenvfile",
-        "--lesson=3",
        "--resume",
        "--inference",
        "--run-id=myawesomerun",
    assert opt.behaviors == {}
    assert opt.env_settings.env_path == "./myenvfile"
    assert opt.parameter_randomization is None
-    assert opt.checkpoint_settings.lesson == 3
    assert opt.checkpoint_settings.run_id == "myawesomerun"
    assert opt.checkpoint_settings.save_freq == 123456
    assert opt.env_settings.seed == 7890
    assert opt.behaviors == {}
    assert opt.env_settings.env_path == "./oldenvfile"
    assert opt.parameter_randomization is None
-    assert opt.checkpoint_settings.lesson == 2
    assert opt.checkpoint_settings.run_id == "uselessrun"
    assert opt.checkpoint_settings.save_freq == 654321
    assert opt.env_settings.seed == 9870
    full_args = [
        "mytrainerpath",
        "--env=./myenvfile",
-        "--lesson=3",
        "--resume",
        "--inference",
        "--run-id=myawesomerun",
    assert opt.behaviors == {}
    assert opt.env_settings.env_path == "./myenvfile"
    assert opt.parameter_randomization is None
-    assert opt.checkpoint_settings.lesson == 3
    assert opt.checkpoint_settings.run_id == "myawesomerun"
    assert opt.checkpoint_settings.save_freq == 123456
    assert opt.env_settings.seed == 7890
--- a/ml-agents/mlagents/trainers/tests/test_meta_curriculum.py
+++ b/ml-agents/mlagents/trainers/tests/test_meta_curriculum.py
 import pytest
-from unittest.mock import patch, Mock
+from unittest.mock import patch, Mock, call

 from mlagents.trainers.meta_curriculum import MetaCurriculum

 )
 from mlagents.trainers.tests.test_curriculum import dummy_curriculum_config
 from mlagents.trainers.settings import CurriculumSettings
+from mlagents.trainers.training_status import StatusType


@pytest.fixture
    curriculum_b.increment_lesson.assert_not_called()


-def test_set_all_curriculums_to_lesson_num():
+@patch("mlagents.trainers.meta_curriculum.GlobalTrainingStatus")
+def test_restore_curriculums(mock_trainingstatus):
-
-    meta_curriculum.set_all_curricula_to_lesson_num(2)
-
+    # Test restore to value
+    mock_trainingstatus.get_parameter_state.return_value = 2
+    meta_curriculum.try_restore_all_curriculum()
+    mock_trainingstatus.get_parameter_state.assert_has_calls(
+        [call("Brain1", StatusType.LESSON_NUM), call("Brain2", StatusType.LESSON_NUM)],
+        any_order=True,
+    )
+
+    # Test restore to None
+    mock_trainingstatus.get_parameter_state.return_value = None
+    meta_curriculum.try_restore_all_curriculum()
+
+    assert meta_curriculum.brains_to_curricula["Brain1"].lesson_num == 0
+    assert meta_curriculum.brains_to_curricula["Brain2"].lesson_num == 0


 def test_get_config():
--- a/ml-agents/mlagents/trainers/tests/test_nn_policy.py
+++ b/ml-agents/mlagents/trainers/tests/test_nn_policy.py
 import pytest
 import os
+import unittest
+import tempfile

 import numpy as np
 from mlagents.tf_utils import tf
 from mlagents.trainers.tests import mock_brain as mb
 from mlagents.trainers.settings import TrainerSettings, NetworkSettings
 from mlagents.trainers.tests.test_trajectory import make_fake_trajectory
+from mlagents.trainers import __version__


 VECTOR_ACTION_SPACE = [2]
    _compare_two_policies(policy2, policy3)
    # Assert that the steps are 0.
    assert policy3.get_current_step() == 0
+
+
+class ModelVersionTest(unittest.TestCase):
+    def test_version_compare(self):
+        # Test write_stats
+        with self.assertLogs("mlagents.trainers", level="WARNING") as cm:
+            path1 = tempfile.mkdtemp()
+            trainer_params = TrainerSettings(output_path=path1)
+            policy = create_policy_mock(trainer_params)
+            policy.initialize_or_load()
+            policy._check_model_version(
+                "0.0.0"
+            )  # This is not the right version for sure
+            # Assert that 1 warning has been thrown with incorrect version
+            assert len(cm.output) == 1
+            policy._check_model_version(__version__)  # This should be the right version
+            # Assert that no additional warnings have been thrown wth correct ver
+            assert len(cm.output) == 1


 def _compare_two_policies(policy1: NNPolicy, policy2: NNPolicy) -> None:
--- a/ml-agents/mlagents/trainers/tests/test_policy.py
+++ b/ml-agents/mlagents/trainers/tests/test_policy.py
        policy_eval_out["action"], policy_eval_out["value"], policy_eval_out, [0]
    )
    assert result == expected
+
+
+def test_convert_version_string():
+    result = TFPolicy._convert_version_string("200.300.100")
+    assert result == (200, 300, 100)
+    # Test dev versions
+    result = TFPolicy._convert_version_string("200.300.100.dev0")
+    assert result == (200, 300, 100)
--- a/ml-agents/mlagents/trainers/tests/test_sac.py
+++ b/ml-agents/mlagents/trainers/tests/test_sac.py
        discrete_action=False, visual_inputs=0, vec_obs_size=6
    )
    dummy_config.hyperparameters.steps_per_update = 20
+    dummy_config.hyperparameters.reward_signal_steps_per_update = 20
    dummy_config.hyperparameters.buffer_init_steps = 0
    trainer = SACTrainer(brain_params, 0, dummy_config, True, False, 0, "0")
    policy = trainer.create_policy(brain_params.brain_name, brain_params)
    trainer.advance()
    with pytest.raises(AgentManagerQueue.Empty):
        policy_queue.get_nowait()
+
+    # Call add_policy and check that we update the correct number of times.
+    # This is to emulate a load from checkpoint.
+    policy = trainer.create_policy(brain_params.brain_name, brain_params)
+    policy.get_current_step = lambda: 200
+    trainer.add_policy(brain_params.brain_name, policy)
+    trainer.optimizer.update = mock.Mock()
+    trainer.optimizer.update_reward_signals = mock.Mock()
+    trainer.optimizer.update_reward_signals.return_value = {}
+    trainer.optimizer.update.return_value = {}
+    trajectory_queue.put(trajectory)
+    trainer.advance()
+    # Make sure we did exactly 1 update
+    assert trainer.optimizer.update.call_count == 1
+    assert trainer.optimizer.update_reward_signals.call_count == 1


 if __name__ == "__main__":
--- a/ml-agents/mlagents/trainers/tests/test_simple_rl.py
+++ b/ml-agents/mlagents/trainers/tests/test_simple_rl.py
@pytest.mark.parametrize("use_discrete", [True, False])
 def test_2d_ppo(use_discrete):
    env = SimpleEnvironment(
-        [BRAIN_NAME], use_discrete=use_discrete, action_size=2, step_size=0.5
+        [BRAIN_NAME], use_discrete=use_discrete, action_size=2, step_size=0.8
+    )
+    new_hyperparams = attr.evolve(
+        PPO_CONFIG.hyperparameters, batch_size=64, buffer_size=640
-    config = attr.evolve(PPO_CONFIG)
+    config = attr.evolve(PPO_CONFIG, hyperparameters=new_hyperparams, max_steps=10000)
    _check_environment_trains(env, {BRAIN_NAME: config})



@pytest.mark.parametrize("use_discrete", [True, False])
 def test_recurrent_sac(use_discrete):
-    env = MemoryEnvironment([BRAIN_NAME], use_discrete=use_discrete)
+    step_size = 0.2 if use_discrete else 1.0
+    env = MemoryEnvironment(
+        [BRAIN_NAME], use_discrete=use_discrete, step_size=step_size
+    )
-        memory=NetworkSettings.MemorySettings(memory_size=16, sequence_length=32),
+        memory=NetworkSettings.MemorySettings(memory_size=16, sequence_length=16),
-        batch_size=64,
+        batch_size=128,
-        buffer_init_steps=500,
+        buffer_init_steps=1000,
        steps_per_update=2,
    )
    config = attr.evolve(
--- a/ml-agents/mlagents/trainers/trainer_controller.py
+++ b/ml-agents/mlagents/trainers/trainer_controller.py
 from mlagents.trainers.behavior_id_utils import BehaviorIdentifiers
 from mlagents.trainers.agent_processor import AgentManager
 from mlagents.trainers.settings import CurriculumSettings
+from mlagents.trainers.training_status import GlobalTrainingStatus, StatusType


 class TrainerController(object):
                if brain_name in self.trainers:
                    self.trainers[brain_name].stats_reporter.set_stat(
                        "Environment/Lesson", curr.lesson_num
+                    )
+                    GlobalTrainingStatus.set_parameter_state(
+                        brain_name, StatusType.LESSON_NUM, curr.lesson_num
                    )

        for trainer in self.trainers.values():
--- a/ml-agents/mlagents/trainers/tests/test_training_status.py
+++ b/ml-agents/mlagents/trainers/tests/test_training_status.py
+import os
+import unittest
+import json
+from enum import Enum
+
+from mlagents.trainers.training_status import (
+    StatusType,
+    StatusMetaData,
+    GlobalTrainingStatus,
+)
+
+
+def test_globaltrainingstatus(tmpdir):
+    path_dir = os.path.join(tmpdir, "test.json")
+
+    GlobalTrainingStatus.set_parameter_state("Category1", StatusType.LESSON_NUM, 3)
+    GlobalTrainingStatus.save_state(path_dir)
+
+    with open(path_dir, "r") as fp:
+        test_json = json.load(fp)
+
+    assert "Category1" in test_json
+    assert StatusType.LESSON_NUM.value in test_json["Category1"]
+    assert test_json["Category1"][StatusType.LESSON_NUM.value] == 3
+    assert "metadata" in test_json
+
+    GlobalTrainingStatus.load_state(path_dir)
+    restored_val = GlobalTrainingStatus.get_parameter_state(
+        "Category1", StatusType.LESSON_NUM
+    )
+    assert restored_val == 3
+
+    # Test unknown categories and status types (keys)
+    unknown_category = GlobalTrainingStatus.get_parameter_state(
+        "Category3", StatusType.LESSON_NUM
+    )
+
+    class FakeStatusType(Enum):
+        NOTAREALKEY = "notarealkey"
+
+    unknown_key = GlobalTrainingStatus.get_parameter_state(
+        "Category1", FakeStatusType.NOTAREALKEY
+    )
+    assert unknown_category is None
+    assert unknown_key is None
+
+
+class StatsMetaDataTest(unittest.TestCase):
+    def test_metadata_compare(self):
+        # Test write_stats
+        with self.assertLogs("mlagents.trainers", level="WARNING") as cm:
+            default_metadata = StatusMetaData()
+            version_statsmetadata = StatusMetaData(mlagents_version="test")
+            default_metadata.check_compatibility(version_statsmetadata)
+
+            tf_version_statsmetadata = StatusMetaData(tensorflow_version="test")
+            default_metadata.check_compatibility(tf_version_statsmetadata)
+
+        # Assert that 2 warnings have been thrown
+        assert len(cm.output) == 2
--- a/ml-agents/mlagents/trainers/training_status.py
+++ b/ml-agents/mlagents/trainers/training_status.py
+from typing import Dict, Any
+from enum import Enum
+from collections import defaultdict
+import json
+import attr
+import cattr
+
+from mlagents.tf_utils import tf
+from mlagents_envs.logging_util import get_logger
+from mlagents.trainers import __version__
+from mlagents.trainers.exception import TrainerError
+
+logger = get_logger(__name__)
+
+STATUS_FORMAT_VERSION = "0.1.0"
+
+
+class StatusType(Enum):
+    LESSON_NUM = "lesson_num"
+    STATS_METADATA = "metadata"
+
+
+@attr.s(auto_attribs=True)
+class StatusMetaData:
+    stats_format_version: str = STATUS_FORMAT_VERSION
+    mlagents_version: str = __version__
+    tensorflow_version: str = tf.__version__
+
+    def to_dict(self) -> Dict[str, str]:
+        return cattr.unstructure(self)
+
+    @staticmethod
+    def from_dict(import_dict: Dict[str, str]) -> "StatusMetaData":
+        return cattr.structure(import_dict, StatusMetaData)
+
+    def check_compatibility(self, other: "StatusMetaData") -> None:
+        """
+        Check compatibility with a loaded StatsMetaData and warn the user
+        if versions mismatch. This is used for resuming from old checkpoints.
+        """
+        # This should cover all stats version mismatches as well.
+        if self.mlagents_version != other.mlagents_version:
+            logger.warning(
+                "Checkpoint was loaded from a different version of ML-Agents. Some things may not resume properly."
+            )
+        if self.tensorflow_version != other.tensorflow_version:
+            logger.warning(
+                "Tensorflow checkpoint was saved with a different version of Tensorflow. Model may not resume properly."
+            )
+
+
+class GlobalTrainingStatus:
+    """
+    GlobalTrainingStatus class that contains static methods to save global training status and
+    load it on a resume. These are values that might be needed for the training resume that
+    cannot/should not be captured in a model checkpoint, such as curriclum lesson.
+    """
+
+    saved_state: Dict[str, Dict[str, Any]] = defaultdict(lambda: {})
+
+    @staticmethod
+    def load_state(path: str) -> None:
+        """
+        Load a JSON file that contains saved state.
+        :param path: Path to the JSON file containing the state.
+        """
+        try:
+            with open(path, "r") as f:
+                loaded_dict = json.load(f)
+            # Compare the metadata
+            _metadata = loaded_dict[StatusType.STATS_METADATA.value]
+            StatusMetaData.from_dict(_metadata).check_compatibility(StatusMetaData())
+            # Update saved state.
+            GlobalTrainingStatus.saved_state.update(loaded_dict)
+        except FileNotFoundError:
+            logger.warning(
+                "Training status file not found. Not all functions will resume properly."
+            )
+        except KeyError:
+            raise TrainerError(
+                "Metadata not found, resuming from an incompatible version of ML-Agents."
+            )
+
+    @staticmethod
+    def save_state(path: str) -> None:
+        """
+        Save a JSON file that contains saved state.
+        :param path: Path to the JSON file containing the state.
+        """
+        GlobalTrainingStatus.saved_state[
+            StatusType.STATS_METADATA.value
+        ] = StatusMetaData().to_dict()
+        with open(path, "w") as f:
+            json.dump(GlobalTrainingStatus.saved_state, f, indent=4)
+
+    @staticmethod
+    def set_parameter_state(category: str, key: StatusType, value: Any) -> None:
+        """
+        Stores an arbitrary-named parameter in the global saved state.
+        :param category: The category (usually behavior name) of the parameter.
+        :param key: The parameter, e.g. lesson number.
+        :param value: The value.
+        """
+        GlobalTrainingStatus.saved_state[category][key.value] = value
+
+    @staticmethod
+    def get_parameter_state(category: str, key: StatusType) -> Any:
+        """
+        Loads an arbitrary-named parameter from training_status.json.
+        If not found, returns None.
+        :param category: The category (usually behavior name) of the parameter.
+        :param key: The statistic, e.g. lesson number.
+        :param value: The value.
+        """
+        return GlobalTrainingStatus.saved_state[category].get(key.value, None)