Remove Standalone Offline BC Training (#2969)

5 年前 · 1fa07edb
--- a/config/gail_config.yaml
+++ b/config/gail_config.yaml
    beta: 1.0e-2
    max_steps: 5.0e5
    num_epoch: 3
-    pretraining:
+    behavioral_cloning:
        demo_path: ./demos/ExpertPyramid.demo
        strength: 0.5
        steps: 10000
    summary_freq: 3000
    num_layers: 3
    hidden_units: 512
+    behavioral_cloning:
+        demo_path: ./demos/ExpertCrawlerSta.demo
+        strength: 0.5
+        steps: 5000
    reward_signals:
        gail:
            strength: 1.0
--- a/docs/Migrating.md
+++ b/docs/Migrating.md
 * `reset()` on the Low-Level Python API no longer takes a `config` argument. `UnityEnvironment` no longer has a `reset_parameters` field. To modify float properties in the environment, you must use a `FloatPropertiesChannel`. For more information, refer to the [Low Level Python API documentation](Python-API.md)
 * The Academy no longer has a `Training Configuration` nor `Inference Configuration` field in the inspector. To modify the configuration from the Low-Level Python API, use an `EngineConfigurationChannel`. To modify it during training, use the new command line arguments `--width`, `--height`, `--quality-level`, `--time-scale` and `--target-frame-rate` in `mlagents-learn`.
 * The Academy no longer has a `Default Reset Parameters` field in the inspector. The Academy class no longer has a `ResetParameters`. To access shared float properties with Python, use the new `FloatProperties` field on the Academy.
+* Offline Behavioral Cloning has been removed. To learn from demonstrations, use the GAIL and
+Behavioral Cloning features with either PPO or SAC. See [Imitation Learning](Training-Imitation-Learning.md) for more information.

 ### Steps to Migrate
 * If you had a custom `Training Configuration` in the Academy inspector, you will need to pass your custom configuration at every training run using the new command line arguments `--width`, `--height`, `--quality-level`, `--time-scale` and `--target-frame-rate`.
--- a/docs/Reward-Signals.md
+++ b/docs/Reward-Signals.md
 In this way, while the agent gets better and better at mimicing the demonstrations, the
 discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.

-This approach, when compared to [Behavioral Cloning](Training-Behavioral-Cloning.md), requires
-far fewer demonstrations to be provided. After all, we are still learning a policy that happens
-to be similar to the demonstrations, not directly copying the behavior of the demonstrations. It
-is especially effective when combined with an Extrinsic signal. However, the GAIL reward signal can
-also be used independently to purely learn from demonstrations.
+This approach learns a _policy_ that produces states and actions similar to the demonstrations,
+requiring fewer demonstrations than direct cloning of the actions. In addition to learning purely
+from demonstrations, the GAIL reward signal can be mixed with an extrinsic reward signal to guide
+the learning process.

 Using GAIL requires recorded demonstrations from your Unity environment. See the
 [imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations.
--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
 reduce the time the agent takes to solve the environment.
 For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
 using 6 episodes of demonstrations can reduce training steps by more than 4 times.
-See PreTraining + GAIL + Curiosity + RL below.
+See Behavioral Cloning + GAIL + Curiosity + RL below.

 <p align="center">
  <img src="images/mlagents-ImitationAndRL.png"

-The ML-Agents toolkit provides several ways to learn from demonstrations.
+The ML-Agents toolkit provides two features that enable your agent to learn from demonstrations.
+In most scenarios, you should combine these two features
-* To train using GAIL (Generative Adversarial Imitation Learning) you can add the
+* GAIL (Generative Adversarial Imitation Learning) uses an adversarial approach to
+  reward your Agent for behaving similar to a set of demonstrations. To use GAIL, you can add the
-* To help bootstrap reinforcement learning, you can enable
-  [pretraining](Training-PPO.md#optional-pretraining-using-demonstrations)
-  on the PPO trainer, in addition to using a small GAIL reward signal.
-* To train an agent to exactly mimic demonstrations, you can use the
-  [Behavioral Cloning](Training-Behavioral-Cloning.md) trainer. Behavioral Cloning can be
-  used with demonstrations (in-editor), and learns very quickly. However, it usually is ineffective
-  on more complex environments without a large number of demonstrations.
+* Behavioral Cloning (BC) trains the Agent's neural network to exactly mimic the actions
+  shown in a set of demonstrations.
+  [The BC feature](Training-PPO.md#optional-behavioral-cloning-using-demonstrations)
+  can be enabled on the PPO or SAC trainer. BC tends to work best when
+  there are a lot of demonstrations, or in conjunction with GAIL and/or an extrinsic reward.
-using pre-recorded demonstrations, you can generally enable both GAIL and Pretraining.
+using pre-recorded demonstrations, you can generally enable both GAIL and Behavioral Cloning
+at low strengths in addition to having an extrinsic reward.
-If you want to train purely from demonstrations, GAIL is generally the preferred approach, especially
-if you have few (<10) episodes of demonstrations. An example of this is provided for the Crawler example
-environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.
-
-If you have plenty of demonstrations and/or a very simple environment, Offline Behavioral Cloning can be effective and quick. However, it cannot be combined with RL.
+If you want to train purely from demonstrations, GAIL and BC _without_ an
+extrinsic reward signal is the preferred approach. An example of this is provided for the Crawler
+example environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.

 ## Recording Demonstrations

-They can be managed from the Editor, as well as used for training with Offline
-Behavioral Cloning and GAIL.
+They can be managed from the Editor, as well as used for training with BC and GAIL.

 In order to record demonstrations from an agent, add the `Demonstration Recorder`
 component to a GameObject in the scene which contains an `Agent` component.
--- a/docs/Training-ML-Agents.md
+++ b/docs/Training-ML-Agents.md
 `config/gail_config.yaml` and `config/offline_bc_config.yaml` specifies the training method,
 the hyperparameters, and a few additional values to use when training with Proximal Policy
 Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial Imitation Learning)
-with PPO, and online and offline Behavioral Cloning(BC)/Imitation. These files are divided
+with PPO/SAC, and Behavioral Cloning(BC)/Imitation with PPO/SAC. These files are divided
-training with PPO, SAC, GAIL (with PPO), and offline BC. These files are divided into sections.
+training with PPO, SAC, GAIL (with PPO), and BC. These files are divided into sections.
 The **default** section defines the default values for all the available settings. You can
 also add new sections to override these defaults to train specific Behaviors. Name each of these
 override sections after the appropriate `Behavior Name`. Sections for the
 | :------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- |
-| batch_size           | The number of experiences in each iteration of gradient descent.                                                                                                                        | PPO, SAC, BC             |
-| batches_per_epoch    | In imitation learning, the number of batches of training examples to collect before training the model.                                                                                 | BC                       |
+| batch_size           | The number of experiences in each iteration of gradient descent.                                                                                                                        | PPO, SAC             |
+| batches_per_epoch    | In imitation learning, the number of batches of training examples to collect before training the model.                                                                                 |                        |
-| demo_path            | For offline imitation learning, the file path of the recorded demonstration file                                                                                                        | (offline)BC              |
-| hidden_units         | The number of units in the hidden layers of the neural network.                                                                                                                         | PPO, SAC, BC             |
+| hidden_units         | The number of units in the hidden layers of the neural network.                                                                                                                         | PPO, SAC             |
-| learning_rate        | The initial learning rate for gradient descent.                                                                                                                                         | PPO, SAC, BC             |
-| max_steps            | The maximum number of simulation steps to run during a training session.                                                                                                                | PPO, SAC, BC             |
-| memory_size          | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                 | PPO, SAC, BC             |
+| learning_rate        | The initial learning rate for gradient descent.                                                                                                                                         | PPO, SAC             |
+| max_steps            | The maximum number of simulation steps to run during a training session.                                                                                                                | PPO, SAC             |
+| memory_size          | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                 | PPO, SAC             |
-| num_layers           | The number of hidden layers in the neural network.                                                                                                                                      | PPO, SAC, BC             |
-| pretraining          | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations).                           | PPO, SAC                 |
-| reward_signals       | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options.                                         | PPO, SAC, BC             |
+| num_layers           | The number of hidden layers in the neural network.                                                                                                                                      | PPO, SAC             |
+| behavioral_cloning          | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations).                           | PPO, SAC                 |
+| reward_signals       | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options.                                         | PPO, SAC             |
-| sequence_length      | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC             |
-| summary_freq         | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard.                                                                       | PPO, SAC, BC             |
+| sequence_length      | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC             |
+| summary_freq         | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard.                                                                       | PPO, SAC             |
-| time_horizon         | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, SAC, (online)BC     |
-| trainer              | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc".                                                                                                             | PPO, SAC, BC             |
+| time_horizon         | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, SAC    |
+| trainer              | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc".                                                                                                             | PPO, SAC             |
-| use_recurrent        | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                                                                       | PPO, SAC, BC             |
+| use_recurrent        | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                                                                       | PPO, SAC             |
-\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation)
+\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation), GAIL = Generative Adversarial Imitaiton Learning

 For specific advice on setting hyperparameters based on the type of training you
 are conducting, see:
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md

 Typical Range: `64` - `512`

-## (Optional) Pretraining Using Demonstrations
+## (Optional) Behavioral Cloning Using Demonstrations
-from a player. This can help guide the agent towards the reward. Pretraining adds
+from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
-It is essentially equivalent to running [behavioral cloning](Training-Behavioral-Cloning.md)
-in-line with PPO.
-To use pretraining, add a `pretraining` section to the trainer_config. For instance:
+To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:
-    pretraining:
+    behavioral_cloning:
-Below are the available hyperparameters for pretraining.
+Below are the available hyperparameters for BC.
-rate of PPO, and roughly corresponds to how strongly we allow the behavioral cloning
+rate of PPO, and roughly corresponds to how strongly we allow BC
 to influence the policy.

 Typical Range: `0.1` - `0.5`

 ### Steps

-During pretraining, it is often desirable to stop using demonstrations after the agent has
+During BC, it is often desirable to stop using demonstrations after the agent has
-pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
+BC is active. The learning rate of BC will anneal over the steps. Set
 the steps to 0 for constant imitation over the entire training run.

 ### (Optional) Batch Size
--- a/docs/Training-SAC.md
+++ b/docs/Training-SAC.md

 Default: `False`

-## (Optional) Pretraining Using Demonstrations
+## (Optional) Behavioral Cloning Using Demonstrations
-from a player. This can help guide the agent towards the reward. Pretraining adds
+from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
-It is essentially equivalent to running [behavioral cloning](./Training-Behavioral-Cloning.md)
-in-line with SAC.
-To use pretraining, add a `pretraining` section to the trainer_config. For instance:
+To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:
-    pretraining:
+    behavioral_cloning:
-Below are the available hyperparameters for pretraining.
+Below are the available hyperparameters for BC.
-rate of SAC, and roughly corresponds to how strongly we allow the behavioral cloning
+rate of SAC, and roughly corresponds to how strongly we allow BC
 to influence the policy.

 Typical Range: `0.1` - `0.5`

 ### Steps

-During pretraining, it is often desirable to stop using demonstrations after the agent has
+During BC, it is often desirable to stop using demonstrations after the agent has
-pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
+BC is active. The learning rate of BC will anneal over the steps. Set
 the steps to 0 for constant imitation over the entire training run.

 ### (Optional) Batch Size
--- a/docs/images/mlagents-ImitationAndRL.png
+++ b/docs/images/mlagents-ImitationAndRL.png
--- a/ml-agents/mlagents/trainers/components/bc/module.py
+++ b/ml-agents/mlagents/trainers/components/bc/module.py
        samples_per_update: int = 0,
    ):
        """
-        A BC trainer that can be used inline with RL, especially for pretraining.
+        A BC trainer that can be used inline with RL.
        :param policy: The policy of the learning model
        :param policy_learning_rate: The initial Learning Rate of the policy. Used to set an appropriate learning rate
            for the pretrainer.
        :param demo_path: The path to the demonstration file.
        :param batch_size: The batch size to use during BC training.
        :param num_epoch: Number of epochs to train for during each update.
-        :param samples_per_update: Maximum number of samples to train on during each pretraining update.
+        :param samples_per_update: Maximum number of samples to train on during each BC update.
        """
        self.policy = policy
        self.current_lr = policy_learning_rate * strength
    @staticmethod
    def check_config(config_dict: Dict[str, Any]) -> None:
        """
-        Check the pretraining config for the required keys.
+        Check the behavioral_cloning config for the required keys.
        :param config_dict: Pretraining section of trainer_config
        """
        param_keys = ["strength", "demo_path", "steps"]
--- a/ml-agents/mlagents/trainers/ppo/policy.py
+++ b/ml-agents/mlagents/trainers/ppo/policy.py
        with self.graph.as_default():
            self.bc_module: Optional[BCModule] = None
            # Create pretrainer if needed
-            if "pretraining" in trainer_params:
-                BCModule.check_config(trainer_params["pretraining"])
+            if "behavioral_cloning" in trainer_params:
+                BCModule.check_config(trainer_params["behavioral_cloning"])
-                    default_num_epoch=trainer_params["num_epoch"],
-                    **trainer_params["pretraining"],
+                    default_num_epoch=3,
+                    **trainer_params["behavioral_cloning"],
                )

        if load:
--- a/ml-agents/mlagents/trainers/sac/models.py
+++ b/ml-agents/mlagents/trainers/sac/models.py
        self.dones_holder = tf.placeholder(
            shape=[None], dtype=tf.float32, name="dones_holder"
        )
-        # This is just a dummy to get pretraining to work. PPO has this but SAC doesn't.
+        # This is just a dummy to get BC to work. PPO has this but SAC doesn't.
        # TODO: Proper input and output specs for models
        self.epsilon = tf.placeholder(
            shape=[None, self.act_size[0]], dtype=tf.float32, name="epsilon"
--- a/ml-agents/mlagents/trainers/sac/policy.py
+++ b/ml-agents/mlagents/trainers/sac/policy.py
        with self.graph.as_default():
            # Create pretrainer if needed
            self.bc_module: Optional[BCModule] = None
-            if "pretraining" in trainer_params:
-                BCModule.check_config(trainer_params["pretraining"])
+            if "behavioral_cloning" in trainer_params:
+                BCModule.check_config(trainer_params["behavioral_cloning"])
                self.bc_module = BCModule(
                    self,
                    policy_learning_rate=trainer_params["learning_rate"],
-                    **trainer_params["pretraining"],
+                    **trainer_params["behavioral_cloning"],
-                if "samples_per_update" in trainer_params["pretraining"]:
+                if "samples_per_update" in trainer_params["behavioral_cloning"]:
                    logger.warning(
                        "Pretraining: Samples Per Update is not a valid setting for SAC."
                    )
--- a/ml-agents/mlagents/trainers/tests/test_barracuda_converter.py
+++ b/ml-agents/mlagents/trainers/tests/test_barracuda_converter.py
 import os
-import yaml
-import pytest
-from mlagents.trainers.tests.test_bc import create_bc_trainer


 def test_barracuda_converter():

    # cleanup
    os.remove(tmpfile)
-
-
-@pytest.fixture
-def bc_dummy_config():
-    return yaml.safe_load(
-        """
-            hidden_units: 32
-            learning_rate: 3.0e-4
-            num_layers: 1
-            use_recurrent: false
-            sequence_length: 32
-            memory_size: 64
-            batches_per_epoch: 1
-            batch_size: 64
-            summary_freq: 2000
-            max_steps: 4000
-            """
-    )
-
-
-@pytest.mark.parametrize("use_lstm", [False, True], ids=["nolstm", "lstm"])
-@pytest.mark.parametrize("use_discrete", [True, False], ids=["disc", "cont"])
-def test_bc_export(bc_dummy_config, use_lstm, use_discrete):
-    bc_dummy_config["use_recurrent"] = use_lstm
-    trainer, env = create_bc_trainer(bc_dummy_config, use_discrete)
-    trainer.export_model()
--- a/ml-agents/mlagents/trainers/tests/test_bcmodule.py
+++ b/ml-agents/mlagents/trainers/tests/test_bcmodule.py
        summary_freq: 1000
        use_recurrent: false
        memory_size: 8
-        pretraining:
+        behavioral_cloning:
          demo_path: ./demos/ExpertPyramid.demo
          strength: 1.0
          steps: 10000000
        tau: 0.005
        use_recurrent: false
        vis_encode_type: simple
-        pretraining:
+        behavioral_cloning:
            demo_path: ./demos/ExpertPyramid.demo
            strength: 1.0
            steps: 10000000
    trainer_config["model_path"] = model_path
    trainer_config["keep_checkpoints"] = 3
    trainer_config["use_recurrent"] = use_rnn
-    trainer_config["pretraining"]["demo_path"] = (
+    trainer_config["behavioral_cloning"]["demo_path"] = (
        os.path.dirname(os.path.abspath(__file__)) + "/" + demo_file
    )

    env, policy = create_policy_with_bc_mock(
        mock_env, mock_brain, trainer_config, False, "test.demo"
    )
-    assert policy.bc_module.num_epoch == trainer_config["num_epoch"]
+    assert policy.bc_module.num_epoch == 3
-    trainer_config["pretraining"]["num_epoch"] = 100
-    trainer_config["pretraining"]["batch_size"] = 10000
+    trainer_config["behavioral_cloning"]["num_epoch"] = 100
+    trainer_config["behavioral_cloning"]["batch_size"] = 10000
    env, policy = create_policy_with_bc_mock(
        mock_env, mock_brain, trainer_config, False, "test.demo"
    )
@mock.patch("mlagents.envs.environment.UnityEnvironment")
 def test_bcmodule_constant_lr_update(mock_env, trainer_config):
    mock_brain = mb.create_mock_3dball_brain()
-    trainer_config["pretraining"]["steps"] = 0
+    trainer_config["behavioral_cloning"]["steps"] = 0
    env, policy = create_policy_with_bc_mock(
        mock_env, mock_brain, trainer_config, False, "test.demo"
    )
--- a/ml-agents/mlagents/trainers/tests/test_reward_signals.py
+++ b/ml-agents/mlagents/trainers/tests/test_reward_signals.py
        tau: 0.005
        use_recurrent: false
        vis_encode_type: simple
-        pretraining:
+        behavioral_cloning:
            demo_path: ./demos/ExpertPyramid.demo
            strength: 1.0
            steps: 10000000
--- a/ml-agents/mlagents/trainers/tests/test_trainer_util.py
+++ b/ml-agents/mlagents/trainers/tests/test_trainer_util.py
 import pytest
 import yaml
-import os
 import io
 from unittest.mock import patch

 from mlagents.trainers.ppo.trainer import PPOTrainer
-from mlagents.trainers.bc.offline_trainer import OfflineBCTrainer
 from mlagents.envs.exception import UnityEnvironmentException




@pytest.fixture
-def dummy_offline_bc_config():
-    return yaml.safe_load(
-        """
-        default:
-            trainer: offline_bc
-            demo_path: """
-        + os.path.dirname(os.path.abspath(__file__))
-        + """/test.demo
-            batches_per_epoch: 16
-            batch_size: 32
-            beta: 5.0e-3
-            buffer_size: 512
-            epsilon: 0.2
-            gamma: 0.99
-            hidden_units: 128
-            lambd: 0.95
-            learning_rate: 3.0e-4
-            max_steps: 5.0e4
-            normalize: true
-            num_epoch: 5
-            num_layers: 2
-            time_horizon: 64
-            sequence_length: 64
-            summary_freq: 1000
-            use_recurrent: false
-            memory_size: 8
-            use_curiosity: false
-            curiosity_strength: 0.0
-            curiosity_enc_size: 1
-        """
-    )
-
-
-@pytest.fixture
-def dummy_offline_bc_config_with_override():
-    base = dummy_offline_bc_config()
+def dummy_config_with_override():
+    base = dummy_config()
    base["testbrain"] = {}
    base["testbrain"]["normalize"] = False
    return base
    train_model = True
    load_model = False
    seed = 11
+    expected_reward_buff_cap = 1
-    base_config = dummy_offline_bc_config_with_override()
+    base_config = dummy_config_with_override()
    expected_config = base_config["default"]
    expected_config["summary_path"] = summaries_dir + f"/{run_id}_testbrain"
    expected_config["model_path"] = model_path + "/testbrain"
    BrainParametersMock.return_value.brain_name = "testbrain"
    external_brains = {"testbrain": brain_params_mock}

-    def mock_constructor(self, brain, trainer_parameters, training, load, seed, run_id):
+    def mock_constructor(
+        self,
+        brain,
+        reward_buff_cap,
+        trainer_parameters,
+        training,
+        load,
+        seed,
+        run_id,
+        multi_gpu,
+    ):
+        self.trainer_metrics = TrainerMetrics("", "")
+        assert reward_buff_cap == expected_reward_buff_cap
+        assert multi_gpu == multi_gpu
-    with patch.object(OfflineBCTrainer, "__init__", mock_constructor):
+    with patch.object(PPOTrainer, "__init__", mock_constructor):
        trainer_factory = trainer_util.TrainerFactory(
            trainer_config=base_config,
            summaries_dir=summaries_dir,
        for _, brain_parameters in external_brains.items():
            trainers["testbrain"] = trainer_factory.generate(brain_parameters)
        assert "testbrain" in trainers
-        assert isinstance(trainers["testbrain"], OfflineBCTrainer)
+        assert isinstance(trainers["testbrain"], PPOTrainer)


@patch("mlagents.trainers.brain.BrainParameters")
--- a/ml-agents/mlagents/trainers/trainer_util.py
+++ b/ml-agents/mlagents/trainers/trainer_util.py

 from mlagents.trainers.meta_curriculum import MetaCurriculum
 from mlagents.envs.exception import UnityEnvironmentException
-from mlagents.trainers.trainer import Trainer
+from mlagents.trainers.trainer import Trainer, UnityTrainerException
-from mlagents.trainers.bc.offline_trainer import OfflineBCTrainer


 class TrainerFactory:

    trainer: Trainer = None  # type: ignore  # will be set to one of these, or raise
    if trainer_parameters["trainer"] == "offline_bc":
-        trainer = OfflineBCTrainer(
-            brain_parameters, trainer_parameters, train_model, load_model, seed, run_id
+        raise UnityTrainerException(
+            "The offline_bc trainer has been removed. To train with demonstrations, "
+            "please use a PPO or SAC trainer with the GAIL Reward Signal and/or the "
+            "Behavioral Cloning feature enabled."
        )
    elif trainer_parameters["trainer"] == "ppo":
        trainer = PPOTrainer(
--- a/docs/Training-Behavioral-Cloning.md
+++ b/docs/Training-Behavioral-Cloning.md
-# Training with Behavioral Cloning
-
-There are a variety of possible imitation learning algorithms which can
-be used, the simplest one of them is Behavioral Cloning. It works by collecting
-demonstrations from a teacher, and then simply uses them to directly learn a
-policy, in the same way the supervised learning for image classification
-or other traditional Machine Learning tasks work.
-
-## Offline Training
-
-With offline behavioral cloning, we can use demonstrations (`.demo` files)
-generated using the `Demonstration Recorder` as the dataset used to train a behavior.
-
-1. Choose an agent you would like to learn to imitate some set of demonstrations.
-2. Record a set of demonstration using the `Demonstration Recorder` (see [here](Training-Imitation-Learning.md)).
-   For illustrative purposes we will refer to this file as `AgentRecording.demo`.
-3. Build the scene(make sure the Agent is not using its heuristic).
-4. Open the `config/offline_bc_config.yaml` file.
-5. Modify the `demo_path` parameter in the file to reference the path to the
-   demonstration file recorded in step 2. In our case this is:
-   `./UnitySDK/Assets/Demonstrations/AgentRecording.demo`
-6. Launch `mlagent-learn`, providing `./config/offline_bc_config.yaml`
-   as the config parameter, and include the `--run-id` and `--train` as usual.
-   Provide your environment as the `--env` parameter if it has been compiled
-   as standalone, or omit to train in the editor.
-7. (Optional) Observe training performance using TensorBoard.
-
-This will use the demonstration file to train a neural network driven agent
-to directly imitate the actions provided in the demonstration. The environment
-will launch and be used for evaluating the agent's performance during training.
--- a/ml-agents/mlagents/trainers/tests/test_bc.py
+++ b/ml-agents/mlagents/trainers/tests/test_bc.py
-import unittest.mock as mock
-import pytest
-import os
-
-import numpy as np
-from mlagents.tf_utils import tf
-import yaml
-
-from mlagents.trainers.bc.models import BehavioralCloningModel
-import mlagents.trainers.tests.mock_brain as mb
-from mlagents.trainers.bc.policy import BCPolicy
-from mlagents.trainers.bc.offline_trainer import BCTrainer
-
-from mlagents.envs.mock_communicator import MockCommunicator
-from mlagents.trainers.tests.mock_brain import make_brain_parameters
-from mlagents.envs.environment import UnityEnvironment
-from mlagents.trainers.brain_conversion_utils import (
-    step_result_to_brain_info,
-    group_spec_to_brain_parameters,
-)
-
-
-@pytest.fixture
-def dummy_config():
-    return yaml.safe_load(
-        """
-            hidden_units: 32
-            learning_rate: 3.0e-4
-            num_layers: 1
-            use_recurrent: false
-            sequence_length: 32
-            memory_size: 32
-            batches_per_epoch: 100 # Force code to use all possible batches
-            batch_size: 32
-            summary_freq: 2000
-            max_steps: 4000
-            """
-    )
-
-
-def create_bc_trainer(dummy_config, is_discrete=False, use_recurrent=False):
-    mock_env = mock.Mock()
-    if is_discrete:
-        mock_brain = mb.create_mock_pushblock_brain()
-        mock_braininfo = mb.create_mock_braininfo(
-            num_agents=12, num_vector_observations=70
-        )
-    else:
-        mock_brain = mb.create_mock_3dball_brain()
-        mock_braininfo = mb.create_mock_braininfo(
-            num_agents=12, num_vector_observations=8
-        )
-    mb.setup_mock_unityenvironment(mock_env, mock_brain, mock_braininfo)
-    env = mock_env()
-
-    trainer_parameters = dummy_config
-    trainer_parameters["summary_path"] = "tmp"
-    trainer_parameters["model_path"] = "tmp"
-    trainer_parameters["demo_path"] = (
-        os.path.dirname(os.path.abspath(__file__)) + "/test.demo"
-    )
-    trainer_parameters["use_recurrent"] = use_recurrent
-    trainer = BCTrainer(
-        mock_brain, trainer_parameters, training=True, load=False, seed=0, run_id=0
-    )
-    trainer.demonstration_buffer = mb.simulate_rollout(env, trainer.policy, 100)
-    return trainer, env
-
-
-@pytest.mark.parametrize("use_recurrent", [True, False])
-def test_bc_trainer_step(dummy_config, use_recurrent):
-    trainer, env = create_bc_trainer(dummy_config, use_recurrent=use_recurrent)
-    # Test get_step
-    assert trainer.get_step == 0
-    # Test update policy
-    trainer.update_policy()
-    assert len(trainer.stats["Losses/Cloning Loss"]) > 0
-    # Test increment step
-    trainer.increment_step(1)
-    assert trainer.step == 1
-
-
-def test_bc_trainer_add_proc_experiences(dummy_config):
-    trainer, env = create_bc_trainer(dummy_config)
-    # Test add_experiences
-    returned_braininfo = env.step()
-    brain_name = "Ball3DBrain"
-    trainer.add_experiences(
-        returned_braininfo[brain_name], returned_braininfo[brain_name], {}
-    )  # Take action outputs is not used
-    for agent_id in returned_braininfo[brain_name].agents:
-        assert trainer.evaluation_buffer[agent_id].last_brain_info is not None
-        assert trainer.episode_steps[agent_id] > 0
-        assert trainer.cumulative_rewards[agent_id] > 0
-    # Test process_experiences by setting done
-    returned_braininfo[brain_name].local_done = 12 * [True]
-    trainer.process_experiences(
-        returned_braininfo[brain_name], returned_braininfo[brain_name]
-    )
-    for agent_id in returned_braininfo[brain_name].agents:
-        assert trainer.episode_steps[agent_id] == 0
-        assert trainer.cumulative_rewards[agent_id] == 0
-
-
-def test_bc_trainer_end_episode(dummy_config):
-    trainer, env = create_bc_trainer(dummy_config)
-    returned_braininfo = env.step()
-    brain_name = "Ball3DBrain"
-    trainer.add_experiences(
-        returned_braininfo[brain_name], returned_braininfo[brain_name], {}
-    )  # Take action outputs is not used
-    trainer.process_experiences(
-        returned_braininfo[brain_name], returned_braininfo[brain_name]
-    )
-    # Should set everything to 0
-    trainer.end_episode()
-    for agent_id in returned_braininfo[brain_name].agents:
-        assert trainer.episode_steps[agent_id] == 0
-        assert trainer.cumulative_rewards[agent_id] == 0
-
-
-@mock.patch("mlagents.envs.environment.UnityEnvironment.executable_launcher")
-@mock.patch("mlagents.envs.environment.UnityEnvironment.get_communicator")
-def test_bc_policy_evaluate(mock_communicator, mock_launcher, dummy_config):
-    tf.reset_default_graph()
-    mock_communicator.return_value = MockCommunicator(
-        discrete_action=False, visual_inputs=0
-    )
-    env = UnityEnvironment(" ")
-    env.reset()
-    brain_name = env.get_agent_groups()[0]
-    brain_info = step_result_to_brain_info(
-        env.get_step_result(brain_name), env.get_agent_group_spec(brain_name)
-    )
-    brain_params = group_spec_to_brain_parameters(
-        brain_name, env.get_agent_group_spec(brain_name)
-    )
-
-    trainer_parameters = dummy_config
-    model_path = brain_name
-    trainer_parameters["model_path"] = model_path
-    trainer_parameters["keep_checkpoints"] = 3
-    policy = BCPolicy(0, brain_params, trainer_parameters, False)
-    run_out = policy.evaluate(brain_info)
-    assert run_out["action"].shape == (3, 2)
-
-    env.close()
-
-
-def test_cc_bc_model():
-    tf.reset_default_graph()
-    with tf.Session() as sess:
-        with tf.variable_scope("FakeGraphScope"):
-            model = BehavioralCloningModel(
-                make_brain_parameters(discrete_action=False, visual_inputs=0)
-            )
-            init = tf.global_variables_initializer()
-            sess.run(init)
-
-            run_list = [model.sample_action, model.policy]
-            feed_dict = {
-                model.batch_size: 2,
-                model.sequence_length: 1,
-                model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]),
-            }
-            sess.run(run_list, feed_dict=feed_dict)
-            # env.close()
-
-
-def test_dc_bc_model():
-    tf.reset_default_graph()
-    with tf.Session() as sess:
-        with tf.variable_scope("FakeGraphScope"):
-            model = BehavioralCloningModel(
-                make_brain_parameters(discrete_action=True, visual_inputs=0)
-            )
-            init = tf.global_variables_initializer()
-            sess.run(init)
-
-            run_list = [model.sample_action, model.action_probs]
-            feed_dict = {
-                model.batch_size: 2,
-                model.dropout_rate: 1.0,
-                model.sequence_length: 1,
-                model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]),
-                model.action_masks: np.ones([2, 2], dtype=np.float32),
-            }
-            sess.run(run_list, feed_dict=feed_dict)
-
-
-def test_visual_dc_bc_model():
-    tf.reset_default_graph()
-    with tf.Session() as sess:
-        with tf.variable_scope("FakeGraphScope"):
-            model = BehavioralCloningModel(
-                make_brain_parameters(discrete_action=True, visual_inputs=2)
-            )
-            init = tf.global_variables_initializer()
-            sess.run(init)
-
-            run_list = [model.sample_action, model.action_probs]
-            feed_dict = {
-                model.batch_size: 2,
-                model.dropout_rate: 1.0,
-                model.sequence_length: 1,
-                model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]),
-                model.visual_in[0]: np.ones([2, 40, 30, 3], dtype=np.float32),
-                model.visual_in[1]: np.ones([2, 40, 30, 3], dtype=np.float32),
-                model.action_masks: np.ones([2, 2], dtype=np.float32),
-            }
-            sess.run(run_list, feed_dict=feed_dict)
-
-
-def test_visual_cc_bc_model():
-    tf.reset_default_graph()
-    with tf.Session() as sess:
-        with tf.variable_scope("FakeGraphScope"):
-            model = BehavioralCloningModel(
-                make_brain_parameters(discrete_action=False, visual_inputs=2)
-            )
-            init = tf.global_variables_initializer()
-            sess.run(init)
-
-            run_list = [model.sample_action, model.policy]
-            feed_dict = {
-                model.batch_size: 2,
-                model.sequence_length: 1,
-                model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]),
-                model.visual_in[0]: np.ones([2, 40, 30, 3], dtype=np.float32),
-                model.visual_in[1]: np.ones([2, 40, 30, 3], dtype=np.float32),
-            }
-            sess.run(run_list, feed_dict=feed_dict)
-
-
-if __name__ == "__main__":
-    pytest.main()