浏览代码

Remove Standalone Offline BC Training (#2969)

/develop
GitHub 5 年前
当前提交
1fa07edb
共有 20 个文件被更改,包括 170 次插入500 次删除
  1. 6
      config/gail_config.yaml
  2. 2
      docs/Migrating.md
  3. 9
      docs/Reward-Signals.md
  4. 34
      docs/Training-Imitation-Learning.md
  5. 35
      docs/Training-ML-Agents.md
  6. 18
      docs/Training-PPO.md
  7. 18
      docs/Training-SAC.md
  8. 142
      docs/images/mlagents-ImitationAndRL.png
  9. 6
      ml-agents/mlagents/trainers/components/bc/module.py
  10. 8
      ml-agents/mlagents/trainers/ppo/policy.py
  11. 2
      ml-agents/mlagents/trainers/sac/models.py
  12. 8
      ml-agents/mlagents/trainers/sac/policy.py
  13. 29
      ml-agents/mlagents/trainers/tests/test_barracuda_converter.py
  14. 14
      ml-agents/mlagents/trainers/tests/test_bcmodule.py
  15. 2
      ml-agents/mlagents/trainers/tests/test_reward_signals.py
  16. 62
      ml-agents/mlagents/trainers/tests/test_trainer_util.py
  17. 9
      ml-agents/mlagents/trainers/trainer_util.py
  18. 30
      docs/Training-Behavioral-Cloning.md
  19. 236
      ml-agents/mlagents/trainers/tests/test_bc.py

6
config/gail_config.yaml


beta: 1.0e-2
max_steps: 5.0e5
num_epoch: 3
pretraining:
behavioral_cloning:
demo_path: ./demos/ExpertPyramid.demo
strength: 0.5
steps: 10000

summary_freq: 3000
num_layers: 3
hidden_units: 512
behavioral_cloning:
demo_path: ./demos/ExpertCrawlerSta.demo
strength: 0.5
steps: 5000
reward_signals:
gail:
strength: 1.0

2
docs/Migrating.md


* `reset()` on the Low-Level Python API no longer takes a `config` argument. `UnityEnvironment` no longer has a `reset_parameters` field. To modify float properties in the environment, you must use a `FloatPropertiesChannel`. For more information, refer to the [Low Level Python API documentation](Python-API.md)
* The Academy no longer has a `Training Configuration` nor `Inference Configuration` field in the inspector. To modify the configuration from the Low-Level Python API, use an `EngineConfigurationChannel`. To modify it during training, use the new command line arguments `--width`, `--height`, `--quality-level`, `--time-scale` and `--target-frame-rate` in `mlagents-learn`.
* The Academy no longer has a `Default Reset Parameters` field in the inspector. The Academy class no longer has a `ResetParameters`. To access shared float properties with Python, use the new `FloatProperties` field on the Academy.
* Offline Behavioral Cloning has been removed. To learn from demonstrations, use the GAIL and
Behavioral Cloning features with either PPO or SAC. See [Imitation Learning](Training-Imitation-Learning.md) for more information.
### Steps to Migrate
* If you had a custom `Training Configuration` in the Academy inspector, you will need to pass your custom configuration at every training run using the new command line arguments `--width`, `--height`, `--quality-level`, `--time-scale` and `--target-frame-rate`.

9
docs/Reward-Signals.md


In this way, while the agent gets better and better at mimicing the demonstrations, the
discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.
This approach, when compared to [Behavioral Cloning](Training-Behavioral-Cloning.md), requires
far fewer demonstrations to be provided. After all, we are still learning a policy that happens
to be similar to the demonstrations, not directly copying the behavior of the demonstrations. It
is especially effective when combined with an Extrinsic signal. However, the GAIL reward signal can
also be used independently to purely learn from demonstrations.
This approach learns a _policy_ that produces states and actions similar to the demonstrations,
requiring fewer demonstrations than direct cloning of the actions. In addition to learning purely
from demonstrations, the GAIL reward signal can be mixed with an extrinsic reward signal to guide
the learning process.
Using GAIL requires recorded demonstrations from your Unity environment. See the
[imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations.

34
docs/Training-Imitation-Learning.md


reduce the time the agent takes to solve the environment.
For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
using 6 episodes of demonstrations can reduce training steps by more than 4 times.
See PreTraining + GAIL + Curiosity + RL below.
See Behavioral Cloning + GAIL + Curiosity + RL below.
<p align="center">
<img src="images/mlagents-ImitationAndRL.png"

The ML-Agents toolkit provides several ways to learn from demonstrations.
The ML-Agents toolkit provides two features that enable your agent to learn from demonstrations.
In most scenarios, you should combine these two features
* To train using GAIL (Generative Adversarial Imitation Learning) you can add the
* GAIL (Generative Adversarial Imitation Learning) uses an adversarial approach to
reward your Agent for behaving similar to a set of demonstrations. To use GAIL, you can add the
* To help bootstrap reinforcement learning, you can enable
[pretraining](Training-PPO.md#optional-pretraining-using-demonstrations)
on the PPO trainer, in addition to using a small GAIL reward signal.
* To train an agent to exactly mimic demonstrations, you can use the
[Behavioral Cloning](Training-Behavioral-Cloning.md) trainer. Behavioral Cloning can be
used with demonstrations (in-editor), and learns very quickly. However, it usually is ineffective
on more complex environments without a large number of demonstrations.
* Behavioral Cloning (BC) trains the Agent's neural network to exactly mimic the actions
shown in a set of demonstrations.
[The BC feature](Training-PPO.md#optional-behavioral-cloning-using-demonstrations)
can be enabled on the PPO or SAC trainer. BC tends to work best when
there are a lot of demonstrations, or in conjunction with GAIL and/or an extrinsic reward.
using pre-recorded demonstrations, you can generally enable both GAIL and Pretraining.
using pre-recorded demonstrations, you can generally enable both GAIL and Behavioral Cloning
at low strengths in addition to having an extrinsic reward.
If you want to train purely from demonstrations, GAIL is generally the preferred approach, especially
if you have few (<10) episodes of demonstrations. An example of this is provided for the Crawler example
environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.
If you have plenty of demonstrations and/or a very simple environment, Offline Behavioral Cloning can be effective and quick. However, it cannot be combined with RL.
If you want to train purely from demonstrations, GAIL and BC _without_ an
extrinsic reward signal is the preferred approach. An example of this is provided for the Crawler
example environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.
## Recording Demonstrations

They can be managed from the Editor, as well as used for training with Offline
Behavioral Cloning and GAIL.
They can be managed from the Editor, as well as used for training with BC and GAIL.
In order to record demonstrations from an agent, add the `Demonstration Recorder`
component to a GameObject in the scene which contains an `Agent` component.

35
docs/Training-ML-Agents.md


`config/gail_config.yaml` and `config/offline_bc_config.yaml` specifies the training method,
the hyperparameters, and a few additional values to use when training with Proximal Policy
Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial Imitation Learning)
with PPO, and online and offline Behavioral Cloning(BC)/Imitation. These files are divided
with PPO/SAC, and Behavioral Cloning(BC)/Imitation with PPO/SAC. These files are divided
training with PPO, SAC, GAIL (with PPO), and offline BC. These files are divided into sections.
training with PPO, SAC, GAIL (with PPO), and BC. These files are divided into sections.
The **default** section defines the default values for all the available settings. You can
also add new sections to override these defaults to train specific Behaviors. Name each of these
override sections after the appropriate `Behavior Name`. Sections for the

| :------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- |
| batch_size | The number of experiences in each iteration of gradient descent. | PPO, SAC, BC |
| batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model. | BC |
| batch_size | The number of experiences in each iteration of gradient descent. | PPO, SAC |
| batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model. | |
| demo_path | For offline imitation learning, the file path of the recorded demonstration file | (offline)BC |
| hidden_units | The number of units in the hidden layers of the neural network. | PPO, SAC, BC |
| hidden_units | The number of units in the hidden layers of the neural network. | PPO, SAC |
| learning_rate | The initial learning rate for gradient descent. | PPO, SAC, BC |
| max_steps | The maximum number of simulation steps to run during a training session. | PPO, SAC, BC |
| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC |
| learning_rate | The initial learning rate for gradient descent. | PPO, SAC |
| max_steps | The maximum number of simulation steps to run during a training session. | PPO, SAC |
| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC |
| num_layers | The number of hidden layers in the neural network. | PPO, SAC, BC |
| pretraining | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations). | PPO, SAC |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO, SAC, BC |
| num_layers | The number of hidden layers in the neural network. | PPO, SAC |
| behavioral_cloning | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations). | PPO, SAC |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO, SAC |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC |
| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, SAC, BC |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC |
| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, SAC |
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC, (online)BC |
| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC, BC |
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC |
| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC |
\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation)
\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation), GAIL = Generative Adversarial Imitaiton Learning
For specific advice on setting hyperparameters based on the type of training you
are conducting, see:

18
docs/Training-PPO.md


Typical Range: `64` - `512`
## (Optional) Pretraining Using Demonstrations
## (Optional) Behavioral Cloning Using Demonstrations
from a player. This can help guide the agent towards the reward. Pretraining adds
from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
It is essentially equivalent to running [behavioral cloning](Training-Behavioral-Cloning.md)
in-line with PPO.
To use pretraining, add a `pretraining` section to the trainer_config. For instance:
To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:
pretraining:
behavioral_cloning:
Below are the available hyperparameters for pretraining.
Below are the available hyperparameters for BC.
rate of PPO, and roughly corresponds to how strongly we allow the behavioral cloning
rate of PPO, and roughly corresponds to how strongly we allow BC
to influence the policy.
Typical Range: `0.1` - `0.5`

### Steps
During pretraining, it is often desirable to stop using demonstrations after the agent has
During BC, it is often desirable to stop using demonstrations after the agent has
pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
BC is active. The learning rate of BC will anneal over the steps. Set
the steps to 0 for constant imitation over the entire training run.
### (Optional) Batch Size

18
docs/Training-SAC.md


Default: `False`
## (Optional) Pretraining Using Demonstrations
## (Optional) Behavioral Cloning Using Demonstrations
from a player. This can help guide the agent towards the reward. Pretraining adds
from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
It is essentially equivalent to running [behavioral cloning](./Training-Behavioral-Cloning.md)
in-line with SAC.
To use pretraining, add a `pretraining` section to the trainer_config. For instance:
To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:
pretraining:
behavioral_cloning:
Below are the available hyperparameters for pretraining.
Below are the available hyperparameters for BC.
rate of SAC, and roughly corresponds to how strongly we allow the behavioral cloning
rate of SAC, and roughly corresponds to how strongly we allow BC
to influence the policy.
Typical Range: `0.1` - `0.5`

### Steps
During pretraining, it is often desirable to stop using demonstrations after the agent has
During BC, it is often desirable to stop using demonstrations after the agent has
pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
BC is active. The learning rate of BC will anneal over the steps. Set
the steps to 0 for constant imitation over the entire training run.
### (Optional) Batch Size

142
docs/images/mlagents-ImitationAndRL.png

之前 之后
宽度: 600  |  高度: 371  |  大小: 23 KiB

6
ml-agents/mlagents/trainers/components/bc/module.py


samples_per_update: int = 0,
):
"""
A BC trainer that can be used inline with RL, especially for pretraining.
A BC trainer that can be used inline with RL.
:param policy: The policy of the learning model
:param policy_learning_rate: The initial Learning Rate of the policy. Used to set an appropriate learning rate
for the pretrainer.

:param demo_path: The path to the demonstration file.
:param batch_size: The batch size to use during BC training.
:param num_epoch: Number of epochs to train for during each update.
:param samples_per_update: Maximum number of samples to train on during each pretraining update.
:param samples_per_update: Maximum number of samples to train on during each BC update.
"""
self.policy = policy
self.current_lr = policy_learning_rate * strength

@staticmethod
def check_config(config_dict: Dict[str, Any]) -> None:
"""
Check the pretraining config for the required keys.
Check the behavioral_cloning config for the required keys.
:param config_dict: Pretraining section of trainer_config
"""
param_keys = ["strength", "demo_path", "steps"]

8
ml-agents/mlagents/trainers/ppo/policy.py


with self.graph.as_default():
self.bc_module: Optional[BCModule] = None
# Create pretrainer if needed
if "pretraining" in trainer_params:
BCModule.check_config(trainer_params["pretraining"])
if "behavioral_cloning" in trainer_params:
BCModule.check_config(trainer_params["behavioral_cloning"])
default_num_epoch=trainer_params["num_epoch"],
**trainer_params["pretraining"],
default_num_epoch=3,
**trainer_params["behavioral_cloning"],
)
if load:

2
ml-agents/mlagents/trainers/sac/models.py


self.dones_holder = tf.placeholder(
shape=[None], dtype=tf.float32, name="dones_holder"
)
# This is just a dummy to get pretraining to work. PPO has this but SAC doesn't.
# This is just a dummy to get BC to work. PPO has this but SAC doesn't.
# TODO: Proper input and output specs for models
self.epsilon = tf.placeholder(
shape=[None, self.act_size[0]], dtype=tf.float32, name="epsilon"

8
ml-agents/mlagents/trainers/sac/policy.py


with self.graph.as_default():
# Create pretrainer if needed
self.bc_module: Optional[BCModule] = None
if "pretraining" in trainer_params:
BCModule.check_config(trainer_params["pretraining"])
if "behavioral_cloning" in trainer_params:
BCModule.check_config(trainer_params["behavioral_cloning"])
self.bc_module = BCModule(
self,
policy_learning_rate=trainer_params["learning_rate"],

**trainer_params["pretraining"],
**trainer_params["behavioral_cloning"],
if "samples_per_update" in trainer_params["pretraining"]:
if "samples_per_update" in trainer_params["behavioral_cloning"]:
logger.warning(
"Pretraining: Samples Per Update is not a valid setting for SAC."
)

29
ml-agents/mlagents/trainers/tests/test_barracuda_converter.py


import os
import yaml
import pytest
from mlagents.trainers.tests.test_bc import create_bc_trainer
def test_barracuda_converter():

# cleanup
os.remove(tmpfile)
@pytest.fixture
def bc_dummy_config():
return yaml.safe_load(
"""
hidden_units: 32
learning_rate: 3.0e-4
num_layers: 1
use_recurrent: false
sequence_length: 32
memory_size: 64
batches_per_epoch: 1
batch_size: 64
summary_freq: 2000
max_steps: 4000
"""
)
@pytest.mark.parametrize("use_lstm", [False, True], ids=["nolstm", "lstm"])
@pytest.mark.parametrize("use_discrete", [True, False], ids=["disc", "cont"])
def test_bc_export(bc_dummy_config, use_lstm, use_discrete):
bc_dummy_config["use_recurrent"] = use_lstm
trainer, env = create_bc_trainer(bc_dummy_config, use_discrete)
trainer.export_model()

14
ml-agents/mlagents/trainers/tests/test_bcmodule.py


summary_freq: 1000
use_recurrent: false
memory_size: 8
pretraining:
behavioral_cloning:
demo_path: ./demos/ExpertPyramid.demo
strength: 1.0
steps: 10000000

tau: 0.005
use_recurrent: false
vis_encode_type: simple
pretraining:
behavioral_cloning:
demo_path: ./demos/ExpertPyramid.demo
strength: 1.0
steps: 10000000

trainer_config["model_path"] = model_path
trainer_config["keep_checkpoints"] = 3
trainer_config["use_recurrent"] = use_rnn
trainer_config["pretraining"]["demo_path"] = (
trainer_config["behavioral_cloning"]["demo_path"] = (
os.path.dirname(os.path.abspath(__file__)) + "/" + demo_file
)

env, policy = create_policy_with_bc_mock(
mock_env, mock_brain, trainer_config, False, "test.demo"
)
assert policy.bc_module.num_epoch == trainer_config["num_epoch"]
assert policy.bc_module.num_epoch == 3
trainer_config["pretraining"]["num_epoch"] = 100
trainer_config["pretraining"]["batch_size"] = 10000
trainer_config["behavioral_cloning"]["num_epoch"] = 100
trainer_config["behavioral_cloning"]["batch_size"] = 10000
env, policy = create_policy_with_bc_mock(
mock_env, mock_brain, trainer_config, False, "test.demo"
)

@mock.patch("mlagents.envs.environment.UnityEnvironment")
def test_bcmodule_constant_lr_update(mock_env, trainer_config):
mock_brain = mb.create_mock_3dball_brain()
trainer_config["pretraining"]["steps"] = 0
trainer_config["behavioral_cloning"]["steps"] = 0
env, policy = create_policy_with_bc_mock(
mock_env, mock_brain, trainer_config, False, "test.demo"
)

2
ml-agents/mlagents/trainers/tests/test_reward_signals.py


tau: 0.005
use_recurrent: false
vis_encode_type: simple
pretraining:
behavioral_cloning:
demo_path: ./demos/ExpertPyramid.demo
strength: 1.0
steps: 10000000

62
ml-agents/mlagents/trainers/tests/test_trainer_util.py


import pytest
import yaml
import os
import io
from unittest.mock import patch

from mlagents.trainers.ppo.trainer import PPOTrainer
from mlagents.trainers.bc.offline_trainer import OfflineBCTrainer
from mlagents.envs.exception import UnityEnvironmentException

@pytest.fixture
def dummy_offline_bc_config():
return yaml.safe_load(
"""
default:
trainer: offline_bc
demo_path: """
+ os.path.dirname(os.path.abspath(__file__))
+ """/test.demo
batches_per_epoch: 16
batch_size: 32
beta: 5.0e-3
buffer_size: 512
epsilon: 0.2
gamma: 0.99
hidden_units: 128
lambd: 0.95
learning_rate: 3.0e-4
max_steps: 5.0e4
normalize: true
num_epoch: 5
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 1000
use_recurrent: false
memory_size: 8
use_curiosity: false
curiosity_strength: 0.0
curiosity_enc_size: 1
"""
)
@pytest.fixture
def dummy_offline_bc_config_with_override():
base = dummy_offline_bc_config()
def dummy_config_with_override():
base = dummy_config()
base["testbrain"] = {}
base["testbrain"]["normalize"] = False
return base

train_model = True
load_model = False
seed = 11
expected_reward_buff_cap = 1
base_config = dummy_offline_bc_config_with_override()
base_config = dummy_config_with_override()
expected_config = base_config["default"]
expected_config["summary_path"] = summaries_dir + f"/{run_id}_testbrain"
expected_config["model_path"] = model_path + "/testbrain"

BrainParametersMock.return_value.brain_name = "testbrain"
external_brains = {"testbrain": brain_params_mock}
def mock_constructor(self, brain, trainer_parameters, training, load, seed, run_id):
def mock_constructor(
self,
brain,
reward_buff_cap,
trainer_parameters,
training,
load,
seed,
run_id,
multi_gpu,
):
self.trainer_metrics = TrainerMetrics("", "")
assert reward_buff_cap == expected_reward_buff_cap
assert multi_gpu == multi_gpu
with patch.object(OfflineBCTrainer, "__init__", mock_constructor):
with patch.object(PPOTrainer, "__init__", mock_constructor):
trainer_factory = trainer_util.TrainerFactory(
trainer_config=base_config,
summaries_dir=summaries_dir,

for _, brain_parameters in external_brains.items():
trainers["testbrain"] = trainer_factory.generate(brain_parameters)
assert "testbrain" in trainers
assert isinstance(trainers["testbrain"], OfflineBCTrainer)
assert isinstance(trainers["testbrain"], PPOTrainer)
@patch("mlagents.trainers.brain.BrainParameters")

9
ml-agents/mlagents/trainers/trainer_util.py


from mlagents.trainers.meta_curriculum import MetaCurriculum
from mlagents.envs.exception import UnityEnvironmentException
from mlagents.trainers.trainer import Trainer
from mlagents.trainers.trainer import Trainer, UnityTrainerException
from mlagents.trainers.bc.offline_trainer import OfflineBCTrainer
class TrainerFactory:

trainer: Trainer = None # type: ignore # will be set to one of these, or raise
if trainer_parameters["trainer"] == "offline_bc":
trainer = OfflineBCTrainer(
brain_parameters, trainer_parameters, train_model, load_model, seed, run_id
raise UnityTrainerException(
"The offline_bc trainer has been removed. To train with demonstrations, "
"please use a PPO or SAC trainer with the GAIL Reward Signal and/or the "
"Behavioral Cloning feature enabled."
)
elif trainer_parameters["trainer"] == "ppo":
trainer = PPOTrainer(

30
docs/Training-Behavioral-Cloning.md


# Training with Behavioral Cloning
There are a variety of possible imitation learning algorithms which can
be used, the simplest one of them is Behavioral Cloning. It works by collecting
demonstrations from a teacher, and then simply uses them to directly learn a
policy, in the same way the supervised learning for image classification
or other traditional Machine Learning tasks work.
## Offline Training
With offline behavioral cloning, we can use demonstrations (`.demo` files)
generated using the `Demonstration Recorder` as the dataset used to train a behavior.
1. Choose an agent you would like to learn to imitate some set of demonstrations.
2. Record a set of demonstration using the `Demonstration Recorder` (see [here](Training-Imitation-Learning.md)).
For illustrative purposes we will refer to this file as `AgentRecording.demo`.
3. Build the scene(make sure the Agent is not using its heuristic).
4. Open the `config/offline_bc_config.yaml` file.
5. Modify the `demo_path` parameter in the file to reference the path to the
demonstration file recorded in step 2. In our case this is:
`./UnitySDK/Assets/Demonstrations/AgentRecording.demo`
6. Launch `mlagent-learn`, providing `./config/offline_bc_config.yaml`
as the config parameter, and include the `--run-id` and `--train` as usual.
Provide your environment as the `--env` parameter if it has been compiled
as standalone, or omit to train in the editor.
7. (Optional) Observe training performance using TensorBoard.
This will use the demonstration file to train a neural network driven agent
to directly imitate the actions provided in the demonstration. The environment
will launch and be used for evaluating the agent's performance during training.

236
ml-agents/mlagents/trainers/tests/test_bc.py


import unittest.mock as mock
import pytest
import os
import numpy as np
from mlagents.tf_utils import tf
import yaml
from mlagents.trainers.bc.models import BehavioralCloningModel
import mlagents.trainers.tests.mock_brain as mb
from mlagents.trainers.bc.policy import BCPolicy
from mlagents.trainers.bc.offline_trainer import BCTrainer
from mlagents.envs.mock_communicator import MockCommunicator
from mlagents.trainers.tests.mock_brain import make_brain_parameters
from mlagents.envs.environment import UnityEnvironment
from mlagents.trainers.brain_conversion_utils import (
step_result_to_brain_info,
group_spec_to_brain_parameters,
)
@pytest.fixture
def dummy_config():
return yaml.safe_load(
"""
hidden_units: 32
learning_rate: 3.0e-4
num_layers: 1
use_recurrent: false
sequence_length: 32
memory_size: 32
batches_per_epoch: 100 # Force code to use all possible batches
batch_size: 32
summary_freq: 2000
max_steps: 4000
"""
)
def create_bc_trainer(dummy_config, is_discrete=False, use_recurrent=False):
mock_env = mock.Mock()
if is_discrete:
mock_brain = mb.create_mock_pushblock_brain()
mock_braininfo = mb.create_mock_braininfo(
num_agents=12, num_vector_observations=70
)
else:
mock_brain = mb.create_mock_3dball_brain()
mock_braininfo = mb.create_mock_braininfo(
num_agents=12, num_vector_observations=8
)
mb.setup_mock_unityenvironment(mock_env, mock_brain, mock_braininfo)
env = mock_env()
trainer_parameters = dummy_config
trainer_parameters["summary_path"] = "tmp"
trainer_parameters["model_path"] = "tmp"
trainer_parameters["demo_path"] = (
os.path.dirname(os.path.abspath(__file__)) + "/test.demo"
)
trainer_parameters["use_recurrent"] = use_recurrent
trainer = BCTrainer(
mock_brain, trainer_parameters, training=True, load=False, seed=0, run_id=0
)
trainer.demonstration_buffer = mb.simulate_rollout(env, trainer.policy, 100)
return trainer, env
@pytest.mark.parametrize("use_recurrent", [True, False])
def test_bc_trainer_step(dummy_config, use_recurrent):
trainer, env = create_bc_trainer(dummy_config, use_recurrent=use_recurrent)
# Test get_step
assert trainer.get_step == 0
# Test update policy
trainer.update_policy()
assert len(trainer.stats["Losses/Cloning Loss"]) > 0
# Test increment step
trainer.increment_step(1)
assert trainer.step == 1
def test_bc_trainer_add_proc_experiences(dummy_config):
trainer, env = create_bc_trainer(dummy_config)
# Test add_experiences
returned_braininfo = env.step()
brain_name = "Ball3DBrain"
trainer.add_experiences(
returned_braininfo[brain_name], returned_braininfo[brain_name], {}
) # Take action outputs is not used
for agent_id in returned_braininfo[brain_name].agents:
assert trainer.evaluation_buffer[agent_id].last_brain_info is not None
assert trainer.episode_steps[agent_id] > 0
assert trainer.cumulative_rewards[agent_id] > 0
# Test process_experiences by setting done
returned_braininfo[brain_name].local_done = 12 * [True]
trainer.process_experiences(
returned_braininfo[brain_name], returned_braininfo[brain_name]
)
for agent_id in returned_braininfo[brain_name].agents:
assert trainer.episode_steps[agent_id] == 0
assert trainer.cumulative_rewards[agent_id] == 0
def test_bc_trainer_end_episode(dummy_config):
trainer, env = create_bc_trainer(dummy_config)
returned_braininfo = env.step()
brain_name = "Ball3DBrain"
trainer.add_experiences(
returned_braininfo[brain_name], returned_braininfo[brain_name], {}
) # Take action outputs is not used
trainer.process_experiences(
returned_braininfo[brain_name], returned_braininfo[brain_name]
)
# Should set everything to 0
trainer.end_episode()
for agent_id in returned_braininfo[brain_name].agents:
assert trainer.episode_steps[agent_id] == 0
assert trainer.cumulative_rewards[agent_id] == 0
@mock.patch("mlagents.envs.environment.UnityEnvironment.executable_launcher")
@mock.patch("mlagents.envs.environment.UnityEnvironment.get_communicator")
def test_bc_policy_evaluate(mock_communicator, mock_launcher, dummy_config):
tf.reset_default_graph()
mock_communicator.return_value = MockCommunicator(
discrete_action=False, visual_inputs=0
)
env = UnityEnvironment(" ")
env.reset()
brain_name = env.get_agent_groups()[0]
brain_info = step_result_to_brain_info(
env.get_step_result(brain_name), env.get_agent_group_spec(brain_name)
)
brain_params = group_spec_to_brain_parameters(
brain_name, env.get_agent_group_spec(brain_name)
)
trainer_parameters = dummy_config
model_path = brain_name
trainer_parameters["model_path"] = model_path
trainer_parameters["keep_checkpoints"] = 3
policy = BCPolicy(0, brain_params, trainer_parameters, False)
run_out = policy.evaluate(brain_info)
assert run_out["action"].shape == (3, 2)
env.close()
def test_cc_bc_model():
tf.reset_default_graph()
with tf.Session() as sess:
with tf.variable_scope("FakeGraphScope"):
model = BehavioralCloningModel(
make_brain_parameters(discrete_action=False, visual_inputs=0)
)
init = tf.global_variables_initializer()
sess.run(init)
run_list = [model.sample_action, model.policy]
feed_dict = {
model.batch_size: 2,
model.sequence_length: 1,
model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]),
}
sess.run(run_list, feed_dict=feed_dict)
# env.close()
def test_dc_bc_model():
tf.reset_default_graph()
with tf.Session() as sess:
with tf.variable_scope("FakeGraphScope"):
model = BehavioralCloningModel(
make_brain_parameters(discrete_action=True, visual_inputs=0)
)
init = tf.global_variables_initializer()
sess.run(init)
run_list = [model.sample_action, model.action_probs]
feed_dict = {
model.batch_size: 2,
model.dropout_rate: 1.0,
model.sequence_length: 1,
model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]),
model.action_masks: np.ones([2, 2], dtype=np.float32),
}
sess.run(run_list, feed_dict=feed_dict)
def test_visual_dc_bc_model():
tf.reset_default_graph()
with tf.Session() as sess:
with tf.variable_scope("FakeGraphScope"):
model = BehavioralCloningModel(
make_brain_parameters(discrete_action=True, visual_inputs=2)
)
init = tf.global_variables_initializer()
sess.run(init)
run_list = [model.sample_action, model.action_probs]
feed_dict = {
model.batch_size: 2,
model.dropout_rate: 1.0,
model.sequence_length: 1,
model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]),
model.visual_in[0]: np.ones([2, 40, 30, 3], dtype=np.float32),
model.visual_in[1]: np.ones([2, 40, 30, 3], dtype=np.float32),
model.action_masks: np.ones([2, 2], dtype=np.float32),
}
sess.run(run_list, feed_dict=feed_dict)
def test_visual_cc_bc_model():
tf.reset_default_graph()
with tf.Session() as sess:
with tf.variable_scope("FakeGraphScope"):
model = BehavioralCloningModel(
make_brain_parameters(discrete_action=False, visual_inputs=2)
)
init = tf.global_variables_initializer()
sess.run(init)
run_list = [model.sample_action, model.policy]
feed_dict = {
model.batch_size: 2,
model.sequence_length: 1,
model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]),
model.visual_in[0]: np.ones([2, 40, 30, 3], dtype=np.float32),
model.visual_in[1]: np.ones([2, 40, 30, 3], dtype=np.float32),
}
sess.run(run_list, feed_dict=feed_dict)
if __name__ == "__main__":
pytest.main()
正在加载...
取消
保存