浏览代码
GAIL and Pretraining (#2118)
GAIL and Pretraining (#2118)
Based on the new reward signals architecture, add BC pretrainer and GAIL for PPO. Main changes: - A new GAILRewardSignal and GAILModel for GAIL/VAIL - A BCModule component (not a reward signal) to do pretraining during RL - Documentation for both of these - Change to Demo Loader that lets you load multiple demo files in a folder - Example Demo files for all of our tested sample environments (for future regression testing)/develop-generalizationTraining-TrainerController
GitHub
5 年前
当前提交
9c50abcf
共有 44 个文件被更改,包括 15563 次插入 和 155 次删除
-
125docs/Training-Imitation-Learning.md
-
70docs/Training-PPO.md
-
99docs/Training-RewardSignals.md
-
7ml-agents/mlagents/trainers/components/reward_signals/curiosity/signal.py
-
2ml-agents/mlagents/trainers/components/reward_signals/reward_signal_factory.py
-
37ml-agents/mlagents/trainers/demo_loader.py
-
14ml-agents/mlagents/trainers/ppo/policy.py
-
4ml-agents/mlagents/trainers/ppo/trainer.py
-
49ml-agents/mlagents/trainers/tests/mock_brain.py
-
13ml-agents/mlagents/trainers/tests/test_demo_loader.py
-
154ml-agents/mlagents/trainers/tests/test_reward_signals.py
-
92docs/Training-BehavioralCloning.md
-
80docs/images/mlagents-ImitationAndRL.png
-
158ml-agents/mlagents/trainers/tests/test_bcmodule.py
-
1001ml-agents/mlagents/trainers/tests/testdcvis.demo
-
442demos/Expert3DBall.demo
-
1001demos/Expert3DBallHard.demo
-
1001demos/ExpertBanana.demo
-
171demos/ExpertBasic.demo
-
198demos/ExpertBouncer.demo
-
1001demos/ExpertCrawlerSta.demo
-
1001demos/ExpertGrid.demo
-
1001demos/ExpertHallway.demo
-
1001demos/ExpertPush.demo
-
1001demos/ExpertPyramid.demo
-
1001demos/ExpertReacher.demo
-
1001demos/ExpertSoccerGoal.demo
-
1001demos/ExpertSoccerStri.demo
-
1001demos/ExpertTennis.demo
-
1001demos/ExpertWalker.demo
-
1ml-agents/mlagents/trainers/components/bc/__init__.py
-
101ml-agents/mlagents/trainers/components/bc/model.py
-
172ml-agents/mlagents/trainers/components/bc/module.py
-
1ml-agents/mlagents/trainers/components/reward_signals/gail/__init__.py
-
265ml-agents/mlagents/trainers/components/reward_signals/gail/model.py
-
270ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
-
60ml-agents/mlagents/trainers/tests/test_demo_dir/test.demo
-
60ml-agents/mlagents/trainers/tests/test_demo_dir/test2.demo
-
60ml-agents/mlagents/trainers/tests/test_demo_dir/test3.demo
|
|||
# Training with Behavioral Cloning |
|||
|
|||
There are a variety of possible imitation learning algorithms which can |
|||
be used, the simplest one of them is Behavioral Cloning. It works by collecting |
|||
demonstrations from a teacher, and then simply uses them to directly learn a |
|||
policy, in the same way the supervised learning for image classification |
|||
or other traditional Machine Learning tasks work. |
|||
|
|||
## Offline Training |
|||
|
|||
With offline behavioral cloning, we can use demonstrations (`.demo` files) |
|||
generated using the `Demonstration Recorder` as the dataset used to train a behavior. |
|||
|
|||
1. Choose an agent you would like to learn to imitate some set of demonstrations. |
|||
2. Record a set of demonstration using the `Demonstration Recorder` (see [here](Training-Imitation-Learning.md)). |
|||
For illustrative purposes we will refer to this file as `AgentRecording.demo`. |
|||
3. Build the scene, assigning the agent a Learning Brain, and set the Brain to |
|||
Control in the Broadcast Hub. For more information on Brains, see |
|||
[here](Learning-Environment-Design-Brains.md). |
|||
4. Open the `config/offline_bc_config.yaml` file. |
|||
5. Modify the `demo_path` parameter in the file to reference the path to the |
|||
demonstration file recorded in step 2. In our case this is: |
|||
`./UnitySDK/Assets/Demonstrations/AgentRecording.demo` |
|||
6. Launch `mlagent-learn`, providing `./config/offline_bc_config.yaml` |
|||
as the config parameter, and include the `--run-id` and `--train` as usual. |
|||
Provide your environment as the `--env` parameter if it has been compiled |
|||
as standalone, or omit to train in the editor. |
|||
7. (Optional) Observe training performance using TensorBoard. |
|||
|
|||
This will use the demonstration file to train a neural network driven agent |
|||
to directly imitate the actions provided in the demonstration. The environment |
|||
will launch and be used for evaluating the agent's performance during training. |
|||
|
|||
## Online Training |
|||
|
|||
It is also possible to provide demonstrations in realtime during training, |
|||
without pre-recording a demonstration file. The steps to do this are as follows: |
|||
|
|||
1. First create two Brains, one which will be the "Teacher," and the other which |
|||
will be the "Student." We will assume that the names of the Brain |
|||
Assets are "Teacher" and "Student" respectively. |
|||
2. The "Teacher" Brain must be a **Player Brain**. You must properly |
|||
configure the inputs to map to the corresponding actions. |
|||
3. The "Student" Brain must be a **Learning Brain**. |
|||
4. The Brain Parameters of both the "Teacher" and "Student" Brains must be |
|||
compatible with the agent. |
|||
5. Drag both the "Teacher" and "Student" Brain into the Academy's `Broadcast Hub` |
|||
and check the `Control` checkbox on the "Student" Brain. |
|||
6. Link the Brains to the desired Agents (one Agent as the teacher and at least |
|||
one Agent as a student). |
|||
7. In `config/online_bc_config.yaml`, add an entry for the "Student" Brain. Set |
|||
the `trainer` parameter of this entry to `online_bc`, and the |
|||
`brain_to_imitate` parameter to the name of the teacher Brain: "Teacher". |
|||
Additionally, set `batches_per_epoch`, which controls how much training to do |
|||
each moment. Increase the `max_steps` option if you'd like to keep training |
|||
the Agents for a longer period of time. |
|||
8. Launch the training process with `mlagents-learn config/online_bc_config.yaml |
|||
--train --slow`, and press the :arrow_forward: button in Unity when the |
|||
message _"Start training by pressing the Play button in the Unity Editor"_ is |
|||
displayed on the screen |
|||
9. From the Unity window, control the Agent with the Teacher Brain by providing |
|||
"teacher demonstrations" of the behavior you would like to see. |
|||
10. Watch as the Agent(s) with the student Brain attached begin to behave |
|||
similarly to the demonstrations. |
|||
11. Once the Student Agents are exhibiting the desired behavior, end the training |
|||
process with `CTL+C` from the command line. |
|||
12. Move the resulting `*.nn` file into the `TFModels` subdirectory of the |
|||
Assets folder (or a subdirectory within Assets of your choosing) , and use |
|||
with `Learning` Brain. |
|||
|
|||
**BC Teacher Helper** |
|||
|
|||
We provide a convenience utility, `BC Teacher Helper` component that you can add |
|||
to the Teacher Agent. |
|||
|
|||
<p align="center"> |
|||
<img src="images/bc_teacher_helper.png" |
|||
alt="BC Teacher Helper" |
|||
width="375" border="10" /> |
|||
</p> |
|||
|
|||
This utility enables you to use keyboard shortcuts to do the following: |
|||
|
|||
1. To start and stop recording experiences. This is useful in case you'd like to |
|||
interact with the game _but not have the agents learn from these |
|||
interactions_. The default command to toggle this is to press `R` on the |
|||
keyboard. |
|||
|
|||
2. Reset the training buffer. This enables you to instruct the agents to forget |
|||
their buffer of recent experiences. This is useful if you'd like to get them |
|||
to quickly learn a new behavior. The default command to reset the buffer is |
|||
to press `C` on the keyboard. |
|
|||
import unittest.mock as mock |
|||
import pytest |
|||
import mlagents.trainers.tests.mock_brain as mb |
|||
|
|||
import numpy as np |
|||
import yaml |
|||
import os |
|||
|
|||
from mlagents.trainers.ppo.policy import PPOPolicy |
|||
|
|||
|
|||
@pytest.fixture |
|||
def dummy_config(): |
|||
return yaml.safe_load( |
|||
""" |
|||
trainer: ppo |
|||
batch_size: 32 |
|||
beta: 5.0e-3 |
|||
buffer_size: 512 |
|||
epsilon: 0.2 |
|||
hidden_units: 128 |
|||
lambd: 0.95 |
|||
learning_rate: 3.0e-4 |
|||
max_steps: 5.0e4 |
|||
normalize: true |
|||
num_epoch: 5 |
|||
num_layers: 2 |
|||
time_horizon: 64 |
|||
sequence_length: 64 |
|||
summary_freq: 1000 |
|||
use_recurrent: false |
|||
memory_size: 8 |
|||
pretraining: |
|||
demo_path: ./demos/ExpertPyramid.demo |
|||
strength: 1.0 |
|||
steps: 10000000 |
|||
reward_signals: |
|||
extrinsic: |
|||
strength: 1.0 |
|||
gamma: 0.99 |
|||
""" |
|||
) |
|||
|
|||
|
|||
def create_mock_3dball_brain(): |
|||
mock_brain = mb.create_mock_brainparams( |
|||
vector_action_space_type="continuous", |
|||
vector_action_space_size=[2], |
|||
vector_observation_space_size=8, |
|||
) |
|||
return mock_brain |
|||
|
|||
|
|||
def create_mock_banana_brain(): |
|||
mock_brain = mb.create_mock_brainparams( |
|||
number_visual_observations=1, |
|||
vector_action_space_type="discrete", |
|||
vector_action_space_size=[3, 3, 3, 2], |
|||
vector_observation_space_size=0, |
|||
) |
|||
return mock_brain |
|||
|
|||
|
|||
def create_ppo_policy_with_bc_mock( |
|||
mock_env, mock_brain, dummy_config, use_rnn, demo_file |
|||
): |
|||
mock_braininfo = mb.create_mock_braininfo(num_agents=12, num_vector_observations=8) |
|||
mb.setup_mock_unityenvironment(mock_env, mock_brain, mock_braininfo) |
|||
env = mock_env() |
|||
|
|||
trainer_parameters = dummy_config |
|||
model_path = env.brain_names[0] |
|||
trainer_parameters["model_path"] = model_path |
|||
trainer_parameters["keep_checkpoints"] = 3 |
|||
trainer_parameters["use_recurrent"] = use_rnn |
|||
trainer_parameters["pretraining"]["demo_path"] = ( |
|||
os.path.dirname(os.path.abspath(__file__)) + "/" + demo_file |
|||
) |
|||
policy = PPOPolicy(0, mock_brain, trainer_parameters, False, False) |
|||
return env, policy |
|||
|
|||
|
|||
# Test default values |
|||
@mock.patch("mlagents.envs.UnityEnvironment") |
|||
def test_bcmodule_defaults(mock_env, dummy_config): |
|||
# See if default values match |
|||
mock_brain = create_mock_3dball_brain() |
|||
env, policy = create_ppo_policy_with_bc_mock( |
|||
mock_env, mock_brain, dummy_config, False, "test.demo" |
|||
) |
|||
assert policy.bc_module.num_epoch == dummy_config["num_epoch"] |
|||
assert policy.bc_module.batch_size == dummy_config["batch_size"] |
|||
env.close() |
|||
# Assign strange values and see if it overrides properly |
|||
dummy_config["pretraining"]["num_epoch"] = 100 |
|||
dummy_config["pretraining"]["batch_size"] = 10000 |
|||
env, policy = create_ppo_policy_with_bc_mock( |
|||
mock_env, mock_brain, dummy_config, False, "test.demo" |
|||
) |
|||
assert policy.bc_module.num_epoch == 100 |
|||
assert policy.bc_module.batch_size == 10000 |
|||
env.close() |
|||
|
|||
|
|||
# Test with continuous control env and vector actions |
|||
@mock.patch("mlagents.envs.UnityEnvironment") |
|||
def test_bcmodule_update(mock_env, dummy_config): |
|||
mock_brain = create_mock_3dball_brain() |
|||
env, policy = create_ppo_policy_with_bc_mock( |
|||
mock_env, mock_brain, dummy_config, False, "test.demo" |
|||
) |
|||
stats = policy.bc_module.update() |
|||
for _, item in stats.items(): |
|||
assert isinstance(item, np.float32) |
|||
env.close() |
|||
|
|||
|
|||
# Test with RNN |
|||
@mock.patch("mlagents.envs.UnityEnvironment") |
|||
def test_bcmodule_rnn_update(mock_env, dummy_config): |
|||
mock_brain = create_mock_3dball_brain() |
|||
env, policy = create_ppo_policy_with_bc_mock( |
|||
mock_env, mock_brain, dummy_config, True, "test.demo" |
|||
) |
|||
stats = policy.bc_module.update() |
|||
for _, item in stats.items(): |
|||
assert isinstance(item, np.float32) |
|||
env.close() |
|||
|
|||
|
|||
# Test with discrete control and visual observations |
|||
@mock.patch("mlagents.envs.UnityEnvironment") |
|||
def test_bcmodule_dc_visual_update(mock_env, dummy_config): |
|||
mock_brain = create_mock_banana_brain() |
|||
env, policy = create_ppo_policy_with_bc_mock( |
|||
mock_env, mock_brain, dummy_config, False, "testdcvis.demo" |
|||
) |
|||
stats = policy.bc_module.update() |
|||
for _, item in stats.items(): |
|||
assert isinstance(item, np.float32) |
|||
env.close() |
|||
|
|||
|
|||
# Test with discrete control, visual observations and RNN |
|||
@mock.patch("mlagents.envs.UnityEnvironment") |
|||
def test_bcmodule_rnn_dc_update(mock_env, dummy_config): |
|||
mock_brain = create_mock_banana_brain() |
|||
env, policy = create_ppo_policy_with_bc_mock( |
|||
mock_env, mock_brain, dummy_config, True, "testdcvis.demo" |
|||
) |
|||
stats = policy.bc_module.update() |
|||
for _, item in stats.items(): |
|||
assert isinstance(item, np.float32) |
|||
env.close() |
|||
|
|||
|
|||
if __name__ == "__main__": |
|||
pytest.main() |
1001
ml-agents/mlagents/trainers/tests/testdcvis.demo
文件差异内容过多而无法显示
查看文件
文件差异内容过多而无法显示
查看文件
|
|||
BallDemo� -bfB * * 0:3DBallBrain7 |
|||
f&C�v���x��? �@�Q�� " P���������< |
|||
��<����x��?�;|@�Q�� �"{� "n��>@��==���=P���������< |
|||
�=���x��? 0r@�Q�� �"�� ":MG?��J?=���=P���������< |
|||
���=e] =x��?�a@�Q�� Z<� "�hw�{�>=���=P���������< |
|||
��m={;x��?�BK@�Q�� �"{� "�ѫ��٠�=���=P���������< |
|||
q�<�H=x��?|a.@�Q�� ���� "3����K�>=���=P���������< |
|||
��_�ޚJ=x��?�8@�Q�� Z�� "������E>=���=P���������< |
|||
�l =qR=x��?���?�Q�� r��� "�R?6�o<=���=P���������< |
|||
�')=F-�x��?0FH?�Q�� �"�� "o�= ��=���=P���������< |
|||
���=A��袧?��@?�:����J���{Ѿ"���>�Ҵ�=���=P���������< |
|||
�'H=����(��?-5?0k�Q�(�����t^"�dr���N>=���=P���������< |
|||
���;8��9��?�T?P���,-�z�H�Q���"�%���t>=���=P���������< |
|||
ގn���Y=��? )?�j���!����5�˾"��5�l?=���=P���������< |
|||
��V��9�<�Շ?0�?h�$�;������햾"܂=@ ��=���=P���������< |
|||
�.����<P[�?��? +�~/ӾA=?�}�"�A�=6���=���=P���������< |
|||
��B�%�1= �{?�"?p /�xϿ�3l�<.J�"��>p-c>=���=P���������< |
|||
��U=4(=�Wp?�3A?x�0�1NϾ;������"s;<?�; |
|||
�=���=P���������< |
|||
.�<��|<P�e?�Z1? M4�1NϾ�w�����"�~�ÿ��=���=P���������< |
|||
�/��>F= �Z?P�+?��6�����5���k�"?+���>=���=P���������< |
|||
ψ);��=��N?p�5?�j5��w��k="�3"=dO�>=���=P���������< |
|||
t�v<}��=�C?�9?�2�,��� �Z:>"��>�}U�=���=P���������< |
|||
}nE<Z��<P7?`=)?�=/�,������Z:>"I�����=���=P���������< |
|||
z�ڼ� �<�+?Pa?�+�����J���=>"PA㾻�<=���=P���������< |
|||
^�0=˙X=��?�s8?(."�J�� |
|||
���9�>"��M?�*�>=���=P���������< |
|||
��S<M�< |