浏览代码

Split Policy and Optimizer, common Policy for PPO and SAC (#3345)

/asymm-envs
GitHub 5 年前
当前提交
c145e75b
共有 51 个文件被更改,包括 2931 次插入3472 次删除
  1. 1
      com.unity.ml-agents/CHANGELOG.md
  2. 9
      config/sac_trainer_config.yaml
  3. 8
      config/trainer_config.yaml
  4. 1
      docs/Migrating.md
  5. 1
      docs/Training-ML-Agents.md
  6. 6
      docs/Training-PPO.md
  7. 6
      docs/Training-SAC.md
  8. 3
      ml-agents/mlagents/trainers/agent_processor.py
  9. 39
      ml-agents/mlagents/trainers/components/bc/model.py
  10. 36
      ml-agents/mlagents/trainers/components/bc/module.py
  11. 19
      ml-agents/mlagents/trainers/components/reward_signals/__init__.py
  12. 74
      ml-agents/mlagents/trainers/components/reward_signals/curiosity/model.py
  13. 43
      ml-agents/mlagents/trainers/components/reward_signals/curiosity/signal.py
  14. 66
      ml-agents/mlagents/trainers/components/reward_signals/gail/model.py
  15. 37
      ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
  16. 10
      ml-agents/mlagents/trainers/components/reward_signals/reward_signal_factory.py
  17. 8
      ml-agents/mlagents/trainers/exception.py
  18. 14
      ml-agents/mlagents/trainers/ghost/trainer.py
  19. 6
      ml-agents/mlagents/trainers/learn.py
  20. 270
      ml-agents/mlagents/trainers/models.py
  21. 78
      ml-agents/mlagents/trainers/ppo/trainer.py
  22. 9
      ml-agents/mlagents/trainers/rl_trainer.py
  23. 39
      ml-agents/mlagents/trainers/sac/trainer.py
  24. 3
      ml-agents/mlagents/trainers/tests/mock_brain.py
  25. 134
      ml-agents/mlagents/trainers/tests/test_bcmodule.py
  26. 10
      ml-agents/mlagents/trainers/tests/test_ghost.py
  27. 3
      ml-agents/mlagents/trainers/tests/test_learn.py
  28. 2
      ml-agents/mlagents/trainers/tests/test_meta_curriculum.py
  29. 16
      ml-agents/mlagents/trainers/tests/test_policy.py
  30. 403
      ml-agents/mlagents/trainers/tests/test_ppo.py
  31. 66
      ml-agents/mlagents/trainers/tests/test_reward_signals.py
  32. 257
      ml-agents/mlagents/trainers/tests/test_sac.py
  33. 22
      ml-agents/mlagents/trainers/tests/test_trainer_util.py
  34. 221
      ml-agents/mlagents/trainers/tf_policy.py
  35. 10
      ml-agents/mlagents/trainers/trainer.py
  36. 5
      ml-agents/mlagents/trainers/trainer_util.py
  37. 352
      ml-agents/mlagents/trainers/ppo/optimizer.py
  38. 447
      ml-agents/mlagents/trainers/sac/network.py
  39. 643
      ml-agents/mlagents/trainers/sac/optimizer.py
  40. 189
      ml-agents/mlagents/trainers/tests/test_nn_policy.py
  41. 0
      ml-agents/mlagents/trainers/common/__init__.py
  42. 393
      ml-agents/mlagents/trainers/common/nn_policy.py
  43. 21
      ml-agents/mlagents/trainers/common/optimizer.py
  44. 156
      ml-agents/mlagents/trainers/common/tf_optimizer.py
  45. 382
      ml-agents/mlagents/trainers/ppo/models.py
  46. 219
      ml-agents/mlagents/trainers/ppo/multi_gpu_policy.py
  47. 227
      ml-agents/mlagents/trainers/ppo/policy.py
  48. 1001
      ml-agents/mlagents/trainers/sac/models.py
  49. 315
      ml-agents/mlagents/trainers/sac/policy.py
  50. 123
      ml-agents/mlagents/trainers/tests/test_multigpu.py

1
com.unity.ml-agents/CHANGELOG.md


- Agent.CollectObservations now takes a VectorSensor argument. It was also overloaded to optionally take an ActionMasker argument. (#3352, #3389)
- Beta support for ONNX export was added. If the `tf2onnx` python package is installed, models will be saved to `.onnx` as well as `.nn` format.
Note that Barracuda 0.6.0 or later is required to import the `.onnx` files properly
- Multi-GPU training and the `--multi-gpu` option has been removed temporarily. (#3345)
### Minor Changes
- Monitor.cs was moved to Examples. (#3372)

9
config/sac_trainer_config.yaml


learning_rate: 3.0e-4
learning_rate_schedule: constant
max_steps: 5.0e5
memory_size: 256
memory_size: 128
normalize: false
num_update: 1
train_interval: 1

sequence_length: 32
num_layers: 2
hidden_units: 128
memory_size: 256
memory_size: 128
init_entcoef: 0.1
max_steps: 1.0e7
summary_freq: 10000

sequence_length: 32
num_layers: 1
hidden_units: 128
memory_size: 256
memory_size: 128
summary_freq: 10000
time_horizon: 64
use_recurrent: true

num_layers: 1
hidden_units: 128
memory_size: 256
memory_size: 128
gamma: 0.99
buffer_size: 1024
batch_size: 64

8
config/trainer_config.yaml


learning_rate: 3.0e-4
learning_rate_schedule: linear
max_steps: 5.0e5
memory_size: 256
memory_size: 128
normalize: false
num_epoch: 3
num_layers: 2

sequence_length: 64
num_layers: 2
hidden_units: 128
memory_size: 256
memory_size: 128
beta: 1.0e-2
num_epoch: 3
buffer_size: 1024

sequence_length: 64
num_layers: 1
hidden_units: 128
memory_size: 256
memory_size: 128
beta: 1.0e-2
num_epoch: 3
buffer_size: 1024

sequence_length: 32
num_layers: 1
hidden_units: 128
memory_size: 256
memory_size: 128
beta: 1.0e-2
num_epoch: 3
buffer_size: 1024

1
docs/Migrating.md


* The interface for `RayPerceptionSensor.PerceiveStatic()` was changed to take an input class and write to an output class.
* The `SetActionMask` method must now be called on the optional `ActionMasker` argument of the `CollectObservations` method. (We now consider an action mask as a type of observation)
* The method `GetStepCount()` on the Agent class has been replaced with the property getter `StepCount`
* The `--multi-gpu` option has been removed temporarily.
### Steps to Migrate
* Replace your Agent's implementation of `CollectObservations()` with `CollectObservations(VectorSensor sensor)`. In addition, replace all calls to `AddVectorObs()` with `sensor.AddObservation()` or `sensor.AddOneHotObservation()` on the `VectorSensor` passed as argument.

1
docs/Training-ML-Agents.md


[here](https://docs.unity3d.com/Manual/CommandLineArguments.html) for more
details.
* `--debug`: Specify this option to enable debug-level logging for some parts of the code.
* `--multi-gpu`: Setting this flag enables the use of multiple GPU's (if available) during training.
* `--cpu`: Forces training using CPU only.
* Engine Configuration :
* `--width' : The width of the executable window of the environment(s) in pixels

6
docs/Training-PPO.md


### Memory Size
`memory_size` corresponds to the size of the array of floating point numbers
used to store the hidden state of the recurrent neural network. This value must
be a multiple of 4, and should scale with the amount of information you expect
used to store the hidden state of the recurrent neural network of the policy. This value must
be a multiple of 2, and should scale with the amount of information you expect
Typical Range: `64` - `512`
Typical Range: `32` - `256`
## (Optional) Behavioral Cloning Using Demonstrations

6
docs/Training-SAC.md


### Memory Size
`memory_size` corresponds to the size of the array of floating point numbers
used to store the hidden state of the recurrent neural network. This value must
be a multiple of 4, and should scale with the amount of information you expect
used to store the hidden state of the recurrent neural network in the policy.
This value must be a multiple of 2, and should scale with the amount of information you expect
Typical Range: `64` - `512`
Typical Range: `32` - `256`
### (Optional) Save Replay Buffer

3
ml-agents/mlagents/trainers/agent_processor.py


if take_action_outputs:
for _entropy in take_action_outputs["entropy"]:
self.stats_reporter.add_stat("Policy/Entropy", _entropy)
self.stats_reporter.add_stat(
"Policy/Learning Rate", take_action_outputs["learning_rate"]
)
terminated_agents: Set[str] = set()
# Make unique agent_ids that are global across workers

39
ml-agents/mlagents/trainers/components/bc/model.py


from mlagents.tf_utils import tf
from mlagents.trainers.models import LearningModel
from mlagents.trainers.tf_policy import TFPolicy
self,
policy_model: LearningModel,
learning_rate: float = 3e-4,
anneal_steps: int = 0,
self, policy: TFPolicy, learning_rate: float = 3e-4, anneal_steps: int = 0
:param policy_model: The policy of the learning algorithm
:param policy: The policy of the learning algorithm
self.policy_model = policy_model
self.expert_visual_in = self.policy_model.visual_in
self.obs_in_expert = self.policy_model.vector_in
self.policy = policy
self.expert_visual_in = self.policy.visual_in
self.obs_in_expert = self.policy.vector_in
self.make_inputs()
self.create_loss(learning_rate, anneal_steps)

self.done_expert = tf.placeholder(shape=[None, 1], dtype=tf.float32)
self.done_policy = tf.placeholder(shape=[None, 1], dtype=tf.float32)
if self.policy_model.brain.vector_action_space_type == "continuous":
action_length = self.policy_model.act_size[0]
if self.policy.brain.vector_action_space_type == "continuous":
action_length = self.policy.act_size[0]
action_length = len(self.policy_model.act_size)
action_length = len(self.policy.act_size)
self.action_in_expert = tf.placeholder(
shape=[None, action_length], dtype=tf.int32
)

for i, act_size in enumerate(self.policy_model.act_size)
for i, act_size in enumerate(self.policy.act_size)
],
axis=1,
)

:param learning_rate: The learning rate for the optimizer
:param anneal_steps: Number of steps over which to anneal the learning_rate
"""
selected_action = self.policy_model.output
if self.policy_model.brain.vector_action_space_type == "continuous":
selected_action = self.policy.output
if self.policy.use_continuous_act:
log_probs = self.policy_model.all_log_probs
log_probs = self.policy.all_log_probs
self.loss = tf.reduce_mean(
-tf.log(tf.nn.softmax(log_probs) + 1e-7) * self.expert_action
)

learning_rate,
self.policy_model.global_step,
anneal_steps,
0.0,
power=1.0,
learning_rate, self.policy.global_step, anneal_steps, 0.0, power=1.0
optimizer = tf.train.AdamOptimizer(learning_rate=self.annealed_learning_rate)
optimizer = tf.train.AdamOptimizer(
learning_rate=self.annealed_learning_rate, name="bc_adam"
)
self.update_batch = optimizer.minimize(self.loss)

36
ml-agents/mlagents/trainers/components/bc/module.py


from mlagents.trainers.tf_policy import TFPolicy
from .model import BCModel
from mlagents.trainers.demo_loader import demo_to_buffer
from mlagents.trainers.trainer import UnityTrainerException
from mlagents.trainers.exception import UnityTrainerException
class BCModule:

"""
self.policy = policy
self.current_lr = policy_learning_rate * strength
self.model = BCModel(policy.model, self.current_lr, steps)
self.model = BCModel(policy, self.current_lr, steps)
_, self.demonstration_buffer = demo_to_buffer(demo_path, policy.sequence_length)
self.batch_size = batch_size if batch_size else default_batch_size

Helper function for update_batch.
"""
feed_dict = {
self.policy.model.batch_size: n_sequences,
self.policy.model.sequence_length: self.policy.sequence_length,
self.policy.batch_size_ph: n_sequences,
self.policy.sequence_length_ph: self.policy.sequence_length,
if self.policy.model.brain.vector_action_space_type == "continuous":
feed_dict[self.policy.model.epsilon] = np.random.normal(
size=(1, self.policy.model.act_size[0])
)
else:
feed_dict[self.policy.model.action_masks] = np.ones(
if not self.policy.use_continuous_act:
feed_dict[self.policy.action_masks] = np.ones(
sum(self.policy.model.brain.vector_action_space_size),
sum(self.policy.brain.vector_action_space_size),
if self.policy.model.brain.vector_observation_space_size > 0:
feed_dict[self.policy.model.vector_in] = mini_batch_demo["vector_obs"]
for i, _ in enumerate(self.policy.model.visual_in):
feed_dict[self.policy.model.visual_in[i]] = mini_batch_demo[
"visual_obs%d" % i
]
if self.policy.brain.vector_observation_space_size > 0:
feed_dict[self.policy.vector_in] = mini_batch_demo["vector_obs"]
for i, _ in enumerate(self.policy.visual_in):
feed_dict[self.policy.visual_in[i]] = mini_batch_demo["visual_obs%d" % i]
feed_dict[self.policy.model.memory_in] = np.zeros(
feed_dict[self.policy.memory_in] = np.zeros(
if not self.policy.model.brain.vector_action_space_type == "continuous":
feed_dict[self.policy.model.prev_action] = mini_batch_demo[
"prev_action"
]
if not self.policy.use_continuous_act:
feed_dict[self.policy.prev_action] = mini_batch_demo["prev_action"]
network_out = self.policy.sess.run(
list(self.out_dict.values()), feed_dict=feed_dict
)

19
ml-agents/mlagents/trainers/components/reward_signals/__init__.py


from mlagents.tf_utils import tf
from mlagents.trainers.trainer import UnityTrainerException
from mlagents.trainers.exception import UnityTrainerException
from mlagents.trainers.models import LearningModel
logger = logging.getLogger("mlagents.trainers")

class RewardSignal(abc.ABC):
def __init__(
self,
policy: TFPolicy,
policy_model: LearningModel,
strength: float,
gamma: float,
):
def __init__(self, policy: TFPolicy, strength: float, gamma: float):
:param policy: The Policy object (e.g. PPOPolicy) that this Reward Signal will apply to.
:param policy: The Policy object (e.g. NNPolicy) that this Reward Signal will apply to.
:param strength: The strength of the reward. The reward's raw value will be multiplied by this value.
:param gamma: The time discounting factor used for this reward.
:return: A RewardSignal object.

self.update_dict: Dict[str, tf.Tensor] = {}
self.gamma = gamma
self.policy = policy
self.policy_model = policy_model
self.strength = strength
self.stats_name_to_update_name: Dict[str, str] = {}

)
def prepare_update(
self,
policy_model: LearningModel,
mini_batch: Dict[str, np.ndarray],
num_sequences: int,
self, policy: TFPolicy, mini_batch: Dict[str, np.ndarray], num_sequences: int
) -> Dict[tf.Tensor, Any]:
"""
If the reward signal has an internal model (e.g. GAIL or Curiosity), get the feed_dict

74
ml-agents/mlagents/trainers/components/reward_signals/curiosity/model.py


from typing import List, Tuple
from mlagents.tf_utils import tf
from mlagents.trainers.models import LearningModel
from mlagents.trainers.models import ModelUtils
from mlagents.trainers.tf_policy import TFPolicy
self,
policy_model: LearningModel,
encoding_size: int = 128,
learning_rate: float = 3e-4,
self, policy: TFPolicy, encoding_size: int = 128, learning_rate: float = 3e-4
:param policy_model: The model being used by the learning policy
:param policy: The policy being trained
self.policy_model = policy_model
self.policy = policy
self.next_visual_in: List[tf.Tensor] = []
encoded_state, encoded_next_state = self.create_curiosity_encoders()
self.create_inverse_model(encoded_state, encoded_next_state)

encoded_state_list = []
encoded_next_state_list = []
if self.policy_model.vis_obs_size > 0:
if self.policy.vis_obs_size > 0:
for i in range(self.policy_model.vis_obs_size):
for i in range(self.policy.vis_obs_size):
next_visual_input = LearningModel.create_visual_input(
self.policy_model.brain.camera_resolutions[i],
next_visual_input = ModelUtils.create_visual_input(
self.policy.brain.camera_resolutions[i],
name="curiosity_next_visual_observation_" + str(i),
)
self.next_visual_in.append(next_visual_input)

encoded_visual = self.policy_model.create_visual_observation_encoder(
self.policy_model.visual_in[i],
encoded_visual = ModelUtils.create_visual_observation_encoder(
self.policy.visual_in[i],
LearningModel.swish,
ModelUtils.swish,
encoded_next_visual = self.policy_model.create_visual_observation_encoder(
encoded_next_visual = ModelUtils.create_visual_observation_encoder(
LearningModel.swish,
ModelUtils.swish,
1,
"curiosity_stream_{}_visual_obs_encoder".format(i),
True,

encoded_state_list.append(hidden_visual)
encoded_next_state_list.append(hidden_next_visual)
if self.policy_model.vec_obs_size > 0:
if self.policy.vec_obs_size > 0:
shape=[None, self.policy_model.vec_obs_size],
shape=[None, self.policy.vec_obs_size],
encoded_vector_obs = self.policy_model.create_vector_observation_encoder(
self.policy_model.vector_in,
encoded_vector_obs = ModelUtils.create_vector_observation_encoder(
self.policy.vector_in,
LearningModel.swish,
ModelUtils.swish,
encoded_next_vector_obs = self.policy_model.create_vector_observation_encoder(
encoded_next_vector_obs = ModelUtils.create_vector_observation_encoder(
LearningModel.swish,
ModelUtils.swish,
2,
"curiosity_vector_obs_encoder",
True,

:param encoded_next_state: Tensor corresponding to encoded next state.
"""
combined_input = tf.concat([encoded_state, encoded_next_state], axis=1)
hidden = tf.layers.dense(combined_input, 256, activation=LearningModel.swish)
if self.policy_model.brain.vector_action_space_type == "continuous":
hidden = tf.layers.dense(combined_input, 256, activation=ModelUtils.swish)
if self.policy.brain.vector_action_space_type == "continuous":
hidden, self.policy_model.act_size[0], activation=None
hidden, self.policy.act_size[0], activation=None
tf.squared_difference(pred_action, self.policy_model.selected_actions),
axis=1,
tf.squared_difference(pred_action, self.policy.selected_actions), axis=1
tf.dynamic_partition(squared_difference, self.policy_model.mask, 2)[1]
tf.dynamic_partition(squared_difference, self.policy.mask, 2)[1]
hidden, self.policy_model.act_size[i], activation=tf.nn.softmax
hidden, self.policy.act_size[i], activation=tf.nn.softmax
for i in range(len(self.policy_model.act_size))
for i in range(len(self.policy.act_size))
-tf.log(pred_action + 1e-10) * self.policy_model.selected_actions,
axis=1,
-tf.log(pred_action + 1e-10) * self.policy.selected_actions, axis=1
tf.dynamic_partition(cross_entropy, self.policy_model.mask, 2)[1]
tf.dynamic_partition(cross_entropy, self.policy.mask, 2)[1]
)
def create_forward_model(

:param encoded_next_state: Tensor corresponding to encoded next state.
"""
combined_input = tf.concat(
[encoded_state, self.policy_model.selected_actions], axis=1
[encoded_state, self.policy.selected_actions], axis=1
hidden = tf.layers.dense(combined_input, 256, activation=LearningModel.swish)
hidden = tf.layers.dense(combined_input, 256, activation=ModelUtils.swish)
* (
self.policy_model.vis_obs_size + int(self.policy_model.vec_obs_size > 0)
),
* (self.policy.vis_obs_size + int(self.policy.vec_obs_size > 0)),
activation=None,
)
squared_difference = 0.5 * tf.reduce_sum(

self.forward_loss = tf.reduce_mean(
tf.dynamic_partition(squared_difference, self.policy_model.mask, 2)[1]
tf.dynamic_partition(squared_difference, self.policy.mask, 2)[1]
)
def create_loss(self, learning_rate: float) -> None:

43
ml-agents/mlagents/trainers/components/reward_signals/curiosity/signal.py


from mlagents.trainers.components.reward_signals import RewardSignal, RewardSignalResult
from mlagents.trainers.components.reward_signals.curiosity.model import CuriosityModel
from mlagents.trainers.tf_policy import TFPolicy
from mlagents.trainers.models import LearningModel
class CuriosityRewardSignal(RewardSignal):

policy_model: LearningModel,
strength: float,
gamma: float,
encoding_size: int = 128,

:param encoding_size: The size of the hidden encoding layer for the ICM
:param learning_rate: The learning rate for the ICM.
"""
super().__init__(policy, policy_model, strength, gamma)
super().__init__(policy, strength, gamma)
policy_model, encoding_size=encoding_size, learning_rate=learning_rate
policy, encoding_size=encoding_size, learning_rate=learning_rate
)
self.use_terminal_states = False
self.update_dict = {

def evaluate_batch(self, mini_batch: Dict[str, np.array]) -> RewardSignalResult:
feed_dict: Dict[tf.Tensor, Any] = {
self.policy.model.batch_size: len(mini_batch["actions"]),
self.policy.model.sequence_length: self.policy.sequence_length,
self.policy.batch_size_ph: len(mini_batch["actions"]),
self.policy.sequence_length_ph: self.policy.sequence_length,
feed_dict[self.policy.model.vector_in] = mini_batch["vector_obs"]
feed_dict[self.policy.vector_in] = mini_batch["vector_obs"]
if self.policy.model.vis_obs_size > 0:
for i in range(len(self.policy.model.visual_in)):
if self.policy.vis_obs_size > 0:
for i in range(len(self.policy.visual_in)):
feed_dict[self.policy.model.visual_in[i]] = _obs
feed_dict[self.policy.visual_in[i]] = _obs
feed_dict[self.policy.model.selected_actions] = mini_batch["actions"]
feed_dict[self.policy.selected_actions] = mini_batch["actions"]
feed_dict[self.policy.model.action_holder] = mini_batch["actions"]
feed_dict[self.policy.action_holder] = mini_batch["actions"]
unscaled_reward = self.policy.sess.run(
self.model.intrinsic_reward, feed_dict=feed_dict
)

super().check_config(config_dict, param_keys)
def prepare_update(
self,
policy_model: LearningModel,
mini_batch: Dict[str, np.ndarray],
num_sequences: int,
self, policy: TFPolicy, mini_batch: Dict[str, np.ndarray], num_sequences: int
) -> Dict[tf.Tensor, Any]:
"""
Prepare for update and get feed_dict.

"""
feed_dict = {
policy_model.batch_size: num_sequences,
policy_model.sequence_length: self.policy.sequence_length,
policy_model.mask_input: mini_batch["masks"],
policy.batch_size_ph: num_sequences,
policy.sequence_length_ph: self.policy.sequence_length,
policy.mask_input: mini_batch["masks"],
feed_dict[policy_model.selected_actions] = mini_batch["actions"]
feed_dict[policy.selected_actions] = mini_batch["actions"]
feed_dict[policy_model.action_holder] = mini_batch["actions"]
feed_dict[policy.action_holder] = mini_batch["actions"]
feed_dict[policy_model.vector_in] = mini_batch["vector_obs"]
feed_dict[policy.vector_in] = mini_batch["vector_obs"]
if policy_model.vis_obs_size > 0:
for i, vis_in in enumerate(policy_model.visual_in):
if policy.vis_obs_size > 0:
for i, vis_in in enumerate(policy.visual_in):
feed_dict[vis_in] = mini_batch["visual_obs%d" % i]
for i, next_vis_in in enumerate(self.model.next_visual_in):
feed_dict[next_vis_in] = mini_batch["next_visual_obs%d" % i]

66
ml-agents/mlagents/trainers/components/reward_signals/gail/model.py


from mlagents.tf_utils import tf
from mlagents.trainers.models import LearningModel
from mlagents.trainers.tf_policy import TFPolicy
from mlagents.trainers.models import ModelUtils
EPSILON = 1e-7

self,
policy_model: LearningModel,
policy: TFPolicy,
h_size: int = 128,
learning_rate: float = 3e-4,
encoding_size: int = 64,

self.z_size = 128
self.alpha = 0.0005
self.mutual_information = 0.5
self.policy_model = policy_model
self.policy = policy
self.encoding_size = encoding_size
self.gradient_penalty_weight = gradient_penalty_weight
self.use_vail = use_vail

self.done_expert = tf.expand_dims(self.done_expert_holder, -1)
self.done_policy = tf.expand_dims(self.done_policy_holder, -1)
if self.policy_model.brain.vector_action_space_type == "continuous":
action_length = self.policy_model.act_size[0]
if self.policy.brain.vector_action_space_type == "continuous":
action_length = self.policy.act_size[0]
action_length = len(self.policy_model.act_size)
action_length = len(self.policy.act_size)
self.action_in_expert = tf.placeholder(
shape=[None, action_length], dtype=tf.int32
)

for i, act_size in enumerate(self.policy_model.act_size)
for i, act_size in enumerate(self.policy.act_size)
],
axis=1,
)

if self.policy_model.vec_obs_size > 0:
if self.policy.vec_obs_size > 0:
shape=[None, self.policy_model.vec_obs_size], dtype=tf.float32
shape=[None, self.policy.vec_obs_size], dtype=tf.float32
if self.policy_model.normalize:
if self.policy.normalize:
self.policy_model.normalize_vector_obs(self.obs_in_expert)
)
encoded_policy_list.append(
self.policy_model.normalize_vector_obs(self.policy_model.vector_in)
ModelUtils.normalize_vector_obs(
self.obs_in_expert,
self.policy.running_mean,
self.policy.running_variance,
self.policy.normalization_steps,
)
encoded_policy_list.append(self.policy.processed_vector_in)
encoded_policy_list.append(self.policy_model.vector_in)
encoded_policy_list.append(self.policy.vector_in)
if self.policy_model.vis_obs_size > 0:
if self.policy.vis_obs_size > 0:
for i in range(self.policy_model.vis_obs_size):
for i in range(self.policy.vis_obs_size):
visual_input = self.policy_model.create_visual_input(
self.policy_model.brain.camera_resolutions[i],
visual_input = ModelUtils.create_visual_input(
self.policy.brain.camera_resolutions[i],
encoded_policy_visual = self.policy_model.create_visual_observation_encoder(
self.policy_model.visual_in[i],
encoded_policy_visual = ModelUtils.create_visual_observation_encoder(
self.policy.visual_in[i],
LearningModel.swish,
ModelUtils.swish,
encoded_expert_visual = self.policy_model.create_visual_observation_encoder(
encoded_expert_visual = ModelUtils.create_visual_observation_encoder(
LearningModel.swish,
ModelUtils.swish,
1,
"gail_stream_{}_visual_obs_encoder".format(i),
True,

hidden_1 = tf.layers.dense(
concat_input,
self.h_size,
activation=LearningModel.swish,
activation=ModelUtils.swish,
name="gail_d_hidden_1",
reuse=reuse,
)

self.h_size,
activation=LearningModel.swish,
activation=ModelUtils.swish,
name="gail_d_hidden_2",
reuse=reuse,
)

self.z_size,
reuse=reuse,
name="gail_z_mean",
kernel_initializer=LearningModel.scaled_init(0.01),
kernel_initializer=ModelUtils.scaled_init(0.01),
)
self.noise = tf.random_normal(tf.shape(z_mean), dtype=tf.float32)

)
self.policy_estimate, self.z_mean_policy, _ = self.create_encoder(
self.encoded_policy,
self.policy_model.selected_actions,
self.policy.selected_actions,
self.done_policy,
reuse=True,
)

for off-policy. Compute gradients w.r.t randomly interpolated input.
"""
expert = [self.encoded_expert, self.expert_action, self.done_expert]
policy = [
self.encoded_policy,
self.policy_model.selected_actions,
self.done_policy,
]
policy = [self.encoded_policy, self.policy.selected_actions, self.done_policy]
interp = []
for _expert_in, _policy_in in zip(expert, policy):
alpha = tf.random_uniform(tf.shape(_expert_in))

37
ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py


from mlagents.trainers.components.reward_signals import RewardSignal, RewardSignalResult
from mlagents.trainers.tf_policy import TFPolicy
from mlagents.trainers.models import LearningModel
from .model import GAILModel
from mlagents.trainers.demo_loader import demo_to_buffer

def __init__(
self,
policy: TFPolicy,
policy_model: LearningModel,
strength: float,
gamma: float,
demo_path: str,

:param use_vail: Whether or not to use a variational bottleneck for the discriminator.
See https://arxiv.org/abs/1810.00821.
"""
super().__init__(policy, policy_model, strength, gamma)
super().__init__(policy, strength, gamma)
policy.model, 128, learning_rate, encoding_size, use_actions, use_vail
policy, 128, learning_rate, encoding_size, use_actions, use_vail
)
_, self.demonstration_buffer = demo_to_buffer(demo_path, policy.sequence_length)
self.has_updated = False

def evaluate_batch(self, mini_batch: Dict[str, np.array]) -> RewardSignalResult:
feed_dict: Dict[tf.Tensor, Any] = {
self.policy.model.batch_size: len(mini_batch["actions"]),
self.policy.model.sequence_length: self.policy.sequence_length,
self.policy.batch_size_ph: len(mini_batch["actions"]),
self.policy.sequence_length_ph: self.policy.sequence_length,
feed_dict[self.policy.model.vector_in] = mini_batch["vector_obs"]
if self.policy.model.vis_obs_size > 0:
for i in range(len(self.policy.model.visual_in)):
feed_dict[self.policy.vector_in] = mini_batch["vector_obs"]
if self.policy.vis_obs_size > 0:
for i in range(len(self.policy.visual_in)):
feed_dict[self.policy.model.visual_in[i]] = _obs
feed_dict[self.policy.visual_in[i]] = _obs
feed_dict[self.policy.model.selected_actions] = mini_batch["actions"]
feed_dict[self.policy.selected_actions] = mini_batch["actions"]
feed_dict[self.policy.model.action_holder] = mini_batch["actions"]
feed_dict[self.policy.action_holder] = mini_batch["actions"]
feed_dict[self.model.done_policy_holder] = np.array(
mini_batch["done"]
).flatten()

super().check_config(config_dict, param_keys)
def prepare_update(
self,
policy_model: LearningModel,
mini_batch: Dict[str, np.ndarray],
num_sequences: int,
self, policy: TFPolicy, mini_batch: Dict[str, np.ndarray], num_sequences: int
) -> Dict[tf.Tensor, Any]:
"""
Prepare inputs for update. .

feed_dict[self.model.action_in_expert] = np.array(mini_batch_demo["actions"])
if self.policy.use_continuous_act:
feed_dict[policy_model.selected_actions] = mini_batch["actions"]
feed_dict[policy.selected_actions] = mini_batch["actions"]
feed_dict[policy_model.action_holder] = mini_batch["actions"]
feed_dict[policy.action_holder] = mini_batch["actions"]
for i in range(len(policy_model.visual_in)):
feed_dict[policy_model.visual_in[i]] = mini_batch["visual_obs%d" % i]
for i in range(len(policy.visual_in)):
feed_dict[policy.visual_in[i]] = mini_batch["visual_obs%d" % i]
feed_dict[policy_model.vector_in] = mini_batch["vector_obs"]
feed_dict[policy.vector_in] = mini_batch["vector_obs"]
feed_dict[self.model.obs_in_expert] = mini_batch_demo["vector_obs"]
self.has_updated = True
return feed_dict

10
ml-agents/mlagents/trainers/components/reward_signals/reward_signal_factory.py


import logging
from typing import Any, Dict, Type
from mlagents.trainers.trainer import UnityTrainerException
from mlagents.trainers.exception import UnityTrainerException
from mlagents.trainers.components.reward_signals import RewardSignal
from mlagents.trainers.components.reward_signals.extrinsic.signal import (
ExtrinsicRewardSignal,

CuriosityRewardSignal,
)
from mlagents.trainers.tf_policy import TFPolicy
from mlagents.trainers.models import LearningModel
logger = logging.getLogger("mlagents.trainers")

def create_reward_signal(
policy: TFPolicy,
policy_model: LearningModel,
name: str,
config_entry: Dict[str, Any],
policy: TFPolicy, name: str, config_entry: Dict[str, Any]
) -> RewardSignal:
"""
Creates a reward signal class based on the name and config entry provided as a dict.

raise UnityTrainerException("Unknown reward signal type {0}".format(name))
rcls.check_config(config_entry)
try:
class_inst = rcls(policy, policy_model, **config_entry)
class_inst = rcls(policy, **config_entry)
except TypeError:
raise UnityTrainerException(
"Unknown parameters given for reward signal {0}".format(name)

8
ml-agents/mlagents/trainers/exception.py


"""
pass
class UnityTrainerException(TrainerError):
"""
Related to errors with the Trainer.
"""
pass

14
ml-agents/mlagents/trainers/ghost/trainer.py


return self.trainer.create_policy(brain_parameters)
def add_policy(self, name_behavior_id: str, policy: TFPolicy) -> None:
# for saving/swapping snapshots
policy.init_load_weights()
"""
Adds policy to trainer. For the first policy added, add a trainer
to the policy and set the learning behavior name to name_behavior_id.
:param name_behavior_id: Behavior ID that the policy should belong to.
:param policy: Policy to associate with name_behavior_id.
"""
policy.create_tf_graph()
self._save_snapshot(policy)
self._save_snapshot(policy) # Need to save after trainer initializes policy
else:
# for saving/swapping snapshots
policy.init_load_weights()
def get_policy(self, name_behavior_id: str) -> TFPolicy:
return self.policies[name_behavior_id]

6
ml-agents/mlagents/trainers/learn.py


help="Whether to run ML-Agents in debug mode with detailed logging",
)
argparser.add_argument(
"--multi-gpu",
default=False,
action="store_true",
help="Setting this flag enables the use of multiple GPU's (if available) during training",
)
argparser.add_argument(
"--env-args",
default=None,
nargs=argparse.REMAINDER,

270
ml-agents/mlagents/trainers/models.py


import logging
from enum import Enum
from typing import Callable, Dict, List, Optional
from typing import Callable, Dict, List, Tuple, NamedTuple
from mlagents.trainers.trainer import UnityTrainerException
from mlagents.trainers.exception import UnityTrainerException
from mlagents.trainers.brain import CameraResolution
logger = logging.getLogger("mlagents.trainers")

LINEAR = "linear"
class LearningModel:
_version_number_ = 2
class NormalizerTensors(NamedTuple):
update_op: tf.Operation
steps: tf.Tensor
running_mean: tf.Tensor
running_variance: tf.Tensor
class ModelUtils:
# Minimum supported side for each encoder type. If refactoring an encoder, please
# adjust these also.
MIN_RESOLUTION_FOR_ENCODER = {

}
def __init__(
self, m_size, normalize, use_recurrent, brain, seed, stream_names=None
):
tf.set_random_seed(seed)
self.brain = brain
self.vector_in = None
self.global_step, self.increment_step, self.steps_to_increment = (
self.create_global_steps()
)
self.visual_in = []
self.batch_size = tf.placeholder(shape=None, dtype=tf.int32, name="batch_size")
self.sequence_length = tf.placeholder(
shape=None, dtype=tf.int32, name="sequence_length"
)
self.mask_input = tf.placeholder(shape=[None], dtype=tf.float32, name="masks")
self.mask = tf.cast(self.mask_input, tf.int32)
self.stream_names = stream_names or []
self.use_recurrent = use_recurrent
if self.use_recurrent:
self.m_size = m_size
else:
self.m_size = 0
self.normalize = normalize
self.act_size = brain.vector_action_space_size
self.vec_obs_size = brain.vector_observation_space_size
self.vis_obs_size = brain.number_visual_observations
tf.Variable(
int(brain.vector_action_space_type == "continuous"),
name="is_continuous_control",
trainable=False,
dtype=tf.int32,
)
tf.Variable(
self._version_number_,
name="version_number",
trainable=False,
dtype=tf.int32,
)
tf.Variable(self.m_size, name="memory_size", trainable=False, dtype=tf.int32)
if brain.vector_action_space_type == "continuous":
tf.Variable(
self.act_size[0],
name="action_output_shape",
trainable=False,
dtype=tf.int32,
)
else:
tf.Variable(
sum(self.act_size),
name="action_output_shape",
trainable=False,
dtype=tf.int32,
)
self.value_heads: Dict[str, tf.Tensor] = {}
self.normalization_steps: Optional[tf.Variable] = None
self.running_mean: Optional[tf.Variable] = None
self.running_variance: Optional[tf.Variable] = None
self.update_normalization: Optional[tf.Operation] = None
self.value: Optional[tf.Tensor] = None
self.all_log_probs: Optional[tf.Tensor] = None
self.output: Optional[tf.Tensor] = None
self.selected_actions: Optional[tf.Tensor] = None
self.action_holder: Optional[tf.Tensor] = None
@staticmethod
def create_global_steps():
"""Creates TF ops to track and increment global training step."""

global_step: tf.Tensor,
max_step: int,
) -> tf.Tensor:
"""
Create a learning rate tensor.
:param lr_schedule: Type of learning rate schedule.
:param lr: Base learning rate.
:param global_step: A TF Tensor representing the total global step.
:param max_step: The maximum number of steps in the training run.
:return: A Tensor containing the learning rate.
"""
if lr_schedule == LearningRateSchedule.CONSTANT:
learning_rate = tf.Variable(lr)
elif lr_schedule == LearningRateSchedule.LINEAR:

)
return visual_in
def create_vector_input(self, name="vector_observation"):
@staticmethod
def create_visual_input_placeholders(
camera_resolutions: List[CameraResolution]
) -> List[tf.Tensor]:
"""
Creates input placeholders for visual inputs.
:param camera_resolutions: A List of CameraResolutions that specify the resolutions
of the input visual observations.
:returns: A List of Tensorflow placeholders where the input iamges should be fed.
"""
visual_in: List[tf.Tensor] = []
for i, camera_resolution in enumerate(camera_resolutions):
visual_input = ModelUtils.create_visual_input(
camera_resolution, name="visual_observation_" + str(i)
)
visual_in.append(visual_input)
return visual_in
@staticmethod
def create_vector_input(
vec_obs_size: int, name: str = "vector_observation"
) -> tf.Tensor:
:param name: Name of the placeholder op.
:return:
:param name: Name of the placeholder op.
:return: Placeholder for vector observations.
self.vector_in = tf.placeholder(
shape=[None, self.vec_obs_size], dtype=tf.float32, name=name
vector_in = tf.placeholder(
shape=[None, vec_obs_size], dtype=tf.float32, name=name
if self.normalize:
self.create_normalizer(self.vector_in)
return self.normalize_vector_obs(self.vector_in)
else:
return self.vector_in
return vector_in
def normalize_vector_obs(self, vector_obs):
@staticmethod
def normalize_vector_obs(
vector_obs: tf.Tensor,
running_mean: tf.Tensor,
running_variance: tf.Tensor,
normalization_steps: tf.Tensor,
) -> tf.Tensor:
"""
Create a normalized version of an input tensor.
:param vector_obs: Input vector observation tensor.
:param running_mean: Tensorflow tensor representing the current running mean.
:param running_variance: Tensorflow tensor representing the current running variance.
:param normalization_steps: Tensorflow tensor representing the current number of normalization_steps.
:return: A normalized version of vector_obs.
"""
(vector_obs - self.running_mean)
(vector_obs - running_mean)
self.running_variance
/ (tf.cast(self.normalization_steps, tf.float32) + 1)
running_variance / (tf.cast(normalization_steps, tf.float32) + 1)
),
-5,
5,

def create_normalizer(self, vector_obs):
self.normalization_steps = tf.get_variable(
@staticmethod
def create_normalizer(vector_obs: tf.Tensor) -> NormalizerTensors:
"""
Creates the normalizer and the variables required to store its state.
:param vector_obs: A Tensor representing the next value to normalize. When the
update operation is called, it will use vector_obs to update the running mean
and variance.
:return: A NormalizerTensors tuple that holds running mean, running variance, number of steps,
and the update operation.
"""
vec_obs_size = vector_obs.shape[1]
steps = tf.get_variable(
"normalization_steps",
[],
trainable=False,

self.running_mean = tf.get_variable(
running_mean = tf.get_variable(
[self.vec_obs_size],
[vec_obs_size],
self.running_variance = tf.get_variable(
running_variance = tf.get_variable(
[self.vec_obs_size],
[vec_obs_size],
self.update_normalization = self.create_normalizer_update(vector_obs)
update_normalization = ModelUtils.create_normalizer_update(
vector_obs, steps, running_mean, running_variance
)
return NormalizerTensors(
update_normalization, steps, running_mean, running_variance
)
def create_normalizer_update(self, vector_input):
@staticmethod
def create_normalizer_update(
vector_input: tf.Tensor,
steps: tf.Tensor,
running_mean: tf.Tensor,
running_variance: tf.Tensor,
) -> tf.Operation:
"""
Creates the update operation for the normalizer.
:param vector_input: Vector observation to use for updating the running mean and variance.
:param running_mean: Tensorflow tensor representing the current running mean.
:param running_variance: Tensorflow tensor representing the current running variance.
:param steps: Tensorflow tensor representing the current number of steps that have been normalized.
:return: A TF operation that updates the normalization based on vector_input.
"""
total_new_steps = tf.add(self.normalization_steps, steps_increment)
total_new_steps = tf.add(steps, steps_increment)
input_to_old_mean = tf.subtract(vector_input, self.running_mean)
new_mean = self.running_mean + tf.reduce_sum(
input_to_old_mean = tf.subtract(vector_input, running_mean)
new_mean = running_mean + tf.reduce_sum(
new_variance = self.running_variance + tf.reduce_sum(
new_variance = running_variance + tf.reduce_sum(
update_mean = tf.assign(self.running_mean, new_mean)
update_variance = tf.assign(self.running_variance, new_variance)
update_norm_step = tf.assign(self.normalization_steps, total_new_steps)
update_mean = tf.assign(running_mean, new_mean)
update_variance = tf.assign(running_variance, new_variance)
update_norm_step = tf.assign(steps, total_new_steps)
return tf.group([update_mean, update_variance, update_norm_step])
@staticmethod

hidden = tf.layers.flatten(conv2)
with tf.variable_scope(scope + "/" + "flat_encoding"):
hidden_flat = LearningModel.create_vector_observation_encoder(
hidden_flat = ModelUtils.create_vector_observation_encoder(
hidden, h_size, activation, num_layers, scope, reuse
)
return hidden_flat

hidden = tf.layers.flatten(conv3)
with tf.variable_scope(scope + "/" + "flat_encoding"):
hidden_flat = LearningModel.create_vector_observation_encoder(
hidden_flat = ModelUtils.create_vector_observation_encoder(
hidden, h_size, activation, num_layers, scope, reuse
)
return hidden_flat

hidden = tf.layers.flatten(hidden)
with tf.variable_scope(scope + "/" + "flat_encoding"):
hidden_flat = LearningModel.create_vector_observation_encoder(
hidden_flat = ModelUtils.create_vector_observation_encoder(
hidden, h_size, activation, num_layers, scope, reuse
)
return hidden_flat

ENCODER_FUNCTION_BY_TYPE = {
EncoderType.SIMPLE: LearningModel.create_visual_observation_encoder,
EncoderType.NATURE_CNN: LearningModel.create_nature_cnn_visual_observation_encoder,
EncoderType.RESNET: LearningModel.create_resnet_visual_observation_encoder,
EncoderType.SIMPLE: ModelUtils.create_visual_observation_encoder,
EncoderType.NATURE_CNN: ModelUtils.create_nature_cnn_visual_observation_encoder,
EncoderType.RESNET: ModelUtils.create_resnet_visual_observation_encoder,
encoder_type, LearningModel.create_visual_observation_encoder
encoder_type, ModelUtils.create_visual_observation_encoder
)
@staticmethod

@staticmethod
def _check_resolution_for_encoder(
camera_res: CameraResolution, vis_encoder_type: EncoderType
vis_in: tf.Tensor, vis_encoder_type: EncoderType
min_res = LearningModel.MIN_RESOLUTION_FOR_ENCODER[vis_encoder_type]
if camera_res.height < min_res or camera_res.width < min_res:
min_res = ModelUtils.MIN_RESOLUTION_FOR_ENCODER[vis_encoder_type]
height = vis_in.shape[1]
width = vis_in.shape[2]
if height < min_res or width < min_res:
f"Visual observation resolution ({camera_res.width}x{camera_res.height}) is too small for"
f"Visual observation resolution ({width}x{height}) is too small for"
@staticmethod
self,
visual_in: List[tf.Tensor],
vector_in: tf.Tensor,
num_streams: int,
h_size: int,
num_layers: int,

the scopes for each of the streams. None if all under the same TF scope.
:return: List of encoded streams.
"""
brain = self.brain
activation_fn = self.swish
self.visual_in = []
for i in range(brain.number_visual_observations):
LearningModel._check_resolution_for_encoder(
brain.camera_resolutions[i], vis_encode_type
)
visual_input = self.create_visual_input(
brain.camera_resolutions[i], name="visual_observation_" + str(i)
)
self.visual_in.append(visual_input)
vector_observation_input = self.create_vector_input()
activation_fn = ModelUtils.swish
vector_observation_input = vector_in
create_encoder_func = LearningModel.get_encoder_for_type(vis_encode_type)
create_encoder_func = ModelUtils.get_encoder_for_type(vis_encode_type)
if self.vis_obs_size > 0:
for j in range(brain.number_visual_observations):
if len(visual_in) > 0:
for j, vis_in in enumerate(visual_in):
ModelUtils._check_resolution_for_encoder(vis_in, vis_encode_type)
self.visual_in[j],
vis_in,
h_size,
activation_fn,
num_layers,

visual_encoders.append(encoded_visual)
hidden_visual = tf.concat(visual_encoders, axis=1)
if brain.vector_observation_space_size > 0:
hidden_state = self.create_vector_observation_encoder(
if vector_in.get_shape()[-1] > 0: # Don't encode 0-shape inputs
hidden_state = ModelUtils.create_vector_observation_encoder(
vector_observation_input,
h_size,
activation_fn,

recurrent_output = tf.reshape(recurrent_output, shape=[-1, half_point])
return recurrent_output, tf.concat([lstm_state_out.c, lstm_state_out.h], axis=1)
def create_value_heads(self, stream_names, hidden_input):
@staticmethod
def create_value_heads(
stream_names: List[str], hidden_input: tf.Tensor
) -> Tuple[Dict[str, tf.Tensor], tf.Tensor]:
"""
Creates one value estimator head for each reward signal in stream_names.
Also creates the node corresponding to the mean of all the value heads in self.value.

of the hidden input.
"""
value_heads = {}
self.value_heads[name] = value
self.value = tf.reduce_mean(list(self.value_heads.values()), 0)
value_heads[name] = value
value = tf.reduce_mean(list(value_heads.values()), 0)
return value_heads, value

78
ml-agents/mlagents/trainers/ppo/trainer.py


import numpy as np
from mlagents.trainers.ppo.policy import PPOPolicy
from mlagents.trainers.ppo.multi_gpu_policy import MultiGpuPPOPolicy, get_devices
from mlagents.trainers.common.nn_policy import NNPolicy
from mlagents.trainers.ppo.optimizer import PPOOptimizer
from mlagents.trainers.trajectory import Trajectory
logger = logging.getLogger("mlagents.trainers")

load: bool,
seed: int,
run_id: str,
multi_gpu: bool,
):
"""
Responsible for collecting experiences and training PPO model.

:param load: Whether the model should be loaded.
:param seed: The seed the model will be initialized with
:param run_id: The identifier of the current run
:param multi_gpu: Boolean for multi-gpu policy model
"""
super(PPOTrainer, self).__init__(
brain_name, trainer_parameters, training, run_id, reward_buff_cap

]
self._check_param_keys()
self.load = load
self.multi_gpu = multi_gpu
self.policy: PPOPolicy = None # type: ignore
self.policy: NNPolicy = None # type: ignore
def _process_trajectory(self, trajectory: Trajectory) -> None:
"""

self.policy.update_normalization(agent_buffer_trajectory["vector_obs"])
# Get all value estimates
value_estimates = self.policy.get_batched_value_estimates(
agent_buffer_trajectory
value_estimates, value_next = self.optimizer.get_trajectory_value_estimates(
agent_buffer_trajectory,
trajectory.next_obs,
trajectory.done_reached and not trajectory.max_step_reached,
self.policy.reward_signals[name].value_name, np.mean(v)
self.optimizer.reward_signals[name].value_name, np.mean(v)
value_next = self.policy.get_value_estimates(
trajectory.next_obs,
agent_id,
trajectory.done_reached and not trajectory.max_step_reached,
)
for name, reward_signal in self.policy.reward_signals.items():
for name, reward_signal in self.optimizer.reward_signals.items():
evaluate_result = reward_signal.evaluate_batch(
agent_buffer_trajectory
).scaled_reward

# Compute GAE and returns
tmp_advantages = []
tmp_returns = []
for name in self.policy.reward_signals:
for name in self.optimizer.reward_signals:
bootstrap_value = value_next[name]
local_rewards = agent_buffer_trajectory[

rewards=local_rewards,
value_estimates=local_value_estimates,