Improvements for GAIL (#2296)

* Don't 0 value bootstrap for GAIL and Curiosity * Add gradient penalties to GAN to help with stability * Add gail_config.yaml with GAIL examples * Cleaned up trainer_config.yaml and unnecessary gammas * Documentation updates * Code cleanup
6 年前 · 6a212f73
--- a/config/trainer_config.yaml
+++ b/config/trainer_config.yaml
    time_horizon: 64
    num_layers: 2

-SmallWallJumpLearning: 
+SmallWallJumpLearning:
    max_steps: 1.0e6
    batch_size: 128
    buffer_size: 2048
    num_layers: 2
    normalize: false

-BigWallJumpLearning: 
+BigWallJumpLearning:
    max_steps: 1.0e6
    batch_size: 128
    buffer_size: 2048
    max_steps: 5.0e5
    num_epoch: 3
    reward_signals:
-        extrinsic: 
+        extrinsic:
            strength: 1.0
            gamma: 0.99
        curiosity:
-            
+
 VisualPyramidsLearning:
    time_horizon: 128
    batch_size: 64
    max_steps: 5.0e5
    num_epoch: 3
    reward_signals:
-        extrinsic: 
+        extrinsic:
            strength: 1.0
            gamma: 0.99
        curiosity:
    max_steps: 5.0e5
    beta: 0.001
    reward_signals:
-        extrinsic: 
+        extrinsic:
            strength: 1.0
            gamma: 0.995

    num_layers: 3
    hidden_units: 512
    reward_signals:
-        extrinsic: 
+        extrinsic:
            strength: 1.0
            gamma: 0.995

    num_layers: 3
    hidden_units: 512
    reward_signals:
-        extrinsic: 
+        extrinsic:
            strength: 1.0
            gamma: 0.995

    num_layers: 3
    hidden_units: 512
    reward_signals:
-        extrinsic: 
+        extrinsic:
            strength: 1.0
            gamma: 0.995

    time_horizon: 1000
    batch_size: 2024
    buffer_size: 20240
-    gamma: 0.995
+    reward_signals:
+        extrinsic:
+            strength: 1.0
+            gamma: 0.995

 HallwayLearning:
    use_recurrent: true
    hidden_units: 128
    memory_size: 256
    beta: 1.0e-2
-    gamma: 0.99
    num_epoch: 3
    buffer_size: 1024
    batch_size: 64
    hidden_units: 128
    memory_size: 256
    beta: 1.0e-2
-    gamma: 0.99
    num_epoch: 3
    buffer_size: 1024
    batch_size: 64
    num_layers: 1
    hidden_units: 256
    beta: 5.0e-3
-    gamma: 0.9
+    reward_signals:
+        extrinsic:
+            strength: 1.0
+            gamma: 0.9

 BasicLearning:
    batch_size: 32
    beta: 5.0e-3
-    gamma: 0.9
+    reward_signals:
+        extrinsic:
+            strength: 1.0
+            gamma: 0.9
--- a/docs/Training-Imitation-Learning.md
+++ b/docs/Training-Imitation-Learning.md
 Imitation Learning uses pairs of observations and actions from
 from a demonstration to learn a policy. [Video Link](https://youtu.be/kpb8ZkMBFYs).

-Imitation learning can also be used to help reinforcement learning. Especially in 
+Imitation learning can also be used to help reinforcement learning. Especially in
-it is easier to just show the agent how to achieve the reward. In these cases, 
+it is easier to just show the agent how to achieve the reward. In these cases,
-For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids), 
+For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
 just 6 episodes of demonstrations can reduce training steps by more than 4 times.

 <p align="center">
 </p>

-ML-Agents provides several ways to learn from demonstrations. For most situations,
-[GAIL](Training-RewardSignals.md#the-gail-reward-signal) is the preferred approach.
+ML-Agents provides several ways to learn from demonstrations.
-  number of demonstrations. 
-* To help bootstrap reinforcement learning, you can enable 
-  [pretraining](Training-PPO.md#optional-pretraining-using-demonstrations) 
-  on the PPO trainer, in addition to using a small GAIL reward signal. 
-* To train an agent to exactly mimic demonstrations, you can use the 
+  number of demonstrations.
+* To help bootstrap reinforcement learning, you can enable
+  [pretraining](Training-PPO.md#optional-pretraining-using-demonstrations)
+  on the PPO trainer, in addition to using a small GAIL reward signal.
+* To train an agent to exactly mimic demonstrations, you can use the
+### How to Choose
+
+If you want to help your agents learn (especially with environments that have sparse rewards)
+using pre-recorded demonstrations, you can generally enable both GAIL and Pretraining.
+An example of this is provided for the Pyramids example environment under
+ `PyramidsLearning` in `config/gail_config.yaml`.
+
+If you want to train purely from demonstrations, GAIL is generally the preferred approach, especially
+if you have few (<10) episodes of demonstrations. An example of this is provided for the Crawler example
+environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.
+
+If you have plenty of demonstrations and/or a very simple environment, Behavioral Cloning
+(online and offline) can be effective and quick. However, it cannot be combined with RL.
+
-It is possible to record demonstrations of agent behavior from the Unity Editor, 
-and save them as assets. These demonstrations contain information on the 
-observations, actions, and rewards for a given agent during the recording session. 
-They can be managed from the Editor, as well as used for training with Offline 
+It is possible to record demonstrations of agent behavior from the Unity Editor,
+and save them as assets. These demonstrations contain information on the
+observations, actions, and rewards for a given agent during the recording session.
+They can be managed from the Editor, as well as used for training with Offline
-In order to record demonstrations from an agent, add the `Demonstration Recorder` 
-component to a GameObject in the scene which contains an `Agent` component. 
-Once added, it is possible to name the demonstration that will be recorded 
+In order to record demonstrations from an agent, add the `Demonstration Recorder`
+component to a GameObject in the scene which contains an `Agent` component.
+Once added, it is possible to name the demonstration that will be recorded
 from the agent.

 <p align="center">
 </p>

-When `Record` is checked, a demonstration will be created whenever the scene 
-is played from the Editor. Depending on the complexity of the task, anywhere 
-from a few minutes or a few hours of demonstration data may be necessary to 
-be useful for imitation learning. When you have recorded enough data, end 
-the Editor play session, and a `.demo` file will be created in the 
-`Assets/Demonstrations` folder. This file contains the demonstrations. 
-Clicking on the file will provide metadata about the demonstration in the 
+When `Record` is checked, a demonstration will be created whenever the scene
+is played from the Editor. Depending on the complexity of the task, anywhere
+from a few minutes or a few hours of demonstration data may be necessary to
+be useful for imitation learning. When you have recorded enough data, end
+the Editor play session, and a `.demo` file will be created in the
+`Assets/Demonstrations` folder. This file contains the demonstrations.
+Clicking on the file will provide metadata about the demonstration in the
 inspector.

 <p align="center">
 </p>
- 
--- a/ml-agents/mlagents/trainers/components/bc/model.py
+++ b/ml-agents/mlagents/trainers/components/bc/model.py
            )
            self.expert_action = tf.concat(
                [
-                    tf.one_hot(
-                        self.action_in_expert[:, i], self.policy_model.act_size[i]
-                    )
-                    for i in range(len(self.policy_model.act_size))
+                    tf.one_hot(self.action_in_expert[:, i], act_size)
+                    for i, act_size in enumerate(self.policy_model.act_size)
                ],
                axis=1,
            )
--- a/ml-agents/mlagents/trainers/components/reward_signals/curiosity/signal.py
+++ b/ml-agents/mlagents/trainers/components/reward_signals/curiosity/signal.py
            policy.model, encoding_size=encoding_size, learning_rate=learning_rate
        )
        self.num_epoch = num_epoch
+        self.use_terminal_states = False
        self.update_dict = {
            "forward_loss": self.model.forward_loss,
            "inverse_loss": self.model.inverse_loss,
--- a/ml-agents/mlagents/trainers/components/reward_signals/gail/model.py
+++ b/ml-agents/mlagents/trainers/components/reward_signals/gail/model.py
 import tensorflow as tf
 from mlagents.trainers.models import LearningModel

+EPSILON = 1e-7
+

 class GAILModel(object):
    def __init__(
        encoding_size: int = 64,
        use_actions: bool = False,
        use_vail: bool = False,
+        gradient_penalty_weight: float = 10.0,
    ):
        """
        The initializer for the GAIL reward generator.
        self.mutual_information = 0.5
        self.policy_model = policy_model
        self.encoding_size = encoding_size
+        self.gradient_penalty_weight = gradient_penalty_weight
        self.use_vail = use_vail
        self.use_actions = use_actions  # True # Not using actions
        self.make_beta()
        )
        self.kl_div_input = tf.placeholder(shape=[], dtype=tf.float32)
        new_beta = tf.maximum(
-            self.beta + self.alpha * (self.kl_div_input - self.mutual_information), 1e-7
+            self.beta + self.alpha * (self.kl_div_input - self.mutual_information),
+            EPSILON,
        )
        self.update_beta = tf.assign(self.beta, new_beta)

            )
            self.expert_action = tf.concat(
                [
-                    tf.one_hot(
-                        self.action_in_expert[:, i], self.policy_model.act_size[i]
-                    )
-                    for i in range(len(self.policy_model.act_size))
+                    tf.one_hot(self.action_in_expert[:, i], act_size)
+                    for i, act_size in enumerate(self.policy_model.act_size)
                ],
                axis=1,
            )

    def create_encoder(
        self, state_in: tf.Tensor, action_in: tf.Tensor, done_in: tf.Tensor, reuse: bool
-    ) -> Tuple[tf.Tensor, tf.Tensor]:
+    ) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor]:
        """
        Creates the encoder for the discriminator
        :param state_in: The encoded observation input
                name="d_estimate",
                reuse=reuse,
            )
-            return estimate, z_mean
+            return estimate, z_mean, concat_input

    def create_network(self) -> None:
        """
                initializer=tf.ones_initializer(),
            )
            self.z_sigma_sq = self.z_sigma * self.z_sigma
-            self.z_log_sigma_sq = tf.log(self.z_sigma_sq + 1e-7)
+            self.z_log_sigma_sq = tf.log(self.z_sigma_sq + EPSILON)
-        self.expert_estimate, self.z_mean_expert = self.create_encoder(
+        self.expert_estimate, self.z_mean_expert, _ = self.create_encoder(
-        self.policy_estimate, self.z_mean_policy = self.create_encoder(
+        self.policy_estimate, self.z_mean_policy, _ = self.create_encoder(
            self.encoded_policy,
            self.policy_model.selected_actions,
            self.done_policy,
            self.policy_estimate, [-1], name="GAIL_reward"
        )
-        self.intrinsic_reward = -tf.log(1.0 - self.discriminator_score + 1e-7)
+        self.intrinsic_reward = -tf.log(1.0 - self.discriminator_score + EPSILON)
+
+    def create_gradient_magnitude(self) -> tf.Tensor:
+        """
+        Gradient penalty from https://arxiv.org/pdf/1704.00028. Adds stability esp.
+        for off-policy. Compute gradients w.r.t randomly interpolated input.
+        """
+        expert = [self.encoded_expert, self.expert_action, self.done_expert]
+        policy = [
+            self.encoded_policy,
+            self.policy_model.selected_actions,
+            self.done_policy,
+        ]
+        interp = []
+        for _expert_in, _policy_in in zip(expert, policy):
+            alpha = tf.random_uniform(tf.shape(_expert_in))
+            interp.append(alpha * _expert_in + (1 - alpha) * _policy_in)
+
+        grad_estimate, _, grad_input = self.create_encoder(
+            interp[0], interp[1], interp[2], reuse=True
+        )
+
+        grad = tf.gradients(grad_estimate, [grad_input])[0]
+
+        # Norm's gradient could be NaN at 0. Use our own safe_norm
+        safe_norm = tf.sqrt(tf.reduce_sum(grad ** 2, axis=-1) + EPSILON)
+        gradient_mag = tf.reduce_mean(tf.pow(safe_norm - 1, 2))
+
+        return gradient_mag

    def create_loss(self, learning_rate: float) -> None:
        """
        self.mean_policy_estimate = tf.reduce_mean(self.policy_estimate)

        self.discriminator_loss = -tf.reduce_mean(
-            tf.log(self.expert_estimate + 1e-7)
-            + tf.log(1.0 - self.policy_estimate + 1e-7)
+            tf.log(self.expert_estimate + EPSILON)
+            + tf.log(1.0 - self.policy_estimate + EPSILON)
        )

        if self.use_vail:
            )
        else:
            self.loss = self.discriminator_loss
+
+        if self.gradient_penalty_weight > 0.0:
+            self.loss += self.gradient_penalty_weight * self.create_gradient_magnitude()
+
        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
        self.update_batch = optimizer.minimize(self.loss)
--- a/ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
+++ b/ml-agents/mlagents/trainers/components/reward_signals/gail/signal.py
        super().__init__(policy, strength, gamma)
        self.num_epoch = num_epoch
        self.samples_per_update = samples_per_update
+        self.use_terminal_states = False

        self.model = GAILModel(
            policy.model, 128, learning_rate, encoding_size, use_actions, use_vail
--- a/ml-agents/mlagents/trainers/components/reward_signals/reward_signal.py
+++ b/ml-agents/mlagents/trainers/components/reward_signals/reward_signal.py
        short_name = class_name.replace("RewardSignal", "")
        self.stat_name = f"Policy/{short_name} Reward"
        self.value_name = f"Policy/{short_name} Value Estimate"
+        # Terminate discounted reward computation at Done. Can disable to mitigate positive bias in rewards with
+        # no natural end, e.g. GAIL or Curiosity
+        self.use_terminal_states = True
        self.gamma = gamma
        self.policy = policy
        self.strength = strength
--- a/ml-agents/mlagents/trainers/ppo/policy.py
+++ b/ml-agents/mlagents/trainers/ppo/policy.py
        :return: The value estimate dictionary with key being the name of the reward signal and the value the
        corresponding value estimate.
        """
-        if done:
-            return {k: 0.0 for k in self.model.value_heads.keys()}

        feed_dict: Dict[tf.Tensor, Any] = {
            self.model.batch_size: 1,
            ].reshape([-1, len(self.model.act_size)])
        value_estimates = self.sess.run(self.model.value_heads, feed_dict)

-        return {k: float(v) for k, v in value_estimates.items()}
+        value_estimates = {k: float(v) for k, v in value_estimates.items()}
+
+        # If we're done, reassign all of the value estimates that need terminal states.
+        if done:
+            for k in value_estimates:
+                if self.reward_signals[k].use_terminal_states:
+                    value_estimates[k] = 0.0
+
+        return value_estimates

    def get_action(self, brain_info: BrainInfo) -> ActionInfo:
        """
--- a/ml-agents/mlagents/trainers/tests/test_ppo.py
+++ b/ml-agents/mlagents/trainers/tests/test_ppo.py
        assert type(key) is str
        assert val == 0.0

+    # Check if we ignore terminal states properly
+    policy.reward_signals["extrinsic"].use_terminal_states = False
+    run_out = policy.get_value_estimates(brain_info, 0, done=True)
+    for key, val in run_out.items():
+        assert type(key) is str
+        assert val != 0.0
+
    env.close()


--- a/config/gail_config.yaml
+++ b/config/gail_config.yaml
+default:
+    trainer: ppo
+    batch_size: 1024
+    beta: 5.0e-3
+    buffer_size: 10240
+    epsilon: 0.2
+    hidden_units: 128
+    lambd: 0.95
+    learning_rate: 3.0e-4
+    max_steps: 5.0e4
+    memory_size: 256
+    normalize: false
+    num_epoch: 3
+    num_layers: 2
+    time_horizon: 64
+    sequence_length: 64
+    summary_freq: 1000
+    use_recurrent: false
+    reward_signals:
+        extrinsic:
+            strength: 1.0
+            gamma: 0.99
+
+PyramidsLearning:
+    summary_freq: 2000
+    time_horizon: 128
+    batch_size: 128
+    buffer_size: 2048
+    hidden_units: 512
+    num_layers: 2
+    beta: 1.0e-2
+    max_steps: 5.0e5
+    num_epoch: 3
+    pretraining:
+        demo_path: ./demos/ExpertPyramid.demo
+        strength: 0.5
+        steps: 10000
+    reward_signals:
+        extrinsic:
+            strength: 1.0
+            gamma: 0.99
+        curiosity:
+            strength: 0.02
+            gamma: 0.99
+            encoding_size: 256
+        gail:
+            strength: 0.01
+            gamma: 0.99
+            encoding_size: 128
+            demo_path: demos/ExpertPyramid.demo
+
+CrawlerStaticLearning:
+    normalize: true
+    num_epoch: 3
+    time_horizon: 1000
+    batch_size: 2024
+    buffer_size: 20240
+    max_steps: 1e6
+    summary_freq: 3000
+    num_layers: 3
+    hidden_units: 512
+    reward_signals:
+        gail:
+            strength: 1.0
+            gamma: 0.99
+            encoding_size: 128
+            demo_path: demos/ExpertCrawlerSta.demo
+
+PushBlockLearning:
+    max_steps: 5.0e4
+    batch_size: 128
+    buffer_size: 2048
+    beta: 1.0e-2
+    hidden_units: 256
+    summary_freq: 2000
+    time_horizon: 64
+    num_layers: 2
+    reward_signals:
+        gail:
+            strength: 1.0
+            gamma: 0.99
+            encoding_size: 128
+            demo_path: demos/ExpertPush.demo
+
+HallwayLearning:
+    use_recurrent: true
+    sequence_length: 64
+    num_layers: 2
+    hidden_units: 128
+    memory_size: 256
+    beta: 1.0e-2
+    num_epoch: 3
+    buffer_size: 1024
+    batch_size: 128
+    max_steps: 5.0e5
+    summary_freq: 1000
+    time_horizon: 64
+    reward_signals:
+        extrinsic:
+            strength: 1.0
+            gamma: 0.99
+        gail:
+            strength: 0.1
+            gamma: 0.99
+            encoding_size: 128
+            demo_path: demos/ExpertHallway.demo