Report means instead of totals for losses (#580)

* Report means instead of totals for losses. * Report absolute loss for policy.
7 年前 · 076c8744
--- a/docs/Training-PPO.md
+++ b/docs/Training-PPO.md

 #### Policy Loss

-These values will oscillate with training.
+These values will oscillate during training. Generally they should be less than 1.0. 
-These values should increase with the reward. They corresponds to how much future reward the agent predicts itself receiving at any given point.
+These values should increase as the cumulative reward increases. They correspond to how much future reward the agent predicts itself receiving at any given point.
-These values will increase as the reward increases, and should decrease when reward becomes stable.
+These values will increase as the reward increases, and then should decrease once reward becomes stable.
--- a/docs/Using-Tensorboard.md
+++ b/docs/Using-Tensorboard.md
 * Learning Rate - How large a step the training algorithm takes as it searches 
 for the optimal policy. Should decrease over time.

-* Policy Loss - The mean loss of the policy function update. Correlates to how
+* Policy Loss - The mean magnitude of policy loss function. Correlates to how
 much the policy (process for deciding actions) is changing. The magnitude of 
 this should decrease during a successful training session.

 * Value Loss - The mean loss of the value function update. Correlates to how
-well the model is able to predict the value of each state. This should decrease
-during a successful training session.
+well the model is able to predict the value of each state. This should increase
+while the agent is learning, and then decrease once the reward stabilizes.

--- a/docs/images/mlagents-TensorBoard.png
+++ b/docs/images/mlagents-TensorBoard.png
--- a/python/unitytrainers/ppo/trainer.py
+++ b/python/unitytrainers/ppo/trainer.py
        """
        num_epoch = self.trainer_parameters['num_epoch']
        n_sequences = max(int(self.trainer_parameters['batch_size'] / self.sequence_length), 1)
-        total_v, total_p = 0, 0
+        total_v, total_p = [], []
        advantages = self.training_buffer.update_buffer['advantages'].get_batch()
        self.training_buffer.update_buffer['advantages'].set(
            (advantages - advantages.mean()) / (advantages.std() + 1e-10))
                v_loss, p_loss, _ = self.sess.run(
                    [self.model.value_loss, self.model.policy_loss,
                     self.model.update_batch], feed_dict=feed_dict)
-                total_v += v_loss
-                total_p += p_loss
-        self.stats['value_loss'].append(total_v)
-        self.stats['policy_loss'].append(total_p)
+                total_v.append(v_loss)
+                total_p.append(np.abs(p_loss))
+        self.stats['value_loss'].append(np.mean(total_v))
+        self.stats['policy_loss'].append(np.mean(total_p))
        self.training_buffer.reset_update_buffer()

    def write_summary(self, lesson_number):