Clear cumulative_returns_since_policy_update (#2120)

Before the CSV file's mean rewards would lag much behind the rest of the code since this buffer was never cleared.
6 年前 · a15763f8
--- a/ml-agents/mlagents/trainers/ppo/trainer.py
+++ b/ml-agents/mlagents/trainers/ppo/trainer.py
            number_experiences=len(self.training_buffer.update_buffer["actions"]),
            mean_return=float(np.mean(self.cumulative_returns_since_policy_update)),
        )
+        self.cumulative_returns_since_policy_update = []
        n_sequences = max(
            int(self.trainer_parameters["batch_size"] / self.policy.sequence_length), 1
        )