浏览代码

Report means instead of totals for losses (#580)

* Report means instead of totals for losses.
* Report absolute loss for policy.
/develop-generalizationTraining-TrainerController
vincentpierre 7 年前
当前提交
076c8744
共有 4 个文件被更改,包括 405 次插入266 次删除
  1. 6
      docs/Training-PPO.md
  2. 6
      docs/Using-Tensorboard.md
  3. 649
      docs/images/mlagents-TensorBoard.png
  4. 10
      python/unitytrainers/ppo/trainer.py

6
docs/Training-PPO.md


#### Policy Loss
These values will oscillate with training.
These values will oscillate during training. Generally they should be less than 1.0.
These values should increase with the reward. They corresponds to how much future reward the agent predicts itself receiving at any given point.
These values should increase as the cumulative reward increases. They correspond to how much future reward the agent predicts itself receiving at any given point.
These values will increase as the reward increases, and should decrease when reward becomes stable.
These values will increase as the reward increases, and then should decrease once reward becomes stable.

6
docs/Using-Tensorboard.md


* Learning Rate - How large a step the training algorithm takes as it searches
for the optimal policy. Should decrease over time.
* Policy Loss - The mean loss of the policy function update. Correlates to how
* Policy Loss - The mean magnitude of policy loss function. Correlates to how
much the policy (process for deciding actions) is changing. The magnitude of
this should decrease during a successful training session.

* Value Loss - The mean loss of the value function update. Correlates to how
well the model is able to predict the value of each state. This should decrease
during a successful training session.
well the model is able to predict the value of each state. This should increase
while the agent is learning, and then decrease once the reward stabilizes.

649
docs/images/mlagents-TensorBoard.png

之前 之后
宽度: 1395  |  高度: 548  |  大小: 138 KiB

10
python/unitytrainers/ppo/trainer.py


"""
num_epoch = self.trainer_parameters['num_epoch']
n_sequences = max(int(self.trainer_parameters['batch_size'] / self.sequence_length), 1)
total_v, total_p = 0, 0
total_v, total_p = [], []
advantages = self.training_buffer.update_buffer['advantages'].get_batch()
self.training_buffer.update_buffer['advantages'].set(
(advantages - advantages.mean()) / (advantages.std() + 1e-10))

v_loss, p_loss, _ = self.sess.run(
[self.model.value_loss, self.model.policy_loss,
self.model.update_batch], feed_dict=feed_dict)
total_v += v_loss
total_p += p_loss
self.stats['value_loss'].append(total_v)
self.stats['policy_loss'].append(total_p)
total_v.append(v_loss)
total_p.append(np.abs(p_loss))
self.stats['value_loss'].append(np.mean(total_v))
self.stats['policy_loss'].append(np.mean(total_p))
self.training_buffer.reset_update_buffer()
def write_summary(self, lesson_number):

正在加载...
取消
保存