浏览代码

Update docs

/develop/sac-apex
Ervin Teng 5 年前
当前提交
8bf8c9a9
共有 3 个文件被更改,包括 19 次插入20 次删除
  1. 11
      docs/Migrating.md
  2. 2
      docs/Training-ML-Agents.md
  3. 26
      docs/Training-SAC.md

11
docs/Migrating.md


* The `--load` and `--train` command-line flags have been deprecated and replaced with `--resume` and `--inference`.
* Running with the same `--run-id` twice will now throw an error.
* The `play_against_current_self_ratio` self-play trainer hyperparameter has been renamed to `play_against_latest_model_ratio`
* The Jupyter notebooks have been removed from the repository.
* `Academy.FloatProperties` was removed.
* `Academy.RegisterSideChannel` and `Academy.UnregisterSideChannel` were removed.
* `num_updates` and `train_interval` for SAC have been replaced with `steps_per_update`.
* The Jupyter notebooks have been removed from the repository.
* `Academy.FloatProperties` was removed.
* `Academy.RegisterSideChannel` and `Academy.UnregisterSideChannel` were removed.
### Steps to Migrate
* `steps_per_update` should be around equal to the number of agents in your environment, times `num_updates`
and divided by `train_interval`.
## Migrating from 0.14 to 0.15

2
docs/Training-ML-Agents.md


| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC |
| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC |
| train_interval | How often to update the agent. | SAC |
| num_update | Number of mini-batches to update the agent with during each update. | SAC |
| steps_per_update | Ratio of agent steps per mini-batch update. | SAC |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC |
\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation), GAIL = Generative Adversarial Imitaiton Learning

26
docs/Training-SAC.md


#### Number of Updates for Reward Signal (Optional)
`reward_signal_num_update` for the reward signals corresponds to the number of mini batches sampled
and used for updating the reward signals during each
update. By default, we update the reward signals once every time the main policy is updated.
`reward_signal_steps_per_update` for the reward signals corresponds to the number of steps per mini batch sampled
and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated.
we may want to update the policy N times, then update the reward signal (GAIL) M times.
We can change `train_interval` and `num_update` of SAC to N, as well as `reward_signal_num_update`
under `reward_signals` to M to accomplish this. By default, `reward_signal_num_update` is set to
`num_update`.
we may want to update the reward signal (GAIL) M times for every update of the policy.
We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update`
under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to
`steps_per_update`.
Typical Range: `num_update`
Typical Range: `steps_per_update`
### Buffer Size

Typical Range: `1` - `5`
### Number of Updates
### Steps Per Update
`num_update` corresponds to the number of mini batches sampled and used for training during each
training event. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
replay buffer, and using this mini batch to update the models. Typically, this can be left at 1.
`steps_per_update` corresponds to the number agent steps (actions) taken for each mini-batch sampled and used during training. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience
replay buffer, and using this mini batch to update the models. Typically, this should be greater than 1.
We can change `train_interval` and `num_update` to N to accomplish this.
We can change `steps_per_update` to lower than 1 to accomplish this.
Typical Range: `1`
Typical Range: `10` - `20`
### Tau

正在加载...
取消
保存