浏览代码

Set ignore done=False in GAIL (#4971)

/develop/gail-srl-hack
GitHub 4 年前
当前提交
4d5545c8
共有 17 个文件被更改,包括 116 次插入59 次删除
  1. 4
      com.unity.ml-agents/CHANGELOG.md
  2. 6
      config/imitation/CrawlerStatic.yaml
  3. 6
      config/imitation/FoodCollector.yaml
  4. 3
      config/imitation/Hallway.yaml
  5. 18
      config/imitation/PushBlock.yaml
  6. 4
      config/imitation/Pyramids.yaml
  7. 3
      config/ppo/Pyramids.yaml
  8. 4
      config/ppo/PyramidsRND.yaml
  9. 3
      config/ppo/VisualPyramids.yaml
  10. 1
      config/sac/Pyramids.yaml
  11. 1
      config/sac/VisualPyramids.yaml
  12. 23
      docs/ML-Agents-Overview.md
  13. 6
      docs/Training-Configuration-File.md
  14. 20
      ml-agents/mlagents/trainers/settings.py
  15. 28
      ml-agents/mlagents/trainers/torch/components/reward_providers/curiosity_reward_provider.py
  16. 27
      ml-agents/mlagents/trainers/torch/components/reward_providers/gail_reward_provider.py
  17. 18
      ml-agents/mlagents/trainers/torch/components/reward_providers/rnd_reward_provider.py

4
com.unity.ml-agents/CHANGELOG.md


### Minor Changes
#### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
#### ml-agents / ml-agents-envs / gym-unity (Python)
- The `encoding_size` setting for RewardSignals has been deprecated. Please use `network_settings` instead. (#4982)
- An issue that caused `GAIL` to fail for environments where agents can terminate episodes by self-sacrifice has been fixed. (#4971)
## [1.8.0-preview] - 2021-02-17
### Major Changes

6
config/imitation/CrawlerStatic.yaml


gail:
gamma: 0.99
strength: 1.0
encoding_size: 128
network_settings:
normalize: true
hidden_units: 128
num_layers: 2
vis_encode_type: simple
learning_rate: 0.0003
use_actions: false
use_vail: false

6
config/imitation/FoodCollector.yaml


gail:
gamma: 0.99
strength: 0.1
encoding_size: 128
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
learning_rate: 0.0003
use_actions: false
use_vail: false

3
config/imitation/Hallway.yaml


strength: 1.0
gail:
gamma: 0.99
strength: 0.1
encoding_size: 128
strength: 0.01
learning_rate: 0.0003
use_actions: false
use_vail: false

18
config/imitation/PushBlock.yaml


num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
strength: 1.0
encoding_size: 128
strength: 0.01
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
max_steps: 15000000
max_steps: 1000000
behavioral_cloning:
demo_path: Project/Assets/ML-Agents/Examples/PushBlock/Demos/ExpertPush.demo
steps: 50000
strength: 1.0
samples_per_update: 0

4
config/imitation/Pyramids.yaml


curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 256
network_settings:
hidden_units: 256
encoding_size: 128
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
behavioral_cloning:
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo

3
config/ppo/Pyramids.yaml


curiosity:
gamma: 0.99
strength: 0.02
encoding_size: 256
network_settings:
hidden_units: 256
learning_rate: 0.0003
keep_checkpoints: 5
max_steps: 10000000

4
config/ppo/PyramidsRND.yaml


rnd:
gamma: 0.99
strength: 0.01
encoding_size: 64
network_settings:
hidden_units: 64
framework: pytorch
threaded: true

3
config/ppo/VisualPyramids.yaml


curiosity:
gamma: 0.99
strength: 0.01
encoding_size: 256
network_settings:
hidden_units: 256
learning_rate: 0.0003
keep_checkpoints: 5
max_steps: 10000000

1
config/sac/Pyramids.yaml


gail:
gamma: 0.99
strength: 0.01
encoding_size: 128
learning_rate: 0.0003
use_actions: true
use_vail: false

1
config/sac/VisualPyramids.yaml


gail:
gamma: 0.99
strength: 0.02
encoding_size: 128
learning_rate: 0.0003
use_actions: true
use_vail: false

23
docs/ML-Agents-Overview.md


- If you want to help your agents learn (especially with environments that have
sparse rewards) using pre-recorded demonstrations, you can generally enable
both GAIL and Behavioral Cloning at low strengths in addition to having an
extrinsic reward. An example of this is provided for the Pyramids example
environment under `PyramidsLearning` in `config/gail_config.yaml`.
- If you want to train purely from demonstrations, GAIL and BC _without_ an
extrinsic reward signal is the preferred approach. An example of this is
provided for the Crawler example environment under `CrawlerStaticLearning` in
`config/gail_config.yaml`.
extrinsic reward. An example of this is provided for the PushBlock example
environment in `config/imitation/PushBlock.yaml`.
- If you want to train purely from demonstrations with GAIL and BC _without_ an
extrinsic reward signal, please see the CrawlerStatic example environment under
in `config/imitation/CrawlerStatic.yaml`.
***Note:*** GAIL introduces a [_survivor bias_](https://arxiv.org/pdf/1809.02925.pdf)
to the learning process. That is, by giving positive rewards based on similarity
to the expert, the agent is incentivized to remain alive for as long as possible.
This can directly conflict with goal-oriented tasks like our PushBlock or Pyramids
example environments where an agent must reach a goal state thus ending the
episode as quickly as possible. In these cases, we strongly recommend that you
use a low strength GAIL reward signal and a sparse extrinisic signal when
the agent achieves the task. This way, the GAIL reward signal will guide the
agent until it discovers the extrnisic signal and will not overpower it. If the
agent appears to be ignoring the extrinsic reward signal, you should reduce
the strength of GAIL.
#### GAIL (Generative Adversarial Imitation Learning)

6
docs/Training-Configuration-File.md


| :--------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `curiosity -> strength` | (default = `1.0`) Magnitude of the curiosity reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. <br><br>Typical range: `0.001` - `0.1` |
| `curiosity -> gamma` | (default = `0.99`) Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.995` |
| `curiosity -> encoding_size` | (default = `64`) Size of the encoding used by the intrinsic curiosity model. This value should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations. <br><br>Typical range: `64` - `256` |
| `curiosity -> network_settings` | Please see the documentation for `network_settings` under [Common Trainer Configurations](#common-trainer-configurations). The network specs used by the intrinsic curiosity model. The value should of `hidden_units` should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations. <br><br>Typical range: `64` - `256` |
| `curiosity -> learning_rate` | (default = `3e-4`) Learning rate used to update the intrinsic curiosity module. This should typically be decreased if training is unstable, and the curiosity loss is unstable. <br><br>Typical range: `1e-5` - `1e-3` |
### GAIL Intrinsic Reward

| `gail -> strength` | (default = `1.0`) Factor by which to multiply the raw reward. Note that when using GAIL with an Extrinsic Signal, this value should be set lower if your demonstrations are suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. <br><br>Typical range: `0.01` - `1.0` |
| `gail -> gamma` | (default = `0.99`) Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.9` |
| `gail -> demo_path` | (Required, no default) The path to your .demo file or directory of .demo files. |
| `gail -> encoding_size` | (default = `64`) Size of the hidden layer used by the discriminator. This value should be small enough to encourage the discriminator to compress the original observation, but also not too small to prevent it from learning to differentiate between demonstrated and actual behavior. Dramatically increasing this size will also negatively affect training times. <br><br>Typical range: `64` - `256` |
| `gail -> network_settings` | Please see the documentation for `network_settings` under [Common Trainer Configurations](#common-trainer-configurations). The network specs for the GAIL discriminator. The value of `hidden_units` should be small enough to encourage the discriminator to compress the original observation, but also not too small to prevent it from learning to differentiate between demonstrated and actual behavior. Dramatically increasing this size will also negatively affect training times. <br><br>Typical range: `64` - `256` |
| `gail -> learning_rate` | (Optional, default = `3e-4`) Learning rate used to update the discriminator. This should typically be decreased if training is unstable, and the GAIL loss is unstable. <br><br>Typical range: `1e-5` - `1e-3` |
| `gail -> use_actions` | (default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. |
| `gail -> use_vail` | (default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand. |

| :--------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `rnd -> strength` | (default = `1.0`) Magnitude of the curiosity reward generated by the intrinsic rnd module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. <br><br>Typical range: `0.001` - `0.01` |
| `rnd -> gamma` | (default = `0.99`) Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.995` |
| `rnd -> encoding_size` | (default = `64`) Size of the encoding used by the intrinsic RND model. <br><br>Typical range: `64` - `256` |
| `rnd -> network_settings` | Please see the documentation for `network_settings` under [Common Trainer Configurations](#common-trainer-configurations). The network specs for the RND model. |
| `curiosity -> learning_rate` | (default = `3e-4`) Learning rate used to update the RND module. This should be large enough for the RND module to quickly learn the state representation, but small enough to allow for stable learning. <br><br>Typical range: `1e-5` - `1e-3`

20
ml-agents/mlagents/trainers/settings.py


class RewardSignalSettings:
gamma: float = 0.99
strength: float = 1.0
network_settings: NetworkSettings = attr.ib(factory=NetworkSettings)
@staticmethod
def structure(d: Mapping, t: type) -> Any:

enum_key = RewardSignalType(key)
t = enum_key.to_settings()
d_final[enum_key] = strict_to_cls(val, t)
# Checks to see if user specifying deprecated encoding_size for RewardSignals.
# If network_settings is not specified, this updates the default hidden_units
# to the value of encoding size. If specified, this ignores encoding size and
# uses network_settings values.
if "encoding_size" in val:
logger.warning(
"'encoding_size' was deprecated for RewardSignals. Please use network_settings."
)
# If network settings was not specified, use the encoding size. Otherwise, use hidden_units
if "network_settings" not in val:
d_final[enum_key].network_settings.hidden_units = val[
"encoding_size"
]
encoding_size: int = 64
encoding_size: Optional[int] = None
use_actions: bool = False
use_vail: bool = False
demo_path: str = attr.ib(kw_only=True)

class CuriositySettings(RewardSignalSettings):
encoding_size: int = 64
encoding_size: Optional[int] = None
encoding_size: int = 64
encoding_size: Optional[int] = None
# SAMPLERS #############################################################################

28
ml-agents/mlagents/trainers/torch/components/reward_providers/curiosity_reward_provider.py


from mlagents.trainers.settings import CuriositySettings
from mlagents_envs.base_env import BehaviorSpec
from mlagents_envs import logging_util
from mlagents.trainers.settings import NetworkSettings, EncoderType
logger = logging_util.get_logger(__name__)
class ActionPredictionTuple(NamedTuple):

def __init__(self, specs: BehaviorSpec, settings: CuriositySettings) -> None:
super().__init__()
self._action_spec = specs.action_spec
state_encoder_settings = NetworkSettings(
normalize=False,
hidden_units=settings.encoding_size,
num_layers=2,
vis_encode_type=EncoderType.SIMPLE,
memory=None,
)
state_encoder_settings = settings.network_settings
if state_encoder_settings.memory is not None:
state_encoder_settings.memory = None
logger.warning(
"memory was specified in network_settings but is not supported by Curiosity. It is being ignored."
)
self._state_encoder = NetworkBody(
specs.observation_specs, state_encoder_settings
)

self.inverse_model_action_encoding = torch.nn.Sequential(
LinearEncoder(2 * settings.encoding_size, 1, 256)
LinearEncoder(2 * state_encoder_settings.hidden_units, 1, 256)
)
if self._action_spec.continuous_size > 0:

self.forward_model_next_state_prediction = torch.nn.Sequential(
LinearEncoder(
settings.encoding_size + self._action_flattener.flattened_size, 1, 256
state_encoder_settings.hidden_units
+ self._action_flattener.flattened_size,
1,
256,
linear_layer(256, settings.encoding_size),
linear_layer(256, state_encoder_settings.hidden_units),
)
def get_current_state(self, mini_batch: AgentBuffer) -> torch.Tensor:

27
ml-agents/mlagents/trainers/torch/components/reward_providers/gail_reward_provider.py


)
from mlagents.trainers.settings import GAILSettings
from mlagents_envs.base_env import BehaviorSpec
from mlagents_envs import logging_util
from mlagents.trainers.settings import NetworkSettings, EncoderType
logger = logging_util.get_logger(__name__)
self._ignore_done = True
self._ignore_done = False
self._discriminator_network = DiscriminatorNetwork(specs, settings)
self._discriminator_network.to(default_device())
_, self._demo_buffer = demo_to_buffer(

)
def update(self, mini_batch: AgentBuffer) -> Dict[str, np.ndarray]:
self._discriminator_network.encoder.update_normalization(expert_batch)
loss, stats_dict = self._discriminator_network.compute_loss(
mini_batch, expert_batch
)

self._use_vail = settings.use_vail
self._settings = settings
encoder_settings = NetworkSettings(
normalize=False,
hidden_units=settings.encoding_size,
num_layers=2,
vis_encode_type=EncoderType.SIMPLE,
memory=None,
)
encoder_settings = settings.network_settings
if encoder_settings.memory is not None:
encoder_settings.memory = None
logger.warning(
"memory was specified in network_settings but is not supported by GAIL. It is being ignored."
)
self._action_flattener = ActionFlattener(specs.action_spec)
unencoded_size = (
self._action_flattener.flattened_size + 1 if settings.use_actions else 0

)
estimator_input_size = settings.encoding_size
estimator_input_size = encoder_settings.hidden_units
if settings.use_vail:
estimator_input_size = self.z_size
self._z_sigma = torch.nn.Parameter(

settings.encoding_size,
encoder_settings.hidden_units,
self.z_size,
kernel_init=Initialization.KaimingHeNormal,
kernel_gain=0.1,

18
ml-agents/mlagents/trainers/torch/components/reward_providers/rnd_reward_provider.py


from mlagents.trainers.settings import RNDSettings
from mlagents_envs.base_env import BehaviorSpec
from mlagents_envs import logging_util
from mlagents.trainers.settings import NetworkSettings, EncoderType
logger = logging_util.get_logger(__name__)
class RNDRewardProvider(BaseRewardProvider):

def __init__(self, specs: BehaviorSpec, settings: RNDSettings) -> None:
super().__init__()
state_encoder_settings = NetworkSettings(
normalize=True,
hidden_units=settings.encoding_size,
num_layers=3,
vis_encode_type=EncoderType.SIMPLE,
memory=None,
)
state_encoder_settings = settings.network_settings
if state_encoder_settings.memory is not None:
state_encoder_settings.memory = None
logger.warning(
"memory was specified in network_settings but is not supported by RND. It is being ignored."
)
self._encoder = NetworkBody(specs.observation_specs, state_encoder_settings)
def forward(self, mini_batch: AgentBuffer) -> torch.Tensor:

正在加载...
取消
保存