Fixes shuffling issue with newer versions of numpy (#1798).
* make get_value_estimates output a dict of floats
* Use np.append instead of convert to list, unconvert
* Add type hints and test for get_value_estimates
Based on the new reward signals architecture, add BC pretrainer and GAIL for PPO. Main changes:
- A new GAILRewardSignal and GAILModel for GAIL/VAIL
- A BCModule component (not a reward signal) to do pretraining during RL
- Documentation for both of these
- Change to Demo Loader that lets you load multiple demo files in a folder
- Example Demo files for all of our tested sample environments (for future regression testing)
* Using-Docker.md miss a backslash in 3DBall command
Hi,
Just a quick edit because a backslash seems to be missing from the 3DBall command example.
* Added interactive options and Tensorboard documentation for Docker training
* Timer proof-of-concept
* micro optimizations
* add some timers
* cleanup, add asserts
* Cleanup (no start/end methods) and handle exceptions
* unit test and decorator
* move output code, add a decorator
* cleanup
* module docstring
* actually write the timings when done with training
* use __qualname__ instead
* add a few more timers
* fix mock import
* fix unit test
* don't need fwd reference
* cleanup root
* always write timers, add comments
* undo accidental change
TrainerController depended on an external_brains dictionary with
brain params in its constructor but only used it in a single function
call. The same function call (start_learning) takes the environment
as an argument, which is the source of the external_brains.
This change removes the dependency of TrainerController on external
brains and removes the two class members related to external_brains
and retrieves the brains directly from the environment.
Previously in v0.8 we added parallel environments via the
SubprocessUnityEnvironment, which exposed the same abstraction as
UnityEnvironment while actually wrapping many parallel environments
via subprocesses.
Wrapping many environments with the same interface as a single
environment had some downsides, however:
* Ordering needed to be preserved for agents across different envs,
complicating the SubprocessEnvironment logic
* Asynchronous environments with steps taken out of sync with the
trainer aren't viable with the Environment abstraction
This PR introduces a new EnvManager abstraction which exposes a
reduced subset of the UnityEnvironment abstraction and a
SubprocessEnvManager implementation which replaces the
SubprocessUnityEnvironment.
At each step, an unused `last_reward` variable in the TF graph is
updated in our PPO trainer. There are also related unused methods
in various places in the codebase. This change removes them.
* Create new class (RewardSignal) that represents a reward signal.
* Add value heads for each reward signal in the PPO model.
* Make summaries agnostic to the type of reward signals, and log weighted rewards per reward signal.
* Move extrinsic and curiosity rewards into this new structure.
* Allow defining multiple reward signals in YAML file. Add documentation for this new structure.