* Implement behavioral cloning for cc/dc, fc/rnn, state/observations.
* Re-organize folder structure in anticipation of unitytrainers as a package.
* Create demo environment BananaImitation to validate behavioral cloning.
* Fixes#336
* Reorganized python tests into separate folder, and make individiual test files for different (sub) modules.
* Add tests for trainer_controller, PPO, and behavioral cloning. More to come soon.
* Minor bug fixes discovered while writing tests.
* Reworked GirdWorld to reset much faster.
* Cleaned ObservationToTex and reworked GetObservationMatrixList to be 3x faster.
* On Demand Decision : Use RequestDecision and RequestAction
* New Agent Inspector : Use it to set On Demand Decision
* New BrainParameters interface
* LSTM memory size is now set in python
* New C# API
* Semantic Changes
* Replaced RunMDP
* New Bouncer Environment to test On Demand Dscision
* [Previous Text Actions] Renamed previous_action to previous_vector_action
added previous_text_action to the BrainInfo
* [Semantics] Carried the modifications to the semantics of previous_vector_action to the trainers
* Fixes internal brain for Banana Imitation.
* Fixes Discrete Control training for Imitation Learning.
* Fixes Visual Observations in internal brain with non-square inputs.
Fixes the following issues:
* Missing component reference in BananaRL environment.
* Neural Network for multiple visual observations was not properly generated.
* Episode time-out value estimate bootstrapping used incorrect observation as input.
This PR makes the following changes:
* Moves clipping of continuous control model into model itself. Output is now always [-1, 1].
* Internal model values are now clipped between [-3, 3] before being rescaled to [-1, 1] for output. * This improves training performance by providing a wider range of values within which the pdf of the gaussian can fall. Output of [-1, 1] is used to be more environment-creator friendly.
* Fixes issue where epsilon was erroneously being used to reconstruct old probabilities during PPO update, leading to reduced learning performance.
* Introduce ScaleAction() function within python to easily rescale values from [-1, 1] to arbitrary range.
* Re-train all CC models using improved algorithm. All performance levels are equal or improved. In the case of Crawler, improvement is drastic.
* Update documentation appropriately.
* Made miscellaneous minor code style and optimization improvements within environments.
* [Cold Fix] Split the way cummulative rewards and episode length are counted
The reward is appended at each step to the cummulative reward
The episode count is ONLY incremented when d_t+1 is false
Fixes the issue raised by @hsaikia in #552
Added the memory_size variable to the BC model
Added memory_size and recurrent_out to the output nodes of the graph when using BC with LSTM
* Adds implementation of Curiosity-driven Exploration by Self-supervised Prediction (https://arxiv.org/abs/1705.05363) to PPO trainer.
* To enable, set use_curiosity flag to true in hyperparameter file.
* Includes refactor of unitytrainers model code to accommodate new feature.
* Adds new Pyramids environment (w/ documentation). Environment contains sparse reward, and can only be solved using PPO+Curiosity.
In the case the agent is done imediately after spawning, its stats are empty because the stats need at least 2 successive experieces to create the stats.
By specifying the default value of 0, the error does no longer appear
* [Initial Commit]
Modified the model.py file and the ppo/trainer.py file to use masked actions
* Preliminary modifications to the python side of the code to enable action masking
* Preliminary modifications to the C# side of the code to enable action masking
* Preliminary modifications to the communication side of the code to enable action masking
* Implemented action masking for BC
Note : The actions of the teacher are not masked
* More error messages for the action masking
* fix pytests
* Added Documentation
* Address comment
* Addressed Comments on docs
* Addressed second comment on docs
* Addressed comments for the python side of the code
* Created the action masker and associated unit tests
* Addressed comments on the C# side
* Addressed the comment regarding action_masking_name
* Addressed the comments
* Initial Commit
* attempt at refactor
* Put all static methods into the CoreInternalBrain
* improvements
* more testing
* modifications
* renamed epsilon
* misc
* Now supports discrete actions
* added discrete support and RNN and visual. Left to do is refactor and save variables into models
* code cleaning
* made a tensor generator and applier
* fix on the models.py file
* Moved the Checks to a different Class
* Added some unit tests
* BugFix
* Need to generate the output tensors as well as inputs before executing the graph
* Made NodeNames static and created a new namespace
* Added comments to the TensorAppliers
* Started adding comments on the TensorGenerators code
* Added comments for the Tensor Generator
* Moving the helper classes into a separate folder
* Added initial comments to the TensorChecks
* Renamed NodeNames -> TensorNames
* Removing warnings in tests
* Now using Aut...
* Move 'take_action' into Policy class
This refactor is part of Actor-Trainer separation. Since policies
will be distributed across actors in separate processes which share
a single trainer, taking an action should be the responsibility of
the policy.
This change makes a few smaller changes:
* Combines `take_action` logic between trainers, making it more
generic
* Adds an `ActionInfo` data class to be more explicit about the
data returned by the policy, only used by TrainerController and
policy for now.
* Moves trainer stats logic out of `take_action` and into
`add_experiences`
* Renames 'take_action' to 'get_action'
* WIP precommit on top level
* update CI
* circleci fixes
* intentionally fail black
* use --show-diff-on-failure in CI
* fix command order
* rebreak a file
* apply black
* WIP enable mypy
* run mypy on each package
* fix trainer_metrics mypy errors
* more mypy errors
* more mypy
* Fix some partially typed functions
* types for take_action_outputs
* fix formatting
* cleanup
* generate stubs for proto objects
* fix ml-agents-env mypy errors
* disallow-incomplete-defs for gym-unity
* Add CI notes to CONTRIBUTING.md
* Create new class (RewardSignal) that represents a reward signal.
* Add value heads for each reward signal in the PPO model.
* Make summaries agnostic to the type of reward signals, and log weighted rewards per reward signal.
* Move extrinsic and curiosity rewards into this new structure.
* Allow defining multiple reward signals in YAML file. Add documentation for this new structure.
At each step, an unused `last_reward` variable in the TF graph is
updated in our PPO trainer. There are also related unused methods
in various places in the codebase. This change removes them.
Previously in v0.8 we added parallel environments via the
SubprocessUnityEnvironment, which exposed the same abstraction as
UnityEnvironment while actually wrapping many parallel environments
via subprocesses.
Wrapping many environments with the same interface as a single
environment had some downsides, however:
* Ordering needed to be preserved for agents across different envs,
complicating the SubprocessEnvironment logic
* Asynchronous environments with steps taken out of sync with the
trainer aren't viable with the Environment abstraction
This PR introduces a new EnvManager abstraction which exposes a
reduced subset of the UnityEnvironment abstraction and a
SubprocessEnvManager implementation which replaces the
SubprocessUnityEnvironment.
* Removes unused SubprocessEnvManager import in trainer_controller
* Removes unused `steps` argument to `TrainerController._save_model`
* Consolidates unnecessary branching for curricula in
`TrainerController.advance`
* Moves `reward_buffer` into `TFPolicy` from `PPOPolicy` and adds
`BCTrainer` support so that we don't have a broken interface /
undefined behavior when BCTrainer is used with curricula.
- Move common functions to trainer.py, model.pyfromppo/trainer.py, ppo/policy.pyandppo/model.py'
- Introduce RLTrainer class and move most of add_experiences and some common reward
signal code there. PPO and SAC will inherit from this, not so much BC Trainer.
- Add methods to Buffer to enable sampling, truncating, and save/loading.
- Add scoping to create encoders in model.py
- Fix issue with BC Trainer `increment_steps`.
- Fix issue with Demonstration Recorder and visual observations (memory leak fix was deleting vis obs too early).
- Make Samplers sample from the same random seed every time, so generalization runs are repeatable.
- Fix crash when using GAIL, Curiosity, and visual observations together.
We have been ignoring unused imports and star imports via flake8. These are
both bad practice and grow over time without automated checking. This
commit attempts to fix all existing import errors and add back the corresponding
flake8 checks.
* Feature Deprecation : Online Behavioral Cloning
In this PR :
- Delete the online_bc_trainer
- Delete the tests for online bc
- delete the configuration file for online bc training
* Deleting the BCTeacherHelper.cs Script
TODO :
- Remove usages in the scene
- Documentation Edits
*DO NOT MERGE*
* IMPORTANT : REMOVED ALL IL SCENES
- Removed all the IL scenes from the Examples folder
* Removed all mentions of online BC training in the Documentation
* Made a note in the Migrating.md doc about the removal of the Online BC feature.
* Initial commit removing memories from C# and deprecating memory fields in proto
* initial changes to Python
* Adding functionalities
* Fixes
* adding the memories to the dictionary
* Fixing bugs
* tweeks
* Resolving bugs
* Recreating the proto
* Addressing comments
* Passing by reference does not work. Do not merge
* Fixing huge bug in Inference
* Applying patches
* fixing tests
* Addressing comments
* Renaming variable to reflect type
* test
This is the first in a series of PRs that intend to move the agent processing logic (add_experiences and process_experiences) out of the trainer and into a separate class. The plan is to do so in steps:
- Split the processing buffers (keeping track of agent trajectories and assembling trajectories) and update buffer (complete trajectories to be used for training) within the Trainer (this PR)
- Move the processing buffer and add/process experiences into a separate, outside class
- Change the data type of the update buffer to be a Trajectory
- Place and read Trajectories from queues, add subscription mechanism for both AgentProcessor and Trainers