Improve Gym wrapper compatibility and add Dopamine documentation (#1541)

* Add option to set gym visual observation to uint8 * Add option to flatten branched discrete actions * Add game_over variable to gym wrapper * Add guide on how to use Dopamine with the gym wrapper and comparisons with Baselines and PPO
6 年前 · 0f1cb4d8
--- a/docs/Readme.md
+++ b/docs/Readme.md

 * [API Reference](API-Reference.md)
 * [How to use the Python API](Python-API.md)
-* [Wrapping Learning Environment as a Gym](../gym-unity/README.md)
+* [Wrapping Learning Environment as a Gym (+Baselines/Dopamine Integration)](../gym-unity/README.md)
--- a/gym-unity/README.md
+++ b/gym-unity/README.md
 ```python
 from gym_unity.envs import UnityEnv

-env = UnityEnv(environment_filename, worker_id, default_visual, multiagent)
+env = UnityEnv(environment_filename, worker_id, use_visual, uint8_visual, multiagent)
-* `environment_filename` refers to the path to the Unity environment.
-* `worker_id` refers to the port to use for communication with the environment.
-  Defaults to `0`.
-* `use_visual` refers to whether to use visual observations (True) or vector
-  observations (False) as the default observation provided by the `reset` and
-  `step` functions. Defaults to `False`.
-* `multiagent` refers to whether you intent to launch an environment which
-  contains more than one agent. Defaults to `False`.
+*  `environment_filename` refers to the path to the Unity environment.
+
+*  `worker_id` refers to the port to use for communication with the environment.
+   Defaults to `0`.
+
+*  `use_visual` refers to whether to use visual observations (True) or vector
+   observations (False) as the default observation provided by the `reset` and
+   `step` functions. Defaults to `False`.
+
+*  `uint8_visual` refers to whether to output visual observations as `uint8` values 
+   (0-255). Many common Gym environments (e.g. Atari) do this. By default they 
+   will be floats (0.0-1.0). Defaults to `False`.
+
+*  `multiagent` refers to whether you intent to launch an environment which
+   contains more than one agent. Defaults to `False`.
+
+*  `flatten_branched` will flatten a branched discrete action space into a Gym Discrete. 
+   Otherwise, it will be converted into a MultiDiscrete. Defaults to `False`.

 The returned environment `env` will function as a gym.

 Using the provided Gym wrapper, it is possible to train ML-Agents environments
 using these algorithms. This requires the creation of custom training scripts to
 launch each algorithm. In most cases these scripts can be created by making
-slightly modifications to the ones provided for Atari and Mujoco environments.
+slight modifications to the ones provided for Atari and Mujoco environments.

 ### Example - DQN Baseline

 import gym

 from baselines import deepq
-from gym_unity.envs import UnityEnv
+from baselines import logger
+
+from gym_unity.envs.unity_env import UnityEnv
-    env = UnityEnv("./envs/GridWorld", 0, use_visual=True)
+    env = UnityEnv("./envs/GridWorld", 0, use_visual=True, uint8_visual=True)
+    logger.configure('./logs') # Çhange to log in a different directory
-        "mlp",
-        lr=1e-3,
-        total_timesteps=100000,
+        "cnn", # conv_only is also a good choice for GridWorld
+        lr=2.5e-4,
+        total_timesteps=1000000,
-        exploration_fraction=0.1,
-        exploration_final_eps=0.02,
-        print_freq=10
+        exploration_fraction=0.05,
+        exploration_final_eps=0.1,
+        print_freq=20,
+        train_freq=5,
+        learning_starts=20000,
+        target_network_update_freq=50,
+        gamma=0.99,
+        prioritized_replay=False,
+        checkpoint_freq=1000,
+        checkpoint_path='./logs', # Change to save model in a different directory
+        dueling=True
-
-To start the training process, run the following from the root of the baselines
-repository:
+To start the training process, run the following from the directory containing
+`train_unity.py`:

 ```sh
 python -m train_unity
    """
    def make_env(rank, use_visual=True): # pylint: disable=C0111
        def _thunk():
-            env = UnityEnv(env_directory, rank, use_visual=use_visual)
+            env = UnityEnv(env_directory, rank, use_visual=use_visual, uint8_visual=True)
            env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
            return env
        return _thunk
 if __name__ == '__main__':
    main()
 ```
+
+## Run Google Dopamine Algorithms
+
+Google provides a framework [Dopamine](https://github.com/google/dopamine), and
+implementations of algorithms, e.g. DQN, Rainbow, and the C51 variant of Rainbow.  
+Using the Gym wrapper, we can run Unity environments using Dopamine. 
+
+First, after installing the Gym wrapper, clone the Dopamine repository. 
+
+```
+git clone https://github.com/google/dopamine
+```
+
+Then, follow the appropriate install instructions as specified on 
+[Dopamine's homepage](https://github.com/google/dopamine). Note that the Dopamine 
+guide specifies using a virtualenv. If you choose to do so, make sure your unity_env 
+package is also installed within the same virtualenv as Dopamine.
+
+### Adapting Dopamine's Scripts
+
+First, open `dopamine/atari/run_experiment.py`. Alternatively, copy the entire `atari` 
+folder, and name it something else (e.g. `unity`). If you choose the copy approach,
+be sure to change the package names in the import statements in `train.py` to your new
+directory.
+
+Within `run_experiment.py`, we will need to make changes to which environment is 
+instantiated, just as in the Baselines example. At the top of the file, insert 
+
+```python
+from gym_unity.envs import UnityEnv
+```
+
+to import the Gym Wrapper. Navigate to the `create_atari_environment` method 
+in the same file, and switch to instantiating a Unity environment by replacing
+the method with the following code. 
+
+```python
+    game_version = 'v0' if sticky_actions else 'v4'
+    full_game_name = '{}NoFrameskip-{}'.format(game_name, game_version)
+    env = UnityEnv('./envs/GridWorld', 0, use_visual=True, uint8_visual=True)
+    return env
+```
+
+`./envs/GridWorld` is the path to your built Unity executable. For more information on 
+building Unity environments, see [here](../docs/Learning-Environment-Executable.md), and note 
+the Limitations section below. 
+
+Note that we are not using the preprocessor from Dopamine, 
+as it uses many Atari-specific calls. Furthermore, frame-skipping can be done from within Unity, 
+rather than on the Python side. 
+
+### Limitations
+
+Since Dopamine is designed around variants of DQN, it is only compatible
+with discrete action spaces, and specifically the Discrete Gym space. For environments
+that use branched discrete action spaces (e.g. 
+[VisualBanana](../docs/Learning-Environment-Examples.md)), you can enable the 
+`flatten_branched` parameter in `UnityEnv`, which treats each combination of branched 
+actions as separate actions.
+
+Furthermore, when building your environments, ensure that your
+[Learning Brain](../docs/Learning-Environment-Design-Brains.md) is using visual
+observations with greyscale enabled, and that the dimensions of the visual observations
+is 84 by 84 (matches the parameter found in `dqn_agent.py` and `rainbow_agent.py`).
+Dopamine's agents currently do not automatically adapt to the observation 
+dimensions or number of channels.  
+
+### Hyperparameters
+
+The hyperparameters provided by Dopamine are tailored to the Atari games, and you will
+likely need to adjust them for ML-Agents environments. Here is a sample 
+`dopamine/agents/rainbow/configs/rainbow.gin` file that is known to work with 
+GridWorld. 
+
+```python
+import dopamine.agents.rainbow.rainbow_agent
+import dopamine.unity.run_experiment
+import dopamine.replay_memory.prioritized_replay_buffer
+import gin.tf.external_configurables
+
+RainbowAgent.num_atoms = 51
+RainbowAgent.stack_size = 1
+RainbowAgent.vmax = 10.
+RainbowAgent.gamma = 0.99
+RainbowAgent.update_horizon = 3
+RainbowAgent.min_replay_history = 20000  # agent steps
+RainbowAgent.update_period = 5
+RainbowAgent.target_update_period = 50  # agent steps
+RainbowAgent.epsilon_train = 0.1
+RainbowAgent.epsilon_eval = 0.01
+RainbowAgent.epsilon_decay_period = 50000  # agent steps
+RainbowAgent.replay_scheme = 'prioritized'
+RainbowAgent.tf_device = '/cpu:0'  # use '/cpu:*' for non-GPU version
+RainbowAgent.optimizer = @tf.train.AdamOptimizer()
+
+tf.train.AdamOptimizer.learning_rate = 0.00025
+tf.train.AdamOptimizer.epsilon = 0.0003125
+
+Runner.game_name = "Unity" # any name can be used here
+Runner.sticky_actions = False
+Runner.num_iterations = 200
+Runner.training_steps = 10000  # agent steps
+Runner.evaluation_steps = 500  # agent steps
+Runner.max_steps_per_episode = 27000  # agent steps
+
+WrappedPrioritizedReplayBuffer.replay_capacity = 1000000
+WrappedPrioritizedReplayBuffer.batch_size = 32
+```
+
+This example assumed you copied `atari` to a separate folder named `unity`.
+Replace `unity` in `import dopamine.unity.run_experiment` with the folder you 
+copied your `run_experiment.py` and `trainer.py` files to.
+If you directly modified the existing files, then use `atari` here. 
+
+### Starting a Run
+
+You can now run Dopamine as you would normally:
+
+```
+python -um dopamine.unity.train \
+  --agent_name=rainbow \
+  --base_dir=/tmp/dopamine \
+  --gin_files='dopamine/agents/rainbow/configs/rainbow.gin'
+```
+
+Again, we assume that you've copied `atari` into a separate folder. 
+Remember to replace `unity` with the directory you copied your files into. If you
+edited the Atari files directly, this should be `atari`.
+
+### Example: GridWorld
+
+As a baseline, here are rewards over time for the three algorithms provided with
+Dopamine as run on the GridWorld example environment. All Dopamine (DQN, Rainbow, 
+C51) runs were done with the same epsilon, epsilon decay, replay history, training steps, 
+and buffer settings as specified above. Note that the first 20000 steps are used to pre-fill
+the training buffer, and no learning happens. 
+
+We provide results from our PPO implementation and the DQN from Baselines as reference. 
+Note that all runs used the same greyscale GridWorld as Dopamine. For PPO, `num_layers` 
+was set to 2, and all other hyperparameters are the default for GridWorld in `trainer_config.yaml`.
+For Baselines DQN, the provided hyperparameters in the previous section are used. Note
+that Baselines implements certain features (e.g. dueling-Q) that are not enabled
+in Dopamine DQN.
+
+![Dopamine on GridWorld](images/dopamine_gridworld_plot.png)
+
+### Example: VisualBanana
+
+As an example of using the `flatten_branched` option, we also used the Rainbow
+algorithm to train on the VisualBanana environment, and provide the results below. 
+The same hyperparameters were used as in the GridWorld case, except that 
+`replay_history` and `epsilon_decay` were increased to 100000.
+
+![Dopamine on VisualBanana](images/dopamine_visualbanana_plot.png)
--- a/gym-unity/gym_unity/envs/unity_env.py
+++ b/gym-unity/gym_unity/envs/unity_env.py
 import logging
+import itertools
 import gym
 import numpy as np
 from mlagents.envs import UnityEnvironment
    https://github.com/openai/multiagent-particle-envs
    """

-    def __init__(self, environment_filename: str, worker_id=0, use_visual=False, multiagent=False):
+    def __init__(self, environment_filename: str, worker_id=0, use_visual=False, uint8_visual=False, multiagent=False, flatten_branched=False):
+        :param uint8_visual: Return visual observations as uint8 (0-255) matrices instead of float (0.0-1.0).
+        :param flatten_branched: If True, turn branched discrete action spaces into a Discrete space rather than MultiDiscrete.
        """
        self._env = UnityEnvironment(environment_filename, worker_id)
        self.name = self._env.academy_name
        self._multiagent = multiagent
+        self._flattener = None
+        self.game_over = False # Hidden flag used by Atari environments to determine if the game is over

        # Check brain configuration
        if len(self._env.brains) != 1:
                                    " visual observations as part of this environment.")
        self.use_visual = brain.number_visual_observations >= 1 and use_visual

+        if not use_visual and uint8_visual:
+            logger.warning("`uint8_visual was set to true, but visual observations are not in use. "
+                           "This setting will not have any effect.")
+        else:
+            self.uint8_visual = uint8_visual
+
        if brain.number_visual_observations > 1:
            logger.warning("The environment contains more than one visual observation. "
                           "Please note that only the first will be provided in the observation.")
            if len(brain.vector_action_space_size) == 1:
                self._action_space = spaces.Discrete(brain.vector_action_space_size[0])
            else:
-                self._action_space = spaces.MultiDiscrete(brain.vector_action_space_size)
+                if flatten_branched:
+                    self._flattener = ActionFlattener(brain.vector_action_space_size)
+                    self._action_space = self._flattener.action_space
+                else:
+                    self._action_space = spaces.MultiDiscrete(brain.vector_action_space_size)
+
+            if flatten_branched:
+                logger.warning("The environment has a non-discrete action space. It will "
+                                "not be flattened.")
            high = np.array([1] * brain.vector_action_space_size[0])
            self._action_space = spaces.Box(-high, high, dtype=np.float32)
        high = np.array([np.inf] * brain.vector_observation_space_size)
        info = self._env.reset()[self.brain_name]
        n_agents = len(info.agents)
        self._check_agents(n_agents)
+        self.game_over = False

        if not self._multiagent:
            obs, reward, done, info = self._single_step(info)
                raise UnityGymException(
                    "The environment was expecting a list of {} actions.".format(self._n_agents))
            else:
+                if self._flattener is not None:
+                    # Action space is discrete and flattened - we expect a list of scalars
+                    action = [self._flattener.lookup_action(_act) for _act in action]
+        else:
+            if self._flattener is not None:
+                # Translate action into list
+                action = self._flattener.lookup_action(action)

        info = self._env.step(action)[self.brain_name]
        n_agents = len(info.agents)
        if not self._multiagent:
            obs, reward, done, info = self._single_step(info)
+            self.game_over = done
+            self.game_over = all(done)
-            self.visual_obs = info.visual_observations[0][0, :, :, :]
+            self.visual_obs = self._preprocess_single(info.visual_observations[0][0, :, :, :])
            default_observation = self.visual_obs
        else:
            default_observation = info.vector_observations[0, :]
            "brain_info": info}

+    def _preprocess_single(self, single_visual_obs):
+        if self.uint8_visual:
+            return (255.0*single_visual_obs).astype(np.uint8)
+        else:
+            return single_visual_obs
+
-            self.visual_obs = info.visual_observations
+            self.visual_obs = self._preprocess_multi(info.visual_observations)
            default_observation = self.visual_obs
        else:
            default_observation = info.vector_observations
+    
+    def _preprocess_multi(self, multiple_visual_obs):
+        if self.uint8_visual:
+            return [(255.0*_visual_obs).astype(np.uint8) for _visual_obs in multiple_visual_obs]
+        else:
+            return multiple_visual_obs

    def render(self, mode='rgb_array'):
        return self.visual_obs
    @property
    def number_agents(self):
        return self._n_agents
+
+class ActionFlattener():
+    """
+    Flattens branched discrete action spaces into single-branch discrete action spaces.
+    """
+    def __init__(self,branched_action_space):
+        """
+        Initialize the flattener.
+        :param branched_action_space: A List containing the sizes of each branch of the action
+        space, e.g. [2,3,3] for three branches with size 2, 3, and 3 respectively.
+        """
+        self._action_shape = branched_action_space
+        self.action_lookup = self._create_lookup(self._action_shape)
+        self.action_space = spaces.Discrete(len(self.action_lookup))
+
+    @classmethod
+    def _create_lookup(self, branched_action_space):
+        """
+        Creates a Dict that maps discrete actions (scalars) to branched actions (lists).
+        Each key in the Dict maps to one unique set of branched actions, and each value
+        contains the List of branched actions.
+        """
+        possible_vals = [range(_num) for _num in branched_action_space]
+        all_actions = [list(_action) for _action in itertools.product(*possible_vals)]
+        # Dict should be faster than List for large action spaces
+        action_lookup = {_scalar: _action for (_scalar, _action) in enumerate(all_actions)}
+        return action_lookup
+
+    def lookup_action(self, action):
+        """
+        Convert a scalar discrete action into a unique set of branched actions.
+        :param: action: A scalar value representing one of the discrete actions.
+        :return: The List containing the branched actions.
+        """
+        return self.action_lookup[action]
--- a/gym-unity/tests/test_gym.py
+++ b/gym-unity/tests/test_gym.py
 import pytest
 import numpy as np

+from gym import spaces
-from tests.mock_communicator import MockCommunicator
+from mock_communicator import MockCommunicator

@mock.patch('mlagents.envs.UnityEnvironment.executable_launcher')
@mock.patch('mlagents.envs.UnityEnvironment.get_communicator')
    assert isinstance(rew, list)
    assert isinstance(done, list)
    assert isinstance(info, dict)
+
+@mock.patch('gym_unity.envs.unity_env.UnityEnvironment')
+def test_branched_flatten(mock_env):
+    mock_env.return_value.academy_name = 'MockAcademy'
+    mock_brain = mock.Mock();
+    mock_brain.return_value.number_visual_observations = 0
+    mock_brain.return_value.num_stacked_vector_observations = 1
+    mock_brain.return_value.vector_action_space_type = 'discrete'
+    mock_brain.return_value.vector_observation_space_size = 1
+    # Unflattened action space
+    mock_brain.return_value.vector_action_space_size = [2,2,3]
+
+    mock_env.return_value.brains = {'MockBrain':mock_brain()}
+    mock_env.return_value.external_brain_names = ['MockBrain']
+    env = UnityEnv(' ', use_visual=False, multiagent=False, flatten_branched=True)
+    assert isinstance(env.action_space, spaces.Discrete)
+    assert env.action_space.n==12
+    assert env._flattener.lookup_action(0)==[0,0,0]
+    assert env._flattener.lookup_action(11)==[1,1,2]
+
+    # Check that False produces a MultiDiscrete
+    env = UnityEnv(' ', use_visual=False, multiagent=False, flatten_branched=False)
+    assert isinstance(env.action_space, spaces.MultiDiscrete)
--- a/gym-unity/images/dopamine_gridworld_plot.png
+++ b/gym-unity/images/dopamine_gridworld_plot.png
--- a/gym-unity/images/dopamine_visualbanana_plot.png
+++ b/gym-unity/images/dopamine_visualbanana_plot.png