浏览代码

[docs] Documentation for POCA and cooperative behaviors (#5056)

Co-authored-by: Ruo-Ping Dong <ruoping.dong@unity3d.com>
/develop/input-actuator-tanks
GitHub 4 年前
当前提交
14489fd8
共有 6 个文件被更改,包括 702 次插入6 次删除
  1. 4
      com.unity.ml-agents/CHANGELOG.md
  2. 101
      docs/Learning-Environment-Design-Agents.md
  3. 33
      docs/ML-Agents-Overview.md
  4. 10
      docs/Training-Configuration-File.md
  5. 366
      docs/images/cooperative_pushblock.png
  6. 194
      docs/images/groupmanager_teamid.png

4
com.unity.ml-agents/CHANGELOG.md


### Major Changes
#### com.unity.ml-agents (C#)
- The `BufferSensor` and `BufferSensorComponent` have been added. They allow the Agent to observe variable number of entities. (#4909)
- The `SimpleMultiAgentGroup` class and `IMultiAgentGroup` interface have been added. These allow Agents to be given rewards and
end episodes in groups. (#4923)
- The MA-POCA trainer has been added. This is a new trainer that enables Agents to learn how to work together in groups. Configure
`poca` as the trainer in the configuration YAML after instantiating a `SimpleMultiAgentGroup` to use this feature. (#5005)
### Minor Changes
#### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)

101
docs/Learning-Environment-Design-Agents.md


- [Rewards Summary & Best Practices](#rewards-summary--best-practices)
- [Agent Properties](#agent-properties)
- [Destroying an Agent](#destroying-an-agent)
- [Defining Teams for Multi-agent Scenarios](#defining-teams-for-multi-agent-scenarios)
- [Defining Multi-agent Scenarios](#defining-multi-agent-scenarios)
- [Recording Demonstrations](#recording-demonstrations)
An agent is an entity that can observe its environment, decide on the best

the order of the entities, so there is no need to properly "order" the
entities before feeding them into the `BufferSensor`.
The the `BufferSensorComponent` Editor inspector have two arguments:
The `BufferSensorComponent` Editor inspector has two arguments:
- `Observation Size` : This is how many floats each entities will be
represented with. This number is fixed and all entities must
have the same representation. For example, if the entities you want to

Agent every time one is destroyed or by re-spawning new Agents when the whole
environment resets.
## Defining Teams for Multi-agent Scenarios
## Defining Multi-agent Scenarios
### Teams for Adversarial Scenarios
Self-play is triggered by including the self-play hyperparameter hierarchy in
the [trainer configuration](Training-ML-Agents.md#training-configurations). To

provide examples of symmetric games. To train an asymmetric game, specify
trainer configurations for each of your behavior names and include the self-play
hyperparameter hierarchy in both.
### Groups for Cooperative Scenarios
Cooperative behavior in ML-Agents can be enabled by instantiating a `SimpleMultiAgentGroup`,
typically in an environment controller or similar script, and adding agents to it
using the `RegisterAgent(Agent agent)` method. Note that all agents added to the same `SimpleMultiAgentGroup`
must have the same behavior name and Behavior Parameters. Using `SimpleMultiAgentGroup` enables the
agents within a group to learn how to work together to achieve a common goal (i.e.,
maximize a group-given reward), even if one or more of the group members are removed
before the episode ends. You can then use this group to add/set rewards, end or interrupt episodes
at a group level using the `AddGroupReward()`, `SetGroupReward()`, `EndGroupEpisode()`, and
`GroupEpisodeInterrupted()` methods. For example:
```csharp
// Create a Multi Agent Group in Start() or Initialize()
m_AgentGroup = new SimpleMultiAgentGroup();
// Register agents in group at the beginning of an episode
for (var agent in AgentList)
{
m_AgentGroup.RegisterAgent(agent);
}
// if the team scores a goal
m_AgentGroup.AddGroupReward(rewardForGoal);
// If the goal is reached and the episode is over
m_AgentGroup.EndGroupEpisode();
ResetScene();
// If time ran out and we need to interrupt the episode
m_AgentGroup.GroupEpisodeInterrupted();
ResetScene();
```
Multi Agent Groups should be used with the MA-POCA trainer, which is explicitly designed to train
cooperative environments. This can be enabled by using the `poca` trainer - see the
[training configurations](Training-Configuration-File.md) doc for more information on
configuring MA-POCA. When using MA-POCA, agents which are deactivated or removed from the Scene
during the episode will still learn to contribute to the group's long term rewards, even
if they are not active in the scene to experience them.
**NOTE**: Groups differ from Teams (for competitive settings) in the following way - Agents
working together should be added to the same Group, while agents playing against each other
should be given different Team Ids. If in the Scene there is one playing field and two teams,
there should be two Groups, one for each team, and each team should be assigned a different
Team Id. If this playing field is duplicated many times in the Scene (e.g. for training
speedup), there should be two Groups _per playing field_, and two unique Team Ids
_for the entire Scene_. In environments with both Groups and Team Ids configured, MA-POCA and
self-play can be used together for training. In the diagram below, there are two agents on each team,
and two playing fields where teams are pitted against each other. All the blue agents should share a Team Id
(and the orange ones a different ID), and there should be four group managers, one per pair of agents.
<p align="center">
<img src="images/groupmanager_teamid.png"
alt="Group Manager vs Team Id"
width="650" border="10" />
</p>
#### Cooperative Behaviors Notes and Best Practices
* An agent can only be registered to one MultiAgentGroup at a time. If you want to re-assign an
agent from one group to another, you have to unregister it from the current group first.
* Agents with different behavior names in the same group are not supported.
* Agents within groups should always set the `Max Steps` parameter in the Agent script to 0.
Instead, handle Max Steps using the MultiAgentGroup by ending the episode for the entire
Group using `GroupEpisodeInterrupted()`.
* `EndGroupEpisode` and `GroupEpisodeInterrupted` do the same job in the game, but has
slightly different effect on the training. If the episode is completed, you would want to call
`EndGroupEpisode`. But if the episode is not over but it has been running for enough steps, i.e.
reaching max step, you would call `GroupEpisodeInterrupted`.
* If an agent finished earlier, e.g. completed tasks/be removed/be killed in the game, do not call
`EndEpisode()` on the Agent. Instead, disable the agent and re-enable it when the next episode starts,
or destroy the agent entirely. This is because calling `EndEpisode()` will call `OnEpisodeBegin()`, which
will reset the agent immediately. While it is possible to call `EndEpisode()` in this way, it is usually not the
desired behavior when training groups of agents.
* If an agent that was disabled in a scene needs to be re-enabled, it must be re-registered to the MultiAgentGroup.
* Group rewards are meant to reinforce agents to act in the group's best interest instead of
individual ones, and are treated differently than individual agent rewards during
training. So calling `AddGroupReward()` is not equivalent to calling agent.AddReward() on each agent
in the group.
* You can still add incremental rewards to agents using `Agent.AddReward()` if they are
in a Group. These rewards will only be given to those agents and are received when the
Agent is active.
* Environments which use Multi Agent Groups can be trained using PPO or SAC, but agents will
not be able to learn from group rewards after deactivation/removal, nor will they behave as cooperatively.
## Recording Demonstrations

33
docs/ML-Agents-Overview.md


previous section, the ML-Agents Toolkit provides additional methods that can aid
in training behaviors for specific types of environments.
### Training in Multi-Agent Environments with Self-Play
### Training in Competitive Multi-Agent Environments with Self-Play
ML-Agents provides the functionality to train both symmetric and asymmetric
adversarial games with

our
[blog post on self-play](https://blogs.unity3d.com/2020/02/28/training-intelligent-adversaries-using-self-play-with-ml-agents/)
for additional information.
### Training In Cooperative Multi-Agent Environments with MA-POCA
![PushBlock with Agents Working Together](images/cooperative_pushblock.png)
ML-Agents provides the functionality for training cooperative behaviors - i.e.,
groups of agents working towards a common goal, where the success of the individual
is linked to the success of the whole group. In such a scenario, agents typically receive
rewards as a group. For instance, if a team of agents wins a game against an opposing
team, everyone is rewarded - even agents who did not directly contribute to the win. This
makes learning what to do as an individual difficult - you may get a win
for doing nothing, and a loss for doing your best.
In ML-Agents, we provide MA-POCA (MultiAgent POsthumous Credit Assignment), which
is a novel multi-agent trainer that trains a _centralized critic_, a neural network
that acts as a "coach" for a whole group of agents. You can then give rewards to the team
as a whole, and the agents will learn how best to contribute to achieving that reward.
Agents can _also_ be given rewards individually, and the team will work together to help the
individual achieve those goals. During an episode, agents can be added or removed from the group,
such as when agents spawn or die in a game. If agents are removed mid-episode (e.g., if teammates die
or are removed from the game), they will still learn whether their actions contributed
to the team winning later, enabling agents to take group-beneficial actions even if
they result in the individual being removed from the game (i.e., self-sacrifice).
MA-POCA can also be combined with self-play to train teams of agents to play against each other.
To learn more about enabling cooperative behaviors for agents in an ML-Agents environment,
check out [this page](Learning-Environment-Design-Agents.md#cooperative-scenarios).
For further reading, MA-POCA builds on previous work in multi-agent cooperative learning
([Lowe et al.](https://arxiv.org/abs/1706.02275), [Foerster et al.](https://arxiv.org/pdf/1705.08926.pdf),
among others) to enable the above use-cases.
### Solving Complex Tasks using Curriculum Learning

10
docs/Training-Configuration-File.md


## Common Trainer Configurations
One of the first decisions you need to make regarding your training run is which
trainer to use: PPO or SAC. There are some training configurations that are
trainer to use: PPO, SAC, or POCA. There are some training configurations that are
| `trainer_type` | (default = `ppo`) The type of trainer to use: `ppo` or `sac` |
| `trainer_type` | (default = `ppo`) The type of trainer to use: `ppo`, `sac`, or `poca`. |
| `summary_freq` | (default = `50000`) Number of experiences that needs to be collected before generating and displaying training statistics. This determines the granularity of the graphs in Tensorboard. |
| `time_horizon` | (default = `64`) How many steps of experience to collect per-agent before adding it to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon) and more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions. <br><br> Typical range: `32` - `2048` |
| `max_steps` | (default = `500000`) Total number of steps (i.e., observation collected and action taken) that must be taken in the environment (or across all environments if using multiple in parallel) before ending the training process. If you have multiple agents with the same behavior name within your environment, all steps taken by those agents will contribute to the same `max_steps` count. <br><br>Typical range: `5e5` - `1e7` |

| `hyperparameters -> tau` | (default = `0.005`) How aggressively to update the target network used for bootstrapping value estimation in SAC. Corresponds to the magnitude of the target Q update during the SAC model update. In SAC, there are two neural networks: the target and the policy. The target network is used to bootstrap the policy's estimate of the future rewards at a given state, and is fixed while the policy is being updated. This target is then slowly updated according to tau. Typically, this value should be left at 0.005. For simple problems, increasing tau to 0.01 might reduce the time it takes to learn, at the cost of stability. <br><br>Typical range: `0.005` - `0.01` |
| `hyperparameters -> steps_per_update` | (default = `1`) Average ratio of agent steps (actions) taken to updates made of the agent's policy. In SAC, a single "update" corresponds to grabbing a batch of size `batch_size` from the experience replay buffer, and using this mini batch to update the models. Note that it is not guaranteed that after exactly `steps_per_update` steps an update will be made, only that the ratio will hold true over many steps. Typically, `steps_per_update` should be greater than or equal to 1. Note that setting `steps_per_update` lower will improve sample efficiency (reduce the number of steps required to train) but increase the CPU time spent performing updates. For most environments where steps are fairly fast (e.g. our example environments) `steps_per_update` equal to the number of agents in the scene is a good balance. For slow environments (steps take 0.1 seconds or more) reducing `steps_per_update` may improve training speed. We can also change `steps_per_update` to lower than 1 to update more often than once per step, though this will usually result in a slowdown unless the environment is very slow. <br><br>Typical range: `1` - `20` |
| `hyperparameters -> reward_signal_num_update` | (default = `steps_per_update`) Number of steps per mini batch sampled and used for updating the reward signals. By default, we update the reward signals once every time the main policy is updated. However, to imitate the training procedure in certain imitation learning papers (e.g. [Kostrikov et. al](http://arxiv.org/abs/1809.02925), [Blondé et. al](http://arxiv.org/abs/1809.02064)), we may want to update the reward signal (GAIL) M times for every update of the policy. We can change `steps_per_update` of SAC to N, as well as `reward_signal_steps_per_update` under `reward_signals` to N / M to accomplish this. By default, `reward_signal_steps_per_update` is set to `steps_per_update`. |
### MA-POCA-specific Configurations
MA-POCA uses the same configurations as PPO, and there are no additional POCA-specific parameters.
**NOTE**: Reward signals other than Extrinsic Rewards have not been extensively tested with MA-POCA,
though they can still be added and used for training on a your-mileage-may-vary basis.
## Reward Signals

366
docs/images/cooperative_pushblock.png

之前 之后
宽度: 1484  |  高度: 828  |  大小: 135 KiB

194
docs/images/groupmanager_teamid.png

之前 之后
宽度: 787  |  高度: 416  |  大小: 52 KiB
正在加载...
取消
保存