浏览代码

Fixed code formating and links.

/develop-generalizationTraining-TrainerController
Marwan Mattar 6 年前
当前提交
c471ceca
共有 14 个文件被更改,包括 316 次插入271 次删除
  1. 3
      docs/Background-Unity.md
  2. 12
      docs/Feature-Memory.md
  3. 2
      docs/Feature-Monitor.md
  4. 6
      docs/Getting-Started-with-Balance-Ball.md
  5. 4
      docs/Installation-Windows.md
  6. 263
      docs/Learning-Environment-Create-New.md
  7. 251
      docs/Learning-Environment-Design-Agents.md
  8. 14
      docs/Learning-Environment-Design-Heuristic-Brains.md
  9. 4
      docs/Learning-Environment-Design.md
  10. 4
      docs/Learning-Environment-Examples.md
  11. 6
      docs/ML-Agents-Overview.md
  12. 12
      docs/Training-ML-Agents.md
  13. 4
      docs/Training-PPO.md
  14. 2
      docs/Using-TensorFlow-Sharp-in-Unity.md

3
docs/Background-Unity.md


* [Editor](https://docs.unity3d.com/Manual/UsingTheEditor.html)
* [Interface](https://docs.unity3d.com/Manual/LearningtheInterface.html)
* [Scene](https://docs.unity3d.com/Manual/CreatingScenes.html)
* [GameObjects](https://docs.unity3d.com/Manual/GameObjects.html)
* [GameObject](https://docs.unity3d.com/Manual/GameObjects.html)
* [Physics](https://docs.unity3d.com/Manual/PhysicsSection.html)
* [Ordering of event functions](https://docs.unity3d.com/Manual/ExecutionOrder.html)
(e.g. FixedUpdate, Update)

12
docs/Feature-Memory.md


# Using Recurrent Neural Networks in ML-Agents
# Memory-enhanced Agents using Recurrent Neural Networks
## What are memories for?
Have you ever entered a room to get something and immediately forgot

When configuring the trainer parameters in the `trainer_config.yaml`
file, add the following parameters to the Brain you want to use.
```
use_recurrent: true
sequence_length: 64
memory_size: 256
```json
use_recurrent: true
sequence_length: 64
memory_size: 256
```
* `use_recurrent` is a flag that notifies the trainer that you want

* Adding a recurrent layer increases the complexity of the neural
network, it is recommended to decrease `num_layers` when using recurrent.
* It is required that `memory_size` be divisible by 4.

2
docs/Feature-Monitor.md


You can track many different things both related and unrelated to the agents themselves. To use the Monitor, call the Log function anywhere in your code :
```csharp
Monitor.Log(key, value, displayType , target)
Monitor.Log(key, value, displayType , target)
```
* *`key`* is the name of the information you want to display.
* *`value`* is the information you want to display.

6
docs/Getting-Started-with-Balance-Ball.md


To summarize, go to your command line, enter the `ml-agents` directory and type:
```
```python
```
The `--train` flag tells ML-Agents to run in training mode. `env_file_path` should be the path to the Unity executable that was just created.

Once you start training using `learn.py` in the way described in the previous section, the `ml-agents` folder will
contain a `summaries` directory. In order to observe the training process
in more detail, you can use TensorBoard. From the command line run :
in more detail, you can use TensorBoard. From the command line run:
`tensorboard --logdir=summaries`

4
docs/Installation-Windows.md


Lastly, you should test to see if everything installed properly and that TensorFlow can identify your GPU. In the same Anaconda Prompt, type in the following command:
```
python
```python
import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

263
docs/Learning-Environment-Create-New.md


# Making a new Learning Environment
# Making a New Learning Environment
This tutorial walks through the process of creating a Unity Environment. A Unity Environment is an application built using the Unity Engine which can be used to train Reinforcement Learning agents.

3. Delete the `Start()` and `Update()` methods that were added by default.
In such a basic scene, we don't need the Academy to initialize, reset, or otherwise control any objects in the environment so we have the simplest possible Academy implementation:
public class RollerAcademy : Academy { }
```csharp
public class RollerAcademy : Academy { }
```
The default settings for the Academy properties are also fine for this environment, so we don't need to change anything for the RollerAcademy component in the Inspector window.

So far, our RollerAgent script looks like:
using System.Collections.Generic;
using UnityEngine;
public class RollerAgent : Agent
```csharp
using System.Collections.Generic;
using UnityEngine;
public class RollerAgent : Agent
{
Rigidbody rBody;
void Start () {
rBody = GetComponent<Rigidbody>();
}
public Transform Target;
public override void AgentReset()
Rigidbody rBody;
void Start () {
rBody = GetComponent<Rigidbody>();
if (this.transform.position.y < -1.0)
{
// The agent fell
this.transform.position = Vector3.zero;
this.rBody.angularVelocity = Vector3.zero;
this.rBody.velocity = Vector3.zero;
public Transform Target;
public override void AgentReset()
{
if (this.transform.position.y < -1.0)
{
// The agent fell
this.transform.position = Vector3.zero;
this.rBody.angularVelocity = Vector3.zero;
this.rBody.velocity = Vector3.zero;
}
else
{
// Move the target to a new spot
Target.position = new Vector3(Random.value * 8 - 4,
0.5f,
Random.value * 8 - 4);
}
else
{
// Move the target to a new spot
Target.position = new Vector3(Random.value * 8 - 4,
0.5f,
Random.value * 8 - 4);
}
```
Next, let's implement the Agent.CollectObservations() function.

* Position of the target. In general, it is better to use the relative position of other objects rather than the absolute position for more generalizable training. Note that the agent only collects the x and z coordinates since the floor is aligned with the x-z plane and the y component of the target's position never changes.
// Calculate relative position
Vector3 relativePosition = Target.position - this.transform.position;
// Relative position
AddVectorObs(relativePosition.x/5);
AddVectorObs(relativePosition.z/5);
```csharp
// Calculate relative position
Vector3 relativePosition = Target.position - this.transform.position;
// Relative position
AddVectorObs(relativePosition.x / 5);
AddVectorObs(relativePosition.z / 5);
```
// Distance to edges of platform
AddVectorObs((this.transform.position.x + 5) / 5);
AddVectorObs((this.transform.position.x - 5) / 5);
AddVectorObs((this.transform.position.z + 5) / 5);
AddVectorObs((this.transform.position.z - 5) / 5);
```csharp
// Distance to edges of platform
AddVectorObs((this.transform.position.x + 5) / 5);
AddVectorObs((this.transform.position.x - 5) / 5);
AddVectorObs((this.transform.position.z + 5) / 5);
AddVectorObs((this.transform.position.z - 5) / 5);
```
// Agent velocity
AddVectorObs(rBody.velocity.x/5);
AddVectorObs(rBody.velocity.z/5);
```csharp
// Agent velocity
AddVectorObs(rBody.velocity.x / 5);
AddVectorObs(rBody.velocity.z / 5);
```
List<float> observation = new List<float>();
public override void CollectObservations()
{
// Calculate relative position
Vector3 relativePosition = Target.position - this.transform.position;
// Relative position
AddVectorObs(relativePosition.x/5);
AddVectorObs(relativePosition.z/5);
// Distance to edges of platform
AddVectorObs((this.transform.position.x + 5)/5);
AddVectorObs((this.transform.position.x - 5)/5);
AddVectorObs((this.transform.position.z + 5)/5);
AddVectorObs((this.transform.position.z - 5)/5);
// Agent velocity
AddVectorObs(rBody.velocity.x/5);
AddVectorObs(rBody.velocity.z/5);
}
```csharp
List<float> observation = new List<float>();
public override void CollectObservations()
{
// Calculate relative position
Vector3 relativePosition = Target.position - this.transform.position;
// Relative position
AddVectorObs(relativePosition.x/5);
AddVectorObs(relativePosition.z/5);
// Distance to edges of platform
AddVectorObs((this.transform.position.x + 5)/5);
AddVectorObs((this.transform.position.x - 5)/5);
AddVectorObs((this.transform.position.z + 5)/5);
AddVectorObs((this.transform.position.z - 5)/5);
// Agent velocity
AddVectorObs(rBody.velocity.x/5);
AddVectorObs(rBody.velocity.z/5);
}
```
The final part of the Agent code is the Agent.AgentAction() function, which receives the decision from the Brain.

With the reference to the Rigidbody, the agent can apply the values from the action[] array using the `Rigidbody.AddForce` function:
Vector3 controlSignal = Vector3.zero;
controlSignal.x = Mathf.Clamp(action[0], -1, 1);
controlSignal.z = Mathf.Clamp(action[1], -1, 1);
rBody.AddForce(controlSignal * speed);
```csharp
Vector3 controlSignal = Vector3.zero;
controlSignal.x = Mathf.Clamp(action[0], -1, 1);
controlSignal.z = Mathf.Clamp(action[1], -1, 1);
rBody.AddForce(controlSignal * speed);
```
The agent clamps the action values to the range [-1,1] for two reasons. First, the learning algorithm has less incentive to try very large values (since there won't be any affect on agent behavior), which can avoid numeric instability in the neural network calculations. Second, nothing prevents the neural network from returning excessively large values, so we want to limit them to reasonable ranges in any case.

The RollerAgent calculates the distance to detect when it reaches the target. When it does, the code increments the Agent.reward variable by 1.0 and marks the agent as finished by setting the agent to done.
float distanceToTarget = Vector3.Distance(this.transform.position,
```csharp
float distanceToTarget = Vector3.Distance(this.transform.position,
Target.position);
// Reached target
if (distanceToTarget < 1.42f)
{
Done();
AddReward(1.0f);
}
```
**Note:** When you mark an agent as done, it stops its activity until it is reset. You can have the agent reset immediately, by setting the Agent.ResetOnDone property in the inspector or you can wait for the Academy to reset the environment. This RollerBall environment relies on the `ResetOnDone` mechanism and doesn't set a `Max Steps` limit for the Academy (so it never resets the environment).
To encourage the agent along, we also reward it for getting closer to the target (saving the previous distance measurement between steps):
```csharp
// Getting closer
if (distanceToTarget < previousDistance)
{
AddReward(0.1f);
}
```
It can also encourage an agent to finish a task more quickly to assign a negative reward at each step:
```csharp
// Time penalty
AddReward(-0.05f);
```
Finally, to punish the agent for falling off the platform, assign a large negative reward and, of course, set the agent to done so that it resets itself in the next step:
```csharp
// Fell off platform
if (this.transform.position.y < -1.0)
{
Done();
AddReward(-1.0f);
}
```
**AgentAction()**
With the action and reward logic outlined above, the final version of the `AgentAction()` function looks like:
```csharp
public float speed = 10;
private float previousDistance = float.MaxValue;
public override void AgentAction(float[] vectorAction, string textAction)
{
// Rewards
float distanceToTarget = Vector3.Distance(this.transform.position,
// Reached target
if (distanceToTarget < 1.42f)
{

**Note:** When you mark an agent as done, it stops its activity until it is reset. You can have the agent reset immediately, by setting the Agent.ResetOnDone property in the inspector or you can wait for the Academy to reset the environment. This RollerBall environment relies on the `ResetOnDone` mechanism and doesn't set a `Max Steps` limit for the Academy (so it never resets the environment).
To encourage the agent along, we also reward it for getting closer to the target (saving the previous distance measurement between steps):
// Getting closer
if (distanceToTarget < previousDistance)
{

It can also encourage an agent to finish a task more quickly to assign a negative reward at each step:
Finally, to punish the agent for falling off the platform, assign a large negative reward and, of course, set the agent to done so that it resets itself in the next step:
// Fell off platform
if (this.transform.position.y < -1.0)
{

previousDistance = distanceToTarget;
**AgentAction()**
With the action and reward logic outlined above, the final version of the `AgentAction()` function looks like:
public float speed = 10;
private float previousDistance = float.MaxValue;
public override void AgentAction(float[] vectorAction, string textAction)
{
// Rewards
float distanceToTarget = Vector3.Distance(this.transform.position,
Target.position);
// Reached target
if (distanceToTarget < 1.42f)
{
Done();
AddReward(1.0f);
}
// Getting closer
if (distanceToTarget < previousDistance)
{
AddReward(0.1f);
}
// Time penalty
AddReward(-0.05f);
// Fell off platform
if (this.transform.position.y < -1.0)
{
Done();
AddReward(-1.0f);
}
previousDistance = distanceToTarget;
// Actions, size = 2
Vector3 controlSignal = Vector3.zero;
controlSignal.x = Mathf.Clamp(vectorAction[0], -1, 1);
controlSignal.z = Mathf.Clamp(vectorAction[1], -1, 1);
rBody.AddForce(controlSignal * speed);
}
// Actions, size = 2
Vector3 controlSignal = Vector3.zero;
controlSignal.x = Mathf.Clamp(vectorAction[0], -1, 1);
controlSignal.z = Mathf.Clamp(vectorAction[1], -1, 1);
rBody.AddForce(controlSignal * speed);
}
```
Note the `speed` and `previousDistance` class variables defined before the function. Since `speed` is public, you can set the value from the Inspector window.

251
docs/Learning-Environment-Design-Agents.md


An agent is an actor that can observe its environment and decide on the best course of action using those observations. Create agents in Unity by extending the Agent class. The most important aspects of creating agents that can successfully learn are the observations the agent collects and, for reinforcement learning, the reward you assign to estimate the value of the agent's current state toward accomplishing its tasks.
An agent passes its observations to its brain. The brain, then, makes a decision and passes the chosen action back to the agent. Your agent code must execute the action, for example, move the agent in one direction or another. In order to train an agent using [reinforcement learning](Learning-Environment-Design.md), your agent must calculate a reward value at each action. The reward is used to discover the optimal decision-making policy. (A reward is not used by already trained agents or for imitation learning.)
An agent passes its observations to its brain. The brain, then, makes a decision and passes the chosen action back to the agent. Your agent code must execute the action, for example, move the agent in one direction or another. In order to [train an agent using reinforcement learning](Learning-Environment-Design.md), your agent must calculate a reward value at each action. The reward is used to discover the optimal decision-making policy. (A reward is not used by already trained agents or for imitation learning.)
The Brain class abstracts out the decision making logic from the agent itself so that you can use the same brain in multiple agents.
How a brain makes its decisions depends on the type of brain it is. An **External** brain simply passes the observations from its agents to an external process and then passes the decisions made externally back to the agents. An **Internal** brain uses the trained policy parameters to make decisions (and no longer adjusts the parameters in search of a better decision). The other types of brains do not directly involve training, but you might find them useful as part of a training project. See [Brains](Learning-Environment-Design-Brains.md).

The observation must include all the information an agent needs to accomplish its task. Without sufficient and relevant information, an agent may learn poorly or may not learn at all. A reasonable approach for determining what information should be included is to consider what you would need to calculate an analytical solution to the problem.
For examples of various state observation functions, you can look at the [Examples](Learning-Environment-Examples.md) included in the ML-Agents SDK. For instance, the 3DBall example uses the rotation of the platform, the relative position of the ball, and the velocity of the ball as its state observation. As an experiment, you can remove the velocity components from the observation and retrain the 3DBall agent. While it will learn to balance the ball reasonably well, the performance of the agent without using velocity is noticeably worse.
For examples of various state observation functions, you can look at the [example environments](Learning-Environment-Examples.md) included in the ML-Agents SDK. For instance, the 3DBall example uses the rotation of the platform, the relative position of the ball, and the velocity of the ball as its state observation. As an experiment, you can remove the velocity components from the observation and retrain the 3DBall agent. While it will learn to balance the ball reasonably well, the performance of the agent without using velocity is noticeably worse.
public GameObject ball;
private List<float> state = new List<float>();
public override void CollectObservations()
{
AddVectorObs(gameObject.transform.rotation.z);
AddVectorObs(gameObject.transform.rotation.x);
AddVectorObs((ball.transform.position.x - gameObject.transform.position.x));
AddVectorObs((ball.transform.position.y - gameObject.transform.position.y));
AddVectorObs((ball.transform.position.z - gameObject.transform.position.z));
AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.x);
AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.y);
AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.z);
}
```csharp
public GameObject ball;
private List<float> state = new List<float>();
public override void CollectObservations()
{
AddVectorObs(gameObject.transform.rotation.z);
AddVectorObs(gameObject.transform.rotation.x);
AddVectorObs((ball.transform.position.x - gameObject.transform.position.x));
AddVectorObs((ball.transform.position.y - gameObject.transform.position.y));
AddVectorObs((ball.transform.position.z - gameObject.transform.position.z));
AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.x);
AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.y);
AddVectorObs(ball.transform.GetComponent<Rigidbody>().velocity.z);
}
```
The feature vector must always contain the same number of elements and observations must always be in the same position within the list. If the number of observed entities in an environment can vary you can pad the feature vector with zeros for any missing entities in a specific observation or you can limit an agent's observations to a fixed subset. For example, instead of observing every enemy agent in an environment, you could only observe the closest five.

Integers can be be added directly to the observation vector. You must explicitly convert Boolean values to a number:
AddVectorObs(isTrueOrFalse ? 1 : 0);
```csharp
AddVectorObs(isTrueOrFalse ? 1 : 0);
```
Vector3 speed = ball.transform.GetComponent<Rigidbody>().velocity;
AddVectorObs(speed.x);
AddVectorObs(speed.y);
AddVectorObs(speed.z);
```csharp
Vector3 speed = ball.transform.GetComponent<Rigidbody>().velocity;
AddVectorObs(speed.x);
AddVectorObs(speed.y);
AddVectorObs(speed.z);
```
enum CarriedItems { Sword, Shield, Bow, LastItem }
private List<float> state = new List<float>();
public override void CollectObservations()
```csharp
enum CarriedItems { Sword, Shield, Bow, LastItem }
private List<float> state = new List<float>();
public override void CollectObservations()
{
for (int ci = 0; ci < (int)CarriedItems.LastItem; ci++)
for (int ci = 0; ci < (int)CarriedItems.LastItem; ci++)
{
AddVectorObs((int)currentItem == ci ? 1.0f : 0.0f);
}
AddVectorObs((int)currentItem == ci ? 1.0f : 0.0f);
}
```
#### Normalization

normalizedValue = (currentValue - minValue)/(maxValue - minValue)
```csharp
normalizedValue = (currentValue - minValue)/(maxValue - minValue)
```
Quaternion rotation = transform.rotation;
Vector3 normalized = rotation.eulerAngles/180.0f - Vector3.one; // [-1,1]
Vector3 normalized = rotation.eulerAngles/360.0f; // [0,1]
For angles that can be outside the range [0,360], you can either reduce the angle, or, if the number of turns is significant, increase the maximum value used in your normalization formula.
```csharp
Quaternion rotation = transform.rotation;
Vector3 normalized = rotation.eulerAngles / 180.0f - Vector3.one; // [-1,1]
Vector3 normalized = rotation.eulerAngles / 360.0f; // [0,1]
```
For angles that can be outside the range [0,360], you can either reduce the angle, or, if the number of turns is significant, increase the maximum value used in your normalization formula.
### Multiple Visual Observations

### Discrete Vector Observation Space: Table Lookup
You can use the discrete vector observation space when an agent only has a limited number of possible states and those states can be enumerated by a single number. For instance, the [Basic example environment](Learning-Environment-Examples.md) in ML-Agents defines an agent with a discrete vector observation space. The states of this agent are the integer steps between two linear goals. In the Basic example, the agent learns to move to the goal that provides the greatest reward.
You can use the discrete vector observation space when an agent only has a limited number of possible states and those states can be enumerated by a single number. For instance, the [Basic example environment](Learning-Environment-Examples.md#basic) in ML-Agents defines an agent with a discrete vector observation space. The states of this agent are the integer steps between two linear goals. In the Basic example, the agent learns to move to the goal that provides the greatest reward.
public override void CollectObservations()
{
AddVectorObs(stateIndex); // stateIndex is the state identifier
}
```csharp
public override void CollectObservations()
{
AddVectorObs(stateIndex); // stateIndex is the state identifier
}
```
## Vector Actions

Note that when you are programming actions for an agent, it is often helpful to test your action logic using a **Player** brain, which lets you map keyboard commands to actions. See [Brains](Learning-Environment-Design-Brains.md).
The [3DBall and Area example projects](Learning-Environment-Examples.md) are set up to use either the continuous or the discrete vector action spaces.
The [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) and [Area](Learning-Environment-Examples.md#push-block) example environments are set up to use either the continuous or the discrete vector action spaces.
The [Reacher example](Learning-Environment-Examples.md) defines a continuous action space with four control values.
The [Reacher example](Learning-Environment-Examples.md#reacher) defines a continuous action space with four control values.
public override void AgentAction(float[] act)
{
float torque_x = Mathf.Clamp(act[0], -1, 1) * 100f;
float torque_z = Mathf.Clamp(act[1], -1, 1) * 100f;
rbA.AddTorque(new Vector3(torque_x, 0f, torque_z));
torque_x = Mathf.Clamp(act[2], -1, 1) * 100f;
torque_z = Mathf.Clamp(act[3], -1, 1) * 100f;
rbB.AddTorque(new Vector3(torque_x, 0f, torque_z));
}
```csharp
public override void AgentAction(float[] act)
{
float torque_x = Mathf.Clamp(act[0], -1, 1) * 100f;
float torque_z = Mathf.Clamp(act[1], -1, 1) * 100f;
rbA.AddTorque(new Vector3(torque_x, 0f, torque_z));
torque_x = Mathf.Clamp(act[2], -1, 1) * 100f;
torque_z = Mathf.Clamp(act[3], -1, 1) * 100f;
rbB.AddTorque(new Vector3(torque_x, 0f, torque_z));
}
```
You should clamp continuous action values to a reasonable value (typically [-1,1]) to avoid introducing instability while training the agent with the PPO algorithm. As shown above, you can scale the control values as needed after clamping them.

The [Area example](Learning-Environment-Examples.md) defines five actions for the discrete vector action space: a jump action and one action for each cardinal direction:
The [Area example](Learning-Environment-Examples.md#push-block) defines five actions for the discrete vector action space: a jump action and one action for each cardinal direction:
```csharp
// Get the action index
int movement = Mathf.FloorToInt(act[0]);
// Look up the index in the action list:
if (movement == 1) { directionX = -1; }
if (movement == 2) { directionX = 1; }
if (movement == 3) { directionZ = -1; }
if (movement == 4) { directionZ = 1; }
if (movement == 5 && GetComponent<Rigidbody>().velocity.y <= 0) { directionY = 1; }
// Get the action index
int movement = Mathf.FloorToInt(act[0]);
// Look up the index in the action list:
if (movement == 1) { directionX = -1; }
if (movement == 2) { directionX = 1; }
if (movement == 3) { directionZ = -1; }
if (movement == 4) { directionZ = 1; }
if (movement == 5 && GetComponent<Rigidbody>().velocity.y <= 0) { directionY = 1; }
// Apply the action results to move the agent
gameObject.GetComponent<Rigidbody>().AddForce(
new Vector3(
directionX * 40f, directionY * 300f, directionZ * 40f));
// Apply the action results to move the agent
gameObject.GetComponent<Rigidbody>().AddForce(
new Vector3(
directionX * 40f, directionY * 300f, directionZ * 40f));
```
Note that the above code example is a simplified extract from the AreaAgent class, which provides alternate implementations for both the discrete and the continuous action spaces.

**Examples**
You can examine the `AgentAction()` functions defined in the [Examples](Learning-Environment-Examples.md) to see how those projects allocate rewards.
You can examine the `AgentAction()` functions defined in the [example environments](Learning-Environment-Examples.md) to see how those projects allocate rewards.
The `GridAgent` class in the [GridWorld example](Learning-Environment-Examples.md) uses a very simple reward system:
The `GridAgent` class in the [GridWorld example](Learning-Environment-Examples.md#gridworld) uses a very simple reward system:
Collider[] hitObjects = Physics.OverlapBox(trueAgent.transform.position,
new Vector3(0.3f, 0.3f, 0.3f));
if (hitObjects.Where(col => col.gameObject.tag == "goal").ToArray().Length == 1)
{
AddReward(1.0f);
Done();
}
if (hitObjects.Where(col => col.gameObject.tag == "pit").ToArray().Length == 1)
{
AddReward(-1f);
Done();
}
```csharp
Collider[] hitObjects = Physics.OverlapBox(trueAgent.transform.position,
new Vector3(0.3f, 0.3f, 0.3f));
if (hitObjects.Where(col => col.gameObject.tag == "goal").ToArray().Length == 1)
{
AddReward(1.0f);
Done();
}
if (hitObjects.Where(col => col.gameObject.tag == "pit").ToArray().Length == 1)
{
AddReward(-1f);
Done();
}
```
In contrast, the `AreaAgent` in the [Area example](Learning-Environment-Examples.md) gets a small negative reward every step. In order to get the maximum reward, the agent must finish its task of reaching the goal square as quickly as possible:
In contrast, the `AreaAgent` in the [Area example](Learning-Environment-Examples.md#push-block) gets a small negative reward every step. In order to get the maximum reward, the agent must finish its task of reaching the goal square as quickly as possible:
```csharp
AddReward( -0.005f);
MoveAgent(act);
AddReward( -0.005f);
MoveAgent(act);
if (gameObject.transform.position.y < 0.0f ||
Mathf.Abs(gameObject.transform.position.x - area.transform.position.x) > 8f ||
Mathf.Abs(gameObject.transform.position.z + 5 - area.transform.position.z) > 8)
{
Done();
AddReward(-1f);
}
if (gameObject.transform.position.y < 0.0f ||
Mathf.Abs(gameObject.transform.position.x - area.transform.position.x) > 8f ||
Mathf.Abs(gameObject.transform.position.z + 5 - area.transform.position.z) > 8)
{
Done();
AddReward(-1f);
}
```
The `Ball3DAgent` in the [3DBall](Learning-Environment-Examples.md) takes a similar approach, but allocates a small positive reward as long as the agent balances the ball. The agent can maximize its rewards by keeping the ball on the platform:
The `Ball3DAgent` in the [3DBall](Learning-Environment-Examples.md#3dball-3d-balance-ball) takes a similar approach, but allocates a small positive reward as long as the agent balances the ball. The agent can maximize its rewards by keeping the ball on the platform:
if (IsDone() == false)
{
SetReward(0.1f);
}
//When ball falls mark agent as done and give a negative penalty
if ((ball.transform.position.y - gameObject.transform.position.y) < -2f ||
Mathf.Abs(ball.transform.position.x - gameObject.transform.position.x) > 3f ||
Mathf.Abs(ball.transform.position.z - gameObject.transform.position.z) > 3f)
{
Done();
SetReward(-1f);
}
```csharp
if (IsDone() == false)
{
SetReward(0.1f);
}
// When ball falls mark agent as done and give a negative penalty
if ((ball.transform.position.y - gameObject.transform.position.y) < -2f ||
Mathf.Abs(ball.transform.position.x - gameObject.transform.position.x) > 3f ||
Mathf.Abs(ball.transform.position.z - gameObject.transform.position.z) > 3f)
{
Done();
SetReward(-1f);
}
```
The `Ball3DAgent` also assigns a negative penalty when the ball falls off the platform.

To add an Agent to an environment at runtime, use the Unity `GameObject.Instantiate()` function. It is typically easiest to instantiate an agent from a [Prefab](https://docs.unity3d.com/Manual/Prefabs.html) (otherwise, you have to instantiate every GameObject and Component that make up your agent individually). In addition, you must assign a Brain instance to the new Agent and initialize it by calling its `AgentReset()` method. For example, the following function creates a new agent given a Prefab, Brain instance, location, and orientation:
private void CreateAgent(GameObject agentPrefab, Brain brain, Vector3 position, Quaternion orientation)
{
GameObject agentObj = Instantiate(agentPrefab, position, orientation);
Agent agent = agentObj.GetComponent<Agent>();
agent.GiveBrain(brain);
agent.AgentReset();
}
```csharp
private void CreateAgent(GameObject agentPrefab, Brain brain, Vector3 position, Quaternion orientation)
{
GameObject agentObj = Instantiate(agentPrefab, position, orientation);
Agent agent = agentObj.GetComponent<Agent>();
agent.GiveBrain(brain);
agent.AgentReset();
}
```
## Destroying an Agent

14
docs/Learning-Environment-Design-Heuristic-Brains.md


When creating your Decision class, extend MonoBehaviour (so you can use the class as a Unity component) and extend the Decision interface.
using UnityEngine;
public class HeuristicLogic : MonoBehaviour, Decision
{
// ...
}
```csharp
using UnityEngine;
public class HeuristicLogic : MonoBehaviour, Decision
{
// ...
}
```
The Decision interface defines two methods, `Decide()` and `MakeMemory()`.

4
docs/Learning-Environment-Design.md


The Brain encapsulates the decision making process. Brain objects must be children of the Academy in the Unity scene hierarchy. Every Agent must be assigned a Brain, but you can use the same Brain with more than one Agent.
Use the Brain class directly, rather than a subclass. Brain behavior is determined by the brain type. During training, set your agent's brain type to **External**. To use the trained model, import the model file into the Unity project and change the brain type to **Internal**. See [Brain topic](Learning-Environment-Design-Brains.md) for details on using the different types of brains. You can extend the CoreBrain class to create different brain types if the four built-in types don't do what you need.
Use the Brain class directly, rather than a subclass. Brain behavior is determined by the brain type. During training, set your agent's brain type to **External**. To use the trained model, import the model file into the Unity project and change the brain type to **Internal**. See [Brains](Learning-Environment-Design-Brains.md) for details on using the different types of brains. You can extend the CoreBrain class to create different brain types if the four built-in types don't do what you need.
The Brain class has several important properties that you can set using the Inspector window. These properties must be appropriate for the agents using the brain. For example, the `Vector Observation Space Size` property must match the length of the feature vector created by an agent exactly. See [Agent topic]() for information about creating agents and setting up a Brain instance correctly.
The Brain class has several important properties that you can set using the Inspector window. These properties must be appropriate for the agents using the brain. For example, the `Vector Observation Space Size` property must match the length of the feature vector created by an agent exactly. See [Agents](Learning-Environment-Design-Agents.md) for information about creating agents and setting up a Brain instance correctly.
See [Brains](Learning-Environment-Design-Brains.md) for a complete list of the Brain properties.

4
docs/Learning-Environment-Examples.md


This page only overviews the example environments we provide. To learn more
on how to design and build your own environments see our
[Making a new Learning Environment](Learning-Environment-Create-New.md)
[Making a New Learning Environment](Learning-Environment-Create-New.md)
[contribution guidelines](CONTRIBUTING.md) page.
[contribution guidelines](../CONTRIBUTING.md) page.
## Basic

6
docs/ML-Agents-Overview.md


elements of the environment related to difficulty or complexity to be
dynamically adjusted based on training progress.
The [Curriculum Learning](Training-Curriculum-Learning.md)
The [Training with Curriculum Learning](Training-Curriculum-Learning.md)
tutorial covers this training mode with the **Wall Area** sample environment.
### Imitation Learning

will then use these pairs of observations and actions from the human player
to learn a policy.
The [Imitation Learning](Training-Imitation-Learning.md) tutorial covers this
The [Training with Imitation Learning](Training-Imitation-Learning.md) tutorial covers this
training mode with the **Banana Collector** sample environment.
## Flexible Training Scenarios

multiple cameras with different viewpoints, or a navigational agent which might
need to integrate aerial and first-person visuals. You can learn more about
adding visual observations to an agent
[here](Learning-Environment-Design-Agents.md#visual-observations).
[here](Learning-Environment-Design-Agents.md#multiple-visual-observations).
* **Broadcasting** - As discussed earlier, an External Brain sends the
observations for all its Agents to the Python API by default. This is helpful

12
docs/Training-ML-Agents.md


| lambd | The regularization parameter. | PPO |
| learning_rate | The initial learning rate for gradient descent. | PPO, BC |
| max_steps | The maximum number of simulation steps to run during a training session. | PPO, BC |
| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks in ML-Agents](Feature-Memory.md). | PPO, BC |
| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, BC |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks in ML-Agents](Feature-Memory.md). | PPO, BC |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, BC |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks in ML-Agents](Feature-Memory.md).| PPO, BC |
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).| PPO, BC |
* [Using Recurrent Neural Networks in ML-Agents](Feature-Memory.md)
* [Imitation Learning](Training-Imitation-Learning.md)
* [Using Recurrent Neural Networks](Feature-Memory.md)
* [Training with Imitation Learning](Training-Imitation-Learning.md)
You can also compare the [example environments](Learning-Environment-Examples.md) to the corresponding sections of the `trainer-config.yaml` file for each example to see how the hyperparameters and other configuration variables have been changed from the defaults.
You can also compare the [example environments](Learning-Environment-Examples.md) to the corresponding sections of the `trainer-config.yaml` file for each example to see how the hyperparameters and other configuration variables have been changed from the defaults.

4
docs/Training-PPO.md


See [Training ML-Agents](Training-ML-Agents.md) for instructions on running the training program, `learn.py`.
If you are using the recurrent neural network (RNN) to utilize memory, see [Using Recurrent Neural Networks in ML-Agents](Feature-Memory.md) for RNN-specific training details.
If you are using the recurrent neural network (RNN) to utilize memory, see [Using Recurrent Neural Networks](Feature-Memory.md) for RNN-specific training details.
For information about imitation learning, which uses a different training algorithm, see [Imitation Learning](Training-Imitation-Learning).
For information about imitation learning, which uses a different training algorithm, see [Training with Imitation Learning](Training-Imitation-Learning).
## Best Practices when training with PPO

2
docs/Using-TensorFlow-Sharp-in-Unity.md


Beyond controlling an in-game agent, you can also use TensorFlowSharp for more general computation. The following instructions describe how to generally embed TensorFlow models without using the ML-Agents framework.
You must have a TensorFlow graph, such as `your_name_graph.bytes`, made using TensorFlow's `freeze_graph.py`. The process to create such graph is explained in[Using your own trained graphs](#using-your-own-trained-graphs).
You must have a TensorFlow graph, such as `your_name_graph.bytes`, made using TensorFlow's `freeze_graph.py`. The process to create such graph is explained in the [Using your own trained graphs](#using-your-own-trained-graphs) section.
## Inside of Unity

正在加载...
取消
保存