Reinforcement Learning – Exploring the Unity ML-Agents Toolkit

Last week I discovered and introduced the ML-Agents toolkit, which is a Unity utility that applies Machine Learning to Unity applications. A first look on the kart racing example project provided insights into this specific field. In summary, an RL agent receives some form of input and improves its behavior based on that input over time until its performance reaches a desired degree of quality. On the one hand, this utility fulfills my initial goal to test applications using AI while on the other hand, it even surpasses my intentions by providing extra features, like training NPCs.

In this blog, example projects of the ML-Agents toolkit are briefly shown and summarized.


The „basic“ example is an introduction example that showcases basic decision-making. In the example, the agent may decide whether to move left or right. To its left, there is a small goal a few steps away, offering a small reward. To its right, there is a large goal some more steps away, offering a large reward. Chasing the greatest reward, the agent will always move to its right, as shown in Media 1.

Media 1 – Basic


Grid World

Similar to the „basic“ example, this game is about managing to navigate to the right goal. Unlike the basic example, the agent must now also move up and down and reach the goal in as few turns as possible, as shown in Media 2. This example is supported by a game-board representation, which provides a good overview about the locations of the agent the goals.

Media 2 – Grid World


3D Ball

This game is all about balancing a sphere on a cube. In the example, twelve parallel agents learn and act together, as shown in Media 3. The agent takes its ball’s position and velocity as input and chooses the most appropriate rotation based on it. As long as the ball does not fall, the agent is rewarded positively.

Media 3 – 3D Ball



In this game, the player must jump/bounce to a specified location. 18 agents learn simultaneously how to bounce, as shown in Media 4. The target to which each agent needs to bounce to is randomly set. The agent determines its bounce direction based on two observations: its own position and the position of its target. If the agent bounces too far or even out of the designated area, it is rewarded negatively. In fact, each bounce is rewarded negatively, meaning that a reward is only positive if the target is reached with a small number of bounces.

Media 4 – Bouncer



In the crawler example, the agent needs to navigate a four-legged model to a randomly set target. The catch is that the agent can must move the legs to move the model, as shown in Media 5. The agent is rewarded for facing & moving towards the target. However, facing or moving away from the target leads to penalties. Also, the agent needs to hurry, because each frame takes a toll on the reward.

Media 5 – Crawler


Food Collector

The food collector game is about collecting certain objects (food) while avoiding others (poison). In this example, the agents learn to pick up the right objects by collecting obeservations using a raycast, as shown in Media 6. The reward is based on whether the food the agent managed to pick up was good or bad.

Media 6 – Food Collector



This game is about entering the right door in a hallway. The right door is indicated by a sign – the agent uses this sign select which door to navigate towards, as shown in Media 7. If the right goal is reached, the agent is rewarded positively.

Media 7 – Hallway


Push Block

In this game, the player is tasked to push a block into a designated area, as shown in Media 8. The agent accomplishes this by observing its vicinity for the block and goal. This observations are used to choose an action, which moves the agent. The reward is determined by how fast the block reaches the goal.

Media 8 – Push Block



This example is about finding a button that unlocks the goal and then reaching that goal, as shown in Media 9. Agents use raycasts to observe their surroundings and decide their movements based on them. Only reaching the goal leads to a positive reward.

Media 9 – Pyramids



The Reacher example is about controlling an arm to reach for a moving goal, as shown in Media 10. The actions of the agents are based on a collection of normalized rotations, angularal velocities, and velocities of both limbs of the reacher as well as the relative position of the target and hand. A positive reward is given as long as an arm is in contact with the goal.

Media 10 – Reacher



Like in the crawler example, the agent needs to control a model. However, in this example, the model resembles a human, which requires not only the correct limb movements, but also balancing to avoid tipping over, as shown in Media 11. The agent observes the positions and rotations of its body parts to make the right movement decision. The reward is based on how „correct“ the agents walking is, including its speed and posture.

Media 11 – Walker


Wall Jump

The Wall Jump example is about reaching a static goal after jumping over wall, as shown in Media 12. The agent checks the ground and will move towards a target. In the best case, the agent chooses the block as a target before the actual goal, so that it may jump over the wall. Reaching the goal provides the only positive reward. A time penalty incentivizes fast decisions.

Media 12 – Wall Jump

This example contained scenarios in which the agent got stuck, as shown in Media 13. To avoid this, the scene could be reset after a timeout.

Media 13 – Wall Jump Failing Scenario



Finally, the tennis example shows two agents facing off each other, as shown in Media 14. By observing the ball’s position and velocity, the agents may react and navigate their racket to it. Eventually, an agent scores a point, which leads to a positive reward.

Media 14 – Tennis

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

3 × drei =