Reinforcement Learning – Climber

In the previous two weeks I covered an introduction to the ML-Agents Toolkit and explained examples that make use of it. Using that information, I applied the toolkit to a final project – a modified version of the Bouncer example, which I named Climber.

In this final blog, I report on my final project.

The project

The Climber project is inspired by the bouncer example from the previous blog. In the Bouncer example, the RL agent bounced from target to target. In this climber example, the agent does the same, but the target’s position gets higher with every pickup, as shown in Media 1. To be precise, the position rises by 2 meters, and the agent can jump up to around 7 meters – which means that there is learning required to make it past the 4th pickup. If the agent is unable to reach a target, the scene will automatically reset after a certain amount of time.

Media 1 – Basic Climber

 

The mechanics

Like in the bouncer example, the agent can do only one thing: jumping around. The goal of the agent is to master the art of jumping from A to B. To make this process faster, the agent is forced to learn to jump quickly – or it will fall victim to the ever-rising lava, as shown in Media 2.

Media 2 – Climber, but with lava

 

Since it is a bit mean to let the agent struggle all by itself, it’s only fair to add more agents, as shown in Media 3. Solitary companionship aside, adding more agents also enables parallel learning, which increases the learning speed of each agent. In this case, any agent reaching a target will delay the scene reset, sentencing many agents that are unable to make the jump to a long, hot bath.

Media 3 – Parallel Learning

 

Testing

The agent instantly recognizes its target and starts jumping towards it. As expected, the agent soon learns that it cannot jump high enough to reach targets at heights of 8 meters.

After a few hours, agents consistently manage to reach 8 meter heights, as shown in Media 4.

Media 4 – Climber training

 

Conclusions

In this setup, agents rarely jump from platform to platform – they prefer to directly jump to their target’s platform, even if that is impossible. I tried to incentivize jumping between platforms by adding a reward to every floor, however, this caused the agents to jump on the spot after reaching the first platform – a classic cobra effect. Thus, like in other examples, it is important to cleverly incentivize actions.

Reloading a scene in Unity might slow down the learning process. When comparing this example to the others, the others do not reload a scene, but reset it. This might be the preferred option, since a reload might cause the learning structure to lose some of its progress. However, agents did show performance improvements, meaning that there was some successful learning.

Finally, the lava. It was a fun idea, but I am unsure if it actually made a difference. Agents showed a clear preference to jump to the target’s platform and rarely tried to jump to any other, actually reachable platform, even though they were already receiving lava-penalties.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

eins × vier =