Reinforcement Learning – Lost Chapters – Designing a Reward Function

I recently read a Master Thesis in the field of Reinforcement Learning – and realized that there is a lot of important theoratical content I skipped over so far. This blog marks the beginning of the „Lost Chapters“ series, which will cover topics I might have missed during my journey through the AI jungle.

In this blog, I will take a glance at the design of reward functions.

Reward signal

A Reinforcement Learning (RL) bot performs the action in an environment based on a policy – the environment reacts by returning the next state and a reward signal. This reward signal is a real number based on a reward function, which is designed to encourage the RL bot to take actions that  fulfill some objective in the environment, e.g. winning a game. The reward signal indicates whether an action was good or bad and the purpose of the RL algorithm is to optimize its policy to take actions that maximize this numerical reward signal. (cf. [Sutton 2018])

 

Designing Reward Functions: The Cobra Effect

Since the reward signal encourages the RL bot to take actions that fulfill an objective, it is important to design the reward function properly: unbiased, and without adding any helpful knowledge. (cf. [Sutton 2018])

Incorrect design might lead to the so called „Cobra Effect“: Once upon a time, the government tried to incentivize people to assist them in ridding an area of cobras. If a citizen brought in a venomous snake they had killed, the government would give them a bounty. As an unexpected twist, people started breeding venomous snakes. The moral is that people always game the system. To avoid the Cobra Effect, we need to keep in mind that we get what we incentivize, not what we intend. (cf. [https://medium.com])

I encountered this effect myself in the Metroid Learning project, when the AI decided to spam-jump in the next best corner. The AI realized that I incentivized movement in the algorithm – and gamed that system.

 

Designing Reward Functions: Reward Shaping

Rewards come in two forms: sparse rewards and non-sparse rewards. A sparse reward signal is sent to the RL bot only when it completed a given objective, e.g. winning a game. Sutton and Barto recommend to use sparse rewards of +1 for winning, -1 for losing and 0 for everything in between. This approach directly rewards a success and punishes a failure, without adding any bias.

In the early stages of the training phase, positive rewards occur rarely. This is due to the RL bot still exploring its environment. However, rare positive rewards also mean that the RL bot rarely receives helpful feedback, leading to the „Plateau problem“, where the AI aimlessly tries out different actions before receiving any helpful feedback.

One solution to this problem are non-sparse rewards: intermediate reward signals designed to guide the RL bot to achieving its goal. A non-sparse reward could be implemented in form of a score for an action (or like an achievement, for that matter). There is one possible downside to non-sparse reward signals: they are a form of helpful knowledge, which might lead to a Cobra Effect.

An alternative solution is reward shaping. Similar to Curriculum Learning (see below), reward shaping focuses on steadily increasing the difficulty of the problem that the RL bot is trying to solve. In Curriculum Learning, the RL bot is confronted by more and more difficult tasks. In reward-shaping, the reward transforms from an intermediate non-sparse reward into a sparse reward that only focuses on the ultimate goal. (cf. [Sutton 2018])

 

Curriculum Learning

The machine learning approach „Curriculum Learning“ is inspired by a common concept in human learning: increasing complexity – where learning objectives are presented in a meaningful order. When it comes to the training of the neural network, instead of sampling the training examples of varying difficulty, examples may be sampled with increasing difficulty, starting with the most simple ones. This approach has shown good results on a variety of problems in the fields of natural language processing and computer vision. (cf. [Goodfellow 2016])

 

Sources:

[Sutton 2018]
Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2018.

[Goodfellow 2016]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[https://medium.com]
Deep Reinforcement Learning Models: Tips & Tricks for Writing Reward Functions. The Cobra Effect. 2017.
https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0 (25.11.2019)

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

17 + 13 =