Reinforcement Learning – Lost Chapters – Policy Gradient Methods

This blog serves as an introduction to policy gradient methods.

Reinforcement Learning algorithm types

There are three types of Reinforcement Learning (RL) algorithm methods. First, there are value-iteration methods such as Q-Learning, which aim to maximize the reward by applying a long-term focused strategy – for example, in Super Mario World, the bot could just rush to the goal (= short-term, minor rewards), or the bot could take some time to collect power-ups or coins, and maximize the game’s score (= long-term, major rewards). The Q-Learning algorithm has already been covered in a previous blog post.

Second, there are policy-iteration methods such as policy gradient, which update the RL bots‘ knowledge about the rules of the given system – that way, the RL bot learns to differentiate between elements of the system, such as collectable items, enemies etc. This type of algorithm is further covered below

Finally, there are actor-critic methods, which combine the benefits of the other two types of methods. (cf. [https://medium.com])

 

Introduction to Policy Gradient methods

Policy gradient methods are RL algorithms that can be applied to learn a certain type of policy of the given system. That policy has two defining properties: it is dependent on 1) a given state and 2) a set of parameters when taking an action. If the RL bot uses a neural network, that set of parameters are the weights of the network.

In contrast to other RL algorithms, policy gradient methods do not necessarily need a value function to take actions. However, value functions may help updating the policy. Policy gradient methods that use value functions that way are typically referred to as actor-critic methods.

To measure the performance of the algorithm during the training phase, an objective is set. This objective depends on the current set of policy parameters. To maximize the numerical value of the objective, the policy gradient uses a gradient ascent algorithm, which effectively approaches a local maximum of the overall function. (cf. [Sutton 2018])

 

 

Sources

[Sutton 2018]
Sutton, Richard; Barto, Andrew: Reinforcement Learning: An Introduction. ABradford Book, 2018.

[https://medium.com]
Juliani, Arthur: Simple Reinforcement Learning with Tensorflow Part 8: Asynchronous Actor-Critic Agents (A3C). 2016. https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2 (06/01/2020)

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

neun + 3 =