Reinforcement Learning – Advanced practices

Last week I gave a quick introduction to Q learning. This week I want to follow up to this topic by taking a closer look on more advanced development practices than those used in MarI/O.

This blog covers three useful practices in the field of reinforcement learning. The sources mentioned at the end provide code examples that will be useful for future experiments.

Deep Q Learning

Q Learning follows the principle of long term success – the goal is to determine which action “a” the AI should take in any situation “s” (also called ‘state’) in order to get the best results. The learning rule that is finally applied to the neural network looks like this:

This so-called “Bellman equation” might look a bit intimidating, but it is not as bad as it seems. First of all ‘Q(s, a)’ is the table of all possible states and actions. The rule updates the current table by adding some logic to it. When focusing on the contents of the square brackets, we see that the update includes three terms. The first term is simple – ‘r’ stands for reward that is received when action ‘a’ is taken in state ‘s’. The second term is the delayed reward calculation, which results in the product of two values. The first value, depicted as the Greek letter epsilon, is the impact of the delayed reward – which is a value between 0 and 1. The second value is (simply put) the maximum Q value possible in the next state. The third term is optional; subtracting the current table from the updated rewards might lead to optimized results. Finally, the result of these terms is multiplied with the learning rate, depicted as the Greek letter alpha. A successful update should lead to a change in behavior of the AI – the AI should ‘learn’ and play better with every update.

 

Epsilon-greedy policy

A Q Learning AI selects the action the corresponds best to the highest ‘Q output’. However, there is one flaw to this approach: The neural network is randomly initialized, and therefore it might fall into sub-optimal behavior patterns without exploring the game and its reward-space. Simply put, it might choose the wrong strategy.

The epsilon-greedy policy is all about ‘exploration’ and ‘exploitation’. It follows the idea that the AI first explores the game in hope to find good minima – actions that might lead to delayed rewards. Once the AI finishes its exploration, it should focus on exploiting what has been found by fully making use of the minima.

In Q Learning, epsilon represents the impact of the delayed reward. ‘Epsilon-greedy’ describes the idea to find and collect the best delayed rewards.

 

Batching

When training a neural network, there is always the risk of overfitting – especially when it comes to the practices described above. Overfitting means that the network adjusts too well to one given scenario, rendering it useless everywhere else.

To avoid overfitting, the AI should have something like a memory – a collection of all kinds of data. This memory can then be divided in batches – chunks of collected data that is trained separately from other chunks. By doing so, the training data might end up more diverse, rendering the AI more effective.

 

Sources:

https://adventuresinmachinelearning.com/reinforcement-learning-tensorflow/

https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0

 

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

vier × 2 =