So far, the AI learned to play a game by interacting with its environment and maximizing a desired reward. Practically, the AI just repeatedly played the game; getting better with each iteration. To be successful, it is necessary to act upon a policy. However, there are other approaches for this case: the so-called “Deep Q network” or the “epsilon-greedy policy”. I will focus on the former for one main reason: it is compatible with TensorFlow, which is a python library I wanted to take a closer look on anyway.
This blog serves as an introduction into the paradigm of Q-Learning.
Q-Learning is all about making the right decisions. The basic idea is to keep track of successful actions during a playthrough in a table. For example, there could be a sequence in the game where the AI needs to take three actions in which it needs to press one of two buttons. The rewards for correct decisions could be represented in a table like this:
Action 1 – Press Button 1 | Action 2 – Press Button 2 | |
Situation 1 | 0 | 10 |
Situation 2 | 10 | 0 |
Situation 3 | 0 | 10 |
The idea is that the AI figures out the values in the table while playing. This sequence would be successful if the AI pressed the buttons in order 2 – 1 – 2. The problem here is that sequences are never that simple and simple tables are not enough. Rewards could also be delayed; meaning that the greatest rewards might hide behind minor, seemingly insignificant numbers. In practical terms, Mario would need to jump over multiple enemies to receive a desired 1-up. The AI must also decide if such a delayed reward is actually worth the trouble.
Q-Learning takes the complexity of games into account, rendering the AI more effective in its decision making. The goal is that the AI learns about the long-term gains of its actions.