Machine learning algorithms, and neural networks in particular, are considered to be the cause of a new AI ‘revolution’. In this article I will introduce the concept of reinforcement learning but with limited technical details so that readers with a variety of backgrounds can understand the essence of the technique, its capabilities and limitations.

At the end of the article, I will provide links to a few resources for implementing RL.

What is Reinforcement Learning?

Broadly speaking, data-driven algorithms can be categorized into three types: Supervised, Unsupervised, and Reinforcement learning.

The first two are generally used to perform tasks such as image classification, detection, etc. While their accuracy is remarkable, these tasks differ from those that we would expect from an ‘intelligent’ being.

This is where reinforcement learning comes in. The concept itself is very simple, and much like our evolutionary process: the environment rewards the agent for things that it gets right and penalizes it for things that it gets wrong. The main challenge is developing the capacity to learn several million possible ways of doing things.

Q Learning & Deep Q Learning

Q learning is a widely used reinforcement learning algorithm. Without going into the detailed math, the given quality of an action is determined by what state the agent is in. The agent usually performs the action which gives it the maximum reward. The detailed math can be found here.

In this algorithm, the agent learns the quality(Q value) of each action (action is also called policy) based on how much reward the environment gave it. The value of each environment’s state, along with the Q value is usually stored in a table. As the agent interacts with the environment, the Q values get updated from random values to values that actually help maximize reward.

Deep Q Learning

The problem with using Q learning with tables is that it doesn’t scale well. If the number of states is too high, the table will not fit in memory. This is where Deep Q learning could be applied. Deep learning is basically just a universal approximation machine which can understand and come up with abstract representations. Deep learning can be used to approximate Q values, and it can also easily learn optimal Q values by using gradient descent.

Fun Fact: Google has a patent on some elements of Deep Q learning: https://www.google.com/patents/US20150100530

Exploration vs Exploitation

It is often the case that the agent memorizes one path and will never try to explore any other paths. In general, we would like an agent to not only exploit good paths, but also sometimes explore new paths that it can perform actions in. Therefore, a hyper-parameter, named ε, is used to govern how much to explore new paths vs how much to exploit old paths.

Experience Replay

When training a neural network, data imbalance plays a very important role. If a model is trained as the agent interacts with the environment, there will be imbalances. The most recent play will obviously have more bearing than older plays.

Therefore, all the states, along with related data, is stored in the memory, and the neural network can randomly pick a batch of some interactions and learn (this makes it very similar to supervised learning).

The Training Framework

This is what the whole framework for deep Q learning looks like. Note the 𝛾. This represents the discounted reward. It is a hyperparameter that controls how much weight the future reward will have. The symbolˊ denotes next. e.g. sˊ denotes next state.