How I built an AI to play Dino Run

Artificial Intelligence faces more problems than it can currently solve and one such problem is learning to handle an environment with no data sets.

An AI playing the Dino Run

Update: After some modifications and a GPU backed VM, I was able to improve the scores to 4000. Please refer this article for details

A 2013 publication by DeepMind titled ‘Playing Atari with Deep Reinforcement Learning’ introduced a new deep learning model on similar lines for reinforcement learning, and demonstrated its ability to master difficult control policies for Atari 2600 computer games, using only raw pixels

as input. My project was inspired from few implementations of this paper. I will try to explain the basics of Reinforcement Learning and dive deep into the code snippets for hands on understanding.

Before we begin, as a prerequisite, I’m assuming you have basic knowledge of Deep Supervised Learning and Convolutional Neural Networks which are essential for understanding the project. Feel free to skip to code section if you’re familiar with Reinforcement Learning and Q-Learning .

REINFORCEMENT LEARNING

A child learning to walk

This might be a new word for many but each and every one of us has learned to walk using the concept of Reinforcement Learning(RL) and this is how our brain still works. A reward system is a basis for any RL algorithm. If we go back to the analogy of child’s walk, a positive reward would be a clap from parents or ability to reach a candy and a negative reward would be say no candy. The child then first learns to stand up before starting to walk. In terms of Artificial Intelligence, the main aim for an agent, in our case the Dino , is maximize a certain numeric reward by performing a particular sequence of actions in the environment. The biggest challenge in RL is the absence of supervision (labelled data) to guide the agent. It must explore and learn on its own. The agent starts by randomly performing actions and observing the rewards each action brings and learns to predict the best possible action when faced with a similar state of the environment.

A typical reinforcement learning loop. Source :Wikipedia

We use Q-learning, a technique of RL, where we try to approximate a special function which drives the action-selection policy for any sequence of environment states

Q-learning is a model-less implementation of Reinforcement Learning where a table of Q values is maintained against each state, action taken and the resulting reward. A sample Q-table should give us the idea how the data is structured. In our case, the states are game screenshots and action jump and do nothing[0,1]

A sample Q-table

We take advantage of the Deep Neural Networks to solve this problem through regression and choose an action with highest predicted Q-value.To know more about Q-learning, please refer the Reading Section at the end.

A vanilla Reinforcement Learning implementation has few problems for which we introduce additional parameters to learn things better.

Absence of labelled data makes the training using RL very unstable. To create our own data,we let the model play game randomly for few thousand steps and we record each state, action and reward. We train our model on batches randomly chosen from these experience replays.

Exploration vs Exploitation problem arises when our model tends to stick to same actions while learning, in our case the model might learn that jumping gives better reward rather than doing nothing and in turn apply an always jump policy. However, we would like our model to try out random actions while learning which can give better reward. We introduce ɛ, which decides the randomness of actions. We gradually decay its value to reduce the randomness as we progress and then exploit rewarding actions.

Credit Assignment problem can confuse the model to judge which past action was responsible for current reward. Dino cannot jump again while mid-air and might crash into a cactus, however, our model might have predicted a jump. So the negative reward was in fact a result of previously taken wrong jump and not the current action. We introduce Discount Factor γ, which decides how far into the future our model looks while taking an action. Thus, γ solves the credit assignment problem indirectly. In our case the model learned that stray jumps will inhibit it’s ability to jump in the future when we set γ=0.99

Few additional parameters that we will be using later