Deep Q-Learning (Space Invaders)

Ever since I learned about neural networks playing Atari games I wanted to reimplemnted it and learn how it works. Below you can see an AI playing Space Invaders. I trained it during my batch at Recurse Center on little over 50M frames.

I'm sorry; your browser doesn't support HTML5 video in WebM with VP8 or MP4 with H.264.

It is more awesome if you realize that the AI was trained in a similar way a human would learn: the only inputs are screen and number of gained (or lost) points after each action taken by the AI.

DQN does much better then a best-action strategy (do nothing but shoot) and random strategy.

algorithm points DQN 550 Best action 240 Random 150

You can play with my implementation here: Deep Q-Learning. This is first post on the topic, stay tuned for the next ones!

Average game reward (600 games) after N games played. Blue line is random strategy baseline, red line is best-action strategy baseline.

#Algorithm So what is Deep Q-Learning (DQN)? Below you will find a gentle introduction. I omit certain details for the sake of simplicity and I encourage you to read the original paper.

The task for Neural Network in DQN is to learn which action to take based on the screen and previous experience. Obviously the nueral network should choose the best action but how to learn which one is best?

Turns out your neural network can be pretty simple: the input is game screen and hidden layers consists of 3 convolutional layers and a single fully connected layer. The number of neurons in last layer corresponds to number of actions that can be taken. In the case of Space Invaders there were 4 actions (do nothing, shoot, go left, go right), therefore there were 4 neurons in the output layer.

The tricky and crucial part is the loss function. In ‘normal’ neural networks the loss function is straigtforward as for each training example there is a known expected outcome . Nothing like that is available in our case but we can deal with it thanks to some insights from Q-Learning!

Loss function

The best measure of how good an action is accumulated future reward

where sum is taken over time from until the end of the game and is reward gained at time . There is a couple of problems with that simplified definition and we’ll deal with them one by one.

For one there is no way to calculate that sum as we don’t know the future. With this, we’ll deal at the end though.

Intuitively speaking the immediate reward should be more valuable then a very distant one. After all, future is uncertain and we might never get this distant reward at all.

Let’s introduce discounted accumulated future reward.

Where is between 0 and 1. We’ll set to , though, as the distant rewards are very important.

Finally our game is stochastic (we don’t know when an enemy shoots a laser beam) therefore we should rather think in terms of expected value. Ideally, what we want the neural network to learn is function Q defined as:

where is the input game screen at time , indicates the neuron corresponding with action , is reward obtained after action taken at time . That is what we want each neuron of the output layer to learn.

To do that efficiently we need to realise that where is game screen experienced after taking action after seeing game screen .

Now if is our neural network we can treat as a measure of surprise and therefore a loss function (after squaring).

Note that the loss depends on the neural network itself in an untypical way.

Correlation

Since we play the game online it is tempting to simply update the network after each taken action or in mini-batches of, say, 32 actions taken. Such mini-batches would be highly correlated and any stochastic update algorithm would fail on that.

To deal with that issue we keep previous experiences in memory and after each action taken we draw a mini-batch of experiences from that memory to perform the update step.

Other technicalities

provide only every 4th frame to the neural network

stack 4 frames one on top of the other to make the neural network aware of time. This could be avoided if you used LSTM.

keep a stale network at hand and calculate loss with regards to that stale network

gradient clipping (to avoid blowing up gradients)

DeepMind Rmsprop (instead of normal one) - improved performance by 40% in my case.

I used Arcade Learning Environment to play space invaders.

Lessons Learned

That was my first exposure to training non trivial neural networks so there is plenty of things that I learned.