Let’s get started!

Abstract

As you know, the goal of this research was to train an AI program to play Go at the level of world-class professional human players.

To understand this challenge, let me first talk about something similar done for Chess. In the early 1990s, IBM came out with the Deep Blue computer which defeated the great champion Garry Kasparov in Chess. (He’s also a very cool guy, make sure to read more about him later!) How did Deep Blue play?

Well, it used a very brute force method. At each step of the game, it took a look at all the possible legal moves that could be played, and went ahead to explore each and every move to see what would happen. And it would keep exploring move after move for a while, forming a kind of HUGE decision tree of thousands of moves. And then it would come back along that tree, observing which moves seemed most likely to bring a good result. But, what do we mean by “good result”? Well, Deep Blue had many carefully designed chess strategies built into it by expert chess players to help it make better decisions — for example, how to decide whether to protect the king or get advantage somewhere else? They made a specific “evaluation algorithm” for this purpose, to compare how advantageous or disadvantageous different board positions are (IBM hard-coded expert chess strategies into this evaluation function). And finally it chooses a carefully calculated move. On the next turn, it basically goes through the whole thing again.

As you can see, this means Deep Blue thought about millions of theoretical positions before playing each move. This was not so impressive in terms of the AI software of Deep Blue, but rather in the hardware — IBM claimed it to be one of the most powerful computers available in the market at that time. It could look at 200 million board positions per second.

Now we come to Go. Just believe me that this game is much more open-ended, and if you tried the Deep Blue strategy on Go, you wouldn’t be able to play well. There would be SO MANY positions to look at at each step that it would simply be impractical for a computer to go through that hell. For example, at the opening move in Chess there are 20 possible moves. In Go the first player has 361 possible moves, and this scope of choices stays wide throughout the game.

This is what they mean by “enormous search space.” Moreover, in Go, it’s not so easy to judge how advantageous or disadvantageous a particular board position is at any specific point in the game — you kinda have to play the whole game for a while before you can determine who is winning. But let’s say you magically had a way to do both of these. And that’s where deep learning comes in!

So in this research, DeepMind used neural networks to do both of these tasks (if you have never read about neural networks yet, here’s the link again). They trained a “policy neural network” to decide which are the most sensible moves in a particular board position (so it’s like following an intuitive strategy to pick moves from any position). And they trained a “value neural network” to estimate how advantageous a particular board arrangement is for the player (or in other words, how likely you are to win the game from this position). They trained these neural networks first with human game examples (your good old ordinary supervised learning). After this the AI was able to mimic human playing to a certain degree, so it acted like a weak human player. And then to train the networks even further, they made the AI play against itself millions of times (this is the “reinforcement learning” part). With this, the AI got better because it had more practice.

With these two networks alone, DeepMind’s AI was able to play well against state-of-the-art Go playing programs that other researchers had built before. These other programs had used an already popular pre-existing game playing algorithm, called the “Monte Carlo Tree Search” (MCTS). More about this later.

But guess what, we still haven’t talked about the real deal. DeepMind’s AI isn’t just about the policy and value networks. It doesn’t use these two networks as a replacement of the Monte Carlo Tree Search. Instead, it uses the neural networks to make the MCTS algorithm work better… and it got so much better that it reached superhuman levels. THIS improved variation of MCTS is “AlphaGo”, the AI that beat Lee Sedol and went down in AI history as one of the greatest breakthroughs ever. So essentially, AlphaGo is simply an improved implementation of a very ordinary computer science algorithm. Do you understand now why AI in its current form is absolutely nothing to be scared of?

Wow, we’ve spent a lot of time on the Abstract alone.

Alright — to understand the paper from this point on, first we’ll talk about a gaming strategy called the Monte Carlo Tree Search algorithm. For now, I’ll just explain this algorithm at enough depth to make sense of this essay. But if you want to learn about it in depth, some smart people have also made excellent videos and blog posts on this:

1. A short video series from Udacity

2. Jeff Bradberry’s explanation of MCTS

3. An MCTS tutorial by Fullstack Academy

The following section is long, but easy to understand (I’ll try my best) and VERY important, so stay with me! The rest of the essay will go much quicker.

Let’s talk about the first paragraph of the essay above. Remember what I said about Deep Blue making a huge tree of millions of board positions and moves at each step of the game? You had to do simulations and look at and compare each and every possible move. As I said before, that was a simple approach and very straightforward approach — if the average software engineer had to design a game playing AI, and had all the strongest computers of the world, he or she would probably design a similar solution.

But let’s think about how do humans themselves play chess? Let’s say you’re at a particular board position in the middle of the game. By game rules, you can do a dozen different things — move this pawn here, move the queen two squares here or three squares there, and so on. But do you really make a list of all the possible moves you can make with all your pieces, and then select one move from this long list? No — you “intuitively” narrow down to a few key moves (let’s say you come up with 3 sensible moves) that you think make sense, and then you wonder what will happen in the game if you chose one of these 3 moves. You might spend 15–20 seconds considering each of these 3 moves and their future — and note that during these 15 seconds you don’t have to carefully plan out the future of each move; you can just “roll out” a few mental moves guided by your intuition without TOO much careful thought (well, a good player would think farther and more deeply than an average player). This is because you have limited time, and you can’t accurately predict what your opponent will do at each step in that lovely future you’re cooking up in your brain. So you’ll just have to let your gut feeling guide you. I’ll refer to this part of the thinking process as “rollout”, so take note of it!

So after “rolling out” your few sensible moves, you finally say screw it and just play the move you find best.

Then the opponent makes a move. It might be a move you had already well anticipated, which means you are now pretty confident about what you need to do next. You don’t have to spend too much time on the rollouts again. OR, it could be that your opponent hits you with a pretty cool move that you had not expected, so you have to be even more careful with your next move.

This is how the game carries on, and as it gets closer and closer to the finishing point, it would get easier for you to predict the outcome of your moves — so your rollouts don’t take as much time.

The purpose of this long story is to describe what the MCTS algorithm does on a superficial level — it mimics the above thinking process by building a “search tree” of moves and positions every time. Again, for more details you should check out the links I mentioned earlier. The innovation here is that instead of going through all the possible moves at each position (which Deep Blue did), it instead intelligently selects a small set of sensible moves and explores those instead. To explore them, it “rolls out” the future of each of these moves and compares them based on their imagined outcomes.

(Seriously — this is all I think you need to understand this essay)

Now — coming back to the screenshot from the paper. Go is a “perfect information game” (please read the definition in the link, don’t worry it’s not scary). And theoretically, for such games, no matter which particular position you are at in the game (even if you have just played 1–2 moves), it is possible that you can correctly guess who will win or lose (assuming that both players play “perfectly” from that point on). I have no idea who came up with this theory, but it is a fundamental assumption in this research project and it works.

So that means, given a state of the game s, there is a function v*(s) which can predict the outcome, let’s say probability of you winning this game, from 0 to 1. They call it the “optimal value function”. Because some board positions are more likely to result in you winning than other board positions, they can be considered more “valuable” than the others. Let me say it again: Value = Probability between 0 and 1 of you winning the game.

But wait — say there was a girl named Foma sitting next to you while you play Chess, and she keeps telling you at each step if you’re winning or losing. “You’re winning… You’re losing… Nope, still losing…” I think it wouldn’t help you much in choosing which move you need to make. She would also be quite annoying. What would instead help you is if you drew the whole tree of all the possible moves you can make, and the states that those moves would lead to — and then Foma would tell you for the entire tree which states are winning states and which states are losing states. Then you can choose moves which will keep leading you to winning states. All of a sudden Foma is your partner in crime, not an annoying friend. Here, Foma behaves as your optimal value function v*(s). Earlier, it was believed that it’s not possible to have an accurate value function like Foma for the game of Go, because the games had so much uncertainty.

BUT — even if you had the wonderful Foma, this wonderland strategy of drawing out all the possible positions for Foma to evaluate will not work very well in the real world. In a game like Chess or Go, as we said before, if you try to imagine even 7–8 moves into the future, there can be so many possible positions that you don’t have enough time to check all of them with Foma.

So Foma is not enough. You need to narrow down the list of moves to a few sensible moves that you can roll out into the future. How will your program do that? Enter Lusha. Lusha is a skilled Chess player and enthusiast who has spent decades watching grand masters play Chess against each other. She can look at your board position, look quickly at all the available moves you can make, and tell you how likely it would be that a Chess expert would make any of those moves if they were sitting at your table. So if you have 50 possible moves at a point, Lusha will tell you the probability that each move would be picked by an expert. Of course, a few sensible moves will have a much higher probability and other pointless moves will have very little probability. For example: if in Chess, let’s say your Queen is in danger in one corner of the game, you might still have the option to move a little pawn in another corner of the game She is your policy function, p(a\s). For a given state s, she can give you probabilities for all the possible moves that an expert would make.

Wow — you can take Lusha’s help to guide you in how to select a few sensible moves, and Foma will tell you the likelihood of winning from each of those moves. You can choose the move that both Foma and Lusha approve. Or, if you want to be extra careful, you can roll out the moves selected by Lusha, have Foma evaluate them, pick a few of them to roll out further into the future, and keep letting Foma and Lusha help you predict VERY far into the game’s future — much quicker and more efficient than to go through all the moves at each step into the future. THIS is what they mean by “reducing the search space”. Use a value function (Foma) to predict outcomes, and use a policy function (Lusha) to give you grand-master probabilities to help narrow down the moves you roll out. These are called “Monte Carlo rollouts”. Then while you backtrack from future to present, you can take average values of all the different moves you rolled out, and pick the most suitable action. So far, this has only worked on a weak amateur level in Go, because the policy functions and value functions that they used to guide these rollouts weren’t that great.

Phew.

The first line is self explanatory. In MCTS, you can start with an unskilled Foma and unskilled Lusha. The more you play, the better they get at predicting solid outcomes and moves. “Narrowing the search to a beam of high probability actions” is just a sophisticated way of saying, “Lusha helps you narrow down the moves you need to roll out by assigning them probabilities that an expert would play them”. Prior work has used this technique to achieve strong amateur level AI players, even with simple (or “shallow” as they call it) policy functions.

Yeah, convolutional neural networks are great for image processing. And since a neural network takes a particular input and gives an output, it is essentially a function, right? So you can use a neural network to become a complex function. So you can just pass in an image of the board position and let the neural network figure out by itself what’s going on. This means it’s possible to create neural networks which will behave like VERY accurate policy and value functions. The rest is pretty self explanatory.

Here we discuss how Foma and Lusha were trained. To train the policy network (predicting for a given position which moves experts would pick), you simply use examples of human games and use them as data for good old supervised learning.

And you want to train another slightly different version of this policy network to use for rollouts; this one will be smaller and faster. Let’s just say that since Lusha is so experienced, she takes some time to process each position. She’s good to start the narrowing-down process with, but if you try to make her repeat the process , she’ll still take a little too much time. So you train a *faster policy network* for the rollout process (I’ll call it… Lusha’s younger brother Jerry? I know I know, enough with these names). After that, once you’ve trained both of the slow and fast policy networks enough using human player data, you can try letting Lusha play against herself on a Go board for a few days, and get more practice. This is the reinforcement learning part — making a better version of the policy network.

Then, you train Foma for value prediction: determining the probability of you winning. You let the AI practice through playing itself again and again in a simulated environment, observe the end result each time, and learn from its mistakes to get better and better.

I won’t go into details of how these networks are trained. You can read more technical details in the later section of the paper (‘Methods’) which I haven’t covered here. In fact, the real purpose of this particular paper is not to show how they used reinforcement learning on these neural networks. One of DeepMind’s previous papers, in which they taught AI to play ATARI games, has already discussed some reinforcement learning techniques in depth (And I’ve already written an explanation of that paper here). For this paper, as I lightly mentioned in the Abstract and also underlined in the screenshot above, the biggest innovation was the fact that they used RL with neural networks for improving an already popular game-playing algorithm, MCTS. RL is a cool tool in a toolbox that they used to fine-tune the policy and value function neural networks after the regular supervised training. This research paper is about proving how versatile and excellent this tool it is, not about teaching you how to use it. In television lingo, the Atari paper was a RL infomercial and this AlphaGo paper is a commercial.

Alright we’re finally done with the “introduction” parts. By now you already have a very good feel for what AlphaGo was all about.

Next, we’ll go slightly deeper into each thing we discussed above. You might see some ugly and dangerous looking mathematical equations and expressions, but they’re simple (I explain them all). Relax and take a quick break.

A quick note before you move on (since I know most people won’t read the whole thing).