The Neural Network

ResNet architectures are a popular way to train deep networks, often for image recognition. For AlphaZero’s network, the input is the board state, and there are two outputs:

The estimated value (v) of the position, ranging from 1 (win) to -1 (loss). A vector of prior probabilities (p) for playing each next possible action.

The network is initialized randomly. After some self-play has occurred, training data consists of randomly chosen board positions from played games, in the form of (board state, game result, child visit counts from MCTS).

MCTS

This video does an excellent walk-through of standard MCTS. To choose a move given a board position, we perform 800 iterations of the following algorithm. We build a tree, initially having just one node (representing the current board state).

Start at the root node (current board state). Walk to the child that gives the best tradeoff between exploiting current information and exploring new moves (formalized by the UCB1 formula). Recurse ’til you hit a leaf node. If this is the first visit to that node, perform a rollout: randomly simulate moves until end of game, and then use that game result to update the values of all nodes from the leaf back to the root. If it is the second visit, expand it (i.e., create its children), and visit+rollout one of them. There is no third visit, because it is no longer a leaf node.

In AlphaZero, rollouts are replaced by fetching predictions from the NN, and UCB1 is replaced by PUCT (polynomial upper confidence tree). The algorithm looks like this:

Start at the root. Walk to the child with best PUCT score. Recurse ’til you hit a leaf node. On the first visit, invoke the NN to fetch (1) the estimated game score (or value, v) and (2) the suggested probabilities (p) for visiting each child, to be used in PUCT. Create the children but do not visit them. There is no second visit.

Tying it Together

After 800 iterations of MCTS, a move is chosen by selecting the child that was visited most. Then the other side plays in the same way. This continues until the game is over. At that point, each board position from the game is annotated with the game score and the child visit counts. These samples get added to the training set for the ResNet. As the ResNet trains, it gets used in subsequent MCTS self-play.

This back-and-forth between MCTS self-play and ResNet training is very powerful. The ResNet helps to generalize the learnings from MCTS, and the self-play uses that generalized knowledge to learn the game more deeply. As we saw from the paper, this allows AlphaZero to outperform basically all existing algorithms — the best of which have been painstakingly hand-tuned with the help of masters of their respective games.