TL;DR Introduces temporal difference learning, TD-Lambda / TD-Gammon, and eligibility traces. Check out the Github repo for an implementation of TD-Gammon with TensorFlow.

A few weeks ago AlphaGo won a historic tournament playing the game of Go against Lee Sedol, one of the top Go players in the world. Many people have compared AlphaGo to DeepBlue, which won a series of famous chess matches against Gary Kasparov, but a different comparison may be made for the game of backgammon.

Before DeepMind tackled playing Atari games or built AlphaGo there was TD-Gammon, the first algorithm to reach an expert level of play in backgammon. Gerald Tesauro published his paper in 1992 describing TD-Gammon as a neural network trained with reinforcement learning. It is referenced in both Atari and AlphaGo research papers and helped set the groundwork for many of the advancements made in the last few years.

Temporal-Difference Learning

TD-Gammon consists of a simple three-layer neural network trained using a reinforcement learning technique known as TD-Lambda or temporal-difference learning with a trace decay parameter lambda (λ). The neural network acts as a “value function” which predicts the value, or reward, of a particular state of the game for the current player.

During training, the neural network iterates over all possible moves for the current player and evaluates each valid move and the move with the highest value is selected. Because the network evaluates moves for both players, it’s effectively playing against itself. Using TD-Lambda we want to improve the neural network so that it can reasonably predict the most likely outcome of a game from a given board state. It does this by learning to reduce the difference between the value for the next state and the current state.

Let’s start with a loss function, which describes how well the network is performing for any state at time t:

Loss function: mean squared error of the difference between our neural network’s output for the next state and the output for the current state. The variable α is a small scalar to control the learning rate.

Here we want to minimize the mean squared error of the difference between the next prediction and the current prediction. Basically, we want our predictions about the present to match our predictions about the future. This in itself isn’t very useful until we know how the game ends so for the final step of the game we modify the loss function:

Same as above, but z represents the true outcome of the game.

Where z is the actual outcome of the game. Together these two loss functions work okay but the network will converge slowly and never reach a strong level of play.

Temporal Credit Assignment

To make our predictions more useful we need to solve the problem of temporal credit assignment. Basically, which actions did the player take in the past that resulted in the desired outcome in the future. Right now the loss only incorporates two consecutive steps and we want to stretch that out.

With the loss function above our parameter updates will look something like this:

Parameter updates for the loss function L.

Where θ is the network’s parameters (weights), α is the learning rate and δ is the difference we defined above:

Definition of δ for intermediate and end-game states where f is the final time-step of the game.

Now rather than include a single gradient we want to include all past gradients while paying more attention to the most recent. This is accomplished keeping a history of gradients then decaying each by increasing amounts of λ that reflect how old the gradient has become:

The full definition for TD-Lambda includes a sum over all previous gradients, decayed by λ.

Eligibility Traces

Keeping a running history of gradients can become memory intensive depending on the size of the network and the length of the game. An elegant solution to this problem is to use something called an “eligibility trace”. Eligibility traces replace the gradient sum of the parameter update with a single moving gradient. The eligibility trace is defined as:

Definition of an eligibility trace decayed by λ.

Basically, we decay our eligibility trace by λ then add the new gradient. With this, our parameter update becomes:

New parameter update for TD-Lambda, using an eligibility trace in place of the gradient.

This effectively allows our parameter updates to take into account decisions made in the past. Now when we backpropagate the end game state, we take into account the gradients from earlier states in the game while we avoid keeping a complete history of gradients.

Results

At the start of training, each game can take hundreds or thousands of turns to complete, effectively taking a random strategy. As the network learns, games require only around 50–100 turns and will outperform an opponent making random moves after around 1000 games (about an hour of training).

The average loss for a game can never really reach zero because there’s more uncertainty at the beginning of a game but it can be useful to visualize convergence:

Average loss for each of 5,000 games.

Conclusion

Hopefully, this post shed some light on a small part of the history of recent deep reinforcement learning papers and the temporal-difference learning algorithm. If you’re interested in learning more about reinforcement learning definitely check out Richard Sutton’s book on the topic. You can also download the code for this implementation of TD-Gammon and play against the pre-trained network included in the repo.

Follow me on Twitter for more posts like these. We also do applied research to solve machine learning challenges.