Reinforcement Learning for the Game of Splendor

By Max Candocia

Previously, I wrote about teaching a computer to play board games and learning strategies from it. The example game, Machi Koro, is relatively simple, and the computer learned to play it very fast, and some easy-to-understand strategies were extracted from it.

Recently, I wrote a similar, more complex, AI that learned to play the game Splendor. The rules of the game itself are not complicated, but the strategy and planning involved are.

Basic Rules

The starting layout for 4 players looks like this:

(the layout is similar for 2 or 3 players, but some of the square cards at the top/gems in the piles on the right are removed)

The basic rules are as follows:

During each players turn that player may take one of the following actions: Take 3 gems from different piles

Take 2 gems from a single pile that has 4 or more in it

Reserve a face-up card and take a yellow (gold) gem that can be used as any other color

Reserve a face-down card on the top of a deck and take a yellow (gold) gem that can be used as any other color

Purchase a face-up card or a card that player reserved using gems and/or the discounts from cards they have If a player has more than 10 gems at the end of a turn, that player must return gems to the piles so that that player only has 10 The cost of a card is indicated by the circles with colors and numbers on the mid-lower left-hand side The cost of a card is reduced by one of each color for each card of those respective colors that player has (this can make some cards free, but the cost can't go below 0). If a card has a number on the upper-left hand corner, that player receives that many points. At the end of your turn, if you have a combination of cards of which the color matches one of the noble (objective) tiles at the top, you choose one of those to take and you get 3 points. e.g., if you have 3 black, 3 red, and 3 white cards (indicated by color of gem in upper right-hand corner), you get the noble on the far left of the top row The final round is the one during which a player reaches 15 or more points Tiebreakers are determined by number of nobles (highest wins), and then number of cards (lowest wins)

For a detailed run-through, see Board Game Geek's tutorial.

Alternately, you can check out a more thorough explanation at BoardGameHalv.com.

General Strategy

When humans play the game, the general strategy is to purchase less expensive cards first so that you can use the discounts they grant as much as you can. Acquiring the nobles at the top by purchasing a certain combination of cards is also a common goal players have when making decisions.

The difficulty of making the decisions involve planning ahead:

Which cards should a player purchase, given that other players may have the opportunity to buy them before that player does?

What gems should a player choose, knowing that their decision may limit another player's ability to acquire gems and purchase specific cards

What cards are worth reserving? This can be done because you want to buy the card later, you want to prevent another player from buying the card, or you need the gold gem that comes along with it to buy something else

What objectives are realistic to strive for given the current board state

Reinforcement Learning Simulations

Neural Network

In some cases, it is obvious what the best move is, but in others, it requires some serious planning. A good player should be able to play efficiently, as well as be able to figure out how to undermine their opponent's strategy.

When simulating the game, I gave each of the four players a different, randomly-initialized neural network to use for decisionmaking. For a simple explanation of neural networks, see my article, A Simple Explanation of how Computers Recognize Images. The inputs I use are shared across all players, and the inputs include reserved cards, number of gems, number of cards owned, points, and turn order. I also include a representation of the game, all of its cards, all of its nobles (objectives), and the turn number. The implementation of the neural network creation can be found in this this source code. Alternatively, I have a visualization of the network below:

Each of the boxes represents numbers that are input to the next layer. The number of inputs are indicated in parentheses. Note that some of the inputs, such as cards on the board and a player's reserved cards, have a blue box below them indicating how many of those layers are used. For example, there are 12 cards on the board, so there are 12 different inputs of game cards that are represented by 15 input values (5 colors for cost, 5 colors for the card color, 3 for tier, 1 for point value, 1 for blank indicator).

Additionally, I connect each player's input to the main dense layer, since a lot of information could be lost when condensing it in the "funnel layers" that I use.

The motivation behind using this type of network is that I can reasonably constrain the values the network learns as to better represent the variables used to predict chances of winning or various score outputs I will describe below.

Decisionmaking and Output Variable

When training the AI, it needs some sort of output to optimize for, as well as a way to act on its predictions of that output. When training, there were 4 values I had it predict:

The estimated score in 1 turn, plus 0.1 times owned cards, plus 0.01 * owned gems The estimated score in 3 turns, plus 0.1 times owned cards The estimated score in 5 turns, plus 0.05 times owned cards The chances of winning

These values are added together in various proportions to yield a target "score". When initially training the network, I gave a much higher weight to the short-term predictions than the long ones. I slowly changed the weighting of these decisions to emphasize long-term predictions more as training progressed. Additionally, the decisions become less random over time, as well, so that decisions that are likely to have better results will be chosen even more frequently than before. Here is a simple flowchart to describe the process:

Essentially, I simulate a few hundred games, then train the AI. After a few rounds of this, I also reset the data and update the decision weights and temperature. The temperature is a parameter inspired by temperature in physics that controls the randomness of the AI's decisions. At high temperatures, all possible decisions are almost equally likely, and at low temperatures, the best decision is almost certainly the only one that will be made.

When determining the probability of each possible action, I use the Boltzmann distribution, with the "energy" being the negative score of each predicted outcome.

First Results

When running the first trials, my goal was to determine effectiveness using two metrics:

Number of turns it takes to complete the game, on average What percentage of games does an AI with N rounds of training win versus 3 other AI with K rounds of training

When the AI plays against other AI with the same amount of training, the number of rounds doesn't affect the expected game length too much. This is likely because the strategy transforms from getting yourself a lot of points to preventing your opponent from doing the same.

For the second test, I chose to pit AI with 2, 6, and 12 rounds of training against each other. Each inter-round pair has 200 samples (50 for each player).

The 2-round AI was clearly inferior to the others, but it is difficult to say what effect 12 rounds had versus 6. The 30.5% win rate of 12 rounds vs 6 is not consistent with the 24.5% rate that a single 6-round AI has versus 3 12-round AIs. It is possible that the 12-round AIs all outsmarted each other, eliminating their advantages.

If one performs a one-sided p-value test, assuming a binomial distribution with n=200 and a probability of winning being 25%, we can determine if the results of 63 wins out of 200 are significant.

Using the R code 1-pbinom(63, 200, 0.25) for the p-value calculation yields a probability of 0.0154, which indicates that it is highly unlikely that the high win rate is purely by chance if the odds of winning were equal among all players. That p-value grows quickly, however, when one tests above 26% for chances of winning.

I will need to make adjustments to the learning process to improve the AI in the future, and more extensive tests will be performed then. Ideally I can get a narrower confidence interval to determine exact performance rates.

Strategy Differences

I decided to look at the differences in what the winning players of games with various amounts of training look like. It turns out the main difference is that the 6 and 12-round AIs prefer to have a second or third objective much more frequently, with an expected value of 1.4 objectives per game, rather than 1.07 objectives. This also translates to one more cheaper card (tier 1) and one fifth fewer expensive cards (tier 3) per game.

Future Work

I will be testing out longer sets of training for future runs, as well as slightly larger networks. I will also try fine-tuning networks with very low temperatures at the end of different phases of training so that the randomness of decisionmaking can be limited. I will also add a command-line interface for human players to interact with the game to get an idea of what it's like to play agains the AI.

The games take about 6-10 seconds to run, so I may need to acquire more RAM in order to run games in parallel.

Sources

GitHub Repository: https://github.com/mcandocia/splendor_ai

IPython Notebooks

For running the code, I used IPython notebooks for their relatively ease of use for testing and running code.

First 6 rounds of training

Next 6 rounds of training.

AI Comparison Notebook