Introduction Pentago is a board game designed by Tomas Flodén and developed and sold by Mindtwister. The rules are given below. Like chess and go, pentago is a two player, deterministic, perfect knowledge, zero sum game: there is no random or hidden state, and the goal of the two players is to make the other player lose (or at least tie). Note that pentago is not related to go; it is much closer to gomoku. Unlike chess and go, pentago is small enough for a computer to play perfectly: with symmetries removed, there are a mere 3,009,081,623,421,558 (3e15) possible positions. Thus, with the help of several hours on 98304 threads of Edison, a Cray supercomputer at NERSC, pentago is now strongly solved. "Strongly" means that perfect play is efficiently computable for any position. For example, the first player wins. This site demonstrates the strong solution. For boards with 17 or fewer stones, the result is read out of a 4 terabyte database. Since the number of positions increases exponentially with the number of stones, the database stops after this, and the result must be computed from scratch. This takes about 15-20 seconds for an 18 stone board if the server is below full load, much less for boards with 19 or more stones.

Pentago rules The rules of pentago can be found on the Mindtwister website, or on Wikipedia. For completeness, we reproduce them here. Pentago is played on a 6 by 6 board, divided into four 3 by 3 quadrants. There are two players, black and white, who alternate turns. The goal of each player is to get five stones of their color in a row, either horizontally, vertically, or diagonally. Each turn, a player places a stone in an empty space in some quadrant, then chooses a possibly different quadrant to rotate 90 degrees left or right. If both players get five in a row at the same time, or the last move is played with no five in a row, the game is a tie. If a player makes five a row by placing a stone, there is no need to rotate a quadrant: the player wins immediately. To use the visualization, click on an empty spot to place a stone, or on an arrow to rotate the board. After a short wait, the results of all moves will show up: green moves win, blue moves tie, and red moves lose. You can back up either by pressing the back button, or using your browser history. If you find an interesting position, you can save it by copying the browser URL, which records both the current position and the history of the game up to this point.

Algorithms: brute force + symmetry I will be writing up more details of the computation in paper form soon, but here is a brief summary. For background on the complexity of pentago and the applicability of a variety of solution algorithms, see Niklas Büscher's 2011 thesis, "On Solving Pentago", which analyzed the game but did not solve it. As mentioned above, pentago has 3,009,081,623,421,558 (3e15) possible positions once symmetries are taken out. The number of positions increases exponentially up to 24 stones (where there are equal numbers of black, white, and empty squares), then falls off: Unfortunately, since each full move involves both placing a stone and choosing a rotation, the branching factor is very high: there are 288 possible first moves including the first rotation. To reduce the branching factor to a manageable level, we perform all computations in terms of rotation-abstracted "superpositions" consisting of all ways of rotating the quadrants of a given board. Since there are four ways to rotate each of four quadrants, one bit per position becomes a 4-dimensional array with 44 = 256 bits total. These 256 bits are packed into two 128-bit SSE registers, allowing us to compute 256 values at a time using SSE bit twiddling. The resulting code is fairly hairy, but is much faster than operating on one position at a time. With the branching factor under control, we solve the game by traversing every possible position, starting at 36 stone positions and moving backwards through the game to the zero stone start. Around the peak at 24 stones, the computation requires 80 TB of RAM even with compression, and was therefore parallelized using MPI and run on 98304 threads of Edison, a Cray supercomputer at NERSC (the National Energy Research Supercomputing Center). Positions with more than 18 stones were discarded after use, and those with 18 or fewer were written to disk using lzma compression to reduce down to 4 TB total. The large scale computation took less than four hours, plus a few more hours on fewer nodes to finish up. In the majority of existing supercomputer codes, all processes operate in lockstep performing the same operation on a subset of the data (SIMD). The pentago computation is in some sense even easier: positions with the same number of stones do not depend on each other, so for each stone count the computation is embarassingly parallel. Unfortunately, this simplicity evaporates when we take memory in account: at the peak of the computation each node is operating near the limit of its available RAM, leaving only a small amount of space left over to store the inputs for required computations (which come from other nodes). Thus, each node must request only a few inputs at a time, compute the result, and possibly send the data off to be stored. The overall control flow becomes asynchronous, with each process dynamically responding to requests for data, replies to its own requests, etc., taking care to never exceed the memory limit. In addition to increasing the complexity, the requirement to only store a few inputs at a time turns an otherwise bandwidth-limited computation into a latency-limited one. Indeed, the computation was significantly slower than expected on the older NERSC supercomputer Hopper due to unexpectedly high latency, possibly exacerbated by the asynchronous structure of the communication. The above discussion applies to the full solve required to compute the 4 TB database, which this website uses to lookup perfect play for positions with 17 or fewer stones. If the site sees a position with 18 or more stones, its children are outside the database, so we perform a much smaller traversal over all downstream positions. The set of downstream positions is much easier to deal with thanks to abstracting away the rotations, as the set of empty spots does not move around. These smaller solves have the same structure as the full solve, but require only 1 GB of RAM and 15-20 seconds on a single thread. As a perk, we can take advantage of parity to store only half of the 256 possible rotations at each step since the rotation parity flips every move.

Server setup The 4 TB data set, this website, and the backend server is hosted on Google Cloud. Prior to 2020 it used hosting generously donated by Rackspace. The data is stored in a Google Cloud Storage bucket, and the backend server uses this data to look up optimal play for positions with 17 or fewer stones. For positions with more stones, the server used to compute the value from scratch, but this functionality has been ported to WebAssembly and moved into the client. The backend server is https://us-central1-naml-148801.cloudfunctions.net/pentago . The API is fairly simple: give it the code for a pentago board and it replies with json containing the values of all child positions. For example, here is the start position. You are welcome to use the server directly for small numbers of queries, but please email me if you want to use it more intensively.

Code and data All code for this project (including this site and its backend) is BSD licensed and hosted on Github at https://github.com/girving/pentago The data set is released into the public domain (CC0). Please email me if you want download access to the raw data.

Usefulness The intrinsic value of solving pentago aside, my hope is that this project provides an example of an MPI calculation with characteristics that are uncommon now but may be significantly more common in future: asynchronous control flow, extreme memory limitations, and heavy dependence on fast compression algorithms and integer performance. Asynchrony in particular makes MPI algorithms far more difficult to develop, debug, and optimize, emphasizing the need for better debugging and tracing tools. Strong asynchrony support in the underlying MPI implementations is a relatively new feature: the recent MPI 3 standard should ease the development of similar applications once it is widely available. Finally, in-memory compression will grow in importance as CPU speed increases relative to memory, memory bandwidth, and disk, a particular problem on platforms such as Blue Gene with poor integer performance.