How to Distribute the Game of Life Do you have a very large grid that you want to use in a simulation and the grid doesn't fit into the memory of a single node? If so, learning how to develop a distributed memory version of the Game of Life may be a practical solution.



If you've read my last two posts, you may be asking yourself why you might want to go through all the hassle of developing a distributed memory version of the Game of Life. For me, I find the exercise stimulating. However, a more practical reason is when you have a very large grid that you want to use in the simulation and the grid doesn't fit into the memory of a single node. This is one of the primary reasons that cluster computing was developed. With multiple machines you can put together a very large amount of aggregate memory, but the computation must be amenable to parallelism and to distribution across compute nodes that are only connected via some network.

With regard to coding the Game of Life for execution on a cluster, the first consideration is how to divide up the grid on which the cells will live and die. In the simplest implementation, this will be a finite grid that can be split up in a number of ways, depending upon how many nodes you have in your cluster. The MPI library has functions that give information on the number of nodes within the system that are executing your program. From these you can choose a logical arrangement of the grid onto the nodes and, if you've gone with a straightforward distribution, each process will be able to determine its logical neighbor processes. For example, if you have nine nodes in the system, you can divide the grid into nine rectangular subgrids with the processes arranged in three rows and three columns (like the grid used for tic-tac-toe). If you assign processes by numbering the subgrids in a row-major fashion, the one assigned to the center subgrid will be process P4. (I'm identifying processes in my example using the MPI rank number starting with zero.) If we use the points of the compass to orient how we describe the placement of adjacent processes, then the neighbors to P4 will be processes P1 (North), P7 (South), P3 (West), and P5 (East).

How does the initial live cell configuration for each subgrid get placed into the appropriate node's memory for the assigned process? I can think of two ways that this can be done. First, if you have the initial configuration in separate files, one per subgrid, and each node can read from the shared filesystem, the processes simply read in their initial data. This approach fixes the problem to a specific grid decomposition and cluster configuration, and requires some preprocessing of the initial data before the simulation even gets started.

The second (and more likely) method, would be to have one process (whose node is attached to the I/O system) that can read in the initial live configuration for each subgrid, in turn, from a file and then send the data to the appropriate process. This reader process can be separate from the computing processes or may itself be one of the team of computing processes. In the former case, each computing process first waits on the initialization data (message) before starting to execute the simulation; in the latter, all but the reader process start by waiting for the initialization message, and the reader does all the input and sending out the initial configuration data messages before reading its own initial configuration data and starting the computation.

Once the initial configuration data has been received by each process, the first derived generation can be computed. The bulk of the computation for each generation is done as if the process were the only process doing a computation on the only grid in the simulation. Like the serial version that I described in a prior post, there is going to be a border around the edges of the rectangular grids being simulated by a process. Unlike the serial version, many of the borders will not be edges of the overall grid. These local borders will be abutting the subgrid held by some other process. In order to know the exact number of live neighbors (out of 8) for those cells sitting on the edges of the local subgrid, the process needs to know the status of those grid points across the logical subgrid boundary. (I can think up some bad ways that this can be done and I won't bother to enumerate them here.) The most common solution to this problem is to use what are known as ghost cells.

Simply put, ghost cells are grid points that are assigned to a different process, but whose data is duplicated in a bordering process. As a concrete example, let me use my previous 3x3 decomposition and the middle process. At each edge of the subgrid assigned to P4, I add another row or column of ghost cells. The top row (northmost) of ghost cells will hold a copy of the data in the southernmost row of P1, the eastmost (furthest right) column of ghost cells holds a copy of the westmost (furthest left) column of P5, and so on around the edges of the subgrid. At the same time, the corresponding extreme edges of the subgrid in P4 will be duplicated in ghost cells held by the neighboring processes. The ghost cells are never updated by a process that holds them; they are simply there to make the computations (and the corresponding coding) at the process easier or more like the serial version.

After a new generation of living and dying and birthing cells has been computed within a process, the data in the ghost cells may have changed within the process that is in charge of the actual cells. Because of this, before the computation for the next generation can begin, the processes must swap ghost cells' data of the just-computed generation with all bordering processes. In a simple grid arrangement of processes, then, each process will send four messages containing either a row or column of the edgemost cell's liveness data and receive four messages, one from each of the bordering processes, to populate the ghost cells for the computation of the next generation. I recommend using asynchronous send and receive functions for these exchanges. This will allow each process to send off its messages and set up the receipt of each message to and from the appropriate neighbors and then await the actual receiving of data. You can even post the receives before starting to compute the current generation and then wait for the messages to arrive only when the data is needed before computing the next generation.

Now, if you've been paying attention in the last couple of paragraphs and haven't fallen asleep in your bowl of Cinnamon Life cereal (assuming you're reading this at breakfast), you might want to know where an edge or corner process, like P3 or P8 from my example, would send messages that don't have a recipient process. In my example, only P4 has four neighbors that are other processes, so this problem is of concern to all but one of the processes. In all cases, there will be ghost cells around all edges of the subgrid to be computed. In those processes that are on the edges of the overall grid, you simply initialize those ghost cells as you would in the serial version.

As for exchanging data in a process with less than four neighboring processes, you could treat these nodes as special cases in your code. A process will need to determine that it is either on the edge or is a corner of the overall grid and which edges of its subgrid are bordering other processes, and then only send/receive messages to/from those actual processes. This will make the code messy at the point where you are exchanging ghost cell data. There will be code for nine different types of nodes: one for the "middle" processes with four true neighbors, a different set of code for each of the four corner nodes, and a set of code for each of the edge nodes that have only three neighbors (differentiated by which neighbor is not another process). This clutter of code is something that I would want to maintain at some later date.

Luckily, MPI has a facility for designating whether a message is going to or is being received from a non-existent process. The MPI constant MPI_PROC_NULL can be used as the destination parameter of a send function or as the source parameter of a receive function. Calling a send or receive function with this constant in the appropriate parameter will treat that function call as a NOP (no operation) by returning control as soon as possible with no changes to the message buffer.

I use this facility by setting up variables with the process numbers of the destinations and sources of ghost cell exchanges. Each process (middle, edge, or corner) can then create four messages with border cell data and send them to the logical neighbor processes. The communication runtime will take care of those messages that do not go to processes. Similarly, a process can set up four asynchronous receives and, after the messages have been signaled as being received, only move data from the receive buffers for those messages that correspond to actual processes. A code snippet that does some of this set up and part of the exchange is shown below.

// assumes ROW and COL are dimensions of process grid North = myid – COL; if (North < 0) North = MPI_PROC_NULL; East = myid + 1; if ((myid+1)%COL == 0) East = MPI_PROC_NULL; . . . // asynchronous sends of integers with tag value 5 MPI_Isend(northGhostBuffer, numNorthCells, MPI_INT, North, 5, MPI_COMM_WORLD, &northRequest); MPI_Isend(eastGhostBuffer, numEastCells, MPI_INT, East, 5, MPI_COMM_WORLD, &eastRequest); . . . // asynchronous receives of integers with tag value 5 MPI_Irecv(northGhostBufferIn, numNorthCells, MPI_INT, North, 5, MPI_COMM_WORLD, &northReceiveRequest); MPI_Irecv(eastGhostBufferIn, numEastCells, MPI_INT, East, 5, MPI_COMM_WORLD, &eastReceiveRequest); . . . if (North != MPI_PROC_NULL) { // exchange with real process MPI_Wait(northReceiveRequest, &northStatus); // Move data from buffer to ghost cell elements } if (East != MPI_PROC_NULL) { // exchange with real process MPI_Wait(eastReceiveRequest, &eastStatus); // Move data from buffer to ghost cell elements }

The asynchronous send and receive functions have an additional parameter from their blocking cousins. This is a request object that is used to determine when the operation has completed. For an asynchronous send, this would be when the message has been copied out of the buffer to the system; and for an asynchronous receive, it indicates when a matching message had arrived into the buffer. Calling MPI_Wait() will block on the given request object until the associated operation has been completed. Since there is no need to receive a message from non-existent processes, you can skip processing ghost cell data in those cases.

Communication within clusters is slower than computation by several orders of magnitude. This is why your distributed memory applications should attempt to minimize the number of messages needed. Sending and receiving messages to/from four processes after every generation is computed will most assuredly drag the performance of the simulation down to a snail's pace. One easy modification is to expand the size of the ghost cells' area used along each subgrid border. If you had two rows/columns of cells, you could compute two generations before needing to refresh the ghost cells or, if you have five rows/columns, then five generations are computed before needing to exchange, and so on.

In the case of multiple ghost cell rows, you will need to do the update computations on some of the ghost cells, which will be redundant computation already being done in the bordering process. Also, the number of ghost cells the process must update will decrease by one row/column each generation. For instance, with five rows of ghost cells, the process must update the closest four rows in the first generation with the outermost being read-only. The next generation computation will need to update the three innermost rows with the next to outermost being read-only, and so on. Even though each process is doing some redundant computation on cells not actually assigned to it, this extra computation will pay off by requiring the sending and receiving of fewer total number of messages.

The last bit of business for the distributed simulation is to figure out how to get the current state of cells out of all the processes when the simulation has run through the requisite number of generations. My advice is to do whatever was done to provide the initial live configuration to each process, but in reverse. If one process sent a message to each of the other processes with that initialization, then each process simply puts together a message detailing the aliveness of each cell in the assigned subgrid and sends that back to the one process.

And that's how you can implement a distributed computation to run the Game of Life. Any details left out of this post (such as the parameters and other details about for the MPI functions used or shown) are intended as an exercise to the interested reader.

These last few posts have been working with the simplest implementation of Game of Life, which uses a two-dimensional array for the simulation. Next, I want to discuss some alternatives to this version and some strategies for parallelizing them.