Game of Life — Using Threads and Lists, Part 3 If I'm programming my list-based version of Game of Life in the real world, how can I parallelize it when I am using a third-party provided data structure?



In Part 1, I described the task decomposition parallelization of threading the logical pairs of TraverseList() calls due to the dependencies between generations of the simulation and how the lists are updated and accessed; Part 2 showed how to parallelize the TraverseList() loop over an array implementation of the list data structure. All well and good. The second scheme relies on the programmer's implementation of the list data structure and having done it as an array. Because we live in the real world, you are probably asking, "Who implements their own list structure these days, besides students in a Data Structures course?" If you need a common data structure in your algorithm, you are most likely to find one that is part of a standard library with a standard interface and use that. So, if I'm programming my list-based version of Game of Life in the real world, how can I parallelize it when I am using a third-party provided data structure?

Method the Third: Using a Third-Party List Structure

The first thing to ensure is that your chosen list data structure is thread-safe. That is, threads can concurrently insert and delete things from the list and keep the contents of the list correct. For this first requirement I'm going to be using the Intel Threading Building Blocks (TBB) concurrent_queue container. While this is strictly not a list structure, it will suffice for the computations needed to implement Game of Life, as I'll show over the next few paragraphs.

The next thing to realize is that the concurrent_queue container does not have a TraverseList() method. This means that I will need to implement some way to traverse each item on the list and process it with the correct function. My solution is to write a small "helper" function for each of the four processing functions that are each executed for a single generation. There is the unsafe_size() method for the concurrent_queue container and I can use this to "iterate" through the whole list with a for loop. To access each item in the container, I use the try_pop() method, which I know will always yield a grid cell from the container. As an example, I create the doAddNeighbor() helper function with the following:

void do_AddNeighbors() { int num_newlive = newlive.unsafe_size(); for (int i = 0; i < num_newlive; ++i) { ListEntry cell; newlive.try_pop(cell); AddNeighbors(cell); } }

Now, in place of the TraverseList() calls, I put in a call to the appropriate helper function. These changes aren't parallel, yet, but just the prelude to get started with a thread-safe structure.

If you're familiar with TBB containers, you may realize that rather than a counted for loop, I could have used the iterator to access each element in the structure. Two reasons I didn't go that route. First, the TBB documentation says the iterators are slow and should only be used for debugging, and second, I need to know the number of iterations to be able to make these loops parallel.

Before I make the change to parallel code, I ensure all the functions that process a single cell from a list will remain the same as they were in the second version of my parallelized list-based Game of Life. Specifically, on a Windows system, the updates of the numNeighbors array are protected with Interlocked intrinsics. The same is going to be required for the change (or potential change) to the status of a grid cell being born (ALIVE) or dying (DEAD).

When that is done, I can simply implement parallelism on the loop within each helper function, much like I did with the code for TraverseList() . Since I'm using the TBB concurrent_queue container, I will keep with the TBB theme and add a call to the TBB parallel_for algorithm. Here is an example of that using the lambda variation:

void do_AddNeighbors() { int num = newlive.unsafe_size(); parallel_for ( blocked_range<int>(0, num), [&](const blocked_range<int>& r) { for (int i = r.begin(); i < r.end(); ++i) { ListEntry cell; newlive.try_pop(cell); AddNeighbors(cell); } } ); }

Simple.

What's the biggest difference between a list and a queue? My answer, or the one that is most relevant to this discussion, is that elements on a list can be reexamined many times without changing the list contents; once you take something off a queue, you have to put it back at the end of the queue if you need to reference it later. And, in order to reference it again, you have to go through all the other elements on the queue (and likely stick them back onto the end of the queue). My point on this final topic is that queues do not make good substitutes for a general list structure. Fortunately, it works perfectly in this instance.

For my Game of Life implementation with lists, during the computation of each generation, items are added to the list once and then are each processed before the whole contents of the list are cleared. Remember back to the original serial implementation where each of the TraverseList() calls in the main loop was immediately followed by a call to ClearList() . By using a concurrenct_queue , the try_pop() returns the item at the head of the queue and removes that item. It's like I get the ClearList() operation for free.

There is one point in the code where I must be careful, though. This is during the initialization of the grid with the starting population of cells. In the original serial code, as each new cell is read in, it is added to the newlive list (queue). After that is done, the list is traversed with AddNeighbors() to get the initial neighbor count grid set up, the contents of newlive are copied into the maydie list and the newlive list is cleared. As a queue, the newlive list will be cleared when it is "traversed" with the do_AddNeighbors() helper function. Unfortunately, this leaves nothing to be copied into the maydie list. While I could use the iterator from the concurrent_queue container to traverse the list for setting up the neighbor counts, a better solution is to simply push() each new input grid cell into both newlive and maydie lists.

Let me leave you with one final parallelization idea that I thought of as I was finishing this article. First, I can combine the do_Vivify() and do_Kill() helper functions and also combine the do_AddNeighbors() and do_SubtractNeighbors() helper functions. The two loops in each of the combined helper functions can both be parallelized with OpenMP with the first loop having the nowait clause added. In both cases, the two loops each pop items from separate lists. The Vivify() and Kill() functions push cells to different lists, too, so there is no dependence between the two loops. Even though AddNeighbors() and SubtractNeighbors() can push things onto both maylive and maydie lists, because I am using a thread-safe container, there is no race condition between threads doing different iterations of either function or between the code from either helper function.

I've not actually implemented or tested this idea, but here is the code I would write for this latter combination of helper functions:

void do_AddAndSubtractNeighbors() { int num_newlive = newlive.unsafe_size(); int num_newdie = newdie.unsafe_size(); #pragma omp parallel { #pragma omp for nowait for (int i = 0; i < num_newlive; ++i) { ListEntry cell; newlive.try_pop(cell); AddNeighbors(cell); } #pragma omp for for (int j = 0; j < num_newdie; ++j) { ListEntry cell; newdie.try_pop(cell); SubtractNeighbors(cell); } }// end omp parallel }