The Pit and the Parallelism: Memory Models When I analyze parallel algorithms and how independent threads interact with each other across shared resources, I use the assumption of sequential consistency.



When I analyze parallel algorithms and how independent threads interact with each other across shared resources, I use the assumption of sequential consistency. That is, I can simulate parallel execution by interleaving the instruction streams of two (or more) threads at programming statement level and assume that the last value written to a shared variable is the one read by another thread that accesses that shared resource. I presume that whatever interleaving I choose selects the next statement from one of the threads and executes it before picking the next statement from either the same thread or the other thread.

In the strictest sense, this can lead to an incorrect result. For example, if the threads are incrementing a shared counter, say x++ , there will be no data race since I've assumed that the read by the second thread to execute the increment will be the value that was written by the first thread. At this point, I realize that even such a simple statement will actually be compiled into three instructions: load x into a register, add one to the register, store the register value back into x . Now, with the interleaving of two threads that were both incrementing the same variable, I can see that there are interleavings at the assembly language level that will yield incorrect results. Being the lazy programmer that I am, this all seems a bit too labor intensive to require me to play compiler for all possible statements in my code just to check if there is some interleaving of thread execution that can lead to a data race.

Another tactic is to not really interleave the execution of two threads as if there was only a single core. (I mean, we've got two or four cores in most every modern device and I know that threads are going to be running on those independent cores, so why not analyze my application under the same assumption?) I can "interleave" the execution of my threads by imagining the execution traces being written side-by-side. Then, if I can match up two statements between the threads that adversely access the same shared resource, like incrementing the same counter, I know there is a potential data race. I can use blank lines in any stream to simulate different processing speeds or other threads using the same core.

The obvious solution to the race condition hinted at above is to put some kind of mutual exclusion synchronization around the increment or to force the operation to be atomic. In every threading library that I've seen, there are several ways to do one or both of these. But that isn't the point of this post.

The use of a sequentially consistent model for reasoning about shared variables is the simplest approach and works great in the abstract. Unfortunately, we have to code in the real world on real hardware with real programming languages. The true villain of this cautionary tale is soon to be revealed as memory models, if you hadn't already guessed that from the title.

Let me show a slightly more complex example to demonstrate one of the "hidden" perils of threaded programs. In one thread, threadZero , I want to update a shared variable; in another thread, threadOne , I want to read that newly updated value. There will need to be some kind of ordering synchronization to ensure the proper execution order. The code fragment below shows one way to do this with a shared flag variable controlling the order of execution.

// Shared declarations int Done = FALSE; int N; void threadZero(void *pArg) { . . . N = SomeComputedValue; Done = TRUE; . . . } void threadOne(void *pArg) { . . . while (!Done) {} // spin-wait SomeLocalValue = N; . . . }

In a sequentially consistent analysis, there is no problem with this code. I can align the two sequences in any order and there is no data race between them. In the arrangement that might be closest to a problem, threadOne will pause on the spin-wait of the while loop until after threadZero has updated the value of N and then set Done to TRUE . But can I count on this happening every time I run the code on some real system? (It's an unfair question since you can see more text below this line, and if the answer were "Yes," none of that would be there.)

Modern architectures use storage buffers, visible only to the processor executing the store instruction, to hold writes to memory before those store results are actually written to system memory, which is when the updates are visible to other threads. Unfortunately, there are processors in use today that allow the order of such writes back to system memory to be done in a different order than the program statement order prescribed them. Thus, even though threadZero expects the update to N to be completed before Done is set (and this is what a sequentially consistent execution would guarantee), the update to Done may well be visible to other threads before the new value is stored in N . In that case, threadOne may proceed out of the spin-wait loop and retrieve an incorrect value of N .

This is not a standard feature of processors, but something that is allowed in the memory model defined for the processor. Even though it is likely a rare occurrence, the right alignment of the stars and planets can lead to incorrect results and will most likely not be caught during routine testing of your application. In the past, processor memory models have been ill-defined, incomplete, and ambiguous enough to stump even experts. It's better today, but there are still outliers that can reach up to bite you in the posterior when least expected.

Another illustrative situation is when I have two threads updating two small, separate — yet adjacent — fields within an object or array. Think of something like byte-sized data fields of an array of char . Clearly there is no data race here because the threads are targeting two different memory locations, right? (There might be some false sharing, but that's a different matter for a different post.) Do I really need to answer that? Actually, I do, because it's an interesting situation.

Consider the scenario where threadOne loads the word containing the byte it is manipulating and updates the register with the appropriate value. If, before threadOne can write out its result, threadZero swoops in to read the same word, update its byte, and write back its own result to memory, what are the results after threadOne has written back to memory? In a perfect world, we'd have no problem since different parts of the word were changed and that is the only modification that needs to be reflected in each write to memory. Regrettably, the update by threadOne will overwrite the results from threadZero since memory is typically not updated by the byte, except on data boundaries of larger size. A similar case might be two threads updating a 64-bit value and having the two 32-bit portions updated separately and interleaving those smaller updates with one another. (Yes, it's a data race, but rather than getting either one value or the other stored, you end up with some kind of mash-up of the two values.) While the architecture and processor memory model shares some of the blame for the overwriting problem described above, the memory model of the programming language being used can also be complicit.

Programming languages have their own memory model built in. In the early days, such models were ill-defined, inconsistent, and downright baffling. Recent efforts have been put forth to clarify the memory model of modern programming languages (i.e., Java, C++11, C11), giving programmers a solid understanding about the interactions of thread execution and memory updates. For example, the split update of the single word is disallowed in the new standards for C++11 and C11 and has long been banned by the Java standard.

The lesson to be learned from all of this is that you need to understand something about memory models, in both hardware and the programming languages you use. I'm no expert on memory models (and can't even play one convincingly on TV), but I know that they may cause problems if I make assumptions that are not supported. I'll continue to use my sequential consistency approach when thinking about parallel algorithms and showing that they are correct since the new standards for Java, C++11, and C11 promise sequential consistency in programs that do not contain data races. (There are some exceptions to that "promise," but these are the "corner cases" that I'm less likely to encounter.)

The second lesson, and perhaps more important, to take away from this would be to protect all possible data races within your code even if you can prove that the race is benign. Just because you "know" that the race is harmless, the programming language or hardware memory model may just reach out and slap you at the most inopportune time. Synchronization, atomicity, and functions to enforce mutually exclusive access, when recognized by the compiler, will set up memory fences to better ensure that memory updates assumed by the programmer are made visible to other threads before those updated values are needed. So, at the very least, be aware. You can't count on a last-second reprieve by General Lasalle from a swinging, descending scimitar; shifting, fiery walls; a yawning gulf; or the perils of misunderstood memory models.