My previous blog post, Understanding the C/C++ memory model, got more (and still receives more) attention than I ever thought it would, so I feel obliged to add a follow up to expand on the topic a little more.

First, a few things that might not have been entirely clear or were missing from the previous post:

The post was focused on acquire and release memory ordering constraints for atomic operations (and memory barriers), while the language model provides various other memory ordering constraint options. While some of these should be reasonably obvious (memory_order_relaxed, which essentially implies no constraints, and memory_order_acq_rel, which combines both acquire and release semantics in a single operation), some others warrant explanation. In case it was not clear, the default memory ordering constraint for the atomic operations provided by the standard library is actually memory_order_seq_cst. Importantly, if this default is used exclusively, some of the issues applicable to the use of acquire/release orderings are no longer of concern. However, this ordering potentially imposes a performance penalty if used when it is not actually required. In some cases acquire/release is not enough and we do need to make use of a stronger ordering constraint.

However, with all that in mind, I mostly want to focus in this post on presenting a more detailed model for understanding exactly what acquire/release entail. In particular, I aim to present a comprehensible model of processor hardware – even if the model is not accurate to any particular hardware – that allows for understanding the C/C++ memory model without having to rely solely on language-defined semantics. I’m not claiming that this model is complete nor even entirely correct, but I believe it may help to understand acquire/release semantics.

A hardware model to explain acquire/release

So, bearing in mind that this is not what real processor architecture necessarily looks like, let’s suppose that each processor (or processor core) is connected directly to a pair of caches (or buffers if you prefer) which are then connected to the main memory of the system:

Note that I’ve opted to call them “buffers” in the diagram, since I’ve learned that describing them as caches might lead to pedants claiming that the model is wrong, although it’s worth bearing in mind that (a) many people will best understand the behaviour of the buffers as being that of a cache and that (b) this model is not supposed to accurately represent any particular hardware.

I hope that the function of the store buffer and read buffer in the diagram is obvious: writes performed by the processor core go through the store buffer before reaching main memory, and reads need to go through the read buffer if the value to be read isn’t otherwise already available in either buffer. Both buffers can hold multiple values, each value corresponding to a different address. At arbitrary times and in unspecified order, values from the read buffer can be discarded, and values from the store buffer can be flushed to main memory.

Note that the buffers allow for different threads to observe reads and writes to be observed in different orders by different threads (cores), but that a thread should always see its own reads and writes in program order – in particular, if a thread writes to an address and then reads from the same address, the read operation should see the same value as was written, unless that address may have been written to by another thread in the meantime.

What may be a little confusing is that the link between the processor core and each buffer is depicted as being two-way. In fact, this is necessary for maintaining consistency between the buffers: if a single core is to store a value to a particular address, and then read that particular address, then it shouldn’t be possible that the read loads a value from main memory before the prior store has actually reached main memory. So in fact, a store operation from the processor both updates an existing value or stores a new value in the store buffer, and updates any existing value (for the corresponding address) in the read buffer.

With all this considered, how do acquire and release operations tie in? Quite simply: an acquire operation forcefully flushes the read buffer (discarding its contents), and a release operation forcefully flushes the store buffer (writing its contents to main memory). Importantly, this means that a value written must be followed by a release in the same thread, and then an acquire – in another thread – before it is read by the 2nd thread. This is what the language model itself requires.

What about atomic operations themselves? Unfortunately they are a bit hard to explain directly in terms of this simple model. We could say that atomic operations simply bypass the buffers and operate directly on main memory; the main issue with this is that it doesn’t allow for re-ordering of atomic operations on different addresses, which the language model does allow (although the orderings must still obey the associated constraints).

What happens if two processor cores have a store in their store buffer to the same address? Well, that’s most likely a race condition; similarly if a store in one processor buffer matches a read from another.

Sequential Consistency

The default ordering constraint for atomic operations provided by the C/C++ standard library is memory_order_seq_cst, where “seq_cst” is a shortening of “sequentially consistent”. Put simply, an atomic operation with sequential consistency implies both acquire semantics (if the operation is a load) and release semantics (if the operation is a store), as well as preventing other atomic operations (of any kind) from being re-ordered with respect to the sequentially consistent operation.

Why is this ever necessary, though? Well, consider the example in the previous post of a mutual exclusion device (mutex) implemented via a spinlock, using a compare-and-exchange operation to acquire the mutex and a store to release it. Previously I suggested that acquiring the mutex could use (naturally enough) acquire semantics and releasing it could use, similarly, release semantics. In the normal case of a single mutex protecting some data structure, this is generally correct.

Sometimes, however, more complex applications can require more than a single mutex to be locked at the same time. With careless design, this can lead to deadlocks: if there are two mutexes, A and B, and one thread locks them in [A,B] order while another does so in [B,A] order, you potentially get the situation where each thread holds one mutex and requires the other (but can never obtain it). A typical approach to avoiding such problems is to impose a consistent ordering of mutex acquisition: the program could be designed, for instance, to always acquire the mutexes in the order [A,B]. Of course it is still possible to acquire either mutex alone without causing deadlock, so long as the B mutex is released before acquiring the A mutex: the important thing is that the mutexes are never acquired in the order [B,A].

In our particular example, even though atomic operations may be re-ordered, the acquisition of the B mutex will not be re-ordered with respect to the acquisition of the A mutex, since no store or load (including via an atomic operation) can move prior to an acquire operation. However, what can happen is that the release of one mutex can be re-ordered with respect to the acquisition of another (by moving it from before to behind), and this is problematic: Suppose that we have the sequence, in one thread:

acquire B release B acquire A release A

This should be ok – it doesn’t acquire the mutex A while B is held – but as has just been suggested, this sequence could be re-ordered (by the compiler or processor) as follows:

acquire B acquire A release B release A

If another thread is obtaining both mutexes in the proscribed order, A and then B, this will lead to deadlock! The solution is to use sequentially consistent ordering, which prevents any atomic operation being re-ordered with respect to the sequentially consistent operation. Note that it’s not necessary to make every mutex operation sequentially consistent; I’ll leave it as an exercise to decide on the best solution.

Edit 3 April: Preshing believes that the problematic re-ordering detailed above isn’t allowed by the C++ language standard (he doesn’t mention C and I’ve not checked whether it has similar wording). I’m not entirely convinced myself either that the paragraph he quotes necessarily means that this is the case, nor that it is intended to do so, though I’ve not given it full consideration yet. Another example of where sequential consistency makes a difference is given in an answer to this stackoverflow question.

Conclusion

The hardware model I’ve presented is hopefully simple enough to understand while providing comprehensible rationale for the particulars of the release/acquire semantics provided by the language model. It may however not be correct in every aspect and is not complete; use it only for what it is. I’ve also discussed sequential consistency and why it is sometimes necessary to avoid deadlocks.

I haven’t discussed the “consume” memory ordering, and may leave that for another time, but I encourage anyone who’s interested to seek out explanations elsewhere. I hope that this post and the one preceding it are enough to give a good grounding in the issues addressed by the memory model.