N1525: Memory-Order Rationale

ISO/IEC JTC1 SC22 WG14 N1525 – 2010-10-11

Paul E. McKenney, paulmck@linux.vnet.ibm.com

Blaine Garst, blaine@apple.com

Introduction

The C-language Committee Draft (n1494) contains no fewer than six memory-order options:

memory_order_relaxed memory_order_consume memory_order_acquire memory_order_release memory_order_acq_rel memory_order_seq_cst

The purpose of this document is to demonstrate the purpose of each of these options by way of “toy” example code fragments, and also show how some varieties of the memory fences may be used. This paper is based on n2153.

memory_order_relaxed

With memory_order_relaxed , the compiler is required to load and store atomically, so that any load sees the results of exactly one store (or the initial value), but there are no ordering constraints. This example uses memory_order_relaxed in conjunction with an explicit fence to process in-shared-memory mailboxes more efficiently:

const int num_mailboxes = 32; _Atomic volatile int mailbox[num_mailboxes]; void MailboxInspection(int my_id, void (*do_work)(int)) { for (int i = 0; i < num_mailboxes; i++) { if (atomic_load_explicit(&mailbox[i], memory_order_relaxed) == my_id) { atomic_thread_fence(memory_order_acquire); // prevent speculative reads in do_work [7.17.4.1] do_work(i); } } }

Here we scan an array of mailboxes, and process only those whose ID indicates that they are intended for us. Although it would be possible to combine the atomic_thread_fence() into the atomic_load_explicit() by using memory_order_acquire() , but doing so would execute needless memory fences in cases where the mailbox is for someone else. In contrast, the above code only executes a memory fence when a message needs to be processed, improving performance on weakly ordered systems (including x86 in the case where certain SSE instructions are used).

Note that sequential consistency would require a heavy-weight fence on each iteration, whether or not the corresponding mailbox was processed.

memory_order_consume

With memory_order_consume , the compiler and CPU are required to order the load in question against only those subsequent loads and stores whose address or value are computed from the value loaded. All mainstream CPUs (other than DEC Alpha) provide the required ordering without the need for any explicit memory-fence instructions, which means that code using memory_order_consume can attain extremely high levels of performance and scalability.

These constraints allow memory_order_consume to be used to implement RCU, which in turn enables high performance and highly scalable access to read-mostly data. RCU is implemented in the Linux kernel and also as a user-space library. In addition, a garbage collector may be used to emulate RCU (in which case rcu_read_lock() and rcu_read_unlock() can be no-ops). The locking primitives can be implemented using any convenient method, including pthread_mutex_t . The following example shows how RCU can be used to support lockless read-side accesses to a hash table:

enum { NUM_BUCKETS = 128; }; struct foo { _Atomic volatile struct foo *next; int key; }; _Atomic volatile struct foo *hashtable[NUM_BUCKETS]; DEFINE_LOCK(foo_lock); int find_foo(int key) { struct foo *p; int retval; rcu_read_lock(); // Linux kernel: p = rcu_dereference(hashtable[foo_hash(key)]); p = atomic_load_explicit(&hashtable[foo_hash(key)], memory_order_consume); while (p != NULL && p-<key < key) { // Linux kernel: p = rcu_dereference(p-<next); p = atomic_load_explicit(&p-<next, memory_order_consume); } retval = p != NULL && p-<key == key; rcu_read_unlock(); return retval; } int insert_foo(int key) { struct foo *newp, *p; _Atomic volatile struct foo **plast; int retval = 0; spin_lock(&foo_lock); // assume acquire-release at minimum plast = &hashtable[foo_hash(key)]; p = *plast; while (p != NULL && p-<key < key) { p = p-<next; plast = &p-<next; } if (p == NULL || p-<key != key) { newp = malloc(sizeof(*newp)); if (newp != NULL) { newp-<key = key; newp-<next = p; // Linux kernel: rcu_assign_pointer(*plast, newp); atomic_store_explicit(*plast, newp, memory_order_release); retval = 1; } } lock_release(&foo_lock); return retval; }

The key point is that the atomic_load_explicit() using memory_order_consume guarantees that subsequent accesses will see any initialization carried out by insert_foo() , even if they are executing concurrently, and without the overhead of explicit memory-fence instructions. In constrast, memory_order_acquire would require explicit memory barriers on weakly ordered systems and would overconstrain compiler optimizations on all systems.

memory_order_acquire

With memory_order_acquire , you can construct a lock-acquisition primitive. A non-production-quality “toy” implementation follows:

typedef atomic_atomic_int lock_t; enum { LOCK_UNLOCKED, LOCK_LOCKED }; void acquire_lock(lock_t &lp) { while (atomic_exchange_explicit(lp, 1, memory_order_acquire) == LOCK_LOCKED) continue }

In principle, a sequentially consistent access could be used, but this would require a full memory fence on (relatively) strongly ordered systems such as x86 or the IBM mainframe. In contrast, use of memory_order_acquire permits such systems to omit explicit memory fences.

memory_order_release

With memory_order_release , you can construct a lock-release primitive, which is a non-production-quality “toy” counterpart to the acquire_lock() function shown above.

void release_lock(lock_t &lp) { atomic_store_explicit(lp, LOCK_UNLOCKED, memory_order_release); }

This approach has the same advantages over sequential consistency noted above for acquire_lock() .

Note further that an explicit memory fence can permit several unlock operations to be protected by a single memory-fence instruction:

void MultipleLockRelease() { extern void do_work_locking_A_and_B(void); do_work_locking_A_and_B(); atomic_thread_fence(memory_order_release); // ensure memory stores are complete atomic_store_explicit(&lock_A, LOCK_UNLOCKED, memory_order_relaxed); atomic_store_explicit(&lock_B, LOCK_UNLOCKED, memory_order_relaxed); }

memory_order_acq_rel

The memory_order_acq_rel permits a single atomic operation and fence to serve for a combined unlock/lock operation. Consider a set of locks implemented as bits in an integral type. A single release_acquire_lock() primitive can release any subset of the locks and acquire any non-held subset. Of course, a failure return is required in order to avoid deadlock. In this (naive) implementation, failure results in the locking state being unchanged.

typedef atomic_int multilock_t; int release_acquire_lock(multilock_t *mlp, multilock_t relmask, multilock_t acqmask) { multilock_t old, new; old = atomic_load_explicit(mlp, memory_order_relaxed); do { new = old & ~relmask; if ((~new & acqmask) != acqmask) return 0; // Failure new = new | acqmask; } while (!atomic_compare_exchange_weak_explicit(&mlp, &old, new, memory_order_acq_rel, memory_order_relaxed)); return 1; // Success }

Note that strongly ordered systems such as x86, SPARC, and the IBM mainframe do not need explicit memory-fence instructions in this case. In contrast, fence instructions would be required for sequential consistency.

memory_order_seq_cst

The quintessential litmus test for sequential consistency is independent reads of independent writes, also known as IRIW. This is shown below, where T1 , T2 , T3 , and T4 run as separate threads.

atomic_int x = ATOMIC_VAR_INIT(0); atomic_int y = ATOMIC_VAR_INIT(0); int r1, r2, r3, r4; void T1(void) { atomic_store(&x, 1); } void T2(void) { atomic_store(&x, 1); } void T3(void) { r1 = atomic_load(&x); r2 = atomic_load(&y); } void T4(void) { r3 = atomic_load(&y); r4 = atomic_load(&x); }

Sequential consistency precludes the outcome r1==1&&r2==0&&r3==1&&r4==0 , which would be permitted by the other memory orders.

Conclusions

These examples show that the five weaker memory-order specifiers have an essential role to play in expert-level performance/scalability-critical low-level code. Usage patterns analogous to each of these five weaker memory-order specifiers really do appear in large shared-memory parallel software artifacts, as exemplified by the Linux kernel. Therefore, each of the five has an important role to play in the upcoming C standard.