Toward generic atomic operations

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

Modern Linux distributions support a number of different computer architectures. Each of these architectures has its own quirks and implementation differences that are largely abstracted by a clever collaboration between the kernel and system libraries, such as the GNU C Library (glibc). However, there are still some ways in which core architecture differences are exposed to higher-level software. One example of this is in the implementation of atomic memory operations.

Atomic operations are necessary to ensure programming correctness in those situations where there are multiple threads of simultaneous execution. (Atomic operations are even necessary on uniprocessor systems, where interrupts and asynchronous scheduling of other threads provide the illusion of multithreading.) Atomicity means that a given operation (such as incrementing a counter variable) takes place in an indivisible fashion; its result is either visible to all CPUs in the system instantaneously, or does not take place at all (it is similar in concept but less fashionable than transactional memory).

Atomic operations are typically fairly small, hand-optimized assembly functions that provide for atomic increment and decrement of counters, acquisition and release of locks, and so on. Since these operations differ from one architecture to another, typically few developers on any given project understand the different implementations in their entirety, and even fewer care to vouch for the code being correct across all supported architectures. Although there are generic implementations available in libraries such as pthreads, not all projects can make use of them, for a variety of reasons, including a desire to be portable to non-Linux platforms; thus a number of projects within the average Linux distribution still contain their own custom implementations of atomic operations.

Atomic operations are particularly useful on modern systems with many CPUs running multi-threaded applications, but even a system with a single CPU (core) has a need for them. After all, the Linux kernel may interrupt a running task thread "A" to service an interrupt routine, and may then schedule a different task thread "B" before returning to the one that was originally interrupted. Without a means to ensure certain operations have taken place atomically, there would be no way to cope with potential interference between task thread B and task thread A (e.g., if both threads race for the same lock or operate on the same variable).

How do atomic operations work?

Atomic operations fundamentally require underlying hardware support. There are, broadly speaking, two popular mechanisms used by CPUs in implementing support for atomic operations in modern computer architectures. The older, more traditional approach involves directly manipulating memory locations, for example, a compare-and-swap (or compare-and-exchange) instruction such as CMPXCHG on Intel's x86 and Itanium architectures. These instructions compare the value of a given memory location with a value supplied as part of the instruction. If the two values are identical, then yet another supplied value is written back to the memory location, while the overall result is signaled in the form of a returned value (almost universally in a register). This whole sequence takes place as a single processor instruction, for example by locking the processor local bus, and disabling external interrupts for its duration.

A more modern alternative to directly acting upon memory locations is to implement a reservation engine within the processor. A reservation engine (as used by modern RISC architectures such as ARM, POWER, etc.) is typically implemented under the control of two special processor instructions. The first instruction, often called load-with-reservation, load-exclusive, or load-link, atomically loads the value of a given memory location into a register and marks that memory location as reserved.

The loaded value can then be manipulated arbitrarily before a second instruction, often called store-exclusive, or store-conditional, atomically stores an updated value from a register back to a given memory location, provided that no modification has been made to that memory location in the interim. The store-exclusive operation returns a value indicating whether the store operation completed successfully or not, which is important because there is an opportunity for external interference between the load, modification, and subsequent store. This means that higher-level atomic operations built using these instructions typically involve a loop, retrying the entire operation until the store is successful.

Reservation engines are slightly more complex to work with in software (requiring two instructions and a comparison in a loop block), but they come with multiple benefits. Although compare-and-exchange appears simpler because it is implemented in a single instruction, it in fact causes poor performance in the CPU pipeline because multiple additional sub-stages are required for the implied memory operations. By contrast, the reservation engine approach explicitly separates memory reads and stores into multiple operations. A reservation engine can be implemented separately from the bulk of the core CPU logic, and can be as complex as desired (including necessary logic to synchronize with other reservation engines).

Some reservation engine implementations handle only a single memory location at a time on a given processor, while others are more complex. In every case, outstanding reservations are invalidated upon a context switch between running tasks (often as a result of a specific invalidation in the context-switch code). The reservation approach can also handle the "ABA problem"—that is, it can detect any changes to the target memory location after the atomic load, even if the original value is written back prior to the store, because the reservation engine is aware of all memory modifications.

The story doesn't quite end there. Some architectures lack full support for certain atomic operations that are required by higher-level software, such as atomic 64-bit (multiple word) load and store operations. In this case, there are workarounds (e.g. the "kuser" helper, a VDSO-like helper on older ARM processors), but that is a topic best saved for another article.

How are atomic operations used?

Atomic-operations libraries typically provide a set of functions that include incrementing and decrementing a memory location, compare-and-swap of a memory location, and higher-level operations built using these functions, such as lock acquisition and release. All of these various operations are built using the fundamental architecture-specific processor instructions of the kind described above. As an example, the OpenMPI message-passing library includes the following inline assembly code to implement an atomic 32-bit addition operation on version 7 of the ARM Architecture:

START_FUNC(opal_atomic_add_32) LSYM(13) ldrex r2, [r0] @ exlusively load address at r0 into r2 add r2, r2, r1 @ increment the value of r2 with value in r1 strex r3, r2, [r0] @ attempt to store the value of r2 at the address in r0 cmp r3, #0 @ r3 contains result from store exclusive, test if successful bne REFLSYM(13) @ repeat entire operation if it was interrupted mov r0, r2 @ return value that was written bx lr END_FUNC(opal_atomic_add_32)

This atomic increment function works by using the special ldrex and strex instructions, which control the CPU's reservation engine, to gain exclusive access to a desired memory location. The example code first loads the contents of a given memory location into a general-purpose register, adds a value to the register, and then tests the result of attempting to exclusively store this change back to memory. If it is successful, the function returns. If it is not successful, the operation is repeated until it completes without interference.

OpenMPI includes a custom atomic-operations library that implements support for 13 different base architectures. Some of those architectures have multiple ways to achieve the same thing, depending on which version is in use. For example, ARM processors have moved away from the deprecated SWP (compare-and-swap) instruction in favor of a reservation-engine-based approach. Both approaches need to be supported if code is to run on newer and older ARM processors. It is unfortunate that projects such as OpenMPI have needed to implement their own atomic-operations libraries, which must be periodically updated for new processors and are hard to maintain because they require special knowledge of multiple underlying architectures.

The C11 memory model

The main culprit for this state of affairs is the venerable C programming language. Traditionally, C had no explicit internal notion of multi-threaded applications, and only a very weakly ordered memory model. That is, it was hard to guarantee that the compiler would not reorder memory operations on shared variables because the language lacked the built-in constructs that are necessary to inform the compiler of such hidden data dependencies. Over the years, independent platform-specific libraries have provided support for general threading abstractions, including atomic operations performed on their own defined types. This is all well and good, but not all projects can rely upon such platform libraries for atomic operations, especially those that want to remain highly portable to non-Linux systems.

This is where C11 comes in. C11 introduces a new memory model explicitly designed with support for threading and atomic operations. It introduces the new standard header stdatomic.h , atomic integer types such as atomic_int (constructed using the _Atomic type qualifier), and a new memory_order enumerated type that defines various levels of memory ordering from the weakest memory_order_relaxed (no specific ordering requirement) through to memory_order_seq (sequentially consistent), the strongest ordering. Using the C11 memory model, the previous inline assembly can be reduced to defining an _Atomic typed variable and using one of the atomic fetch-and-modify generic functions, such as atomic_fetch_add() .

Here is an example of using the C11 defined atomics:

#include <stdatomic.h> _Atomic int magic_number = ATOMIC_VAR_INIT(42); // can also use _Atomic(int) atomic_fetch_add(&magic_number, 5); // make Star Trek fans happy

This defines and initializes a new variable called magic_number to the value 42 (using ATOMIC_VAR_INIT() ) before correcting that value for the true answer to the ultimate question of life, the universe, and everything, which, as everyone knows, Star Trek correctly defined to be 47. Using the new C11 extensions, projects such as OpenMPI do not need to implement their own atomic-operations library, because the underlying language now provides the necessary support, already optimized for each target architecture.

There is, however, at least one little problem with rushing to embrace C11. As of this writing, GCC and glibc do not yet have full support for the new atomic types. This is slated to be added in the GCC 4.8 time frame. (The glibc maintainers are aware of the topic, and plan to incorporate support once it is available in GCC.) Meanwhile, GCC 4.7 gained support for a new set of built-in functions to provide memory-model-aware atomic operations that were designed specifically to meet the requirements of the C11 memory model. The idea is that the higher-level C11 atomic primitives can be easily built using these built-ins in time for GCC 4.8. In the meantime, there are several alternative options. One of these is to use a third-party macro-based implementation of the C11 types (which already use the GCC built-ins), of which several exist.

Another option prior to broader C11 support being available is to use the new GCC built-ins directly. For common operations, such as atomic increment, the GCC 4.7 built-in atomic functions look very similar to those that form part of the broader C11 standard:

__atomic_store_n(&v, 42, __ATOMIC_SEQ_CST); __atomic_add_fetch(&v, 5, __ATOMIC_SEQ_CST);

The OpenMPI atomic-add example code could thus be replaced with a single call to __atomic_add_fetch() , which will atomically fetch a value from a memory location, add a supplied value, and return the result, doing the right thing for every supported architecture. Compiling the example and disassembling it will (unsurprisingly) produce a sequence of operations that appears very similar to the inline assembly it replaces. Of course, one does need to be careful in using the GCC built-ins directly because they do not require the use of variables with an _Atomic type qualifier. This means that it is possible to mix the use of atomic functions with manipulations of regular variables without triggering any compiler warnings. Still, this is no different than existing code failing to use a call to a special inline assembly function, which is also incorrect.

In time, it is the hope of this author that most projects implementing custom inline assembly for atomic operations can move to a standard C11 based implementation using stdatomic.h . That would be both portable to many different platforms, and easier to maintain by distributions and upstream projects themselves, because a specialist knowledge of the architecture specifics can be abstracted by the compiler. (Note, however, that projects may need to continue to support legacy approaches to atomic operations, if they want to continue supporting old compilers.) You can read more about the new C11 atomic operations (including the details of the memory model and its orderings not covered in depth here) in section 7.17.7 of the final draft version of the C11 specification [PDF].

