September 02, 2014

nullprogram.com/blog/2014/09/02/

C11, the latest C standard revision, hasn’t received anywhere near the same amount of fanfare as C++11. I’m not sure why this is. Some of the updates to each language are very similar, such as formal support for threading and atomic object access. Three years have passed and some parts of C11 still haven’t been implemented by any compilers or standard libraries yet. Since there’s not yet a lot of discussion online about C11, I’m basing much of this article on my own understanding of the C11 draft. I may be under-using the _Atomic type specifier and not paying enough attention to memory ordering constraints.

Still, this is a good opportunity to break new ground with a demonstration of C11. I’m going to use the new stdatomic.h portion of C11 to build a lock-free data structure. To compile this code you’ll need a C compiler and C library with support for both C11 and the optional stdatomic.h features. As of this writing, as far as I know only GCC 4.9, released April 2014, supports this. It’s in Debian unstable but not in Wheezy.

If you want to take a look before going further, here’s the source. The test code in the repository uses plain old pthreads because C11 threads haven’t been implemented by anyone yet.

I was originally going to write this article a couple weeks ago, but I was having trouble getting it right. Lock-free data structures are trickier and nastier than I expected, more so than traditional mutex locks. Getting it right requires very specific help from the hardware, too, so it won’t run just anywhere. I’ll discuss all this below. So sorry for the long article. It’s just a lot more complex a topic than I had anticipated!

Lock-free

A lock-free data structure doesn’t require the use of mutex locks. More generally, it’s a data structure that can be accessed from multiple threads without blocking. This is accomplished through the use of atomic operations — transformations that cannot be interrupted. Lock-free data structures will generally provide better throughput than mutex locks. And it’s usually safer, because there’s no risk of getting stuck on a lock that will never be freed, such as a deadlock situation. On the other hand there’s additional risk of starvation (livelock), where a thread is unable to make progress.

As a demonstration, I’ll build up a lock-free stack, a sequence with last-in, first-out (LIFO) behavior. Internally it’s going to be implemented as a linked-list, so pushing and popping is O(1) time, just a matter of consing a new element on the head of the list. It also means there’s only one value to be updated when pushing and popping: the pointer to the head of the list.

Here’s what the API will look like. I’ll define lstack_t shortly. I’m making it an opaque type because its fields should never be accessed directly. The goal is to completely hide the atomic semantics from the users of the stack.

int lstack_init ( lstack_t * lstack , size_t max_size ); void lstack_free ( lstack_t * lstack ); size_t lstack_size ( lstack_t * lstack ); int lstack_push ( lstack_t * lstack , void * value ); void * lstack_pop ( lstack_t * lstack );

Users can push void pointers onto the stack, check the size of the stack, and pop void pointers back off the stack. Except for initialization and destruction, these operations are all safe to use from multiple threads. Two different threads will never receive the same item when popping. No elements will ever be lost if two threads attempt to push at the same time. Most importantly a thread will never block on a lock when accessing the stack.

Notice there’s a maximum size declared at initialization time. While lock-free allocation is possible [PDF], C makes no guarantees that malloc() is lock-free, so being truly lock-free means not calling malloc() . An important secondary benefit to pre-allocating the stack’s memory is that this implementation doesn’t require the use of hazard pointers, which would be far more complicated than the stack itself.

The declared maximum size should actually be the desired maximum size plus the number of threads accessing the stack. This is because a thread might remove a node from the stack and before the node can freed for reuse, another thread attempts a push. This other thread might not find any free nodes, causing it to give up without the stack actually being “full.”

The int return value of lstack_init() and lstack_push() is for error codes, returning 0 for success. The only way these can fail is by running out of memory. This is an issue regardless of being lock-free: systems can simply run out of memory. In the push case it means the stack is full.

Structures

Here’s the definition for a node in the stack. Neither field needs to be accessed atomically, so they’re not special in any way. In fact, the fields are never updated while on the stack and visible to multiple threads, so it’s effectively immutable (outside of reuse). Users never need to touch this structure.

struct lstack_node { void * value ; struct lstack_node * next ; };

Internally a lstack_t is composed of two stacks: the value stack ( head ) and the free node stack ( free ). These will be handled identically by the atomic functions, so it’s really a matter of convention which stack is which. All nodes are initially placed on the free stack and the value stack starts empty. Here’s what an internal stack looks like.

struct lstack_head { uintptr_t aba ; struct lstack_node * node ; };

There’s still no atomic declaration here because the struct is going to be handled as an entire unit. The aba field is critically important for correctness and I’ll go over it shortly. It’s declared as a uintptr_t because it needs to be the same size as a pointer. Now, this is not guaranteed by C11 — it’s only guaranteed to be large enough to hold any valid void * pointer, so it could be even larger — but this will be the case on any system that has the required hardware support for this lock-free stack. This struct is therefore the size of two pointers. If that’s not true for any reason, this code will not link. Users will never directly access or handle this struct either.

Finally, here’s the actual stack structure.

typedef struct { struct lstack_node * node_buffer ; _Atomic struct lstack_head head , free ; _Atomic size_t size ; } lstack_t ;

Notice the use of the new _Atomic qualifier. Atomic values may have different size, representation, and alignment requirements in order to satisfy atomic access. These values should never be accessed directly, even just for reading (use atomic_load() ).

The size field is for convenience to check the number of elements on the stack. It’s accessed separately from the stack nodes themselves, so it’s not safe to read size and use the information to make assumptions about future accesses (e.g. checking if the stack is empty before popping off an element). Since there’s no way to lock the lock-free stack, there’s otherwise no way to estimate the size of the stack during concurrent access without completely disassembling it via lstack_pop() .

There’s no reason to use volatile here. That’s a separate issue from atomic operations. The C11 stdatomic.h macros and functions will ensure atomic values are accessed appropriately.

Stack Functions

As stated before, all nodes are initially placed on the internal free stack. During initialization they’re allocated in one solid chunk, chained together, and pinned on the free pointer. The initial assignments to atomic values are done through ATOMIC_VAR_INIT , which deals with memory access ordering concerns. The aba counters don’t actually need to be initialized. Garbage, indeterminate values are just fine, but not initializing them would probably look like a mistake.

int lstack_init ( lstack_t * lstack , size_t max_size ) { struct lstack_head head_init = { 0 , NULL }; lstack -> head = ATOMIC_VAR_INIT ( head_init ); lstack -> size = ATOMIC_VAR_INIT ( 0 ); /* Pre-allocate all nodes. */ lstack -> node_buffer = malloc ( max_size * sizeof ( struct lstack_node )); if ( lstack -> node_buffer == NULL ) return ENOMEM ; for ( size_t i = 0 ; i < max_size - 1 ; i ++ ) lstack -> node_buffer [ i ]. next = lstack -> node_buffer + i + 1 ; lstack -> node_buffer [ max_size - 1 ]. next = NULL ; struct lstack_head free_init = { 0 , lstack -> node_buffer }; lstack -> free = ATOMIC_VAR_INIT ( free_init ); return 0 ; }

The free nodes will not necessarily be used in the same order that they’re placed on the free stack. Several threads may pop off nodes from the free stack and, as a separate operation, push them onto the value stack in a different order. Over time with multiple threads pushing and popping, the nodes are likely to get shuffled around quite a bit. This is why a linked listed is still necessary even though allocation is contiguous.

The reverse of lstack_init() is simple, and it’s assumed concurrent access has terminated. The stack is no longer valid, at least not until lstack_init() is used again. This one is declared inline and put in the header.

static inline void stack_free ( lstack_t * lstack ) { free ( lstack -> node_buffer ); }

To read an atomic value we need to use atomic_load() . Give it a pointer to an atomic value, it dereferences the pointer and returns the value. This is used in another inline function for reading the size of the stack.

static inline size_t lstack_size ( lstack_t * lstack ) { return atomic_load ( & lstack -> size ); }

Push and Pop

For operating on the two stacks there will be two internal, static functions, push and pop . These deal directly in nodes, accepting and returning them, so they’re not suitable to expose in the API (users aren’t meant to be aware of nodes). This is the most complex part of lock-free stacks. Here’s pop() .

static struct lstack_node * pop ( _Atomic struct lstack_head * head ) { struct lstack_head next , orig = atomic_load ( head ); do { if ( orig . node == NULL ) return NULL ; // empty stack next . aba = orig . aba + 1 ; next . node = orig . node -> next ; } while ( ! atomic_compare_exchange_weak ( head , & orig , next )); return orig . node ; }

It’s centered around the new C11 stdatomic.h function atomic_compare_exchange_weak() . This is an atomic operation more generally called compare-and-swap (CAS). On x86 there’s an instruction specifically for this, cmpxchg . Give it a pointer to the atomic value to be updated ( head ), a pointer to the value it’s expected to be ( orig ), and a desired new value ( next ). If the expected and actual values match, it’s updated to the new value. If not, it reports a failure and updates the expected value to the latest value. In the event of a failure we start all over again, which requires the while loop. This is an optimistic strategy.

The “weak” part means it will sometimes spuriously fail where the “strong” version would otherwise succeed. In exchange for more failures, calling the weak version is faster. Use the weak version when the body of your do ... while loop is fast and the strong version when it’s slow (when trying again is expensive), or if you don’t need a loop at all. You usually want to use weak.

The alternative to CAS is load-link/store-conditional. It’s a stronger primitive that doesn’t suffer from the ABA problem described next, but it’s also not available on x86-64. On other platforms, one or both of atomic_compare_exchange_*() will be implemented using LL/SC, but we still have to code for the worst case (CAS).

The ABA Problem

The aba field is here to solve the ABA problem by counting the number of changes that have been made to the stack. It will be updated atomically alongside the pointer. Reasoning about the ABA problem is where I got stuck last time writing this article.

Suppose aba didn’t exist and it was just a pointer being swapped. Say we have two threads, A and B.

Thread A copies the current head into orig , enters the loop body to update next.node to orig.node->next , then gets preempted before the CAS. The scheduler pauses the thread.

Thread B comes along performs a pop() changing the value pointed to by head . At this point A’s CAS will fail, which is fine. It would reconstruct a new updated value and try again. While A is still asleep, B puts the popped node back on the free node stack.

Some time passes with A still paused. The freed node gets re-used and pushed back on top of the stack, which is likely given that nodes are allocated FIFO. Now head has its original value again, but the head->node->next pointer is pointing somewhere completely new! This is very bad because A’s CAS will now succeed despite next.node having the wrong value.

A wakes up and it’s CAS succeeds. At least one stack value has been lost and at least one node struct was leaked (it will be on neither stack, nor currently being held by a thread). This is the ABA problem.

The core problem is that, unlike integral values, pointers have meaning beyond their intrinsic numeric value. The meaning of a particular pointer changes when the pointer is reused, making it suspect when used in CAS. The unfortunate effect is that, by itself, atomic pointer manipulation is nearly useless. They’ll work with append-only data structures, where pointers are never recycled, but that’s it.

The aba field solves the problem because it’s incremented every time the pointer is updated. Remember that this internal stack struct is two pointers wide? That’s 16 bytes on a 64-bit system. The entire 16 bytes is compared by CAS and they all have to match for it to succeed. Since B, or other threads, will increment aba at least twice (once to remove the node, and once to put it back in place), A will never mistake the recycled pointer for the old one. There’s a special double-width CAS instruction specifically for this purpose, cmpxchg16 . This is generally called DWCAS. It’s available on most x86-64 processors. On Linux you can check /proc/cpuinfo for support. It will be listed as cx16 .

If it’s not available at compile-time this program won’t link. The function that wraps cmpxchg16 won’t be there. You can tell GCC to assume it’s there with the -mcx16 flag. The same rule here applies to C++11’s new std::atomic.

There’s still a tiny, tiny possibility of the ABA problem still cropping up. On 32-bit systems A may get preempted for over 4 billion (2^32) stack operations, such that the ABA counter wraps around to the same value. There’s nothing we can do about this, but if you witness this in the wild you need to immediately stop what you’re doing and go buy a lottery ticket. Also avoid any lightning storms on the way to the store.

Hazard Pointers and Garbage Collection

Another problem in pop() is dereferencing orig.node to access its next field. By the time we get to it, the node pointed to by orig.node may have already been removed from the stack and freed. If the stack was using malloc() and free() for allocations, it may even have had free() called on it. If so, the dereference would be undefined behavior — a segmentation fault, or worse.

There are three ways to deal with this.

Garbage collection. If memory is automatically managed, the node will never be freed as long as we can access it, so this won’t be a problem. However, if we’re interacting with a garbage collector we’re not really lock-free. Hazard pointers. Each thread keeps track of what nodes it’s currently accessing and other threads aren’t allowed to free nodes on this list. This is messy and complicated. Never free nodes. This implementation recycles nodes, but they’re never truly freed until lstack_free() . It’s always safe to dereference a node pointer because there’s always a node behind it. It may point to a node that’s on the free list or one that was even recycled since we got the pointer, but the aba field deals with any of those issues.

Reference counting on the node won’t work here because we can’t get to the counter fast enough (atomically). It too would require dereferencing in order to increment. The reference counter could potentially be packed alongside the pointer and accessed by a DWCAS, but we’re already using those bytes for aba .

Push

Push is a lot like pop.

static void push ( _Atomic struct lstack_head * head , struct lstack_node * node ) { struct lstack_head next , orig = atomic_load ( head ); do { node -> next = orig . node ; next . aba = orig . aba + 1 ; next . node = node ; } while ( ! atomic_compare_exchange_weak ( head , & orig , next )); }

It’s counter-intuitive, but adding a few microseconds of sleep after CAS failures would probably increase throughput. Under high contention, threads wouldn’t take turns clobbering each other as fast as possible. It would be a bit like exponential backoff.

API Push and Pop

The API push and pop functions are built on these internal atomic functions.

int lstack_push ( lstack_t * lstack , void * value ) { struct lstack_node * node = pop ( & lstack -> free ); if ( node == NULL ) return ENOMEM ; node -> value = value ; push ( & lstack -> head , node ); atomic_fetch_add ( & lstack -> size , 1 ); return 0 ; }

Push removes a node from the free stack. If the free stack is empty it reports an out-of-memory error. It assigns the value and pushes it onto the value stack where it will be visible to other threads. Finally, the stack size is incremented atomically. This means there’s an instant where the stack size is listed as one shorter than it actually is. However, since there’s no way to access both the stack size and the stack itself at the same instant, this is fine. The stack size is really only an estimate.

Popping is the same thing in reverse.

void * lstack_pop ( lstack_t * lstack ) { struct lstack_node * node = pop ( & lstack -> head ); if ( node == NULL ) return NULL ; atomic_fetch_sub ( & lstack -> size , 1 ); void * value = node -> value ; push ( & lstack -> free , node ); return value ; }

Remove the top node, subtract the size estimate atomically, put the node on the free list, and return the pointer. It’s really simple with the primitive push and pop.

SHA1 Demo

The lstack repository linked at the top of the article includes a demo that searches for patterns in SHA-1 hashes (sort of like Bitcoin mining). It fires off one worker thread for each core and the results are all collected into the same lock-free stack. It’s not really exercising the library thoroughly because there are no contended pops, but I couldn’t think of a better example at the time.

The next thing to try would be implementing a C11, bounded, lock-free queue. It would also be more generally useful than a stack, particularly for common consumer-producer scenarios.