October, 2004: Lock-Free Data Structures

Andrei Alexandrescu is a graduate student in Computer Science at the University of Washington and author of Modern C++ Design. He can be contacted at [email protected].

After skipping an installment of "Generic<Programming>" (it's naive, I know, to think that grad school asks for anything less than 100 percent of one's time), there has been an embarrassment of riches as far as topic candidates for this article. One topic candidate was a discussion of constructors; in particular, forwarding constructors, handling exceptions, and two-stage object construction. One other topic candidateand another glimpse into the Yaslander technology [2]was creating containers (such as lists, vectors, or maps) of incomplete types, something that is possible with the help of an interesting set of tricks, but not guaranteed by the standard containers.

While both candidates are intriguing, they couldn't stand a chance against lock-free data structures, which are all the rage in the multithreaded programming community. At this year's Programming Language Design and Implementation conference (http://www.cs.umd.edu/~pugh/pldi04/), Michael Maged presented the world's first lock-free memory allocator [7], which surpasses at many tests other more complex, carefully designed lock-based allocators.

This is the most recent of many lock-free data structures and algorithms that have appeared in the recent past.

What Do You Mean, "lock-free?"

That's exactly what I would have asked only a while ago. As the bona-fide mainstream multithreaded programmer that I was, lock-based multithreaded algorithms were familiar to me. In classic lock-based programming, whenever you need to share some data, you need to serialize access to it. The operations that change data must appear as atomic, such that no other thread intervenes to spoil your data's invariant. Even a simple operation such as ++count_ (where count_ is an integral type) must be locked. Incrementing is really a three-step (read, modify, write) operation that isn't necessarily atomic.

In short, with lock-based multithreaded programming, you need to make sure that any operation on shared data that is susceptible to race conditions is made atomic by locking and unlocking a mutex. On the bright side, as long as the mutex is locked, you can perform just about any operation, in confidence that no other thread will trample on your shared state.

It is exactly this "arbitrary"-ness of what you can do while a mutex is locked that's also problematic. You could, for example, read the keyboard or perform some slow I/O operation, which means that you delay any other threads waiting for the same mutex. Worse, you could decide you want access to some other piece of shared data and attempt to lock its mutex. If another thread has already locked that last mutex and wants access to the first mutex that your threads already holds, both processes hang faster than you can say "deadlock."

Enter lock-free programming. In lock-free programming, you can't do just about anything atomically. There is only a precious small set of things that you can do atomically, a limitation that makes lock-free programming way harder. (In fact, there must be around half a dozen lock-free programming experts around the world, and yours truly is not among them. With luck, however, this article will provide you with the basic tools, references, and enthusiasm to help you become one.) The reward of such a scarce framework is that you can provide much better guarantees about thread progress and the interaction between threads. But what's that "small set of things" that you can do atomically in lock-free programming? In fact, what would be the minimal set of atomic primitives that would allow implementing any lock-free algorithmif there's such a set?

If you believe that's a fundamental enough question to award a prize to the answerer, so did others. In 2003, Maurice Herlihy was awarded the Edsger W. Dijkstra Prize in Distributed Computing for his seminal 1991 paper "Wait-Free Synchronization" (see http://www.podc.org/dijkstra/2003.html, which includes a link to the paper, too). In his tour-de-force paper, Herlihy proves which primitives are good and which are bad for building lock-free data structures. That brought some seemingly hot hardware architectures to instant obsolescence, while clarifying what synchronization primitives should be implemented in future hardware.

For example, Herlihy's paper gave impossiblity results, showing that atomic operations such as test-and-set, swap, fetch-and-add, or even atomic queues (!) are insufficient for properly synchronizing more than two threads. (That's surprising because queues with atomic push and pop operations would seem to provide quite a powerful abstraction.) On the bright side, Herlihy also gave universality results, proving that some simple constructs are enough for implementing any lock-free algorithm for any number of threads.

The simplest and most popular universal primitive, and the one that I use throughout, is the compare-and-swap (CAS) operation:

template <class T> bool CAS(T* addr, T expected, T value) { if (*addr == expected) { *addr = value; return true; } return false; }

CAS compares the content of a memory address with an expected value, and if the comparison succeeds, replaces the content with a new value. The entire procedure is atomic. Many modern processors implement CAS or equivalent primitives for different bit lengths (the reason for which we've made it a template, assuming an implementation uses metaprogramming to restrict possible Ts). As a rule of thumb, the more bits a CAS can compare-and-swap atomically, the easier it is to implement lock-free data structures with it. Most of today's 32-bit processors implement 64-bit CAS; for example, Intel's assembler calls it CMPXCHG8 (you gotta love those assembler mnemonics).

A Word of Caution

Usually a C++ article is accompanied by C++ code snippets and examples. Ideally, that code is Standard C++, and "Generic<Programming>" strives to live up to that ideal.

When writing about multithreaded code, however, giving Standard C++ code samples is simply impossible. Threads don't exist in Standard C++, and you can't code something that doesn't exist. Therefore, the code for this article is "pseudocode" and not meant as Standard C++ code for portable compilation. Take memory barriers, for example. Real code would need to be either assembly language translations of the algorithms described herein, or at least sprinkle C++ code with some so-called "memory barriers"processor-dependent magic that forces proper ordering of memory reads and writes. I don't want to spread the discussion too thin by explaining memory barriers in addition to lock-free data structures. If you are interested, refer to Butenhof's excellent book [3] or to a short introduction [6]. For purposes here, I assume that the compiler and the hardware don't introduce funky optimizations (such as eliminating some "redundant" variable reads, a valid optimization under a single-thread assumption). Technically, that's called a "sequentially consistent" model in which reads and writes are performed and seen in the exact order in which the source code does them [8].

Wait-Free and Lock-Free versus Locked

A "wait-free" procedure can complete in a finite number of steps, regardless of the relative speeds of other threads.

A "lock-free" procedure guarantees progress of at least one of the threads executing the procedure. That means some threads can be delayed arbitrarily, but it is guaranteed that at least one thread makes progress at each step. So the system as a whole always makes progress, although some threads might make slower progress than others. Lock-based programs can't provide any of the aforementioned guarantees. If any thread is delayed while holding a lock to a mutex, progress cannot be made by threads that wait for the same mutex; and in the general case, locked algorithms are prey to deadlockeach waits for a mutex locked by the otherand livelockeach tries to dodge the other's locking behavior, just like two dudes in the hallway trying to go past one another but end up doing that social dance of swinging left and right in synchronicity. We humans are pretty good at ending that with a laugh; processors, however, often enjoy doing it until rebooting sets them apart.

Wait-free and lock-free algorithms enjoy more advantages derived from their definitions:

Thread-killing Immunity: Any thread forcefully killed in the system won't delay other threads.

Signal Immunity: The C and C++Standards prohibit signals or asynchronous interrupts from calling many system routines such as malloc . If the interrupt calls malloc at the same time with an interrupted thread, that could cause deadlock. With lock-free routines, there's no such problem anymore: Threads can freely interleave execution.

. If the interrupt calls at the same time with an interrupted thread, that could cause deadlock. With lock-free routines, there's no such problem anymore: Threads can freely interleave execution. Priority Inversion Immunity: Priority inversion occurs when a low-priority thread holds a lock to a mutex needed by a high-priority thread. Such tricky conflicts must be resolved by the OS kernel. Wait-free and lock-free algorithms are immune to such problems.

A Lock-Free WRRM Map

Column writing offers the perk of defining acronyms, so let's define WRRM (Write Rarely Read Many) maps as maps that are read a lot more than they are mutated. Examples include object factories [1], many instances of the Observer design pattern [5], mappings of currency names to exchange rates that are looked up many, many times but are updated only by a comparatively slow stream, and various other look-up tables.

WRRM maps can be implemented via std::map or the post-standard unordered_map (http://www.open-std.org/jtcl/sc22/wg21/docs/papers/2004/n1647.pdf), but as I argue in Modern C++ Design, assoc_vector (a sorted vector or pairs) is a good candidate for WRRM maps because it trades update speed for lookup speed. Whatever structure is used, our lock-free aspect is orthogonal to it; let's just call the back-end Map<Key, Value>. Also, for the purposes of this article, iteration is irrelevantmaps are only tables that provide a means to lookup a key or update a key-value pair.

To recap how a locking implementation would look, let's combine a Map object with a Mutex object like so:

// A locking implementation of WRRMMap template <class K, class V> class WRRMMap { Mutex mtx_; Map<K, V> map_; public: V Lookup(const K& k) { Lock lock(mtx_); return map_[k]; } void Update(const K& k, const V& v) { Lock lock(mtx_); map_[k] = v; } };

To avoid ownership issues and dangling references (that could bite us harder later), Lookup returns its result by value. Rock-solidbut at a cost. Every lookup locks/unlocks the Mutex, although (1) parallel lookups don't need to interlock, and (2) by the spec, Update is much less often called than Lookup. Ouch! Let's now try to provide a better WRRMMap implementation.

Garbage Collector, Where Art Thou?

The first shot at implementing a lock-free WRRMMap rests on this idea:

Reads have no locking at all.

Updates make a copy of the entire map, update the copy, and then try to CAS it with the old map. While the CAS operation does not succeed, the copy/update/ CAS process is tried again in a loop.

it with the old map. While the operation does not succeed, the copy/update/ process is tried again in a loop. Because CAS is limited in how many bytes it can swap, WRRMMap stores the Map as a pointer and not as a direct member of WRRMMap.

// 1st lock-free implementation of WRRMMap // Works only if you have GC template <class K, class V> class WRRMMap { Map<K, V>* pMap_; public: V Lookup (const K& k) { //Look, ma, no lock return (*pMap_) [k]; } void Update(const K& k, const V& v) { Map<K, V>* pNew = 0; do { Map<K, V>* pOld = pMap_; delete pNew; pNew = new Map<K, V>(*pOld); (*pNew) [k] = v; } while (!CAS(&pMap_, pOld, pNew)); // DON'T delete pMap_; } };

It works! In a loop, the Update routine makes a full-blown copy of the map, adds the new entry to it, and then attempts to swap the pointers. It is important to do CAS and not a simple assignment; otherwise, the following sequence of events could corrupt the map:

Thread A copies the map.

copies the map. Thread B copies the map as well and adds an entry.

copies the map as well and adds an entry. Thread A adds some other entry.

adds some other entry. Thread A replaces the map with its version of the mapa version that does not contain whatever B added.

With CAS, things work pretty neatly because each thread says something like, "assuming the map hasn't changed since I last looked at it, copy it. Otherwise, start all over again."

This makes Update lock-free but not wait-free, by my definitions. If many threads call Update concurrently, any particular thread might loop indefinitely, but at all times some thread will be guaranteed to update the structure successfully, thus global progress is being made at each step. Luckily, Lookup is wait-free.

In a garbage-collected environment, we'd be done, and this article would end on an upbeat note. Without garbage collection, however, there is much pain to come. This is because you cannot simply dispose of the old pMap_ willy-nilly; what if, just as you are trying to delete it, many other threads are frantically looking for things inside pMap_ via the Lookup function? You see, a garbage collector would have access to all threads' data and private stacks; it would have a good perspective on when the unused pMap_ pointers aren't perused anymore, and would nicely scavenge them. Without a garbage collector, things get harder. Much harder, actually, and it turns out that deterministic memory freeing is quite a fundamental problem in lock-free data structures.

Write-Locked WRRM Maps

To understand the viciousness of our adversary, it is instructive to try a classic reference-counting implementation and see where it fails. So, think of associating a reference count with the pointer to map, and have WRRMMap store a pointer to the thusly formed structure:

template <class K, class V> class WRRMMap { typedef std::pair<Map<K, V>*, unsigned> Data; Data* pData_; ... };

Sweet. Now, Lookup increments pData_->second, searches through the map all it wants, then decrements pData_->second. When the reference count hits zero, pData_->first can be deleted, and then so can pData_ itself. Sounds foolproof, except...except it's "foolish" (or whatever the antonym to "foolproof" is). Imagine that right at the time some thread notices the refcount is zero and proceeds on deleting pData_, another thread...no, better: A bazillion threads have just loaded the moribund pData_ and are about to read through it! No matter how smart a scheme is, it hits this fundamental Catch-22to read the pointer to the data, you need to increment a reference count; but the counter must be part of the data itself, so it can't be read without accessing the pointer first. It's like an electric fence that has the turn-off button up on top of it: To safely climb the fence you need to disable it first, but to disable it you need to climb it.

So let's think of other ways to delete the old map properly. One solution would be to wait, then delete. You see, the old pMap_ objects will be looked up by less and less threads as processor eons (milliseconds) go by; this is because new lookups use the new maps; as soon as the lookups that were active just before the CAS finish, the pMap_ is ready to go to Hades. Therefore, a solution would be to queue up old pMap_ values to some "boa serpent" thread that, in a loop, sleeps for, say, 200 milliseconds, wakes up and deletes the least recent map, and then goes back to sleep for digestion.

This is not a theoretically safe solution (although it practically could well be within bounds). One nasty thing is that if, for whatever reason, a lookup thread is delayed, the boa serpent thread can delete the map under that thread's feet. This could be solved by always assigning the boa serpent thread a lower priority than any other's, but as a whole, the solution has a stench that is hard to remove. If you agree with me that it's hard to defend this technique with a straight face, let's move on.

Other solutions [4] rely on an extended DCAS atomic instruction, which is able to compare-and-swap two noncontiguous words in memory:

template <class T1, class T2> bool DCAS(T1* p1, T2* p2, T1 e1, T2 e2, T1 v1, T2 v2) { if (*p1 == e1 && *p2 == e2) { *p1 = v1; *p2 = v2; return true; } return false; }

Naturally, the two locations would be the pointer and the reference count itself. DCAS has been implemented (very inefficiently) by the Motorola 68040 processors, but not by other processors. Because of that, DCAS-based solutions are considered of primarily theoretical value.

The first shot at a solution with deterministic destruction is to rely on the less-demanding CAS2. Again, many 32-bit machines implement a 64-bit CAS, often dubbed as CAS2. (Because it only operates on contiguous words, CAS2 is obviously less powerful than DCAS.) For starters, let's store the reference count next to the pointer that it guards:

template <class K, class V> class WRRMMap { typedef std::pair<Map<K, V>*, unsigned> Data; Data data_; ... };

(Notice that this time the count sits next to the pointer that it protects, a setup that eliminates the Catch-22 problem mentioned earlier. You'll see the cost of this setup in a minute.)

Then, let's modify Lookup to increment the reference count before accessing the map, and decrement it after. In the following code snippets, I ignore exception safety issues (which can be taken care of with standard techniques) for the sake of brevity.

V Lookup(const K& k) { Data old; Data fresh; do { old = data_; fresh = old; ++fresh.second; } while (CAS(&data_, old, fresh)); V temp = (*fresh.first)[k]; do { old = data_; fresh = old; --fresh.second; } while (CAS(&data_, old, fresh)); return temp; }

Finally, Update replaces the map with a new onebut only in the window of opportunity when the reference count is 1.

void Update(const K& k, const V& v) { Data old; Data fresh; old.second = 1; fresh.first = 0; fresh.second = 1; Map<K, V>* last = 0; do { old.first = data_.first; if (last != old.first) { delete fresh.first; fresh.first = new Map<K, V>(old.first); fresh.first->insert(make_pair(k, v)); last = old.first; } } while (!CAS(&data_, old, fresh)); delete old.first; // whew }

Here's how Update works. It defines the now-familiar old and fresh variables. But this time old.last (the count) is never assigned from data_.last; it is always 1. This means that Update loops until it has a window of opportunity to replace a pointer with a counter of 1, with another pointer having a counter of 1. In plain English, the loop says "I'll replace the old map with a new, updated one, and I'll be on the lookout for any other updates of the map, but I'll only do the replacement when the reference count of the existing map is one." The variable last and its associated code are only one optimization: Avoid rebuilding the map over and over again if the old map hasn't been replaced (only the count).

Neat, huh? Not that much. Update is now locked: It needs to wait for all Lookups to finish before it has a chance to update the map. Gone with the wind are all the nice properties of lock-free data structures. In particular, it is easy to starve Update to death: Just look up the map at a high-enough rateand the reference count never goes down to 1. So what you really have so far is not a WRRM (Write-Rarely-Read-Many) map, but a WRRMBNTM (Write-Rarely-Read-Many-But-Not-Too-Many) one instead.

Conclusion

Lock-free data structures are promising. They exhibit good properties with regards to thread killing, priority inversion, and signal safety. They never deadlock or livelock. In tests, recent lock-free data structures surpass their locked counterparts by a large margin [9]. However, lock-free programming is tricky, especially with regards to memory deallocation. A garbage-collected environment is a plus because it has the means to stop and inspect all threads, but if you want deterministic destruction, you need special support from the hardware or the memory allocator. In the next installment of "Generic<Programming>," I'll look into ways to optimize WRRMMap such that it stays lock-free while performing deterministic destruction.

And if this installment's garbage-collected map and WRRMBNTM map dissatisfied you, here's a money saving tip: Don't go watch the movie Alien vs. Predator, unless you like "so bad it's funny" movies.

Acknowlegments

David B. Held, Michael Maged, Larry Evans, and Scott Meyers provided very helpful feedback. Also, Michael graciously agreed to coauthor the next "Generic<Programming>" installment, which will greatly improve on our WRRMap implementation.

References

[1] Alexandrescu, Andrei. Modern C++ Design, Addison-Wesley Longman, 2001.

[2] Alexandrescu, Andrei. "Generic<Programming>:yasli::vector Is On the Move," C/C++ Users Journal, June 2004.

[3] Butenhof, D.R. Programming with POSIX Threads, Addison-Wesley, 1997.

[4] Detlefs, David L., Paul A. Martin, Mark Moir, and Guy L. Steele, Jr. "Lock-free Reference Counting," Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, pages 190-199, ACM Press, 2001. ISBN 1-58113-383-9.

[5] Gamma, Erich, Richard Helm, Ralph E. Johnson, and John Vlissides. Design Patterns: Elements of Resusable Object-Oriented Software, Addison-Wesley, 1995.

[6] Meyers, Scott and Andrei Alexandrescu. "The Perils of Double-Checked Locking." Dr. Dobb's Journal, July 2004.

[7] Maged, Michael M. "Scalable Lock-free Dynamic Memory Allocation," Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, pages 35-46. ACM Press, 2004. ISBN 1-58113-807-5.

[8] Robison, Arch. "Memory Consistency & .NET," Dr. Dobb's Journal, April 2003.

[9] Maged, Michael M. "CAS-Based Lock-Free Algorithm for Shared Deques," The Ninth Euro-Par Conference on Parallel Processing, LNCS volume 2790, pages 651-660, August 2003.