I hope that this article will give a good start for a series of notes about lock-free data structures. I would like to share my experience with community, monitoring and thoughts about what lock-free structures are, how to implement them and whether the concepts of Standard Template Library (STL) containers are suitable for the lock-free containers, and when its worth to apply lock-free data structures. It doesn’t make any sense to talk about lock free data structures without covering such topics as atomic operations, memory model in programming languages, safe memory reclamation, compiler and optimizations used by them, modern CPU designs, — all of these topics will be covered more or less in this series.

I take it upon myself to tell about all these things, though I don’t consider myself to be an absolute expert in any of them. On the other hand, if I had no notion of them I wouldn’t be able to create and maintain a libcds library, — open source C++ library of lock-free containers and algorithms of safe memory reclamations. Cds stands for Concurrent Data Structure and prefix “lib”, funny enough, stands for “library”.

I will begin with the library appearing. It happened in far 2006.

I used to work then for quite a big company that created software for a telco operator. We developed quite difficult server applications for the zoo of hardware platforms, in which the performance problem came (and will always come) first. It was solved, among other, by means of paralleling the data handling. As usual, paralleling led to appearance of shared data, access to which had to be synchronized. One day in course of some discussions my colleague asked me: “Have you ever heard anything about lock free queues?” I didn’t know anything about it at the time. But having asked Google about it, I found few articles where a p-code of a lock free queue was given. Having read them several times, I didn’t get anything. Well, I came to the condition “I didn’t get anything” after rolling up my sleeves and telling the world “You fools! I am the clever one here”, I tried to “simplify” algorithms, bringing them in balance with common sense. A month after the fight with segmentation fault my common sense gave up. And that’s exactly when I “didn’t get anything”. I didn’t get how THIS works at all and even if IT somehow works, how it can be carried out at C++. But it has to work somehow, otherwise clever people wouldn’t have written all those articles and other clever people wouldn’t have referred to them (the articles were scientific, and at the end of each of them reference lists wouldn’t having been given). Following these links I read several dozen gigabytes of useful, sometimes not, information, beginning with software developer guide on CPU designs and finishing with reviews on basic methods of lock-free algorithms carrying out. I occasionally wrote something (in C++, of course) on this subject, carrying out some or other primitives. It’s worth noting that at that time (2006-2007) C++ standard was still optimistically called C++0x, there were no atomic primitives in STL, and the interface was just outlined, and compilers sometimes played pranks at my atomic primitives and gave unoperated code at particularly critical areas. By 2008 hazy outlines of libcds library began appearing. The first tests at the platforms gave encouraging, sometimes stunning (“it became 50 times faster!!!”) results, so I soaked myself into the lock-free world. In 2010 I revealed the first (0.5.0) version of the library at the SourceForge. As for today the newest library version is 1.4.0, working on the 1.5.0 version.

Now I’ll move on to the general review of the lock-free data structures. Thus, the main aim of the programmer designing and developing difficult software projects, especially server ones, is to most efficiently use all available resources of the target platform. A modern computer, even a smartphone or a tablet, is a multi-processor device, so the main method of performance tuning is the program paralleling. Parallel threads process some shared data, therefore, our main task is to organize a parallel access to such shared data.

In 1980s the so-called structure programming was popular. It was described as a way to write good programs. The apologist of structure programming was Niklaus Wirth, the author of Pascal language who has written the bestseller “Algorithms + Data Structures = Programs”. It’s interesting that this old equation shows the soft spot of the modern API type threads, Win32 API, estimated by operating systems. The API gives the means for parallel programs building (these means are threads), but gives no means for parallel data structures building, which provide the shared access. Instead of that API provides data security facilities in the form of synchronization primitives. But synchronization is a bottleneck of parallel programs. Synchronization, by definition, is an antipode of parallelism: when paralleling algorithms we work with consecutive data structures, providing their work by synchronization primitives – critical sections, mutexes, condvars. As a result we line up all our threads in queue for the access to data structures, thus killing the parallelism. Synchronization primitives are sometimes OS kernel objects, so calling such objects can be too expensive: context switch may be necessary, switch to kernel execution level, support for access wait queues to the guarded by synchronization primitive data. And all of that can be needed to change the meaning of the only one designator, i.e. to perform one or two assembly function. The burdens can be (and in fact are) too high. Besides, you shouldn’t forget that kernel OS object is a limited in quantity resource.

Another synchronization drawback is a weak scalability. When the number of threads increases the access to data becomes a bottleneck of the program. Often, when the parallelism degree increases instead of getting a proportional increase of capability we get its worsening in response to high connection.

Following Wirth’s equation “Algorithms + Data Structures = Programs”, in my work on the libcds I concentrated on data structures only. In my library you will find no algorithms of parallel sorting or parallel for-each. The library contains competitive data structures only – queue, list, map, set, etc. – and the necessary algorithms of lock-free data support, first of all, algorithms are of safe memory reclamation. Often this or that data structure is carried out in few implementations. It was decided initially: as a rule, there are few interesting algorithms that implement a queue or map, and I don’t know which one is “better” in general case. Firstly, the notion “better” or “worse” is relative and depends on the definite hardware for a definite task, secondly, until you implement the algorithm and compare it with another one, you won’t understand if it’s better. And if the algorithm is implemented and debugged, why not take its place in the library (and let the user make this difficult choice)?

In the educational community the study of implementation methods of the competitive data structures, that provide parallel access to shared data, has led to creation of several major lines:

Lock-free data structures;

Fine-grained algorithms;

Transactional memory

There are no embeddings based on the transactional memory. Transactional memory is a vast area of researching, mainly aimed for future implementations. The algorithms based on the transactional memory implies, roughly speaking, that the memory supports atomic transactions that can be atomic committed or rollbacked. It’s obvious that such a memory should be implemented in the hardware. The present software implementations, admitted by researchers, don’t have the adequate capacity. It is fair to say that Intel processors of Haswell design already have support for transactionality in its instruction code, so heyday for the algorithms, based on the principles of transactional memory, is really near.

Fine-grained algorithms – are devious methods of synchronization, based, as a rule, not on the application of synchronization primitives provided by the OS, nut on the application of “light” atomic primitives, e.g. spin-lock. On the basis of such primitives are built the data structures that allow parallel reading and even concurrent writing, in which synchronization is carried out on the node or page, bucket levels of the data structure and is built in the very operation algorithms for this structure. Often fine-grained containers show capacity compared to lock-free containers capacity under relatively light connection. Libcds library doesn’t loath such data structures.

I will name data structures that don’t require an external synchronization access, the lock-free data structures. It’s an unofficial, purely technical definition which reflects internal build of container and operations on them. The word “external” has been highlighted on purpose: you should understand that you will most likely fail to build the lock-free data structures without any special support from the side of the processor. But support of this sort in the lock-free containers is provided not by synchronization of the access to the container serial methods, but by the atomic modification mechanism, that has been wired in the container method, or by internal synchronization on the level of container compound parts (nodes, buckets, pages). A formal definition of the lock-free object is the following [Her91]: a shared object is lock-free (non-blocking object) if it guarantees that some thread will complete an operation in a finite number of system steps regardless of other threads operation results (even if these other threads failed). A more strict definition of a wait-free object says: an object is wait-free if each thread will complete operation on the object in a finite number of steps. Lock-free condition guarantees forwarding at least one thread while the stronger wait-free condition guarantees successful completion for all threads. There is also a definition of linearizability in the theory of competitive data structures. [Her90]. It plays a key role in the formal proof of lock-free algorithms correctness. Roughly speaking, algorithm is linearizable if its operation result is evident upon the algorithm’s completion. For example, the result of the list insertion will be evident when the insertion function is completed. It sounds idiotically, but it’s impossible to think out an algorithm that makes a list insertion and isn’t linearizable. For example, buffering of all kinds can lead to violation of this feature: we can place a new element into some buffer instead of insertion, give a command to another thread “insert the element from buffer to the list” and not wait until this element will be inserted. Or we will insert only when the buffer will have a fair number of elements. Then, when the insertion function is completed, we can’t guarantee that the element is in the list. All we can guarantee is that this element will certainly be (future!) inserted in the list now or later. These definitions are widely used in research scientific works. My article doesn’t claim to be scientific, so I will use the lock-free term in the “narrow” sense to define the class of competitive containers that are built without application of traditional synchronization templates or those not requiring synchronization.

So what are lock-free algorithms are characterized by? I think the first thing evident is their complexity. Do you know what represents a regular queue implemented on the singly-linked list? A very simple code:

struct Node { Node * m_pNext ; }; class queue { Node * m_pHead ; Node * m_pTail ; public: queue(): m_pHead( NULL ), m_pTail( NULL ) {} void enqueue( Node * p ) { p->m_pNext = m_pTail ; m_pTail = p ; if ( !m_pHead ) m_pHead = p ; } Node * dequeue() { if ( !m_pHead ) return NULL ; Node * p = m_pHead ; m_pHead = p->m_pNext ; if ( !m_pHead ) m_pTail = NULL ; return p ; } };

It can be written even shorter. And that’s how look like enqueue/dequeue methods (synonyms are push/pop) of a classic algorithm implementation of a lock-free Michael&Scott queue. (the code is taken with minimum reductions from the library libcds class cds::intrusive::MSQueue)

bool enqueue( value_type& val ) { node_type * pNew = node_traits::to_node_ptr( val ); typename gc::Guard guard; back_off bkoff; node_type * t; while ( true ) { t = guard.protect( m_pTail, node_to_value() ); node_type * pNext = t->m_pNext.load(memory_model::memory_order_acquire); if ( pNext != null_ptr<node_type *>() ) { // Tail is misplaced, advance it m_pTail.compare_exchange_weak( t, pNext, memory_model::memory_order_release, CDS_ATOMIC::memory_order_relaxed ); continue; } node_type * tmp = null_ptr<node_type *>() ; if ( t->m_pNext.compare_exchange_strong( tmp, pNew, memory_model::memory_order_release, CDS_ATOMIC::memory_order_relaxed )) { break; } bkoff(); } ++m_ItemCounter; m_pTail.compare_exchange_strong( t, pNew, memory_model::memory_order_acq_rel, CDS_ATOMIC::memory_order_relaxed ); return true; } value_type * dequeue() { node_type * pNext; back_off bkoff; typename gc::template GuardArray<2> guards; node_type * h; while ( true ) { h = guards.protect( 0, m_pHead, node_to_value() ); pNext = guards.protect( 1, h->m_pNext, node_to_value() ); if ( m_pHead.load(memory_model::memory_order_relaxed) != h ) continue; if ( pNext == null_ptr<node_type *>() ) return NULL; // empty queue node_type * t = m_pTail.load(memory_model::memory_order_acquire); if ( h == t ) { // It is needed to help enqueue m_pTail.compare_exchange_strong( t, pNext, memory_model::memory_order_release, CDS_ATOMIC::memory_order_relaxed ); continue; } if ( m_pHead.compare_exchange_strong( h, pNext, memory_model::memory_order_release, CDS_ATOMIC::memory_order_relaxed )) { break; } bkoff(); } --m_ItemCounter; dispose_node( h ); return pNext; }

A complex algorithm. The same singly-linked list, but even a simple visual comparison of a general and a lock-free queue says of something. In the lock-free queue we can find the following:

Infinite loop – we are trying to perform an operation, but can’t for now. It’s a typical pattern of implementing an atomic operation compare_exchange;

Local variables security (guards) with the help of one of the safe memory reclamation methods in the lock-free algorithms, in this case – it’s a Hazard Pointers method;

Application of atomic primitives C++11 – load, compare_exchange, memory fences memory_order_xxx;

helping — a widespread method in the lock-free algorithms, when one thread helps another to perform his work;

Application of the back-off strategies (functor bkoff), that are not necessary, but allow offloading the processor in conditions of heavy connection, when many threads call the queue one by one.

I am not going to explain anything now as all of these things are quite vast and require a separate topic. Let’s keep the intrigue. I hope to cover these topics in the next articles.

The next article will concentrate on the fundamental notion on which all lock-free data structures are based – atomicity and atomic primitives.

And finally – useful books (not articles), in which, to my mind, general issues of competitive programming are the most broadly covered.

For the moment I know two good works: 1. Nir Shavit, Maurice Herlihy The Art of Multiprocessor programming. A book by the world-famous lock-free authors describes plenty of parallel algorithms and methods of their building. All examples are in Java so the authors didn’t have to care about memory reclamation, memory models C++ (in Java memory model is stricter) and other things which are applied in Java in the language, and in C++ you have to write everything yourself. Despite the fact, this book is very useful. 2. Anthony Williams C++ Concurrency in Action. A book by the world-famous C++ author covers questions of multithread programming in C++ exactly, popularly describes a new C++ standard and other facilities provided by the standard for parallel algorithms implementation. It’s highly recommended to read this book.

The links: - [Her90] M. Herlihy, J. M. Wing Linearizability: A Correctness Condition for Concurrent Objects, ACM Transactions on Programming Languages and Systems, 1990 - [Her91] M. Herlihy A Methodology for Implementing Highly Concurrent Data Object, 1991 - [Mic96] M.Michael, M.Scott Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms