Lock-free programming for the masses

Efficient concurrent programming libraries are essential for taking advantage of fine-grained parallelism on multicore hardware. In this post, I will introduce reagents, a composable, lock-free concurrency library for expressing fine-grained parallel programs on Multicore OCaml. Reagents offer a high-level DSL for experts to specify efficient concurrency libraries, but also allows the consumers of the libraries to extend them further without knowing the details of the underlying implementation.

Motivation

Designing and implementing scalable concurrency libraries is an enormous undertaking. Decades of research and industrial effort has led to state-of-the-art concurrency libraries such as java.util.concurrent (JUC) for the JVM and System.Collections.Concurrent (SCC) for the .NET framework. These libraries are often written by experts and have subtle invariants, which makes them hard to maintain and improve. Moreover, it is hard for the library user to safely combine multiple atomic operations. For example, while JUC and SCC provide atomic operations on stacks and queues, such atomic operations cannot be combined into larger atomic operations.

On the other hand software transactional memory (STM) offers composability, but STM based data structures are generally less efficient than their lock-free counterparts, especially when there is moderate to high levels of contention. Aaron Turon introduced reagents, an expressive and composable library which retains the performance and scalability of lock-free programming. Reagents allow isolated atomic updates to shared state, as well as message passing communication over channels. Furthermore, reagents provide a set of combinators for sequential composition à la STM, parallel composition à la Join calculus, and selective communication à la Concurrent ML, while being lock-free. Reagents occupy this sweet-spot between expressivity and performance, and we believe it could serve as a great default for writing fine-grained concurrent programs in Multicore OCaml.

Combinators

The basic reagents combinators are presented below.

type ( ' a , ' b ) t (* channel communication *) val swap : ( ' a , ' b ) endpoint -> ( ' a , ' b ) t (* shared memory *) val upd : ' a ref -> ( ' a -> ' b -> ( ' a * ' c ) option ) -> ( ' b , ' c ) t (* sequential composition *) val ( >>> ) : ( ' a , ' b ) t -> ( ' b , ' c ) t -> ( ' a , ' c ) t (* conjunction *) val ( <*> ) : ( ' a , ' b ) t -> ( ' a , ' c ) t -> ( ' a , ' b * ' c ) t (* disjunction *) val ( <+> ) : ( ' a , ' b ) t -> ( ' a , ' b ) t -> ( ' a , ' b ) t val run : ( ' a , ' b ) t -> ' a -> ' b

A reagent value with type ('a,'b) t represents an atomic transaction that takes an input of type 'a and returns a value of type 'b . The basic atomic operations are exchanging message on an endpoint of a channel through swap and updating a shared reference through upd . The swap operation blocks the calling thread until a matching swap operation is available on the dual endpoint.

The atomic reference update operation upd , takes a function which is applied to the current value of the reference (of type 'a ) and the input value (of type 'b ), and is expected to return an optional pair of the new value for the reference and a return value (of type 'c ). If the update function returns None , then the invoking thread blocks until the reference is updated. Reagent implementation takes care of the blocking and signalling necessary for thread wake up.

The most important feature of reagents is that it allows composition of reagent transactions in sequence >>> and in parallel <*> , and also to selectively choose one of the available operations <+> . Furthermore, these combinators being arrows, enable optimisations that cover the common case and help reagents achieve performance commensurate to hand-written implementations. Reagents library also exposes monadic combinators for convenience, at the cost of forgoing optimisation opportunities.

A lock-free stack

The following is a reagent implementation of the Treiber lock-free stack.

module R = Reagent module TreiberStack : sig type ' a t val create : unit -> ' a t val push : ' a t -> ( ' a , unit ) R . t val pop : ' a t -> ( unit , ' a ) R . t val try_pop : ' a t -> ( unit , ' a option ) R . t end = struct type ' a t = ' a list R . ref let create () = R . ref [] let push r = R . upd r ( fun xs x -> Some ( x :: xs , () )) let try_pop r = R . upd r ( fun l () -> match l with | [] -> Some ([] , None ) | x :: xs -> Some ( xs , Some x )) let pop r = Ref . upd r ( fun l () -> match l with | [] -> None | x :: xs -> Some ( xs , x )) end

We utilise a shared reference of type 'a list ref to represent the stack and use the upd operation to perform atomic operations on the stack. The important take away from this snippet is that the code is no more complicated than a sequential stack implementation. The logic for backoff, retry, blocking and signalling are hidden behind the reagents implementation. In particular, the pop operation blocks the calling thread until the stack is non-empty. Thus, the experts can write efficient concurrency libraries using reagents while preserving readability (and as a consequence maintainability) of code.

Furthermore, since the stack interface is exposed as reagents, the individual operations can be further composed. For example, given two Treiber stacks s1 and s2 , pop s1 >>> push s2 transfers elements atomically between the stacks, pop s1 <*> pop s2 consumes elements atomically from both of the stacks, and pop s1 <+> pop s2 consumes an element from either of the stacks. Importantly, the composition preserves the optimisations and blocking/signalling behaviours, allowing the users of the library to arbitrarily combine and extend the functionality without knowing about the underlying implementation.

Feeding the philosophers

The parallel composition combinator provides an elegant way to solve the Dining Philosophers problem. The problem imagines a set of philosophers seated around a circular table, forever alternating between thinking and eating. Forks are placed between adjacent philosophers, and each philosopher can only eat after obtaining both the left and right forks. The goal is to design a solution where no philosopher will starve. The problem highlights the issues of deadlock and fairness in concurrent programming.

One way to solve this problem is to model each fork as a pair of endpoints, one for taking and another for dropping the fork.

type fork = { drop : ( unit , unit ) endpoint ; take : ( unit , unit ) endpoint } let mk_fork () = let drop , take = mk_chan () in { drop ; take } let drop f = swap f . drop let take f = swap f . take

Now, the solution for a single round of eating can be implemented as follows:

let eat l_fork r_fork = ignore @@ run ( take l_fork <*> take r_fork ) () ; (* ... * eat * ... *) spawn @@ run ( drop l_fork ); spawn @@ run ( drop r_fork )

We use take l_fork <*> take r_fork to atomically take both of the forks. Reagents ensure that the protocol does not deadlock. After eating, we release the forks by spawning lightweight threads. In the next round, the philosophers race for the available forks. If the thread scheduler is fair, then the protocol provides fairness among the philosophers. The complete solution is available here.

Implementation

The key idea behind the implementation is that the reagent transaction executes in two phases. The first phase involves collecting all the compare-and-swap (CAS) operations necessary for the transaction, and the second phase is invoking a k-CAS operation (emulated in software). The failure to gather all the available CASes constitutes a permanent failure, causing the thread to explore other alternatives in the case of a selective communication or block otherwise. The failure in the second phase means that there is active interference from other concurrent threads, in which case the transaction is retried.

Performance of the Reagents depends critically on having fine-grained control over threads and schedulers for implementing backoff loops, blocking and signalling. However, one of the main ideas of multicore OCaml is not to bake in the thread scheduler into the compiler but rather describe them as libraries. To this end, the reagents library is functorized over the following generic scheduler interface:

module type Scheduler = sig (* continuation *) type ' a cont effect Suspend : ( ' a cont -> ' a option ) -> ' a effect Resume : ' a cont * ' a -> unit end

The interface itself only describes the scheduler’s effects, whose behaviour is defined by the handlers. perform (Suspend f) applies f to the current continuation, and allows the Reagent library to stash the thread on the unavailable resource’s wait queue. The return type of f is an option to handle the case when the resource might have become available while suspending. If f returns None , then the control returns to the scheduler. Once the resource becomes available, the reagent library performs the Resume effect to resume the suspended thread.

Comparison to STM

Reagents are less expressive than STM, which provides serializability. But in return, Reagents provide stronger progress guarantee (lock-freedom) over STM (obstruction-freedom). A reagent transaction operating more than once on the same memory location will fail at runtime. Abstractly, this behaviour is disallowed since it cannot be represented as a k-CAS operation. Due to this restriction, the transaction pop s1 >>> push s1 always fails, and prohibits important patterns such as atomically pushing or popping multiple values from the same stack. I am currently working on extending the reagents semantics to relax this invariant. The resultant behaviour will be similar to a version of snapshot isolation. While this is weaker than serializability semantics offered by the STM, we will retain the benefit of lock-freedom.

Contribute!

Using the reagents library, we have implemented a collection of composable concurrent data and synchronization structures such as stacks, queues, countdown latches, reader-writer locks, condition variables, exchangers, atomic counters, etc. There is great opportunity here to build a standard library for fine-grained parallelism for Multicore OCaml, incorporating the latest developments in lock-free data structures. There is still work to be done optimising the implementation to remove allocations in the fast path, and fine-tuning the reagents core.

Contributions to the library are most welcome, and is a great way to contribute to the Multicore OCaml effort. Please do file those issues and submit pull-requests.