Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures – David et al. 2015

Linked Lists, Hash Tables, Skip Lists, Binary Search Trees… these data structures are core to many programs. This paper studies such search data structures, supporting search, insert, and remove operations. In particular, the authors look at concurrent versions of these structures (CSDS) and ask how well they scale with multiple cores. From this analysis are derived four principles for scalability. Existing algorithms are shown to be improved by application of the patterns, and a new Hash Table implementation built from scratch according to them outperforms the next best by 23%.

The motivation of this work is to ask whether we can determine characteristics of CSDS algorithms that favor implementations which achieve what we call portable scalability, namely that scale across various platforms, workloads, and performance metrics.

To get a practical measure of portable scalability, the authors use as a baseline the performance of a simple sequential implementation and allow it to be accessed by multiple threads. This of course does not produce correct behaviour, but it does give a good indication of the theoretical maximum as it represents the case where there is no overhead from introducing concurrency controls. Such an execution is termed an asynchronized execution.

We thus consider that a CSDS achieves portable scalability when its scalability closely matches that of asynchronized executions (a) across different types of hardware, including single and multi-sockets, (b) for various classes of workloads, such as read-dominated and read-write, and (c) according to different performance metrics. In the CSDS context, aspect (c) means that as we increase the number of threads, we want to remain as close as possible to the asynchronized execution in terms of throughput and latency, without sacriﬁcing energy (i.e., without consuming more power than implementations that do not scale as well in terms of throughput or latency).

Scalability is measured on four dimensions: throughput, latency, latency distribution, and power consumption.

we perform an exhaustive evaluation of CSDSs on six different processors: a 20-core and a 40-core Intel Xeon, a 48-core AMD Opteron, a 32-core Oracle SPARC T4-4, a 36-core Tilera TILE-Gx36, and a 4-core Intel Haswell.

To perform the evaluation, a CSDS library called ASCYLIB was implemented containing 34 highly optimized and portable implementations of linked lists, hash tables, skip lists, and binary search trees. It also includes two novel CSDS algorithms designed from scratch, and ten re-engineered versions of existing algorithms to improve their scalability.

For full details of the algorithms studied with references, see table 1 in the paper. Briefly these are:

Linked List: async, coupling, pugh, lazy, copy, harris, and michael

Hash Table: async, coupling, pugh, lazy, copy, urcu, java, tbb, harris

Skip List: async, pugh, herlihy, fraser

BST: async-int, async-ext, bronson, drachsler, ellen, howley, natarajan

(where async represents the asynchronized execution baseline for comparison).

The key findings from this evaluation include the following:

“For each data structure and regardless of the platform and workload, there is at least one concurrent algorithm that performs and scales close to the asynchronized implementation (async) of the data structure. On average, the best concurrent implementations are 10% slower than their asynchronized counterparts and exhibit similar scalability trends”

“On average and low-contention levels, algorithms exhibit good scalability in terms of throughput: in the experiments with 20 threads, the average scalability (ratio) of the best performing CSDS algorithm (per data structure) is 16.2 in the low contention case, whereas for the average and high-contention levels, this value is 14.1 and 9.8, respectively. While trends are generally valid across platforms, we do notice a certain variability: for each workload, the standard deviation of the average scalability values of different plat-forms is ∼2. “

“Overall, we see that, per data structure, both lock-based and lock-free algorithms are close in terms of performance. Lock-freedom is more important when we employ more threads than hardware contexts (not shown in the graphs). In these deployments, lock-freedom provides better scalability than lock-based designs.”

“Cache coherence is the source of the most signiﬁcant scalability bottleneck for concurrent algorithms on multi-cores, because the number of cache-line transfers tends to increase with the number of threads. Hence, it is essential for a CSDS algorithm to limit the amount of cache trafﬁc it performs during each operation, which is directly linked to the number of stores on shared data. Stores cause cache-line invalidations, which in turn generate cache misses of future accesses.”

The algorithms that performed the best seem to have something in common:

We notice that the CSDS algorithms that tend to scale and perform the best also tend to have an average number of loads, stores, and branches closer to the asynchronized implementations than the alternatives. This is generally valid for all the operations and phases. Thus, we make the observation that the more an algorithm’s memory access pattern resembles that of the asynchronized execution, the better it scales regardless of the platform and workload.

By exploring this connection more deeply, the authors derive four principles or patterns:

The search operation should not involve any stores, waiting, or retries The parse phase of an update operation should not perform any stores other than for cleaning-up purposes and should not involve waiting, or retries. An update operation whose parse phase is unsuccessful should not perform any stores, besides those used for cleaning up the parse phase. The number and region of memory stores in a successful update should be close to those of a standard sequential implemenation.

None of these patterns is fundamentally counter-intuitive and each of them has already been identiﬁed as important in some form or another. We ﬁnd that the existing algorithms that scale the best already apply some of these patterns. To our knowledge however, they have never been put in a coherent form and collectively applied and evaluated. We refer to these patterns as asynchronized concurrency (ASCY), for together they indeed call for the design of concurrent algorithms to resemble that of their sequential counterparts in terms of access to shared state.

Applying pattern 1 to the harris lock-free list implementation yields a latency improvement on the order of 10-30%. Applying pattern 2 to the fraser skip list yields up to 8% better throughput and 5% lower latency. Applying pattern 3 to the java hash-table algorithm yielded a 12.5% throughput improvement. The natarajan tree, which already uses pattern 4, is the best performing of the BST algorithms under evaluation.

Finally the team design two new CSDS data structures from scratch according to the four principles: a cache-line hash table (CLHT), and a BST-Ticket tree. CLHT outperforms pugh (the best of the existing hash table algorithms) by 23% on average using a lock-based design, and 13% on average with a lock-free design. The lock-free implementation fares better as the number of threads increasing (and is ahead at 40 threads).

CLHT uses cache-line-sized buckets and, of course, follows the four ASCY patterns. As a cache-line block is the granularity of the cache-coherence protocols, CLHT ensures that most operations are completed with at most one cache-line transfer. CLHT uses the 8 words of a cache line as: [concurrency, k1, k2, k3, v1, v2, v3]. The ﬁrst word is used for concurrency-control; the next six are the key/value pairs; the last is a pointer that can be used to link buckets. Updates synchronize based on the concurrency word and do in-place modiﬁcations of the key/value pairs of the bucket. To support in-place updates, the basic idea behind CLHT is that a search/parse does not simply traverse the keys, but obtains an atomic snapshot of each key/value pair.

BST-Ticket ends up performing very similarly to natarajan.

With increasing cores, and NVMM on the horizon, these kind of highly concurrent in-memory basic data structures are sure to get plenty of attention.