Concurrent code and expensive instructions

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

Introduction

Symmetric multiprocessing (SMP) code often requires expensive instructions, including atomic operations and memory barriers, and often causes expensive cache misses. Yet some SMP code can be extremely cheap and fast, using no expensive instructions at all. Examples of cheap SMP code include per-CPU counters and RCU read-side critical sections. So why can't all SMP code be cheap? Is it just that we aren't smart enough to spot clever ways of implementing other algorithms, for example, concurrent stacks and queues? Is it that we might be able to implement concurrent stacks and queues without expensive instructions, but only at the cost of mind-bending complexity? Or is it simply impossible to implement concurrent stacks and queues without using expensive instructions?

My traditional approach has been to place my faith in two observations: (1) if you beat your head against a wall long enough, one of two things is bound to happen, and (2) I have a hard head. Although this approach has worked well, something less painful would be quite welcome. And so it was with great interest that I read a paper entitled "Laws of Order: Expensive Synchronization in Concurrent Algorithms Cannot be Eliminated" by Attiya et al., with the “et al.” including Maged Michael, whom I have had the privilege of working with for quite some time.

It is important to note that the title overstates the paper's case somewhat. Yes, the paper does present some laws requiring that many concurrent algorithms use expensive instructions, however, all laws have their loopholes, including the Laws of Order. So while we do need to understand the Laws of Order, we most especially need to understand how to fully exploit their loopholes.

To arrive at the Laws of Order, this paper first expands the definition of commutativity to include sequential composition, which in the C language can best be thought of as the “;” operator. In this case, commutativity depends not just on the operator, but on the operands, which for our purposes can be thought of as calls to arbitrary C functions. For example, the statements:

atomic_inc(&x); atomic_inc(&y);

are commutative: the values of x and y are the same regardless of the order of execution. In contrast:

atomic_set(&x, 1); atomic_set(&x, 2);

are non-commutative: the value of x will be either 1 or 2, depending on which executes first.

These examples execute sequentially, but the paper considers concurrent execution. To see this, consider a concurrent set that has these operations:

a set-member-addition function (call it set_add() ) that returns an indication of whether the element to be added was already in the set,

) that returns an indication of whether the element to be added was already in the set, a set-member-test function (call it set_member() ), and

), and a set-member-removal function (call it set_remove() ) that returns an indication of whether anything had actually been removed.

Then concurrently testing two distinct members is commutative: the order in which set_member(&s, 1) and set_member(&s, 2) execute will not affect the return values from either, and the final value of set s will be the same in either case. Therefore, it is not necessary for the two invocations to coordinate with each other. The fact that coordination is not required means that there is some hope that expensive instructions are not needed to implement set_member() .

In contrast, concurrent invocation of set_add(&s, 1) and set_member(&s, 1) would not be commutative if the set s initially did not contain the value 1. The set_member() invocation would return true only if it executed after the set_add() . Some coordination between the two functions is clearly required.

The most important results of the paper rely on a strong form of non-commutativity, which is strangely enough termed “strong non-commutativity”, which we can abbreviate to “SNC”. The example in the previous paragraph is not SNC because, while set_add() can affect set_member() , the reverse is not the case. In contrast, an SNC pair of functions would each affect the other's result. For example, consider set_add(&s, 1) and set_remove(&s, 1) , where the set s is initially empty. If set_add(&s, 1) executes first, then both functions will indicate success, and set s will be empty. On the other hand, if set_remove(&s, 1) executes first, then only the set_add(&s, 1) will indicate success and the set s will contain 1. In this case, the return value of set_remove() is affected by the order of execution. On the other hand, if the set s initially contains 1, it will be set_add(&s, 1) whose return value is affected. Therefore, the order of execution can affect the return value of both functions, and these functions are therefore SNC.

Quick Quiz 1: Is atomic_add_return() SNC? In other words, are multiple concurrent calls to this function SNC?

Answer

The key result of the paper is that under certain conditions, the implementation of a pair of SNC functions must contain a heavyweight instruction, where a “heavyweight instruction” can either be an atomic read-modify-write instruction or a heavyweight memory barrier. In the Linux kernel, only smp_mb() qualifies as a heavyweight memory barrier.

The “certain conditions” are:

Both functions in the pair must be deterministic, in other words, the final state (including return values) must be a strict function of the initial state and order of execution. The functions must be linearizable.

Interestingly enough, although the paper requires that the implementation of an SNC, deterministic, and linearizable pair of functions each contain at least one heavyweight instruction, it doesrequire that this instruction be executed on each invocation.

Quick Quiz 2: Imagine an increment function that is not permitted to lose counts even when multiple invocations execute concurrently, and that does not return the value of the counter. Must the implementation of such a function contain an atomic read-modify-write instruction or a heavyweight memory barrier?

Answer

So if we want our code to run fast, we have four ways to avoid heavyweight instructions:

Formulate the API to be non-SNC. Design the implementation so that any required heavyweight instructions almost never need to actually be executed. Accept non-determinism. Accept non-linearizability. The paper ignores this possibility, possibly due to the common academic view that non-linearizable algorithms are by definition faulty.

Interestingly enough, relativistic programming has long suggested use of several of these approaches to attain good performance and scalability. The “Laws of Order” therefore provides a good theoretical basis for understanding why relativistic programming is both desirable and necessary.

Let's take a look at some examples, starting with a memory allocator. Given that concurrent calls to kmalloc() are not supposed to return a pointer to the same block of memory, we have to conclude that kmalloc() is SNC and thus might need heavyweight instructions in its implementation.

Quick Quiz 3: How can we avoid the use of heavyweight instructions in the implementation of kmalloc ? If it turns out to be impossible to completely avoid their use, how can we reduce the frequency of their execution?

Answer

The second example is of course RCU. Let's focus on the rcu_read_lock() , rcu_read_unlock() , synchronize_rcu() , rcu_dereference() and rcu_assign_pointer() API members. The rcu_read_lock() function is unaffected by any of the other members, so any pair that includes rcu_read_lock() is non-SNC, which is why this function need not include any heavyweight instructions. The same is true of rcu_read_unlock() .

Interestingly enough, synchronize_rcu() is affected by both rcu_read_lock() and rcu_read_unlock() , in that the former can prevent synchronize_rcu() from returning and the latter can enable it to return. However, neither rcu_read_lock() nor rcu_read_unlock() is affected by synchronize_rcu() . This means that synchronize_rcu() is non-SNC and might therefore have an implementation that does not use heavyweight instructions. However, such an implementation seems quite implausible if you include the actions of the updater both before and after the call to synchronize_rcu() in conjunction with the RCU read-side critical section. The paper, though, considers only data flowing via those function's arguments and return value. It would be interesting to see a generalization of this work that includes side effects.

My guess is that for a given code fragment to be non-SNC, any conceivable API would need to be non-SNC. If my guess is correct, then the full RCU update is non-SNC with respect to any RCU read-side critical section containing rcu_dereference() . The reasoning is that the return value from rcu_dereference() can be affected by the RCU update, and the duration for which synchronize_rcu() blocks can be affected by rcu_read_lock() and rcu_read_unlock() .

Quick Quiz 4: Are there any conditions in which rcu_read_unlock() will be SNC with respect to synchronize_rcu() ?

Answer

Finally, let us look at the set implementation that includes set_add() , set_member() , and set_remove() . We saw that set_add() and set_remove() were SNC.

Quick Quiz 5: Is there any way to implement set_add() and set_remove() without using heavyweight instructions?

Answer

Of course, this paper does have a few shortcomings, many of which fall under the rubric of “future work”:

The paper describes the theoretical limitations at great length, but does not describe many ways of avoiding them. However, I am quite confident that the Linux kernel community will be more than able to produce good software engineering solutions that work around these limitations. In fact, there is a lot to be said for letting the theoreticians worry about limitations and letting us hackers worry about solving problems in spite of those limitations. The paper focuses almost exclusively on reordering carried out by the CPU. It turns out that reordering due to compiler optimizations can be at least as “interesting” as CPU reordering. These sorts of compiler optimizations are allowed by the current C-language standard, which permits the compiler to assume that there is only one thread in the address space. Within the Linux kernel, the barrier() directive restricts the compiler's ability to move code, and this directive (or its open-coded equivalent) is used in locking primitives, atomic operations, and memory barriers. There is some uncertainty about exactly what properties of code must be SNC for this paper's results to hold. The paper focuses almost exclusively on function arguments and return values, but my guess is that the list of properties is quite general. For example, an unconditional lock-acquisition primitive certainly seems like it should be covered by this paper's result, but such primitives do not return a value. Can the fact that the second of two concurrent acquisitions simply fails to return be considered to be evidence of the SNC nature of lock acquisition? If not, exactly why not? If so, exactly what is the set of effects that must be taken into account when judging whether or not this code fragment is SNC? This seems to be a future-work topic. A bit of thought about the results of this paper give clear reasons why it is often so hard to parallelize existing sequential code. Sequential code inflicts no penalties for the use of SNC APIs, so SNC APIs can be expected to appear in sequential code even when a non-SNC API might have served just as well. After all, what programmer could resist the temptation to make set_add() return an indication of whether the element was already in the set? The paper would have done well to state this point clearly. The paper fails to call out non-linearizability as a valid loophole to its laws of order. An interesting open question: What are the consequences of using one of the loopholes of the laws of order? In my limited personal experience, leveraging non-linearizability and privatization permits full generality (for example, parallel memory allocators), while leveraging non-SNC and non-determinism results in specialized algorithms (for example, RCU). It would be quite interesting to better understand any theoretical and software-engineering limitations imposed by these loopholes. The paper overreaches a bit when it states that: For synthesis and verification of concurrent algorithms, our result is potentially useful in the sense that a synthesizer or a verifier need not generate or attempt to verify algorithms that do not use RAW [ smp_mb() ] and AWAR [atomic read-modify-write operations] for they are certainly incorrect. As we have seen, it is perfectly legal for a concurrent algorithm to avoid use of these operations as long as that algorithm is either: (1) non-SNC, (2) non-deterministic, or (3) non-linearizable. There are a few other places where the limitations on the main result are not stated as carefully as they should be. Given that the rest of the paper seems quite accurate and on-point, I would guess that this sentence is simply an honest error that slipped through the peer-review process. We all make mistakes.

Although I hope that these shortcomings will be addressed, I hasten to add that they are insignificant compared to the huge step forward that this paper represents.

In summary, the “Laws of Order” paper shines some much-needed light on the question of whether heavyweight instructions are needed to implement a given concurrent algorithm. Although I am not going to say that this paper fully captures my parallel-programming intuition, I am quite happy that it does land within a timezone or two, which represents a great improvement over previous academic papers. But the really good news is that the limitations called out in this paper have some interesting loopholes that can be exploited in many cases. If the Linux kernel community pays careful attention to both the limitations and the loopholes called out in this paper, I am confident that the community's already-impressive parallel-programming capabilities will become even more formidable.

I owe thanks to Maged Michael, Josh Triplett, and Jon Walpole for illuminating discussions and for their review of this paper, and to Jim Wasko for his support of this effort.

This work represents the view of the author and does not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Quick Quiz 1: Is atomic_add_return() SNC? In other words, are multiple concurrent calls to this function SNC?

Answer: Yes. Suppose that an atomic_t variable named a is initially zero and that a pair of concurrent atomic_add_return(1, &a) functions execute. The first one to execute will return zero, and the second one will return one. Each instance's return value is therefore affected by the order of execution, which indicates strong non-commutativity.

This may seem strange, given that addition is commutative. And in fact the final value of a will be two regardless of order of execution.

To see the reasoning behind the definition of SNC, consider atomic_inc(&a) , which also adds one to a but does not return the initial value. In this case, because there are no return values, the invocations of atomic_inc(&a) cannot possibly affect each others' return values.

Therefore, atomic_inc(&a) is non-SNC.

It is interesting to note that the designers of the Linux kernel's suite of atomic operations had an intuitive understanding of the results of this paper. The atomic operations that return a value (and thus are more likely to be SNC) are the ones that are required to provide full memory ordering.

Back to Quick Quiz 1.

Quick Quiz 2: Imagine an increment function that is not permitted to lose counts even when multiple invocations execute concurrently, and that does not return the value of the counter. Must the implementation of such a function contain an atomic read-modify-write instruction or a heavyweight memory barrier?

Answer: No. Although this function is deterministic and linearizable, it is non-SNC. And in fact such a function could be implemented via a “split counter” that uses per-CPU non-atomic variables. Because each CPU increments only its own variable, counts are never lost. To get the aggregate value of the counter, simply sum up the individual per-CPU variables.

Of course, it might be necessary to disable preemption and/or interrupts across the increments, but such disabling requires neither atomic read-modify-write instructions nor heavyweight memory barriers.

However, the linearizability of this function depends on the counter always being incremented by the value 1. To see this, imagine a counter with an initial value of zero to which three CPUs are concurrently adding the values 3, 5, and 7, and that meanwhile three other CPUs are reading out the counter's value. Because there are no ordering guarantees, these three other CPUs might see the additions in any order. One of these CPUs might add the per-CPU variables and obtain a sum of 3, another might obtain a sum of 5, and the third might obtain a sum of 7. These three results are not consistent with any possible ordering of the additions, so this counter is not linearizable.

However, for a great many uses, this lack of linearizability is not a problem.

Back to Quick Quiz 2.

Quick Quiz 3: How can we avoid the use of heavyweight instructions in the implementation of kmalloc() ? If it turns out to be impossible to completely avoid their use, how can we reduce the frequency of their execution?

Answer: The usual approach is to observe that a given pair of invocations of kmalloc() invocations will be SNC only if it is possible for them to be satisfied by the same block of memory. The usual way to greatly reduce the probability of a pair of kmalloc() invocations fighting over the same block of memory is to maintain per-CPU pools of memory blocks, which is what the Linux kernel's implementation of kmalloc() actually does. Heavyweight instructions are executed only if a given CPU's pool either becomes exhausted or overflows. This approach is related to the paper's suggestion of using “single-owner” algorithms.

It might be possible to avoid heavyweight instructions by introducing non-determinism, for example, by making kmalloc() randomly fail. This can certainly be accomplished by making kmalloc() unconditionally return NULL if the CPU's pool was exhausted, but such an implementation might not prove to be fully satisfactory to its users. Coming up with a reasonable implementation that uses non-determinism to avoid heavyweight instructions is left as an exercise for the adventurous reader.

Similarly, eliminating heavyweight instructions by introducing non-linearizability is left as an exercise for the adventurous reader.

Back to Quick Quiz 3.

Quick Quiz 4: Are there any conditions in which rcu_read_unlock() will be SNC with respect to synchronize_rcu() ?

Answer: In some implementations of RCU, synchronize_rcu() can interact directly with rcu_read_unlock() when the grace period has extended too long, either via force_quiescent_state() machinations our via RCU priority boosting. In these implementations, rcu_read_unlock() will be SNC with respect to synchronize_rcu() . The Linux kernel's Preemptible Tree RCU is an example of such an implementation, as can be seen by examining the rcu_read_unlock_special() function in kernel/rcutree_plugin.h . This code executes rarely, thus using the second loophole called out above (“Design the implementation so that any required heavyweight instructions almost never need to actually be executed”).

Back to Quick Quiz 4.

Quick Quiz 5: Is there any way to implement set_add() and set_remove() without using heavyweight instructions?

Answer: This can be done easily for sets containing small integers if there is no linearizability requirement. The set is represented as a dense array of bytes so that each potential member of the set maps to a specific byte. The set_add() function would set the corresponding byte to one, the set_remove() function would clear the corresponding byte to zero, and the set_member() function would test the corresponding byte for non-zero.

This implementation is non-linearizable because different CPUs might well disagree on the order that members were added to and removed from the set.

Back to Quick Quiz 5.

