Improving ticket spinlocks

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

Spinlocks, being the lowest-level synchronization mechanism in the kernel, are the target of seemingly endless attempts at performance enhancement. The ticket spinlock mechanism used in the mainline has resisted such attempts for a few years. Now, though, some developers have identified a performance bottleneck associated with these locks and are busily trying to come up with an improved version.

A spinlock is so-named because a CPU waiting for a contended lock will "spin" in a tight loop, repeatedly querying the lock until it becomes available. Ticket spinlocks adjust this algorithm by having each waiting CPU take a "ticket" so that each CPU obtains the lock in the order in which it arrived. These locks thus resemble the "take a number" mechanisms found at deli counters or motor vehicle division offices worldwide — though, with luck, the wait is rather shorter than is required to renew a driver's license in your editor's part of the world. Without the ticket mechanism, which was added for the 2.6.25 release, the kernel's spinlocks were unfair; in some situations, some waiters could be starved for an extended period of time.

It has long been understood that lock contention reduces system performance considerably. The simple act of spinning for a lock clearly is not going to be good for performance, but there are also caching issues to take into account. If two CPUs are repeatedly acquiring a spinlock, the memory location representing that lock will bounce back and forth between those CPUs' caches. Even if neither CPU ever has to wait for the lock, the process of moving it between caches will slow things down considerably. For that reason, interest in lockless algorithms has been growing for many years.

In the case of a contended lock, though, cache contention would appear to be less of an issue. A CPU spinning on a lock will cache its contents in a shared mode; no cache bouncing should occur until the CPU owning the lock releases it. Releasing the lock (and its acquisition by another CPU) requires writing to the lock, and that requires exclusive cache access. The cache line movement at that time hurts, but probably not as much as waiting for the lock in the first place. So it would seem that trying to optimize cache behavior in the contended case is not likely to produce much in the way of useful results.

That picture is not complete, though; one must take a couple of other facts into account. Processors do not cache a single value; they cache a "line" of (typically) 128 consecutive bytes as a single unit. In other words, the cache lines in any contemporary processor are almost certainly significantly larger than what is required to hold a spinlock. So when a CPU needs exclusive access to a spinlock's cache line, it also gains exclusive access to a significant chunk of surrounding data. And that is where the other important detail comes into play: spinlocks tend to be embedded within the data structures that they protect, so that surrounding data is typically data of immediate interest to the CPU holding the lock.

Kernel code will acquire a lock to work with (and, usually, modify) a structure's contents. Often, changing a field within the protected structure will require access to the same cache line that holds the structure's spinlock. If the lock is uncontended, that access is not a problem; the CPU owning the lock probably owns the cache line as well. But if the lock is contended, there will be one or more other CPUs constantly querying its value, obtaining shared access to that same cache line and depriving the lock holder of the exclusive access it needs. A subsequent modification of data within the affected cache line will thus incur a cache miss. So CPUs querying a contended lock can slow the lock owner considerably, even though that owner is not accessing the lock directly.

How badly can throughput be impacted? In the description of his patch adding proportional backoff to ticket spinlocks, Rik van Riel describes a microbenchmark that is slowed by a factor of two when there is a single contending CPU, and by as much as a factor of ten with many CPUs in the mix. That is not just a slowdown; that is a catastrophic loss of performance. Needless to say, that is not the sort of behavior that kernel developers like to see.

Rik's solution is simple enough. Rather than spinning tightly and querying a contended lock's status, a waiting CPU should wait a bit more patiently, only querying the lock occasionally. So his patch causes a waiting CPU to loop a number of times doing nothing at all before it gets impatient and checks the lock again. It goes without saying that picking that "number of times" correctly is the key to good performance with this algorithm. While a CPU is looping without querying the lock it cannot be bouncing cache lines around, so the lock holder should be able to make faster progress. But too much looping will cause the lock to sit idle before the owner of the next ticket notices that its turn has come; that, too, will hurt performance.

The first step in Rik's patch series calculates how many CPUs must release the lock before the current CPU can claim it (by subtracting the current CPU's ticket number from the number currently being served) and loops 50 times for every CPU that is ahead in the queue. That is where the "proportional backoff" comes in; the further back in line the CPU is, the longer it will wait between queries of the lock. The result should be a minimizing of idle looping while also minimizing cache traffic.

The number 50 was determined empirically, but it seems unlikely that it will be optimal for all situations. So the final part of Rik's patch set attempts to tune that number dynamically. The dynamic delay factor is increased when the lock is found to be unavailable and decreased when the lock is obtained. The goal is to have a CPU query the lock an average of 2.7 times before obtaining it. The number 2.7, once again, was obtained by running lots of tests and seeing what worked best; subsequent versions of the patch have tweaked this heuristic somewhat. Details aside, the core idea is that the delay factor (a per-CPU value that applies to all contended locks equally) will increase for workloads experiencing more contention, tuning the system appropriately.

That said, the notion of a single delay for all locks is likely to be causing a severe case of raised eyebrows for some readers, and, indeed, it turned out to be inadequate; some locks are rather more contended than others, after all. So the January 3 version of Rik's patch keeps a hashed list (based on the spinlock address) of delay values instead.

Michel Lespinasse ran some experiments of his own to see how well the proportional backoff algorithm worked. In particular, he wanted to figure out whether it was truly necessary to calculate a dynamic delay factor, or whether an optimal static value could be found. His conclusion was that, in fact, a static value is good enough; it might be possible to do a little better with a dynamic value, he said, but the improvement is not enough to justify the added complexity of the tuning mechanism. There is just one little difficulty:

Of course, one major downside in my proposal is that I haven't figured out an automatic way to find the most appropriate spinlock_delay system tunable. So there is clearly some more experimentation needed there. However, IMO the important result here is that our goal of avoiding performance cliffs seems to be reachable without the complexity (and IMO, risk) of per-spinlock tuning values.

If these results stand, and an appropriate way of picking the static value can be found, then there is probably not a case for adding dynamic backoff to the kernel's spinlock implementation. But the backoff idea in general would appear to be a significant improvement for some workloads. So the chances are good that we will see it added in some form in an upcoming development cycle.

