Lock elision in the GNU C library

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

elision : the act or an instance of omitting something : omission

Contended locks are a common software scaling problem on modern multi-core systems. As more cores are added the synchronization overhead increases, potentially exceeding the speed-up provided by the extra processors. Writing good, scalable, fine-grained locking is difficult for programmers. An alternative to locking is to use memory transactions.

A memory transaction buffers all of the memory side effects from a code region and makes the memory changes visible in an atomic manner, but only if the transaction commits. If, instead, there is a conflict — memory used by one thread in the transaction is modified by another thread at the same time — the transaction is aborted and rolled back, and its side effects are discarded. If the transaction commits, all memory changes become visible to other threads atomically.

Traditionally, memory transactions have been implemented in software and were too slow for most practical uses. In order to make memory transactions usable in real-world situations, Intel has announced "Intel Transactional Synchronization Extension" (Intel TSX) for the upcoming Haswell CPU core. TSX supports memory transactions in hardware to improve their performance and make them practical. IBM has also announced similar extensions.

Of course very few programs today are written to use memory transactions directly. Hardware memory transactions are a best-effort model intended for fast paths: there is no guarantee that any hardware transaction will be successful (but most reasonable ones should be). Any transaction implemented with TSX needs a fallback path. The simplest fallback path is to use a lock when the transaction doesn't succeed. This technique is called "lock elision"; it executes a lock-protected region speculatively as a transaction, and only falls back to blocking on the lock if the transaction does not commit.

Lock elision uses the same programming model as normal locks, so it can be directly applied to existing programs. The programmer keeps using locks, but the locks are faster as they can use hardware transactional memory internally for more parallelism. Lock elision uses memory transactions as a fast path, while the slow path is still a normal lock. Deadlocks and other classic locking problems are still possible, because the transactions may fall back to a real lock at any time. Critical sections that do not conflict in their memory accesses with other threads and are otherwise transaction-friendly will execute in parallel, and not block each other out. The lock elision fast path makes the lock lock-less (non-blocking) if the conditions are right.

Usually, getting good performance with locks requires a large programming effort to do locking in a sufficiently fine-grained manner that individual locks are not overly contended. This work complicates the program. With lock elision it is possible to stay with coarser-grained locks and let the hardware discover the underlying parallelism opportunities instead, saving programmer time.

Intel TSX implements two instruction-set interfaces: HLE (Hardware Lock Elision) and RTM (Restricted Transactional Memory). HLE can be used to add transaction hints to existing code paths, while RTM allows control of the transactions directly and the use of software abort handlers. HLE refers to the specific instruction prefixes, while lock elision is a more general concept that can be also implemented with RTM. RTM requires new code paths as new instructions are used.

The lock library needs to be modified to enable elision. Obvious candidates are the glibc POSIX thread (pthread) mutex and rwlock primitives. Existing binaries can be dynamically linked to the new glibc pthreads library. Programs using pthread locking can then directly elide their locks without modifications. Programs that implement their own locking library need some changes to their lock code paths to enable lock elision.

glibc pthreads elision implementation

An implementation of RTM lock elision for glibc has been posted to the libc-alpha mailing list. POSIX mutexes support different locking types and attributes. Glibc already has a range of lock types, like timed, adaptive, recursive; elided locks simply become a new lock attribute and type. The elision can be controlled by the administrator or the programmer, or it can be automatically controlled by the pthreads library. The current interface is documented in the glibc manual draft. This is a preliminary draft that may change, as the code reviewers are expressing concerns about the API design.

The pthread lock functions like pthread_mutex_lock() dispatch on the different lock types. A new elision path has been added there for timed and adaptive mutexes. The glibc implementation uses RTM because it is more flexible. Inside the locking code, elision is implemented as a wrapper around the existing lock. The wrapper first tries to run in a transaction, and only if it's not successful does it call the underlying lock for blocking as a fallback path.

A simple RTM elision wrapper looks like this (the actual glibc implementation is somewhat more complex, since it implements retries and adaptation):

void elided_lock_wrapper(lock) { if (_xbegin() == _XBEGIN_STARTED) { /* Start transaction */ if (lock is free) /* Check lock and put into read-set */ return; /* Execute lock region in transaction */ _xabort(0xff); /* Abort transaction as lock is busy */ } /* Abort comes here */ take fallback lock } void elided_unlock_wrapper(lock) { if (lock is free) _xend(); /* Commit transaction */ else unlock lock; }

The _xbegin() call is a wrapper around a special instruction that begins the transaction. It can be seen as being similar to setjmp() in that it may return a second time if the transaction is aborted partway through. Unlike the setjmp() situation, though, a transaction abort will also unwind any other changes made since the transaction began. If the transaction begins successfully, _XBEGIN_STARTED will be returned. Should the transaction be aborted, _xbegin() will return with an error code after the transaction's side effects have been unwound; that will cause the code to take the lock instead.

Either way, the lock itself will be put into the transaction's "read set"; a change to memory in the read set by another thread will cause the transaction to abort immediately. Other memory locations accessed during the transaction are treated in the same way; the processor will watch for conflicting accesses and abort the transaction if need be. So, as soon as one thread generates a conflicting memory access, one or more threads will take the lock, causing all ongoing transactions tied to this lock to abort and fall back to normal locking.

In its full form, the glibc elision wrapper uses an adaptive elision algorithm. When a lock region frequently aborts its transactions it may slow down the program due to mis-speculation (executing transactions that do not commit). Not all mis-speculations necessarily cost time because they often happen in the time the thread would have blocked anyway. But transactions that run for a long time before aborting may slow down the execution of the program as a whole. The adaptive algorithm reacts to an aborted transaction by disabling elision for the lock for some time, before retrying. This avoids the need of manually annotating every lock in programs for elision.

It is also possible for the programmer to manually opt in or out, per-lock, using a per-lock annotation interface. That capability may be especially useful for rwlocks, since adaptive elision is not currently implemented for them. See the manual for more details.

Writing to a cache line used in a transaction will cause that transaction to abort; thus, code in critical sections protected by elided locks cannot write to the cache lines (regions of memory stored as a unit in the CPU's cache) containing those locks. One of the big advantages of elision is that it can keep the lock cache line in the "shared" state, avoiding the infamous "lock cache line bouncing" communication overhead that often slows down programs with fine-grained locking. But this requires not writing to the lock or any memory location in the same cache line. Normal glibc mutexes have owner fields for debugging purposes which they always write. The elision code path disables changing these owner fields, to keep the mutex structure read-only in the locking fast path. The only side effect is that pthread_mutex_destroy() will not return an error when called on a locked lock, as the implementation can no longer detect this case.

Code using RTM-elided locks do not know inside the critical section that a lock is locked (the lock appears to be free). POSIX pthreads locking does not export an operation to query a lock's state directly, but it can be queried indirectly by calling pthread_mutex_trylock() recursively on a lock already locked in the same thread. With RTM elision this call will succeed (HLE has special hardware support to handle this case). Essentially, elided locks behave like recursive locks. There were some concerns among code reviewers that this behavior violates the POSIX standard ("shall fail" instead of "may fail"). The latest version aborts nested trylocks by default, unless elision is explicitly forced for the mutex. This prevents some elision opportunities, especially for programs that use pthread_mutex_trylock() as their main locking primitive; this is usually done in an attempt to do a homebrew spin-then-sleep lock, instead of using glibc's adaptive mutexes.

/* Assuming lock is not recursive */ pthread_mutex_lock(&lock); if (pthread_mutex_trylock(&lock) == 0) /* lock with elision forced will come here * else /* non elided lock will come here, after aborting */

pthread_mutex_unlock() detects whether the current lock is executed transactionally by checking if the lock is free. If it is free it commits the transaction, otherwise the lock is unlocked normally. This implies that if a broken program unlocks a free lock, it may attempt to commit outside a transaction, an error which causes a fault in RTM. In POSIX, unlocking a free lock is undefined (so any behavior, including starting World War 3 is acceptable). It is possible to detect this situation by adding an additional check in the unlock path. The current glibc implementation does not do this, but if this programming mistake is common, the implementation may add this check in the future.

There are some other minor semantic changes that are not expected to affect programs. See the glibc manual for a detailed list.

Tuning and Profiling TSX

Existing programs that use pthread locking can run unchanged with the eliding pthreads library. Many programs have some unnecessary conflicts in common lock regions that can be fixed relatively easily, with a corresponding improvement in performance with elision. These changes typically also improve performance without elision, because they eliminate unnecessary bouncing of cache lines between CPU cores.

Memory transactions involve speculative execution, so generally a profiler is needed to understand their performance implications. A patch kit for perf to enable profiling on an Intel Haswell CPU is currently pending on linux-kernel. This patch kit is relatively large because it includes special support for TSX profiling. A normal cycle profiler would cause additional aborts with its profiling interrupts, so special TSX events need to be used to profile aborts.

Basic profiling for TSX with the perf patch kit is done with:

perf stat -T ....

to measure basic transaction success.

When the abort rate is high, aborts can be profiled with:

perf record -g -e tx-aborts .... perf report

This profiles RTM abort locations in the source (for HLE aborts, " -e el-aborts " should be used instead). It is important to keep in mind that the call graph displayed with -g is after the abort, so it points to the lock library, not the abort location inside the lock region. The code address reported at the beginning of the call graph is inside the transaction though. Essentially, the call graph is not continuous.

Additional information on the abort can be recorded with:

perf record -g -W --transaction -e tx-aborts ... perf report --sort comm,vdso,symbol,weight,transaction

The transaction flag allows the categorization of aborts into different classes: caused by the current thread (SYNC), caused by another thread (CONFLICT), and others. weight represents the number of cycles wasted in the transaction before they are aborted; that is, how costly the abort is. Due to limitations in the perf infrastructure and, contrary to what the option name suggests, these fields do not actually sort currently.

Conflict abort profiles show only the target of the conflict abort, not the cause. There are some common program patterns that are vulnerable to conflicts and can be relatively easily fixed when detected:

Global statistic counters (disable or make per thread)

False sharing in data structures or variables

Unnecessary changes to variables that are already the expected value (add a test)

Lock elision in other projects

Other projects that implement their own locking would need their own TSX-enabled lock library. The kernel has its own locking library, but, in a sense, lock elision is less important for the kernel because it already has highly tuned fine-grained locks, unlike many user-level programs. So glibc elision is more important than kernel elision. But even a highly tuned code base like the kernel can benefit from locking improvements (and not all locks in the kernel are highly tuned for every workload).

The kernel has a large number of locks. Benchmarking each lock individually with different workloads and annotating all locks is impractical. One approach is to enable lock elision for all locks and rely on automatic adaption algorithms (and some manual overrides) to disable elision for elision-unfriendly locks.

The kernel locking primitives (spinlocks, rw spinlocks, mutexes, rw semaphores, bit spinlocks) can all be elided with a similar RTM wrapping pattern as used for the glibc mutexes. The kernel has an "is-locked" primitive, which can be handled with RTM locks only with aborting. This can be avoided with some changes to the callers in the kernel.

Some more details are available in this Linux plumbers talk [PDF]. The kernel elision implementation is still in development.

Next steps

The next step is to pass glibc code review and integrate the lock elision implementation into the glibc mainline. Then testing with a large range of applications is useful to evaluate performance and potential problems. This will require volunteers with programs that use pthread locking and the willingness to test them. When no TSX-enabled system is available the SDE emulator can be used for functional testing.

Credits

The glibc elision implementation has been implemented by Andi Kleen and Hongjiu Lu.

Author

Index entries for this article GuestArticles Kleen, Andi

Andi Kleen is a software engineer working on Linux in the Intel Open Source Technology Center. This article is the author's opinion and he is not speaking for Intel. Log in to post comments)