membarrier system call performance and the future of Userspace RCU on Linux

A few months ago on the lttng-dev mailing list, Milian Wolff reported that his test application was seeing noticeable startup delays because it was linked against liblttng-ust. He didn't start a tracing session and his test program didn't call any code in liblttng-ust. In fact, the test program contained fewer lines of code than a standard Hello, World! application:

int main () { return 0 ; }

Yet linking with the library was enough to slow his application by around 640 times. Things became even worse when running within a tracing session, resulting in a roughly one-second delay when launching the program.

This result was entirely unexpected. Sure, linking against liblttng-ust does incur a bit of run time overhead, but it shouldn't be anywhere near one second.

Tracing the first second of the application's run time with perf trace revealed a large percentage of the time was spent executing the membarrier system call:

$ perf trace --duration 1 ./a.out

6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, msg: 0x7fbe2fbb1070) = 1 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op: WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1) = 0

The membarrier system call is a relatively new addition to Linux, so here's a quick overview of why it was created, how liblttng-ust uses it, and how it caused Milian's performance issue.

What is the Linux membarrier system call?

The membarrier system call is used for synchronizing access to data structures shared by multiple threads. Its main use case is implementing synchronization primitives that can be split into fast and slow paths, for example the read-copy-update (RCU) algorithm. As the name implies, it's used to implement memory barrier semantics without actually using memory barrier instructions.

Traditionally, synchronization algorithms are implemented using pairs of memory barriers, either explicitly for RCU, or implicitly by lock-prefixed atomic instruction for reader-writer locks on Intel. For cases where explicit memory barriers are used, and where the algorithm clearly defines fast paths and slow paths, the membarrier system call can be used as an optimization to speed up the fast path at the expense of adding overhead to the slow path, which results in an overall speed up.

Since algorithms such as RCU are designed to be used by code that reads shared data much more often than it writes to it, this optimization is still a good trade-off for that specific scenario. When the membarrier system call was merged into Linux 4.3, improving the speed of Userspace RCU was the rationale mentioned in the patch message.

Internally, the membarrier system call uses the kernel's synchronized_sched() function to wait for a "grace period" to elapse. In other words, it blocks the calling thread until all running threads on the system have gone through a context switch.

And it's this blocking behaviour that led to Milian's application startup performance issue.

How liblttng-ust uses the membarrier system call

liblttng-ust uses the Userspace RCU library, liburcu, to synchronize updates to the tracing state. RCU read side scales very well on modern multi-core machines and it keeps LTTng's overhead low so it can be used in production to analyze race conditions and other timing-critical bugs.

Userspace RCU started using membarrier(2) in 0.9.0. At run time, liburcu detects whether the membarrier system call is available on the running system and, if available, uses it to implement synchronize_rcu() , which blocks the calling thread until it's safe to modify shared data. synchronize_rcu() does not batch memory reclamation, and each call waits for the grace period.

Note:liburcu does offer an alternative function that batches memory reclamation, call_rcu() . However, using it requires a separate thread to asynchronously handle the callback and we wanted to keep the number of threads needed by LTTng to a minimum to reduce the complexity of tracing applications.

liblttng-ust uses GCC's constructor attribute to run an initialization function ( lttng_ust_init() ) as soon as the liblttng-ust shared library is loaded and before the application starts. Code running during this initialization phase calls to synchronize_rcu() . Given that a single call to membarrier(2) can take tens of milliseconds (each enabled event triggers a membarrier(2) call), it's no wonder that Milian saw delays when starting his application.

liblttng-ust does not batch memory reclaim which means that it calls synchronize_rcu() repeatedly.

Commit 6447802 (Fix: don't use membarrier SHARED syscall command in liburcu-bp) to liburcu shows how we addressed the problem: avoid the membarrier system call entirely.

We still think it makes sense to use membarrier(2) to implement Userspace RCU, though not with the MEMBARRIER_CMD_SHARED command.

The real fix? The MEMBARRIER_CMD_PRIVATE_EXPEDITED flag

Fortunately, a new MEMBARRIER_CMD_PRIVATE_EXPEDITED flag for membarrier(2) has been merged into Linux 4.14. It uses inter-processor interrupts (IPI) to implement the memory barrier semantics, and communicates only with threads in the calling thread's process. Best of all, it executes much faster than the MEMBARRIER_CMD_SHARED command and never blocks the calling thread.

Now that MEMBARRIER_CMD_PRIVATE_EXPEDITED has landed in mainline Linux, liburcu 0.11 will include patches that use it if available (it requires at least Linux 4.14 and the CONFIG_MEMBARRIER build option to be enabled). If you compile liburcu using the --disable-sys-membarrier-fallback configuration option, liburcu will now abort if the MEMBARRIER_CMD_PRIVATE_EXPEDITED flag is not supported by the running kernel: it's a useful check to ensure you're running the most optimized version of the code.

Please enable JavaScript to view the comments powered by Disqus.