Full Stack System Call Latency Profiling

Finding the exact origin of a kernel-related latency can be very difficult: we usually end up recording too much or not enough information, and it becomes even more complex if the problem is sporadic or hard to reproduce.

In this blog post, we present a demo of a new feature to extract as much relevant information as possible around an unusual system call latency. Depending on the configuration, during a long system call, we can extract:

multiple kernel-space stacks of the process to identify where we are waiting,

the user-space stack to identify the origin of the system call,

an LTTng snapshot of the user and/or kernel-space trace to give some background information about what led us here and what was happening on the system at that exact moment.

All of this information is tracked at run-time with low overhead and recorded only when some pre-defined conditions are matched (PID, current state and latency threshold).

Combining all of this data gives us enough information to accurately determine where the system call was launched in an application, what it was waiting for in the kernel, and what system-wide events (kernel and/or user-space) could explain this unexpected latency.

Examples

Let’s look at a few sample use cases for the system call latency tracker. You may find all scripts and applications presented here on Github.

First, we want to get detailed information about the long sleeps made by sleep-demo , an application which calls libc's usleep() function with random durations in a loop.

# Generate events for system calls taking more than 2 seconds. echo 2000000 > /sys/module/latency_tracker_syscalls/parameters/usec_threshold # Launch the application with liblttng-profile in order to collect userspace stacks. LD_PRELOAD=liblttng-profile.so src/sleep-demo # Record a trace. scripts/trace.sh # Generate a flame graph, for the show ./syscall_latency_flame_graph.py [path-to-trace] | ./flamegraph.pl > sleep.svg

Here is the result (zoom).

The flame graph shows the userspace and kernel stacks of system calls that exceeded the specified threshold. The recorded trace is small because it only contains the information we are really interested in and the LTTng ring-buffer is configured to store only 1.5MB of trace per-CPU.

In order to grasp the true potential of a low-overhead latency tracker, let's have a look at more complex examples. The following flame graph shows system calls issued by MongoDB which took more than 100 ms (zoom).

There are a lot of long system calls in MongoDB, but we don’t care about most of them because they just correspond to worker threads waiting for new work to do. Fortunately, we know that external commands are processed by the execCommand() method. We can run grep execCommand on the output of syscall_latency_flame_graph.py to get a flame graph of the latencies that directly affect the database's response time (zoom):

Having kernel stacks is also very helpful. In a recent blog post, Fabien Reumont-Locke explained how he dug into the source code of the kernel to find a semaphore that slowed-down parallel calls to mmap() . Executing the system call latency tracker while running the mmap-demo app provides the same information without requiring to look at the kernel code (zoom):

Sometimes, full stacks are not enough to understand why a system call took more time than expected. In such a case, the detailed information provided by an LTTng kernel trace can be the key to understand what is happening on the system. LTTng's snapshot mode allows the collection of a trace in a circular memory buffer and saving it to a file on demand. The system call latency tracker can be used in combination with that mode to take a snapshot when a long system call occurs within a monitored application. This way, we can obtain extremely detailed information about what is happening on a system when rare latencies occur without having to manage huge trace files.

To show this feature, we will use the io-demo application which continuously creates new files on disk.

# Generate events for system calls taking more than 2 seconds. echo 2000000 > /sys/module/latency_tracker_syscalls/parameters/usec_threshold # Launch io-demo with liblttng-profile in order to collect userspace stacks. LD_PRELOAD=liblttng-profile.so src/io-demo # Configure the LTTng session and save snapshots when a high latency occurs scripts/trace-io-snapshot.sh

If we generate a flame graph from the resulting trace, we see that a latency occurred within an open() system call (zoom):

The snapshot extracted covers a period of 9 seconds. If we run the lttng-iousagetop analysis on the trace (see this blog post), we see that io-demo wrote a lot of data:

Per-process I/O Write ############################################################################### ██████████████████████████████████████████████████ 13.50 MB io-demo (5786) 13.50 MB file 0 B net 4.00 KB unknown 73.83 KB chrome (5434) 0 B file 0 B net 73.83 KB unknown 60.46 KB gnome-terminal (3720) 0 B file 0 B net 60.46 KB unknown 50.95 KB Xorg (1980) 0 B file 0 B net 50.95 KB unknown 36.88 KB chrome (5713) 0 B file 0 B net 36.88 KB unknown

And that this data was actually committed to the disk, which explains the latency of the open() system call:

Block I/O Write ############################################################################### ██████████████████████████████████████████████████ 8.51 MB swapper (pid=0) ██████████████████████████████████████████████ 8.00 MB kworker/u17:0 (pid=68) ██████████████████████████ 4.45 MB io-demo (pid=5786) 20.00 KB gnome-terminal (pid=3720)

Overhead

The overhead of this tool on real applications has been assessed by running the MongoDB MultiUpdate.v1.Contended.Doc.Seq.Indexed benchmark with 1 thread. When system calls lasting more than 200 ms are tracked, the request throughput is reduced by 15%. If userspace symbols are resolved offline (only addresses are included in the trace), the throughput is reduced by 6% instead. During the execution of this benchmark, mongod generated an average of 12885 scheduler events per second. Capturing call stacks for each of these events would have produced a gigantic trace. Instead of that, our tool only captured the stacks for system calls of more than 200 ms, an event that occurred about 5 times per second.

In terms of scalability, we benchmarked the latency_tracker module analyzing wake-up latencies on a highly-concurrent application ( hackbench ) with 1 to 32 CPU cores enabled and measured a linear overhead from 0.3% (1 core) to 22% (32 cores). The latency_tracker data structures are currently being profiled to select the best ones to run in production without major impact (we are comparing 3 hashtable implementations, 2 linked-lists and a wait-free queue with various locking schemes), more on that later.

Usage

An install script is provided here. Please use it as an install guide instead of running it blindly.

This solution is composed of 3 individual components:

The latency_tracker for the run-time system call latency detection,

lttng-profile to perform the stack unwinding with an optimized version of libunwind when it receives a signal from the latency_tracker.

and of course LTTng to collect the user and kernel-space trace in a memory ring-buffer. We provide a patch for LTTng-modules to connect to the tracepoints emitted by the latency_tracker because they are not ready for inclusion yet.

Once everything is installed, configure the LTTng session and wait for events to trigger the snapshots:

# ./examples/syscalls-trace-snapshot.sh

To launch an application where you want the full user-space stack:

$ LD_PRELOAD=liblttng-profile.so ./myapplication

Wait for the latency to appear; the syscalls-trace-snapshot.sh script will inform you when a snapshot is recorded. You can then use your favorite CTF trace parser/analyser tool to process the trace.

The threshold parameter can be configured at run-time in /sys/module/latency_tracker_syscalls/parameters or when loading the module (see modinfo latency_tracker_syscalls.ko for the other tunable parameters).

Under the hood

The latency_tracker is a module designed for efficient and scalable tracking of events in the kernel: it matches an event entry to an event exit based on a shared key. If the delay between these two events is higher than a pre-defined, and dynamically configurable, threshold, it executes a user-provided callback.

If we want the user-space stack of a specific application, it must be started with the LD_PRELOAD of a library provided by lttng-profile . This library registers the process (and its threads) to the latency_tracker_syscalls module to notify it that it can receive profiling signals if needed.

We used the infrastructure of the latency_tracker to build a system-call latency tracking module. It hooks on the entry and exit system call tracepoint events and uses the PID as the key to match the entry and exit of system calls. In addition, when a system call takes longer than the threshold, if enabled, we extract the kernel stack of the process every time it gets some CPU time ( sched_switch event), so we get an accurate view of where it is waiting in the kernel. When the system call finally completes, we send the SIGPROF signal to the program (if registered) to ask it to extract its stack.

All of the stacks and related information are extracted through some static tracepoints on which LTTng can attach. When a latency is detected, the latency_tracker_syscalls wakes up a user-process which records the LTTng snapshot which extracts the current buffers on disk or through the network. Of course LTTng can also be configured to simultaneously record everything else happening on the system.

The kernel module works independently from the user-space library, so if you only want kernel events and/or the kernel stack of all system calls taking longer than a threshold, you can simply load the module with the watch_all parameter set to 1 . If you are not interested in the user-space trace, the tracepoints also work fine with ftrace.

Finally, if we need to generate an event when a specific function in the kernel is hit during a long system call, it is easy to add a kprobe anywhere and have it check the state of the current syscall to decide what to do. This gives us the opportunity to create "stateful tracing events" with native code flexibility and performance which we believe will be really useful in the near future.

Conclusion

To summarize, we are using the static kernel tracepoints to dynamically decide what and when to record relevant information for our use-case and trying to find the needle in the haystack. The module can be configured to follow only registered processes or the whole system, and some more advanced filtering options are coming up in a near future.

Please remember that it is a work in progress, the overall user-experience still needs some polish to make it easier to deploy and use, but we are presenting it today to start gathering some early feedback and improvement suggestions. Feel free to let us know what you think, the use-cases you foresee, if you have other ideas, etc.

We hope you enjoy it! Keep in mind that the links presented here might change in the future, so follow the main repos for more updates.

Please enable JavaScript to view the comments powered by Disqus.