In our previous parts we focused on tracing the Linux kernel, for which the eBPF-based projects are, in our humble opinion, the most safe, widely available and useful methods (eBPF is fully upstreamed in Linux, guarantees a stable ABI, comes enabled by default in almost all distributions and integrates with all other tracing mechanisms). It has really become a no-brainer choice for kernel work. However, up until now, talking in-depth about userspace tracing was deliberately avoided because it merits special treatment, hence this full part 5 article devoted to it.

First we'll look at the why of it, then we'll group and examine eBPF user tracing in two categories: static and dynamic.

Why even use eBPF for userspace?

The most important user question is why use eBPF at all to trace userspace processes when there are so many other debuggers/profilers/tracers, so many language-specific or OS-portable tools developed for the same task? The answer is not simple and, depending on the use case, eBPF might not be the best solution; there is no one-size-fits-all debug/tracing project for the vast userspace ecosystem.

eBPF has some advantages:

It provides a unified tracing interface for both kernel and userspace, compatible with mechanisms used by other tools ([k,u]probes, (dtrace)tracepoints and so on). The 2015 article choosing a linux tracer is somewhat dated, but provides good insights on how difficult and how much effort one has to be spend to use all the different tools. Having one unified, powerful, safe and widely-available framework for the majority of tracing use-cases can be very valuable. Some of the higher level tools like Perf/SystemTap/DTrace are being rewritten on top of eBPF (becoming eBPF front-ends), so knowing eBPF helps when using them.

eBPF is fully programmable. Perf/ftrace and others dump data which needs to be processed after the fact, while eBPF runs your custom high level native-compiled C/Python/Go instrumentation code directly in the kernel/application. It can store data between multiple eBPF event runs to compute (for example) per-function call statistics based on function state/arguments.

eBPF can trace everything in a system, it is not limited to a specific application. For example uprobes can be set on a shared library and trace all processes linking and calling it.

A lot of debuggers need to stop a program to observe its state or decrease runtime performance making it hard to do real-time analysis, especially on production workloads. Because eBPF attaches just-in-time native compiled instrumentation code, its performance impact is minimal and doesn't require suspending execution for long periods of time.

Some disadvantages:

eBPF is not as portable as other tracers. The VM is primarily developed in the Linux kernel (there is a work-in-progress BSD port) and the tooling around it is developed for Linux.

eBPF requires a fairly recent kernel. For example MIPS support was added in v4.13, but the vast majority of MIPS devices in the wild run kernels older than v4.13.

In general, eBPF can't easily give as many insights as a language or application-specific userspace debugger. For example, Emacs at its core is an ELISP interpreter written in C: eBPF can be used to trace/debug ELISP programs by hooking the Emacs runtime C function calls, but it has no knowledge of the higher-level ELISP language implementation, so using the special ELISP language-specific tracers and debuggers provided by Emacs becomes much more useful. Another example is debugging JavaScript applications running in a web browser engine.

Because "normal eBPF" runs in the Linux kernel, a kernel-user context switch happens every time eBPF instruments a user process. This can be expensive for debugging performance critical userspace code (perhaps the userspace eBPF VM project can be used to avoid this switch cost?). This context switch is much cheaper than what a normal debugger incurs (or tools like strace), so it's usually negligible, but a tracer capable of fully running in userspace like LTTng may be preferable in this case.

Static tracepoints (USDT probes)

Static tracepoints, also known in userspace as USDT (User Statically Defined Trace) probes, are specific locations of interest in an application where tracers can be hooked to inspect code execution and data. They are explicitly defined by developers in source code and usually enabled at compile time with flags like "--enable-dtrace". The allure of static tracepoints is that they don't frequently change: developers usually keep stable static trace ABIs, so tracing tools work between different application versions which is useful, for example, when upgrading a PostgreSQL installation and encountering a performance regression.

Predefined tracepoints

BCC-tools contains a lot of useful well-tested tools to interact with tracepoints defined by specific applications or language runtimes. For our example we'll want to trace python applications. Make sure you have python3 built with "--enable-dtrace" and run tplist on either the python binary or libpython (depending how you've built it) to confirm tracepoints are enabled:

$ tplist -l /usr/lib/libpython3.7m.so b'/usr/lib/libpython3.7m.so' b'python':b'import__find__load__start' b'/usr/lib/libpython3.7m.so' b'python':b'import__find__load__done' b'/usr/lib/libpython3.7m.so' b'python':b'gc__start' b'/usr/lib/libpython3.7m.so' b'python':b'gc__done' b'/usr/lib/libpython3.7m.so' b'python':b'line' b'/usr/lib/libpython3.7m.so' b'python':b'function__return' b'/usr/lib/libpython3.7m.so' b'python':b'function__entry'

First let's use one of the cool tracer tools provided by BCC, uflow, to trace the execution flow of python's simple http server. The trace should be self explanatory, the arrows and indentation indicate function enter/exit. What we see in this trace is how a worker thread exits on CPU 3 and the main thread is ready to service other incoming http requests on CPU 0.

$ python -m http.server >/dev/null & sudo ./uflow -l python $! [4] 11727 Tracing method calls in python process 11727... Ctrl-C to quit. CPU PID TID TIME(us) METHOD 3 11740 11757 7.034 /usr/lib/python3.7/_weakrefset.py._remove 3 11740 11757 7.034 /usr/lib/python3.7/threading.py._acquire_restore 0 11740 11740 7.034 /usr/lib/python3.7/threading.py.__exit__ 0 11740 11740 7.034 /usr/lib/python3.7/socketserver.py.service_actions 0 11740 11740 7.034 /usr/lib/python3.7/selectors.py.select 0 11740 11740 7.532 /usr/lib/python3.7/socketserver.py.service_actions 0 11740 11740 7.532

Next we want to run our own custom code when a tracepoint is hit, so we don't fully rely on whatever tools BCC provides. The following example hooks itself to python's function__entry tracepoint (see the python instrumentation documentation) and notifies us when someone is downloading a file:

#!/usr/bin/env python from bcc import BPF, USDT import sys bpf = """ #include <uapi/linux/ptrace.h> static int strncmp(char *s1, char *s2, int size) { for (int i = 0; i < size; ++i) if (s1[i] != s2[i]) return 1; return 0; } int trace_file_transfers(struct pt_regs *ctx) { uint64_t fnameptr; char fname[128]={0}, searchname[9]="copyfile"; bpf_usdt_readarg(2, ctx, &fnameptr); bpf_probe_read(&fname, sizeof(fname), (void *)fnameptr); if (!strncmp(fname, searchname, sizeof(searchname))) bpf_trace_printk("Someone is transferring a file!\

"); return 0; }; """ u = USDT(pid=int(sys.argv[1])) u.enable_probe(probe="function__entry", fn_name="trace_file_transfers") b = BPF(text=bpf, usdt_contexts=[u]) while 1: try: (_, _, _, _, ts, msg) = b.trace_fields() except ValueError: continue print("%-18.9f %s" % (ts, msg))

We test by attaching again to the simple http-server:

$ python -m http.server >/dev/null & sudo ./trace_simplehttp.py $! [14] 28682 34677.450520000 b'Someone is transferring a file!'

The above example tells us when someone is downloading a file but it can't be any more specific than that, to provide more detailed information about who is downloading, what file and so on. This is because python only comes with a few very generic tracepoints enabled by default (module load, function entry/exit etc.). To get additional info we'll have to define our own tracepoints at places of interest so we can extract the relevant data.

Defining our own tracepoints

Until now we have only worked with tracepoints defined by others, but if our application doesn't have any tracepoints at all, or we need to expose more information than possible with the existing tracepoints, then we'll have to add our own tracepoints.

There are multiple ways of adding tracepoints, for example python-core uses the systemtap development package "systemtap-sdt-dev" via pydtrace.h and pydtrace.d, but we'll take another route and use libstapsdt because it has a simpler API, is more lightweight (only depends on libelf) and has multiple language bindings. For consistency we focus again on python, but tracepoints can be added in other languages, here's a C example.

First we patch the simple http server to expose a tracepoint. The code should be self explanatory: notice the tracepoint name file_transfer and its arguments which are big enough to store two string pointers and a 32bit unsigned integer representing the client IP address, the file path and file size.

diff --git a/usr/lib/python3.7/http/server.py b/usr/lib/python3.7/http/server.py index ca2dd50..af08e10 100644 --- a/usr/lib/python3.7/http/server.py +++ b/usr/lib/python3.7/http/server.py @@ -107,6 +107,13 @@ from functools import partial from http import HTTPStatus +import stapsdt +provider = stapsdt.Provider("simplehttp") +probe = provider.add_probe("file_transfer", + stapsdt.ArgTypes.uint64, + stapsdt.ArgTypes.uint64, + stapsdt.ArgTypes.uint32) +provider.load() # Default error message template DEFAULT_ERROR_MESSAGE = """\ @@ -650,6 +657,8 @@ class SimpleHTTPRequestHandler(BaseHTTPRequestHandler): f = self.send_head() if f: try: + path = self.translate_path(self.path) + probe.fire(self.address_string(), path, os.path.getsize(path)) self.copyfile(f, self.wfile) finally: f.close()

Running the patched server, we can verify our file_transfer tracepoint is present at runtime using tplist:

$ python -m http.server >/dev/null 2>&1 & tplist -p $! [1] 13297 b'/tmp/simplehttp-Q6SJDX.so' b'simplehttp':b'file_transfer' b'/usr/lib/libpython3.7m.so.1.0' b'python':b'import__find__load__start' b'/usr/lib/libpython3.7m.so.1.0' b'python':b'import__find__load__done'

The most significant modifications we'll do below to our tracer example code from the above example are:

It hooks its logic to our custom file_transfer tracepoint.

tracepoint. It uses PERF EVENTS to store data which can pass arbitrary structures to userspace, as opposed to our previous use of the ftrace ringbuffer which could transfer only individual strings.

It does not use bpf_usdt_readarg to get USDT-provided pointers, instead it declares them directly in the handler function signature. This is a significant quality of life improvement which can be used in all handlers.

use to get USDT-provided pointers, instead it declares them directly in the handler function signature. This is a significant quality of life improvement which can be used in all handlers. This tracer explicitly uses python2, even though all our examples up until now (including the above python http.server patch) use python3. Hopefully in the future all BCC APIs and documentation get ported to python 3.

#!/usr/bin/env python2 from bcc import BPF, USDT import sys bpf = """ #include <uapi/linux/ptrace.h> BPF_PERF_OUTPUT(events); struct file_transf { char client_ip_str[20]; char file_path[300]; u32 file_size; u64 timestamp; }; int trace_file_transfers(struct pt_regs *ctx, char *ipstrptr, char *pathptr, u32 file_size) { struct file_transf ft = {0}; ft.file_size = file_size; ft.timestamp = bpf_ktime_get_ns(); bpf_probe_read(&ft.client_ip_str, sizeof(ft.client_ip_str), (void *)ipstrptr); bpf_probe_read(&ft.file_path, sizeof(ft.file_path), (void *)pathptr); events.perf_submit(ctx, &ft, sizeof(ft)); return 0; }; """ def print_event(cpu, data, size): event = b["events"].event(data) print("{0}: {1} is downloding file {2} ({3} bytes)".format( event.timestamp, event.client_ip_str, event.file_path, event.file_size)) u = USDT(pid=int(sys.argv[1])) u.enable_probe(probe="file_transfer", fn_name="trace_file_transfers") b = BPF(text=bpf, usdt_contexts=[u]) b["events"].open_perf_buffer(print_event) while 1: try: b.perf_buffer_poll() except KeyboardInterrupt: exit()

Tracing the patched server yields:

$ python -m http.server >/dev/null 2>&1 & sudo ./trace_stapsdt.py $! [1] 5613 325540469950102: 127.0.0.1 is downloading file /home/adi/ (4096 bytes) 325543319100447: 127.0.0.1 is downloading file /home/adi/.bashrc (827 bytes) 325552448306918: 127.0.0.1 is downloading file /home/adi/workspace/ (4096 bytes) 325563646387008: 127.0.0.1 is downloading file /home/adi/workspace/work.tar (112640 bytes) (...)

The above custom file_transfer tracepoint can look trivial (a direct python print or a logger call could have the same effect), but the mechanism it presents is very powerful: well placed tracepoints guaranteeing ABI stability, giving the ability to dynamically run safe, native-fast, programmable logic can be very helpful in quickly analyzing and fixing all kinds of issues without restarting the offending application (reproducing an issue might take a long time).

Dynamic probes (uprobes)

The problem with static tracepoints, as exemplified above, is that they need to be explicitly defined in source code and require rebuilding applications when modifying the tracepoints. Guaranteeing ABI stability for existing tracepoints imposes limits on how maintainers can rearchitect/rewrite the code which exposes the tracepoint data. Therefore in some cases, fully runtime dynamic userspace probes (aka uprobes) are preferred: these poke directly at a running application's memory in an ad-hoc manner without requiring any special source code definition. Dynamic probes can easily break between application versions, but even so they can be useful for live debugging running instances.

While static tracepoints are useful for tracing applications written in high level languages like Python or Java, uprobes are less useful for this because they're lower level and have no knowledge of the language runtime implementations (static tracepoints work because developers take it upon themselves to expose the relevant data of the high level application). Dynamic probes are useful however for debugging the language implementations / engines themselves or applications written in languages with no runtimes like C.

Uprobes can be added to stripped binaries, but the user has to manually compute the in-process memory offset location where the uprobe should attach via tools like objdump and /proc//maps (see example) which is painful and unportable. Since most distributions provide debug symbols packages (or a quick way to build with debug symbols) and BCC makes it trivial to use uprobes with symbol name resolution, the vast majority of dynamic probe uses happen this way.

The gethostlatency BCC tool prints DNS lookup latencies by attaching uprobes to gethostbyname and friends in libc. To verify that libc is not stripped so the tool can be run (otherwise it throws a sybol-not-found error):

$ file /usr/lib/libc-2.28.so /usr/lib/libc-2.28.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, (...), not stripped $ nm -na /usr/lib/libc-2.28.so | grep -i -e getaddrinfo 0000000000000000 a getaddrinfo.c

The gethostlatency code is very similar (and in some places identical, it also uses BPF_PERF_OUTPUT) to the tracepoint example we examined above, so we won't be posting it fully here. The most relevant difference is the use of the BCC uprobe API:

b.attach_uprobe(name="c", sym="getaddrinfo", fn_name="do_entry", pid=args.pid) b.attach_uretprobe(name="c", sym="getaddrinfo", fn_name="do_return", pid=args.pid)

The key idea to understand and remember here is that with small changes to our BCC eBPF programs we can trace very different applications, libraries or even the kernel via static and dynamic probes. Before we were statically tracing a Python application and now we're dynamically measuring libc hostname resolution latency. With similar modifications to these small (less than 150 LOC, lots of it boilerplate) examples anything can be traced in a running system, very safely with no risk of crashes or problems induced by other tools like debugger app pauses/halts.

Summary

In this fifth part we looked at how to use eBPF programs to trace userspace applications. The biggest advantage of using eBPF for this task is that it provides a unified interface to safely and efficiently trace the entire system: a bug can be reproduced in an application, then traced further into a library or the kernel, providing full system visibility via a unified programming framework / interface. eBPF is not a silver bullet though, especially when debugging applications written in higher-level languages, where language specific tools can do a better job of providing insight, or for those running old versions of Linux kernels, or requiring non-Linux systems.