bpftrace is a new eBPF-based tracing tool that was first included in Fedora 28. It was developed by Brendan Gregg, Alastair Robertson and Matheus Marchini with the help of a loosely-knit team of hackers across the Net. A tracing tool lets you analyze what a system is doing behind the curtain. It tells you which functions in code are being called, with which arguments, how many times, and so on.

This article covers some basics about bpftrace, and how it works. Read on for more information and some useful examples.

eBPF (extended Berkeley Packet Filter)

eBPF is a tiny virtual machine, or a virtual CPU to be more precise, in the Linux Kernel. The eBPF can load and run small programs in a safe and controlled way in kernel space. This makes it safer to use, even in production systems. This virtual machine has its own instruction set architecture (ISA) resembling a subset of modern processor architectures. The ISA makes it easy to translate those programs to the real hardware. The kernel performs just-in-time translation to native code for main architectures to improve the performance.

The eBPF virtual machine allows the kernel to be extended programmatically. Nowadays several kernel subsystems take advantage of this new powerful Linux Kernel capability. Examples include networking, seccomp, tracing, and more. The main idea is to attach eBPF programs into specific code points, and thereby extend the original kernel behavior.

eBPF machine language is very powerful. But writing code directly in it is extremely painful, because it’s a low level language. This is where bpftrace comes in. It provides a high-level language to write eBPF tracing scripts. The tool then translates these scripts to eBPF with the help of clang/LLVM libraries, and then attached to the specified code points.



Installation and quick start

To install bpftrace, run the following command in a terminal using sudo:

$ sudo dnf install bpftrace

Try it out with a “hello world” example:

$ sudo bpftrace -e 'BEGIN { printf("hello world

"); }'

Note that you must run bpftrace as root due to the privileges required. Use the -e option to specify a program, and to construct the so-called “one-liners.” This example only prints hello world, and then waits for you to press Ctrl+C.

BEGIN is a special probe name that fires only once at the beginning of execution. Every action inside the curly braces { } fires whenever the probe is hit — in this case, it’s just a printf.

Let’s jump now to a more useful example:

$ sudo bpftrace -e 't:syscalls:sys_enter_execve { printf("%s called %s

", comm, str(args->filename)); }'

This example prints the parent process name (comm) and the name of every new process being created in the system. t:syscalls:sys_enter_execve is a kernel tracepoint. It’s a shorthand for tracepoint:syscalls:sys_enter_execve, but both forms can be used. The next section shows you how to list all available tracepoints.

comm is a bpftrace builtin that represents the process name. filename is a field of the t:syscalls:sys_enter_execve tracepoint. You can access these fields through the args builtin.

All available fields of the tracepoint can be listed with this command:

bpftrace -lv "t:syscalls:sys_enter_execve"

Example usage

Listing probes

A central concept for bpftrace are probe points. Probe points are instrumentation points in code (kernel or userspace) where eBPF programs can be attached. They fit into the following categories:

kprobe – kernel function start

kretprobe – kernel function return

uprobe – user-level function start

uretprobe – user-level function return

tracepoint – kernel static tracepoints

usdt – user-level static tracepoints

profile – timed sampling

interval – timed output

software – kernel software events

hardware – processor-level events

All available kprobe/kretprobe, tracepoints, software and hardware probes can be listed with this command:

$ sudo bpftrace -l

The uprobe/uretprobe and usdt probes are userspace probes specific to a given executable. To use them, use the special syntax shown later in this article.

The profile and interval probes fire at fixed time intervals. Fixed time intervals are not covered in this article.

Counting system calls

Maps are special BPF data types that store counts, statistics, and histograms. You can use maps to summarize how many times each syscall is being called:

$ sudo bpftrace -e 't:syscalls:sys_enter_* { @[probe] = count(); }'

Some probe types allow wildcards to match multiple probes. You can also specify multiple attach points for an action block using a comma separated list. In this example, the action block attaches to all tracepoints whose name starts with t:syscalls:sys_enter_, which means all available syscalls.

The bpftrace builtin function count() counts the number of times this function is called. @[] represents a map (an associative array). The key of this map is probe, which is another bpftrace builtin that represents the full probe name.

Here, the same action block is attached to every syscall. Then, each time a syscall is called the map will be updated, and the entry is incremented in the map relative to this same syscall. When the program terminates, it automatically prints out all declared maps.

This example counts the syscalls called globally, it’s also possible to filter for a specific process by PID using the bpftrace filter syntax:

$ sudo bpftrace -e 't:syscalls:sys_enter_* / pid == 1234 / { @[probe] = count(); }'

Write bytes by process

Using these concepts, let’s analyze how many bytes each process is writing:

$ sudo bpftrace -e 't:syscalls:sys_exit_write /args->ret > 0/ { @[comm] = sum(args->ret); }'

bpftrace attaches the action block to the write syscall return probe (t:syscalls:sys_exit_write). Then, it uses a filter to discard the negative values, which are error codes (/args->ret > 0/).

The map key comm represents the process name that called the syscall. The sum() builtin function accumulates the number of bytes written for each map entry or process. args is a bpftrace builtin to access tracepoint’s arguments and return values. Finally, if successful, the write syscall returns the number of written bytes. args->ret provides access to the bytes.

Read size distribution by process (histogram):

bpftrace supports the creation of histograms. Let’s analyze one example that creates a histogram of the read size distribution by process:

$ sudo bpftrace -e 't:syscalls:sys_exit_read { @[comm] = hist(args->ret); }'

Histograms are BPF maps, so they must always be attributed to a map (@). In this example, the map key is comm.

The example makes bpftrace generate one histogram for every process that calls the read syscall. To generate just one global histogram, attribute the hist() function just to ‘@’ (without any key).

bpftrace automatically prints out declared histograms when the program terminates. The value used as base for the histogram creation is the number of read bytes, found through args->ret.

Tracing userspace programs

You can also trace userspace programs with uprobes/uretprobes and USDT (User-level Statically Defined Tracing). The next example uses a uretprobe, which probes to the end of a user-level function. It gets the command lines issued in every bash running in the system:

$ sudo bpftrace -e 'uretprobe:/bin/bash:readline { printf("readline: \"%s\"

", str(retval)); }'

To list all available uprobes/uretprobes of the bash executable, run this command:

$ sudo bpftrace -l "uprobe:/bin/bash"

uprobe instruments the beginning of a user-level function’s execution, and uretprobe instruments the end (its return). readline() is a function of /bin/bash, and it returns the typed command line. retval is the return value for the instrumented function, and can only be accessed on uretprobe.

When using uprobes, you can access arguments with arg0..argN. A str() call is necessary to turn the char * pointer to a string.

Shipped Scripts

There are many useful scripts shipped with bpftrace package. You can find them in the /usr/share/bpftrace/tools/ directory.

Among them, you can find:

killsnoop.bt – Trace signals issued by the kill() syscall.

tcpconnect.bt – Trace all TCP network connections.

pidpersec.bt – Count new procesess (via fork) per second.

opensnoop.bt – Trace open() syscalls.

vfsstat.bt – Count some VFS calls, with per-second summaries.

You can directly use the scripts. For example:

$ sudo /usr/share/bpftrace/tools/killsnoop.bt

You can also study these scripts as you create new tools.

Links

bpftrace reference guide – https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md

bpftrace (DTrace 2.0) for Linux 2018 – http://www.brendangregg.com/blog/2018-10-08/dtrace-for-linux-2018.html

BPF: the universal in-kernel virtual machine – https://lwn.net/Articles/599755/

Linux Extended BPF (eBPF) Tracing Tools – http://www.brendangregg.com/ebpf.html

Dive into BPF: a list of reading material – https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf

Photo by Roman Romashov on Unsplash.