Photo by Sai Kiran Anagani on Unsplash

BpfTrace is a high-level tracing language for Linux eBPF, which allows you to pull information that tends to be quite handy in performance and security investigations. For more information on bpftrace, check the official reference guide.

TL;DR:

Here's how to run bpftrace straight out of an alpine container image.

docker run --rm -t -v /sys/kernel/debug:/sys/kernel/debug:ro --cap-add=SYS_ADMIN --security-opt no-new-privileges paulinhu/bpftrace:alpine sh -c "/init.sh && bpftrace -e 'kprobe:do_sys_open { printf(\"%s: %s

\", comm, str(arg1)) } interval:s:2 { exit(); }'"

If the above does not work for you, either your kernel has lock-down enabled or your kernel major version is not compatible with the version I built the docker image (5.x). Troubleshooting are provided at the bottom of this page.

Running that container, should results in a list of all the current executing commands in the host machine and the files that each one have opened:

Note that the only mounted folder to the container is in read-only mode. Also, in the example above I am not running with a seccomp profile. Further down this post I show a custom seccomp profile and how to use it — assuming you are that way inclined. :)

The long story…

I could not find any alpine image with bpftrace that I could simply download and use. I found that a bit strange, but then trying to build my own I realised why. Here are the basic challenges on this endeavour:

1. No stable bpftrace package for alpine

Currently, the only bpftrace package available for alpine is in the testing channel. I prefer to not rely on testing packages as they may change and would break your docker build command.

2. Dependencies not easily available for alpine

If the focus was to use ubuntu, you could simply generate an image with bpftrace with a single command:

FROM ubuntu:disco as build RUN apt update && apt install -y bpftrace linux-headers-$(uname -r)

The story is not as simple for "alpiners".

3. Bpftrace has dependency on the host machine kernel

At runtime, bpftrace will refer to kernel headers using the kernel release name of the target host machine . That means, if you build an image in your machine that is 5.0.0–32-generic, and try to run that image on a server that is 5.0.0–1020-gcp, it will fail to run.

Building a light bpftrace image

Most of the problems above can be resolved by building all dependencies from source. Using a multistage-build, all the source code can be downloaded, together with their dependencies and their building dependencies. Then at the end just the required static files will be copied into an alpine:latest image. The resulting Dockerfile looks like this:

Note that to resolve the run-time dependency to the kernel release name, I create an init.sh script that symlinks the host machine kernel module name to the one created at image build time. Unless they are vastly different, bpftrace should not complain. Obviously, types change through new kernel releases, so you should strive to build for the target machine, however, with this approach you should get away with quite some drift.

Using the docker image above, assuming you are targeting similar to your local machine, you can build the image using:

build -t my-bpftrace:alpine \

--build-arg KERNEL_VERSION=$(uname -r | awk -F- '{ print $1 }' ) \

--build-arg KERNEL_RELEASE=$( uname -r) \

.

Once built, you can execute on the same machine by using the below:

docker run --rm -t -v /sys/kernel/debug:/sys/kernel/debug:ro --cap-add=SYS_ADMIN --security-opt no-new-privileges my-bpftrace:alpine bpftrace -e 'kprobe:do_sys_open { printf("%s: %s

", comm, str(arg1)) } interval:s:2 { exit(); }'

If you need to run on a different machine that uses a different kernel release name, ensure you execute /init.sh first. On that case, also make sure to escape any double quotes:

docker run --rm -t -v /sys/kernel/debug:/sys/kernel/debug:ro --cap-add=SYS_ADMIN --security-opt no-new-privileges my-bpftrace:alpine sh -c "/init.sh && bpftrace -e 'kprobe:do_sys_open { printf(\"%s: %s

\", comm, str(arg1)) } interval:s:2 { exit(); }'"

The result is a 80mb container image:

Which becomes 27mb once compressed:

That's it! At this point you have a light and simple image to run your bpftrace commands. The bits below are optional, assuming that you don't care about seccomp and you did not have any issues along the way. :)

For the full code to build the image, check out this repo.

Restrict the container image with seccomp

BpfTrace must be executed with the CAP_SYS_ADMIN capability and also have (read-only) access to the /sys/kernel/debug folder. Using a custom seccomp profile helps to decrease the attack surface, which is not small when running with such capability.

Here's the profile I have been using:

The seccomp profile can be referenced by adding --security-opt seccomp=profile-name.json . The final command should look like:

docker run --rm -t -v /sys/kernel/debug:/sys/kernel/debug:ro --cap-add=SYS_ADMIN --security-opt seccomp=bpftrace.json --security-opt no-new-privileges paulinhu/bpftrace:alpine sh -c "/init.sh && bpftrace -e 'kprobe:do_sys_open { printf(\"%s: %s

\", comm, str(arg1)) } interval:s:2 { exit(); }'"

By applying the seccomp profile the container will be restrained and won't be able to use system calls beyond the ones I mapped that were required to run bpftrace commands.

Playing around with bpftrace

There are loads of on-liners that can give you a deeper insight of what's going on in your servers, which goes beyond the scope of this. But here's an extract from the examples in the bpftrace's repo:

# Syscall count by program

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'



# Read bytes by process:

bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = sum(args->ret); }'



# Show per-second syscall rates:

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ = count(); } interval:s:1 { print(@); clear(@); }'



# Trace disk size by process

bpftrace -e 'tracepoint:block:block_rq_issue { printf("%d %s %d

", pid, comm, args->bytes); }'



# Count page faults by process

bpftrace -e 'software:faults:1 { @[comm] = count(); }'



# Files opened by process

bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s

", comm, str(args->filename)); }'