In part 1 and part 2 of this series, we took a condensed in-depth look at the eBPF VM. Reading those parts is not mandatory for understanding this third part, though having a good grasp of the low-level basics does help understand the higher-level tools better. To understand how these tools work, let's define the high-level components of an eBPF program:

The backend: This is the eBPF bytecode loaded and running in the kernel. It writes data to the kernel map and ringbuffer data structures

This is the eBPF bytecode loaded and running in the kernel. It writes data to the kernel map and ringbuffer The loader: This loads the bytecode backend into the kernel. Usually the bytecode gets automatically unloaded by the kernel when its loader process terminates.

This loads the bytecode into the kernel. Usually the bytecode gets automatically unloaded by the kernel when its loader process terminates. The frontend: This reads data (written by the backend ) from the data structures and shows it to the user.

This reads data (written by the ) from the and shows it to the user. The data structures: These are the means of communication between backends and frontends. They are maps and ringbuffers managed by the kernel, accesible via file descriptors and created before a backend gets loaded. They continue to exist until no more backends or frontends read or write to them.

In the sock_example.c studied in parts 1 and 2, all the components are squashed in a single C source file and all actions are done by a single user process:

lines 40-45 create the map data structure

lines 47-61 define the backend

lines 63-76 load the backend in the kernel

the backend in the kernel lines 78-91 are the frontend which prints data read from the map file descriptor to the user.

eBPF programs can be much more complex: multiple backends can be loaded by a single (or separate multiple!) loader processes, writing to multiple data structures which then get read by multiple frontend processes! All of these can happen in a single big user eBPF application spanning multiple processes.

Level one: Easier backend writing: The LLVM eBPF compiler

We saw in the preceeding article how writing raw eBPF bytecode on topof the kernel is hard and unproductive, very much like writing in a processor asembly languege, so naturally an module capable of compiling the LLVM intermediate representation to eBPF was developed and released starting with v3.7 in 2015 (GCC still doesn't support eBPF as of this writing). This allows subsets of multiple higher-level languages like C, Go or Rust to be compiled to eBPF. The most developed and popular is based on C as the kernel is also written in C, making it easier to reuse existing kernel headers.

LLVM compiles a "restricted C" languege (remember, no unbounded loops, max 4096 instructions and so on from part 1) to ELF object files containing special sections which get loaded in the kernel using libraries like libbpf, built on top of the bpf() syscall. This design effectively splits the backend definition from the loader and frontend because the eBPF bytecode lives in its own ELF file.

The kernel also provides examples using this pattern under samples/bpf/: the *_kern.c files are compiled to *_kern.o (this is the backend code) which get loaded by *_user.c (the loader and frontend).

Converting the sock_exapmle.c raw bytecode from part 1 and 2 of this series to "restricted C" yields sockex1_kern.c which is much easier to understand and modify than raw bytecode:

#include <uapi/linux/bpf.h> #include <uapi/linux/if_ether.h> #include <uapi/linux/if_packet.h> #include <uapi/linux/ip.h> #include "bpf_helpers.h" struct bpf_map_def SEC("maps") my_map = { .type = BPF_MAP_TYPE_ARRAY, .key_size = sizeof(u32), .value_size = sizeof(long), .max_entries = 256, }; SEC("socket1") int bpf_prog1(struct __sk_buff *skb) { int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); long *value; value = bpf_map_lookup_elem(&my_map, &index); if (value) __sync_fetch_and_add(value, skb->len); return 0; } char _license[] SEC("license") = "GPL";

The produced eBPF ELF object sockex1_kern.o now contains both the separated backend and data structure definitions. The loader and frontend, sockex1_user.c, parses the ELF file, creates the required map, loads the bytecode function bpf_prog1() in the kernel and then proceeds to run the frontend as before.

The trade-off made by introducing this "restricted C" abstraction layer is all about making the eBPF backend code easier to write in a higher level languege at the expense of increased complexity in the loader (needs to parse ELF objects now), while the frontend is mostly unaffected.

Level two: Automating backend/loader/frontend interactions: The BPF Compiler Collection

Not everyone has kernel sources at hand, especially in production, and it's also a bad idea in general to tie eBPF-based tools to a specific kernel source revision. Designing and implementing the interactions between eBPF program's backends, frontends, loaders and data structures can be very complex, error-prone and time consuming, especially in C which is considered a dangerous low-level languege. In addition to these risks developers are also in a constant danger of re-inventing the wheel for common problems, with endless design variations and implementations. To alleviate all these pains is why the BCC project exists: it provides an easy-to-use framework for writing, loading and running eBPF programs, by writing simple python or lua scripts in addition to the "restricted C" as exemplified above.

The BCC project has two parts:

The compiler collection (BCC proper): This is the framework used for writing BCC tools and the focus of our article. Read on.

BCC-tools: A constantly growing set of well tested eBPF-based programs ready for use with examples and manpages. More info in this tutorial.

The BCC install footprint is big: it depends on LLVM/clang to compile "restricted C" to eBPF, python/lua, it also contains library implementations like libbcc (written in C++), libbpf and so on. Parts of the kernel tree are also copied into the BCC source so it doesn't require building against a full kernel source (only headers). It can easily take hundreds of mb of space which is not very good for small embedded devices which can also benefit from eBPF powers. Finding solutions to this embedded device size constraint problem will be our focus in part 4.

BCC arranges eBPF program components like this:

Backends and data structures : Written in "restricted C". Can be in separate files or stored as multiline strings directly inside the loader/frontend scripts for convenience. Language reference.

and : Written in "restricted C". Can be in separate files or stored as multiline strings directly inside the scripts for convenience. Language reference.

Loaders and frontends: Written in very simple high-level python/lua scripts. Language reference.

Because the main purpouse of BCC is to simplify eBPF program writing, it standardizes and automates as much as possible: compiling the "restricted C" backend via LLVM is completely automated in the background resulting in a standard ELF object format type, allowing the loader to be implemented just once for all BCC programs and reducing it to a minimum API (2 lines of python). It also standardizes data structures APIs for easy access via the frontend. In a nutshell it focuses developer attention on writing frontends without having to wory about lower level details.

To best illustrate how it works let's look at a simple concrete example, a full re-implementation from scratch of the sock_example.c from our previous articles. The program counts how many TCP, UDP and ICMP packets are received on the loopback interface:

Some advantages of implementing the above with BCC as opposed to writing directly in C as we did previously:

Forget about raw bytecode: you write all the backend in the more convenient "restricted C".

in the more convenient "restricted C". No need to maintain any LLVM "restricted C" build logic: The code is compiled and loaded directly on script execution by BCC.

No dangerous C code: python is a safer languege for writing frontends and loaders with no errors like null dereferences.

and with no errors like null dereferences. The code is more concise and you can focus an the logic of your application instead of machine-specifics.

The script can be copied and run anywhere (assuming BCC is installed), it is not tied to the kernel source directory.

and so on.

In the above example we used a BPF.SOCKET_FILTER program type which resulted in our hooked C function getting a network packet buffer as context argument. We can also use the BPF.KPROBE type to peek into arbitrary kernel functions. Let's do it, but instead of using the same interface as above, we'll use a special kprobe__* function name prefix to illlustrace an even higher level BCC API:

This example was taken from bcc/examples/tracing/bitehist.py. It prints a histogram of the block I/O sizes by hooking the blk_account_io_completion() kernelfunction.

Notice how the eBPF loading happens automatically (the loader is implicit) based on the kprobe__blk_account_io_completion() function name! We have come quite far since writing and loading bytecode in C with libbpf.

Level three: Python is too low-level: BPFftrace

In some use cases BCC is still too low-level, for example when inspecting a system in incident response where time is of the essence, decisions need to be made fast and writing python / "restricted C" can take too long, so BPFtrace was built on top of BCC providing an even-higher abstraction level via a domain-specific languege inspired by AWK and C. The languege is similar to the one provided by DTrace according to the announcement post which calls it DTrace 2.0 and provides a good introduction and examples.

What BPFtrace does by abstracting so much logic in a powerful and safe (but still limited compared to BCC) languege is quite amazing. This shell one-liner counts how many syscalls each user process does (visit the built-in vars, map functions, and the count() documentation for more info):

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[pid, comm] = count(); }'

BPFtrace is still a work in progress in some areas. For example, there in no easy way to define and run a socket filter to implement tools like our previously exampined sock_example at this point in time. It could probably be done in BPFtrace with a kprobe:netif_receive_skb hook, but BCC is still a better tool for socket filtering. In any case, even in its current state, BPFTrace is still very useful for quick analysis/debugging before dropping down to the full power of BCC.

Level four: eBPF in the cloud: IOVisor

IOVisor is a Linux Foundation collaborative project built around the eBPF VM and tools presented in this article series. It uses some very high-level buzzword-heavy concepts like "Universal Input/Output" focused on marketing the eBPF technology to Cloud / Data Center developers and users:

The in-kernel eBPF VM becomes the "IO Visor Runtime Engine"

The compiler backends become "IO Visor Compiler backends"

eBPF programs in general are renamed to "IO modules"

Specific eBPF programs implementing packet filters become "IO data-plane modules/components"

and so on.

Considering that the original name, extended Berkely Packet Filter, doesn't mean much, maybe all this renaming is welcome and valuable, especially if it enables more industries to tap into eBPF powers.

The IOVisor project created the Hover framework, also called the "IO Modules Manager", which is a userspace deamon for managing eBPF programs (or IO Modules), capable of pushing and pulling IO modules to the cloud, similar to how Docker daemon publishes/fetches images. It provides a CLI, web-REST interface and also has a fancy web UI. Significant parts of Hover are written in Go so, in addition to the normal BCC dependencies, it also depends on a Go installation, making it big and unsuitable for the small embedded devices we eventually want to target in part 4.

Summary

In this part we have examined the userspace ecosystem built on top of the eBPF VM to increase developer productivity and ease deployment of eBPF programs. These tools make it so easy to work with eBPF that a user can just "apt-get install bpftrace" and run one-liners or use the Hover daemon to deploy an eBPF program (IO module) to 1000 machines. All these tools, however, for all the power they give developers and users, have significant disk footprints or may not even run on 32-bit ARM systems, making them not very suitable for small embedded devices, so this is why in part 4 we'll explore other projects trying to ease running eBPF programs targetig the embedded device ecosystem.

Continue reading (An eBPF overview, part 4: Working with embedded systems)…