~ Updated 2019-01-10 ~

What is BPF?

BPF, as in Berkeley Packet Filter, was initially conceived in 1992 so as to provide a way to filter packets and to avoid useless packet copies from kernel to userspace. It initially consisted in a simple bytecode that is injected from userspace into the kernel, where it is checked by a verifier—to prevent kernel crashes or security issues—and attached to a socket, then run on each received packet. It was ported to Linux a couple of years later, and used for a small number of applications (tcpdump for example). The simplicity of the language as well as the existence of an in-kernel Just-In-Time (JIT) compiling machine for BPF were factors for the excellent performances of this tool.

Then in 2013, Alexei Starovoitov completely reshaped it, started to add new functionalities and to improve the performances of BPF. This new version is designated as eBPF (for “extended BPF”), while the former becomes cBPF (“classic” BPF). New features such as maps and tail calls appeared. The JIT machines were rewritten. The new language is even closer to native machine language than cBPF was. And also, new attach points in the kernel have been created.

Thanks to those new hooks, eBPF programs can be designed for a variety of use cases, that divide into two fields of applications. One of them is the domain of kernel tracing and event monitoring. BPF programs can be attached to kprobes and they compare with other tracing methods, with many advantages (and sometimes some drawbacks).

The other application domain remains network programming. In addition to socket filter, eBPF programs can be attached to tc (Linux traffic control tool) ingress or egress interfaces and perform a variety of packet processing tasks, in an efficient way. This opens new perspectives in the domain.

And eBPF performances are further leveraged through the technologies developed for the IO Visor project: new hooks have also been added for XDP (“eXpress Data Path”), a new fast path recently added to the kernel. XDP works in conjunction with the Linux stack, and relies on BPF to perform very fast packet processing.

Even some projects such as P4, Open vSwitch, consider or started to approach BPF. Some others, such as CETH, Cilium, are entirely based on it. BPF is buzzing, so we can expect a lot of tools and projects to orbit around it soon…

Dive into the bytecode

As for me: some of my work (including for BEBA) is closely related to eBPF, and several future articles on this site will focus on this topic. Logically, I wanted to somehow introduce BPF on this blog before going down to the details—I mean, a real introduction, more developed on BPF functionalities that the brief abstract provided in first section: What are BPF maps? Tail calls? What do the internals look like? And so on. But there are a lot of presentations on this topic available on the web already, and I do not wish to create “yet another BPF introduction” that would come as a duplicate of existing documents.

So instead, here is what we will do. After all, I spent some time reading and learning about BPF, and while doing so, I gathered a fair amount of material about BPF: introductions, documentation, but also tutorials or examples. There is a lot to read, but in order to read it, one has to find it first. Therefore, as an attempt to help people who wish to learn and use BPF, the present article introduces a list of resources. These are various kinds of readings, that hopefully will help you dive into the mechanics of this kernel bytecode.

Resources

Generic presentations

The documents linked below provide a generic overview of BPF, or of some closely related topics. If you are very new to BPF, you can try picking a couple of presentation among the first ones and reading the ones you like most. If you know eBPF already, you probably want to target specific topics instead, lower down in the list.

About BPF

Generic presentations about eBPF:

BPF internals:

The IO Visor blog has some interesting technical articles about BPF. Some of them contain a bit of marketing talks.

As of early 2019, there are more and more presentations being done around multiple aspects of BPF. One nice example is the BPF track that was held in parallel to the Linux Plumbers Conference in late 2018 (and should be held again on coming years), where lots of topics related to eBPF development or use cases were presented.

Kernel tracing: summing up all existing methods, including BPF:

Regarding event tracing and monitoring, Brendan Gregg uses eBPF a lot and does an excellent job at documenting some of his use cases. If you are in kernel tracing, you should see his blog articles related to eBPF or to flame graphs. Most of it are accessible from this article or by browsing his blog.

Introducing BPF, but also presenting generic concepts of Linux networking:

Hardware offload:

eBPF with tc or XDP supports hardware offload, starting with Linux kernel version 4.9 and introduced by Netronome. Here is a presentation about this feature:

eBPF/XDP hardware offload to SmartNICs (Jakub Kicinski and Nic Viljoen, netdev 1.2, Tokyo, October 2016)

eBPF/XDP hardware offload to SmartNICs (Jakub Kicinski and Nic Viljoen, netdev 1.2, Tokyo, October 2016) An updated version was presented on year later:

Comprehensive XDP offload—Handling the edge cases (Jakub Kicinski and Nic Viljoen, netdev 2.2, Seoul, November 2017)

Comprehensive XDP offload—Handling the edge cases (Jakub Kicinski and Nic Viljoen, netdev 2.2, Seoul, November 2017) I presented a shorter but updated version at FOSDEM 2018:

The Challenges of XDP Hardware Offload (Quentin Monnet, FOSDEM’18, Brussels, February 2018)

About cBPF:

About XDP

About other components related or based on eBPF

Documentation

Once you managed to get a broad idea of what BPF is, you can put aside generic presentations and start diving into the documentation. Below are the most complete documents about BPF specifications and functioning. Pick the one you need and read them carefully!

About BPF

About tc

When using BPF for networking purposes in conjunction with tc, the Linux tool for traffic control, one may wish to gather information about tc’s generic functioning. Here are a couple of resources about it.

About XDP

Some work-in-progress documentation (including specifications) for XDP started by Jesper Dangaard Brouer, but meant to be a collaborative work. Under progress (September 2016): you should expect it to change, and maybe to be moved at some point (Jesper called for contribution, if you feel like improving it).

The BPF and XDP Reference Guide from Cilium project… Well, the name says it all.

About flow dissectors

LWN has an excellent article about Writing network flow dissectors in BPF, contributed by Marta Rybczyńska in September 2018.

About P4 and BPF

P4 is a language used to specify the behavior of a switch. It can be compiled for a number of hardware or software targets. As you may have guessed, one of these targets is BPF… The support is only partial: some P4 features cannot be translated towards BPF, and in a similar way there are things that BPF can do but that would not be possible to express with P4. Anyway, the documentation related to P4 use with BPF used to be hidden in bcc repository. This changed with P4_16 version, the p4c reference compiler including a backend for eBPF.

There is also an interesting presentation from Jamal Hadi Salim, presenting a number of points from tc from which P4 could maybe get some inspiration: What P4 Can Learn From Linux Traffic Control Architecture.

Tutorials

Brendan Gregg has initiated excellent tutorials intended for people who want to use bcc tools for tracing and monitoring events in the kernel. The first tutorial about using bcc itself comes with many steps to understand how to use the existing tools, while the one intended for Python developers focuses on developing new tools, across seventeen “lessons”.

Lorenza Fontana has made a tutorial to explain how to Load XDP programs using the ip (iproute2) command.

If you are unfamiliar to kernel compiling, Diego Pino García has a blog entry on How to build a kernel with [AF-]XDP support.

Sasha Goldshtein also has some Linux Tracing Workshops Materials involving the use of several BPF tools for tracing.

Another post by Jean-Tiare Le Bigot provides a detailed (and instructive!) example of using perf and eBPF to setup a low-level tracer for ping requests and replies.

Few tutorials exist for network-related eBPF use cases. There are some interesting documents, including an eBPF Offload Starting Guide, on the Open NFP platform operated by Netronome. Other than these, the talks from Jesper and Andy, XDP for the Rest of Us (and its second edition), are probably one of the best ways to get started with XDP.

If you really focus on hardware offload for eBPF, Netronome (my employer as I edit this text) is the only vendor to propose it at the moment. Besides their Open-NFP platform, the best source of information is their support platform: https://help.netronome.com. You will find there video tutorials from David Beckett explaining how to run and offload XDP programs, user guides, and other materials… including the firmware for the Agilio SmartNICs required to perform eBPF offload!

Examples

It is always nice to have examples. To see how things really work. But BPF program samples are scattered across several projects, so I listed all the ones I know of. The examples do not always use the same helpers (for instance, tc and bcc both have their own set of helpers to make it easier to write BPF programs in C language).

From the kernel

The kernel contains examples for most types of program: filters to bind to sockets or to tc interfaces, event tracing/monitoring, and even XDP. You can find these examples under the linux/samples/bpf/ directory.

Nowadays, most examples are added under linux/tools/testing/selftests/bpf as unit tests. This includes tests for hardware offload or for libbpf.

Some additional tests regarding BPF with tc can be found in the kernel suite of tests for tc itself, under linux/tools/testing/selftests/tc-tests.

Jesper Dangaard Brouer also maintains a specific set of samples in his prototype-kernel repository. They are very similar to those from the kernel, but can be compiled outside of the kernel infrastructure (Makefiles and headers).

Also do not forget to have a look to the logs related to the (git) commits that introduced a particular feature, they may contain some detailed example of the feature.

From package iproute2

The iproute2 package provide several examples as well. They are obviously oriented towards network programming, since the programs are to be attached to tc ingress or egress interfaces. The examples dwell under the iproute2/examples/bpf/ directory.

Many examples are provided with bcc:

Some are networking example programs, under the associated directory. They include socket filters, tc filters, and a XDP program.

The tracing directory include a lot of example tracing programs . The tutorials mentioned earlier are based on these. These programs cover a wide range of event monitoring functions, and some of them are production-oriented. Note that on certain Linux distributions (at least for Debian, Ubuntu, Fedora, Arch Linux), these programs have been packaged and can be “easily” installed by typing e.g. # apt install bcc-tools , but as of this writing (and except for Arch Linux), this first requires to set up IO Visor’s own package repository.

There are also some examples using Lua as a different BPF back-end (that is, BPF programs are written with Lua instead of a subset of C, allowing to use the same language for front-end and back-end), in the third directory.

Of course, bcc tools themselves are interesting example use cases for eBPF programs.

Other examples

Some other BPF programs are emerging here and there. Have a look at the different projects based on or using eBPF, mentioned above, and search their code to find how they inject programs into the kernel.

Netronome also has a GitHub repository with some samples XDP demo applications, some of them for hardware offload only, others for both driver and offloaded XDP.

Manual pages

While bcc is generally the easiest way to inject and run a BPF program in the kernel, attaching programs to tc interfaces can also be performed by the tc tool itself. So if you intend to use BPF with tc, you can find some example invocations in the tc-bpf(8) manual page.

The code

Sometimes, BPF documentation or examples are not enough, and you may have no other solution that to display the code in your favorite text editor (which should be Vim of course) and to read it. Or you may want to hack into the code so as to patch or add features to the machine. So here are a few pointers to the relevant files, finding the functions you want is up to you!

BPF code in the kernel

The file linux/include/linux/bpf.h and its counterpart linux/include/uapi/bpf.h contain definitions related to eBPF, to be used respectively in the kernel and to interface with userspace programs.

On the same pattern, files linux/include/linux/filter.h and linux/include/uapi/filter.h contain information used to run the BPF programs .

The main pieces of code related to BPF are under linux/kernel/bpf/ directory. The different operations permitted by the system call , such as program loading or map management, are implemented in file syscall.c , while core.c contains the interpreter . The other files have self-explanatory names: verifier.c contains the verifier (no kidding), arraymap.c the code used to interact with maps of type array, and so on.

Several functions as well as the helpers related to networking (with tc, XDP…) and available to the user, are implemented in linux/net/core/filter.c. It also contains the code to migrate cBPF bytecode to eBPF (since all cBPF programs are now translated to eBPF in the kernel before being run).

Function and helpers related to event tracing are in linux/kernel/trace/bpf_trace.c instead.

The JIT compilers are under the directory of their respective architectures, such as file linux/arch/x86/net/bpf_jit_comp.c for x86. Exception is made for JIT compilers used for hardware offload, they sit in their driver, see for instance linux/drivers/net/ethernet/netronome/nfp/bpf/jit.c for Netronome NFP cards.

You will find the code related to the BPF components of tc in the linux/net/sched/ directory, and in particular in files act_bpf.c (action) and cls_bpf.c (filter).

I have not used seccomp-BPF much, but you should find the code in linux/kernel/seccomp.c, and some example use cases can be found in linux/tools/testing/selftests/seccomp/seccomp_bpf.c.

XDP hooks code

Once loaded into the in-kernel BPF virtual machine, XDP programs are hooked from userspace into the kernel network path thanks to a Netlink command. On reception, the function dev_change_xdp_fd() in file linux/net/core/dev.c is called and sets a XDP hook. Such hooks are located in the drivers of supported NICs. For example, the nfp driver used for Netronome hardware has hooks implemented in files under the drivers/net/ethernet/netronome/nfp/ directory. File nfp_net_common.c receives Netlink commands and calls nfp_net_xdp_setup() , which in turns calls for instance nfp_net_xdp_setup_drv() to install the program.

BPF logic in bcc

One can find the code for the bcc set of tools on the bcc GitHub repository. The Python code, including the BPF class, is initiated in file bcc/src/python/bcc/__init__.py. But most of the interesting stuff—to my opinion—such as loading the BPF program into the kernel, happens in the libbcc C library.

Code to manage BPF with tc

The code related to BPF in tc comes with the iproute2 package, of course. Some of it is under the iproute2/tc/ directory. The files f_bpf.c and m_bpf.c (and e_bpf.c) are used respectively to handle BPF filters and actions (and tc exec command, whatever this may be). File q_clsact.c defines the clsact qdisc especially created for BPF. But most of the BPF userspace logic is implemented in iproute2/lib/bpf.c library, so this is probably where you should head to if you want to mess up with BPF and tc (it was moved from file iproute2/tc/tc_bpf.c, where you may find the same code in older versions of the package).

BPF utilities

The kernel also ships the sources of three tools ( bpf_asm.c , bpf_dbg.c , bpf_jit_disasm.c ) related to BPF, under the linux/tools/net/ (until Linux 4.14) or linux/tools/bpf/ directory depending on your version:

bpf_asm is a minimal cBPF assembler.

is a minimal cBPF assembler. bpf_dbg is a small debugger for cBPF programs.

is a small debugger for cBPF programs. bpf_jit_disasm is generic for both BPF flavors and could be highly useful for JIT debugging.

is generic for both BPF flavors and could be highly useful for JIT debugging. bpftool is a generic utility written by Jakub Kicinski, and that can be used to interact with eBPF programs and maps from userspace, for example to show, dump, load, pin programs, or to show, create, pin, update, delete maps. It can also attach and detach programs to cgroups, and has JSON support. It keeps getting more and more features, and is expected to be the go-to tool for eBPF introspection and simple management.

Read the comments at the top of the source files to get an overview of their usage.

Other essential files to work with eBPF are the two userspace libraries from the kernel tree, that can be used to manage eBPF programs or maps from external programs. The functions are accessible through headers bpf.h and libbpf.h (higher level) from directory linux/tools/lib/bpf/. The tool bpftool heavily relies on those libraries, for example.

Other interesting chunks

If you are interested the use of less common languages with BPF, bcc contains a P4 compiler for BPF targets as well as a Lua front-end that can be used as alternatives to the C subset and (in the case of Lua) to the Python tools.

LLVM backend

The BPF backend used by clang / LLVM for compiling C into eBPF was added to the LLVM sources in this commit (and can also be accessed on the GitHub mirror).

Running in userspace

As far as I know there are at least two eBPF userspace implementations. The first one, uBPF, is written in C. It contains an interpreter, a JIT compiler for x86_64 architecture, an assembler and a disassembler.

The code of uBPF seems to have been reused to produce a generic implementation, that claims to support FreeBSD kernel, FreeBSD userspace, Linux kernel, Linux userspace and MacOSX userspace. It is used for the BPF extension module for VALE switch.

The other userspace implementation is my own work: rbpf, based on uBPF, but written in Rust. The interpreter and JIT-compiler work (both under Linux, only the interpreter for MacOSX and Windows), there may be more in the future.

Commit logs

As stated earlier, do not hesitate to have a look at the commit log that introduced a particular BPF feature if you want to have more information about it. You can search the logs in many places, such as on git.kernel.org, on GitHub, or on your local repository if you have cloned it. If you are not familiar with git, try things like git blame <file> to see what commit introduced a particular line of code, then git show <commit> to have details (or search by keyword in git log results, but this may be tedious). See also the list of eBPF features per kernel version on bcc repository, that links to relevant commits.

Troubleshooting

The enthusiasm about eBPF is quite recent, and so far I have not found a lot of resources intending to help with troubleshooting. So here are the few I have, augmented with my own recollection of pitfalls encountered while working with BPF.

Errors at compilation time

Make sure you have a recent enough version of the Linux kernel (see also this document).

If you compiled the kernel yourself: make sure you installed correctly all components, including kernel image, headers and libc.

When using the bcc shell function provided by tc-bpf man page (to compile C code into BPF): I once had to add includes to the header for the clang call: __bcc() { clang -O2 -I "/usr/src/linux-headers-$(uname -r)/include/" \ -I "/usr/src/linux-headers-$(uname -r)/arch/x86/include/" \ -emit-llvm -c $1 -o - | \ llc -march=bpf -filetype=obj -o "`basename $1 .c`.o" } (seems fixed as of today).

For other problems with bcc , do not forget to have a look at the FAQ of the tool set.

If you downloaded the examples from the iproute2 package in a version that does not exactly match your kernel, some errors can be triggered by the headers included in the files. The example snippets indeed assume that the same version of iproute2 package and kernel headers are installed on the system. If this is not the case, download the correct version of iproute2, or edit the path of included files in the examples to point to the headers included in iproute2 (some problems may or may not occur at runtime, depending on the features in use).

Errors at load and run time

To load a program with tc, make sure you use a tc binary coming from an iproute2 version equivalent to the kernel in use.

To load a program with bcc, make sure you have bcc installed on the system (just downloading the sources to run the Python script is not enough).

With tc, if the BPF program does not return the expected values, check that you called it in the correct fashion: filter, or action, or filter with “direct-action” mode.

With tc still, note that actions cannot be attached directly to qdiscs or interfaces without the use of a filter.

The errors thrown by the in-kernel verifier may be hard to interpret. The kernel documentation may help, so may the reference guide or, as a last resort, the source code (see above) (good luck!). For this kind of errors it is also important to keep in mind that the verifier does not run the program. If you get an error about an invalid memory access or about uninitialized data, it does not mean that these problems actually occurred (or sometimes, that they can possibly occur at all). It means that your program is written in such a way that the verifier estimates that such errors could happen, and therefore it rejects the program.

Note that tc tool has a verbose mode, and that it works well with BPF: try appending verbose at the end of your command line.

bcc also has verbose options: the BPF class has a debug argument that can take any combination of the three flags DEBUG_LLVM_IR , DEBUG_BPF and DEBUG_PREPROCESSOR (see details in the source file). It even embeds some facilities to print output messages for debugging the code.

LLVM v4.0+ embeds a disassembler for eBPF programs. So if you compile your program with clang, adding the -g flag for compiling enables you to later dump your program in the rather human-friendly format used by the kernel verifier. To proceed to the dump, use: $ llvm-objdump -S -no-show-raw-insn bpf_program.o

Working with maps? You want to have a look at bpf-map, a very userful tool in Go created for the Cilium project, that can be used to dump the contents of kernel eBPF maps. There also exists a clone in Rust.

There is an old bpf tag on StackOverflow, but as of this writing it has been hardly used—ever (and there is nearly nothing related to the new eBPF version). If you are a reader from the Future though, you may want to check whether there has been more activity on this side.

And still more!

And come back on this blog from time to time to see if they are new articles about BPF!

Special thanks to Daniel Borkmann for the numerous additional documents he pointed to me so that I could complete this collection.