A JIT for packet filters

Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

The Berkeley Packet Filter (BPF) is a mechanism for the fast filtering of network packets on their way to an application. It has its roots in BSD in the very early 1990's, a history that was not enough to prevent the SCO Group from claiming ownership of it. Happily, that claim proved to be as valid as the rest of SCO's assertions, so BPF remains a part of the Linux networking stack. A recent patch from Eric Dumazet may make BPF faster, at least on 64-bit x86 systems.

The purpose behind BPF is to let an application specify a filtering function to select only the network packets that it wants to see. An early BPF user was the tcpdump, which used BPF to implement the filtering behind its complex command-line syntax. Other packet capture programs also make use of it. On Linux, there is another interesting application of BPF: the "socket filter" mechanism allows an application to filter incoming packets on any type of socket with BPF. In this mode, it can function as a sort of per-application firewall, eliminating packets before the application ever sees them.

The original BPF distribution came in the form of a user-space library, but the BPF interface quickly found its way into the kernel. When network traffic is high, there is a lot of value in filtering unwanted packets before they are copied into user space. Obviously, it is also important that BPF filters run quickly; every bit of per-packet overhead is going to hurt in a high-traffic situation. BPF was designed to allow a wide variety of filters while keeping speed in mind, but that does not mean that it cannot be made faster.

BPF defines a virtual machine which is almost Turing-machine-like in its simplicity. There are two registers: an accumulator and an index register. The machine also has a small scratch memory area, an implicit array containing the packet in question, and a small set of arithmetic, logical, and jump instructions. The accumulator is used for arithmetic operations, while the index register provides offsets into the packet or into the scratch memory areas. A very simple BPF program (taken from the 1993 USENIX paper [PDF]) might be:

ldh [12] jeq #ETHERTYPE_IP, l1, l2 l1: ret #TRUE l2: ret #0

The first instruction loads a 16-bit quantity from offset 12 in the packet to the accumulator; that value is the Ethernet protocol type field. It then compares the value to see if the packet is an IP packet or not; IP packets are accepted, while anything else is rejected. Naturally, filter programs get more complicated quickly. Header length can vary, so the program will have to calculate the offsets of (for example) TCP header values; that is where the index register comes into play. Scratch memory (which is the only place a BPF program can store to) is used when intermediate results must be kept.

The Linux BPF implementation can be found in net/core/filter.c ; it provides "standard" BPF along with a number of Linux-specific ancillary instructions which can test whether a packet is marked, which CPU the filter is running on, which interface the packet arrived on, and more. It is, at its core, a long switch statement designed to run the BPF instructions quickly. This code has seen a number of enhancements and speed improvements over the years, but there has not been any fundamental change for a long time.

Eric Dumazet's patch is a fundamental change: it puts a just-in-time compiler into the kernel to translate BPF code directly into the host system's assembly code. The simplicity of the BPF machine makes the JIT translation relatively simple; every BPF instruction maps to a straightforward x86 instruction sequence. There are a few assembly language helpers which help to implement the virtual machine's semantics; the accumulator and index are just stored in the processor's registers. The resulting program is placed in a bit of vmalloc() space and run directly when a packet is to be tested. A simple benchmark shows a 50ns savings for each invocation of a simple filter - that may seem small, but, when multiplied by the number of packets going through a system, that difference can add up quickly.

The current implementation is limited to the x86-64 architecture; indeed, that architecture is wired deeply into the code, which is littered with hard-coded x86 instruction opcodes. Should anybody want to add a second architecture, they will be faced with the choice of simply replicating the whole thing (it is not huge) or trying to add a generalized opcode generator to the existing JIT code.

An obvious question is: can this same approach be applied to iptables, which is more heavily used than BPF? The answer may be "yes," but it might also make more sense to bring back the nftables idea, which is built on a BPF-like virtual machine of its own. Given that there has been some talk of using nftables in other contexts (internal packet classification for packet scheduling, for example), the value of a JIT-translated nftables could be even higher. Nftables is a job for another day, though; meanwhile, we have a proof of the concept for BPF that appears to get the job done nicely.

