Using eBPF and XDP in Suricata

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

Much software that uses the Linux kernel does so at comparative arms-length: when it needs the kernel, perhaps for a read or write, it performs a system call, then (at least from its point of view) continues operation later, with whatever the kernel chooses to give it in reply. Some software, however, gets pretty intimately involved with the kernel as part of its normal operation, for example by using eBPF for low-level packet processing. Suricata is such a program; Eric Leblond spoke about it at Kernel Recipes 2017 in a talk entitled "eBPF and XDP seen from the eyes of a meerkat".

Suricata is a network Intrusion Detection System (IDS), released under the GPLv2 (Suricata is also the genus of the meerkat, hence the title of the talk). An IDS is a system designed to sit parallel to a router, examining all the traffic that the router is being asked to pass, to decide if any of it resembles the patterns of any known malware and to alert if so. This means it has to do something considerably more memory- and CPU-intensive than the router is doing, so efficient performance is crucial. Making a decision about the danger represented by a packet involves extracting the TCP segments from those packets, reassembling the data stream (or flow) from the TCP segments, and considering that stream from a protocol-aware standpoint; that work is performed by what Suricata refers to as its worker threads.

Suricata makes decisions about the traffic at all layers of this analysis, and is currently capable of doing this at 10Gbps, or sometimes even faster. It is also capable of operating as an Intrusion Prevention System, or IPS, where it is actually routing the network traffic (or choosing not to route it, depending on the result of the analysis). But Leblond made it clear that it works best when analyzing traffic, then reporting to a greater infrastructure that, in turn, advises the actual router on when to stop routing packets from a given stream.

Suricata starts by capturing network packets using an AF_PACKET socket, which is the method introduced in the 2.2 kernel for getting raw packets off the wire. It is used in fanout mode, which allows each successive packet from a single interface to be sent to one of a set of sockets. The policy for choosing which socket to select is configured when the mode is engaged.

By running multiple worker threads, each on its own CPU and each listening to one of those sockets, Suricata is able to parallelize the work of packet processing, flow reconstruction, and consequent analysis. This parallelization is crucial to Suricata's ability to process data at high speed, but it only works if packets from a given stream are always fed to the same worker thread. Fanout's default mode guarantees this by using a hash of the packet's network parameters to choose which socket to feed with any given packet. This hash is supposed to be constant for all packets in any given flow, but should generally be different for packets in different data flows.

Unfortunately, unrelated changes in kernel 4.4 broke the symmetry of the hash function used by the kernel to make this decision. After that change, packets with source address S and destination address D all returned one hash, H1, but those with source D and destination S returned a different hash, H2. The effect of this was that traffic from client to server tended to end up in a different worker thread than responses from server to client. If the second worker thread was less heavily loaded than the first, responses could end up being considered by Suricata before it had even seen the traffic that prompted them. The effect of this, said Leblond, was interesting: the count of processed packets remained high, but the count of detected attacks fell off dramatically. So, he said, "you are in a total mess, and it's not working", noting with Gallic understatement that "users did start to complain". David S. Miller fixed this in 4.7, with the fix also appearing in 4.4.16 and 4.6.5.

A similar complication came from Receive-Side Scaling (RSS), where network interface cards (NICs) that implement it try to do something similar themselves. As the Suricata documentation notes:

Receive Side Scaling is a technique used by network cards to distribute incoming traffic over various queues on the NIC. This is meant to improve performance but it is important to realize that it was designed for normal traffic, not for the IDS packet capture scenario. RSS uses a hash algorithm to distribute the incoming traffic over the various queues. This hash is normally not symmetrical. This means that when receiving both sides of a flow, each side may end up in a different queue. [...] By having both sides of the traffic in different queues, the order of processing of packets becomes unpredictable. Timing differences on the NIC, the driver, the kernel and in Suricata will lead to a high chance of packets coming in at a different order than on the wire.

This leads to a similar scenario to the issue above. Leblond said it was known to be a problem with the Intel XL510; its brother card, the XL710, is capable of using a symmetric hash, but the Linux driver does not currently allow telling it to do so. A patch by Victor Julien to enable this was initially rejected; Leblond is keen that it should make it in, eventually.

This discussion brought Leblond to the extended Berkeley Packet Filter, or eBPF. As noted earlier, this provides an environment where programs, written in the eBPF virtual-machine language, can be attached from user space to the kernel for various purposes. Since 4.3, Leblond said, one such purpose is providing the hash function that allows fanout mode to decide to which socket to send any given packet. He showed a slide with an example of such an eBPF program, which in 18 lines extracted source and destination IPv4 addresses (treating each as a 32-bit binary number) and returned a simple hash that was the sum of the two.

When using eBPF you have to parse the packet yourself to extract the information you want but, he said, the advantages significantly outweighed this cost. One such advantage was in dealing with data tunneled through protocols such as L2TP (for VPNs) and GTP (a 4G protocol). Because these protocols are not known to the kernel or NIC, if you ask either of those do the hashing, all the data through a given tunnel will go into a single worker even though it probably represents multiple flows. That can overload the worker. With eBPF one could strip the tunnel headers and load-balance on the inner packets, distributing them much more fairly.

Because one worker thread can handle at most some 500-1000Mbps, it's a problem if a single worker gets overloaded. Once that happens, the kernel's ring buffers fill up, packets start to get dropped, and you're no longer monitoring as comprehensively as you should be. The avoidance of this is therefore a priority for Suricata development. As we've seen above, many elegant techniques can be brought to bear to prevent this happening by accident. Sometimes, however, it's going to be unavoidable: when your traffic is completely dominated by a single flow, it's all going to go into a single worker thread because it needs to. Not only will that worker thread struggle to analyze this "big flow", but any other flows that by accident of hashing get sent to the same worker may also not get properly inspected.

To mitigate this, the developers introduced the concept of bypass, which relies on the observation that in most cases attacks are done at the start of a TCP session; for many protocols, multiple requests on the same TCP session are not even possible. Looking only at the start of a flow thus gets you 99% of the coverage you need. Suricata can detect big flows with a simple counter, then either drop the packets from the worker queues (local bypass), or instruct the kernel not to bother capturing them (capture bypass). Suricata can be configured to only reassemble a flow for deep protocol analysis until it is a certain (configurable) length, after which time that flow, too, will be bypassed. It can also be configured to bypass intensive but likely-safe traffic, such as that coming from Netflix's servers.

To support capture bypass, Suricata needs the ability to maintain an in-kernel list of flows not to be captured. This can't be done using nftables, because AF_PACKET capture comes before nftables processing. Unsurprisingly, it turns out that here, again, there are eBPF hooks in the kernel. Leblond showed a 16-line eBPF program that did a lookup against an eBPF map of flows to be bypassed, and returned a capture decision based on the flow's presence in, or absence from, that map. Code for maintaining the map and timing out old entries was also presented.

Suricata's capture bypass testing was done with live data at 1-2Gbps, for an hour; as Leblond noted, that's good, because it's a real-world test, but it's bad because it's not reproducible. Nevertheless, he presented the test results, and big flows could clearly be seen being bypassed, with commensurate lowering of system load.

eBPF has given us the ability to drop packets earlier in the capture process than would otherwise be possible, but the kernel still has to spend some time and memory on each packet before a drop decision is made. Suricata would like to reduce this work as much as possible, and the project is looking at XDP (eXpress Data Path, or eXtreme Data path, depending on who you ask, he said) to do this. With XDP, an eBPF program can be run to make a decision about each packet inside the NIC driver's "receive" code; available decisions include an early (and therefore low-cost) drop, passing the packet up to the kernel's network stack (with or without modification), immediate retransmission out from the receiving NIC (again, with or without modification), and redirection for transmission from another NIC (the choice of NIC being made with more help from eBPF). Leblond is interested in the use of that last capability to make Suricata's IPS mode closer in speed to its IDS mode, by replacing the decision to bypass a packet with the decision to fast-route it into the enterprise network.

The downside to XDP is that it requires drivers that support it, which few currently do. Although he hasn't been able to test this himself, he reported results from others who have; using a single CPU and a Mellanox NIC running at 40Gbps, they were able to drop packets at 28 million packets per second (Mpps), or modify packets and retransmit them from the receiving NIC at 10Mpps. Leblond outlined the possibility of implementing not just packet bypass but packet capture via XDP, with consequent performance improvements.

With lunch fast approaching, Leblond wound up his talk, which was a fascinating exposition of how much more can be achieved in user space if it is willing to work more tightly with the kernel than is usual.

[We would like to thank LWN's travel sponsor, The Linux Foundation, for assistance with travel funding for Kernel Recipes.]

