In a previous post we discussed the performance limitations of the Linux kernel network stack. We detailed the available kernel bypass techniques allowing user space programs to receive packets with high throughput. Unfortunately, none of the discussed open source solutions supported our needs. To improve the situation we decided to contribute to the Netmap project. In this blog post we'll describe our proposed changes.



CC BY-SA 2.0 image by Binary Koala

Our needs

At CloudFlare we are constantly dealing with large packet floods. Our network constantly receives a large volume of packets, often coming from many, simultaneous attacks. In fact, it is entirely possible that the server which just served you this blog post is dealing with a many-million packets per second flood right now.

Since the Linux Kernel can't really handle a large volume of packets, we need to work around it. During packet floods we offload selected network flows (belonging to a flood) to a user space application. This application filters the packets at very high speed. Most of the packets are dropped, as they belong to a flood. The small number of "valid" packets are injected back to the kernel and handled in the same way as usual traffic.

It’s important to emphasize that the kernel bypass is enabled only for selected flows, which means that all other packets go to the kernel as usual.

This setup works perfectly on our servers with Solarflare network cards - we can use the ef_vi API to achieve the kernel bypass. Unfortunately, we don’t have this functionality on our servers with Intel IXGBE NIC’s.

This is when Netmap comes in.

Netmap

Over the last few months we’ve been thinking hard about how to achieve bypass for selected flows (aka: bifurcated driver) on non-Solarflare network cards.

We’ve considered PF_RING, DPDK and other custom solutions, but sadly all of them take over the whole network card. Eventually we decided that the best way would be to patch Netmap with the functionality we need.

We chose Netmap because:

It’s fully open source and released under a BSD license.

It has a great NIC-agnostic API.

It’s very fast: can reach line rate easily.

The project is well maintained and reasonably mature.

The code is very high quality.

The driver-specific modifications are trivial: most of the magic happens in the shared Netmap module. It’s easy to add support for new hardware.

Introducing the single RX queue mode

Usually, when a network card goes into the Netmap mode, all the RX queues get disconnected from the kernel and are available to the Netmap applications.

We don't want that. We want to keep most of the RX queues back in the kernel mode, and enable Netmap mode only on selected RX queues. We call this functionality: "single RX queue mode".

The intention was to expose a minimal API which could:

Open a network interface in "a single RX queue mode".

This would allow netmap applications to receive packets from that specific RX queue.

While leaving all the other queues attached to the host network stack.

On demand add or remove RX queues from the "single RX queue mode".

Eventually remove the interface from the Netmap mode and reattach the RX queues to the host stack.

The patch to Netmap is awaiting code review and is available here:

The minimal program receiving packets from eth3 RX queue #4 would look like:

d = nm_open("netmap:eth3~4", NULL, 0, 0); while (1) { fds = {fds: d->fd, events: POLLIN}; poll(&fds, 1, -1); ring = NETMAP_RXRING(d->nifp, 4); while (!nm_ring_empty(ring)) { i = ring->cur; buf = NETMAP_BUF(ring, ring->slot[i].buf_idx); len = ring->slot[i].len; //process(buf, len) ring->head = ring->cur = nm_ring_next(ring, i); } }

This code is very close to a Netmap example program. Indeed the only difference is the nm_open() call, which uses the new syntax netmap:ifname~queue_number .

Once again, when running this code only packets arriving on the RX queue #4 will go to the netmap program. All other RX and TX queues will be handled by the Linux kernel network stack.

You can find a more complete example here:

Isolating a queue

In multiqueue network cards, any packet can end up in almost any RX queue due to RSS. This is why before enabling the single RX mode it is necessary to make sure only the selected flow goes to the Netmap queue.

To do so it is necessary to:

Modify the indirection table to ensure no new RSS-hashed packets will go there.

to ensure no new RSS-hashed packets will go there. Use flow steering to specifically direct some flows to the isolated queue.

to specifically direct some flows to the isolated queue. Work around RFS - make sure no other application is running on the CPU Netmap will run on.

For example:

$ ethtool -X eth3 weight 1 1 1 1 0 1 1 1 1 1 $ ethtool -K eth3 ntuple on $ ethtool -N eth3 flow-type udp4 dst-port 53 action 4

Here we are setting the indirection table to prevent traffic from going to RX queue #4. Then we are enabling flow steering to enqueue all UDP traffic with destination port 53 into queue #4.

Trying it out

Here's how to run it with the IXGBE NIC. First grab the sources:

$ git clone https://github.com/jibi/netmap.git $ cd netmap $ git checkout -B single-rx-queue-mode $ ./configure --drivers=ixgbe --kernel-sources=/path/to/kernel

Load the netmap-patched modules and setup the interface:

$ insmod ./LINUX/netmap.ko $ insmod ./LINUX/ixgbe/ixgbe.ko $ # Distribute the interrupts: $ (let CPU=0; cd /sys/class/net/eth3/device/msi_irqs/; for IRQ in *; do \ echo $CPU > /proc/irq/$IRQ/smp_affinity_list; let CPU+=1 done) $ # Enable RSS: $ ethtool -K eth3 ntuple on

At this point we started flooding the interface with 6M short UDP packets. htop shows the server being totally busy with handling the flood:

To counter the flood we started Netmap. First, we needed to edit the indirection table, to isolate the RX queue #4:

$ ethtool -X eth3 weight 1 1 1 1 0 1 1 1 1 1 $ ethtool -N eth3 flow-type udp4 dst-port 53 action 4

This caused all the flood packets to go to RX queue #4.

Before putting an interface in Netmap mode it is necessary to turn off hardware offload features:

$ ethtool -K eth3 lro off gro off

Finally we launched the netmap offload:

$ sudo taskset -c 15 ./nm_offload eth3 4 [+] starting test02 on interface eth3 ring 4 [+] UDP pps: 5844714 [+] UDP pps: 5996166 [+] UDP pps: 5863214 [+] UDP pps: 5986365 [+] UDP pps: 5867302 [+] UDP pps: 5964911 [+] UDP pps: 5909715 [+] UDP pps: 5865769 [+] UDP pps: 5906668 [+] UDP pps: 5875486

As you see the netmap program on a single RX queue was able to receive about 5.8M packets.

For completeness, here's an htop showing only a single core being busy with Netmap:

Thanks

We would like to thank Pavel Odintsov who suggested the possibility of using Netmap this way. He even prepared the initial hack we based our work on.

We would also like to thank Luigi Rizzo, for his Netmap work and great feedback on our patches.

Final words

At CloudFlare our application stack is based on open source software. We’re grateful to so many open source programmers for their awesome work. Whenever we can we try to contribute back to the community - we hope "the single RX Netmap mode" will be useful to others.

You can find more CloudFlare open source here.