My article will tell you how to accept 10 million packets per second without using such libraries as Netmap, PF_RING, DPDK and other. We are going to do this with Linux kernel version 3.16 and some code in C and C++.

To begin with, I would like to say a few words on how pcap (a well-known method for packet capture) works. pcap is used in such popular utilities as iftop, tcpdump, arpwatch. Besides, it is known for a very high load on the processor.

So, you have opened the interface with pcap and now are waiting for packets from it using a usual bind/recv approach. The kernel, in its turn, receives data from the network card and saves it in the kernel space. After that, the kernel finds out that the user wants to get it in the user space and passes the following details via an argument: recv commands and a buffer address of where this data should be stored. The kernel dutifully copies data (for the second time!). Quite complicated, right? But these are not all of pcap problems.

In addition, we should keep in mind that recv is a system call, and we call it for every new packet coming to the interface. System calls are usually very fast, but the speeds of modern 10GE interfaces (up to 14.6 million calls per second) lead to a situation when even a simple call becomes pricey for the system due to the frequency of calls.

It is also worth noting that there are usually more than two logical cores on the server. Moreover, data can come to any of them! As for the application that accepts data by means of pcap, it uses one core. That is when kernel locks are invoked, which significantly lowers the capture process. Now we do not only deal with copying memory/processing packets, but also wait for locks to release after being occupied by other cores. Trust me, locks can take up to 90 percent of CPU resources of the entire server.

Quite a big list of problems? Now, we’ll heroically try to solve them!

To be specific, let’s fix the argument that we’re working on mirror ports. Which means that we receive a copy of the entire traffic of a specific server from somewhere outside the network. Ports, in their turn, get the traffic – SYN flood of packets of minimum size at a speed of 14.6 mpps/7.6GE.

Network ixgbe, drivers with SourceForge 4.1.1, Debian 8 Jessie. Module configuration: modprobe ixgbe RSS=8,8 (it’s important!). My CPU is i7 3820, with 8 logical cores. That’s why I use 8 everywhere, in code as well. You should use the number of cores you have.

Let’s Distribute Interrupts to Available Cores

Please note that our port receives packets, destination MAC addresses of which do not coincide with the MAC address of our network card. Otherwise, the TCP/IP stack in Linux will get into action, and the machine will get too much traffic. It is really important. We are talking about capturing third party’s traffic only, not its processing (although my approach is perfect for this).

Let’s see how much traffic we can accept when listening to the entire traffic.

Enable promisc mode on the network card:

ifconfig eth6 promisc After that, we will see quite an unpleasant thing in htop: a complete overload of one of the cores:

1 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%] 2 [ 0.0%] 3 [ 0.0%] 4 [ 0.0%] 5 [ 0.0%] 6 [ 0.0%] 7 [ 0.0%] 8 [ 0.0%]

To determine the speed on the interface, we will use a special script pps.sh: gist.github.com/pavel-odintsov/bc287860335e872db9a5

#!/bin/bash INTERVAL="1" # update interval in seconds if [ -z "$1" ]; then echo echo usage: $0 [network-interface] echo echo e.g. $0 eth0 echo echo shows packets-per-second exit fi IF=$1 while true do R1=`cat /sys/class/net/$1/statistics/rx_packets` T1=`cat /sys/class/net/$1/statistics/tx_packets` sleep $INTERVAL R2=`cat /sys/class/net/$1/statistics/rx_packets` T2=`cat /sys/class/net/$1/statistics/tx_packets` TXPPS=`expr $T2 - $T1` RXPPS=`expr $R2 - $R1` echo "TX $1: $TXPPS pkts/s RX $1: $RXPPS pkts/s" done

The speed on the interface is quite small: 4 million packets per second:

bash /root/pps.sh eth6 TX eth6: 0 pkts/s RX eth6: 3882721 pkts/s TX eth6: 0 pkts/s RX eth6: 3745027 pkts/s

To solve this problem and distribute the load to all logical cores (I’ve got 8), we should run the following script: gist.github.com/pavel-odintsov/9b065f96900da40c5301. It will distribute interrupts from all 8 queues of the network card to all available logical cores.

#!/bin/bash ncpus=`grep -ciw ^processor /proc/cpuinfo` test "$ncpus" -gt 1 || exit 1 n=0 for irq in `cat /proc/interrupts | grep eth | awk '{print $1}' | sed s/\://g` do f="/proc/irq/$irq/smp_affinity" test -r "$f" || continue cpu=$[$ncpus - ($n % $ncpus) - 1] if [ $cpu -ge 0 ] then mask=`printf %x $[2 ** $cpu]` echo "Assign SMP affinity: eth queue $n, irq $irq, cpu $cpu, mask 0x$mask" echo "$mask" > "$f" let n+=1 fi done

Great! The speed has increased to 12mpps (but it’s not capture. It’s only an indication that we can read traffic from the network at this speed):

bash /root/pps.sh eth6 TX eth6: 0 pkts/s RX eth6: 12528942 pkts/s TX eth6: 0 pkts/s RX eth6: 12491898 pkts/s TX eth6: 0 pkts/s RX eth6: 12554312 pkts/s

Load on cores has stabilized:

1 [||||| 7.4%] 2 [||||||| 9.7%] 3 [|||||| 8.9%] 4 [|| 2.8%] 5 [||| 4.1%] 6 [||| 3.9%] 7 [||| 4.1%] 8 [||||| 7.8%]

Please note that there will be two examples of code used in the text. Here they are: AF_PACKET, AF_PACKET + FANOUT: gist.github.com/pavel-odintsov/c2154f7799325aed46ae

AF_PACKET RX_RING, AF_PACKET + RX_RING + FANOUT: gist.github.com/pavel-odintsov/15b7435e484134650f20.

These are complete applications with a maximum level of optimization. I am not providing intermediate, obviously slower versions of code, but all flags to control all optimizations are highlighted and defined in code as bool. You can easily repeat it in your variant.

The First Attempt to Launch AF_PACKET Capture without Optimizations

So, let’s run the application for traffic capture by means of AF_PACKET:

We process: 222048 pps We process: 186315 pps

The load is way too high:

1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||| 86.1%] 2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||| 84.1%] 3 [|||||||||||||||||||||||||||||||||||||||||||||||||||| 79.8%] 4 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 88.3%] 5 [||||||||||||||||||||||||||||||||||||||||||||||||||||||| 83.7%] 6 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 86.7%] 7 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 89.8%] 8 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||90.9%]

The reason of it is that the kernel has sunk in locks that take all the CPU time:

Samples: 303K of event 'cpu-clock', Event count (approx.): 53015222600 59.57% [kernel] [k] _raw_spin_lock 9.13% [kernel] [k] packet_rcv 7.23% [ixgbe] [k] ixgbe_clean_rx_irq 3.35% [kernel] [k] pvclock_clocksource_read 2.76% [kernel] [k] __netif_receive_skb_core 2.00% [kernel] [k] dev_gro_receive 1.98% [kernel] [k] consume_skb 1.94% [kernel] [k] build_skb 1.42% [kernel] [k] kmem_cache_alloc 1.39% [kernel] [k] kmem_cache_free 0.93% [kernel] [k] inet_gro_receive 0.89% [kernel] [k] __netdev_alloc_frag 0.79% [kernel] [k] tcp_gro_receive

The Optimization of AF_PACKET Capture Using FANOUT

So, what do we do? Let’s think a bit :) Locks occur when several processors try to use the same resource. In our case, it happens due to the fact that we have one socket served by one application, which makes other 8 logical processors completely idle.

A great FANOUT function will come to our rescue. We can run several processes for AF_PACKET. It goes without saying that the most optimal in our case will be the number of processes equal to the number of logical cores. In addition, we can define the algorithm, according to which the data will be distributed to these sockets. I have chosen PACKET_FANOUT_CP mode as data is evenly distributed to queues of the network card. In my opinion, it’s the least resource-intensive balancing option (Not sure — take a look at the kernel code).

Adjust in code sample bool use_multiple_fanout_processes = true;

Then run the application again.

It’s magic! We’ve got a 10-time speed-up:

We process: 2250709 pps We process: 2234301 pps We process: 2266138 pps

CPUs are still loaded to the maximum:

1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||92.6%] 2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.1%] 3 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.2%] 4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.3%] 5 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.1%] 6 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.7%] 7 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.7%] 8 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.2%]

However, perf top map looks completely different as there are no locks anymore:

Samples: 1M of event 'cpu-clock', Event count (approx.): 110166379815 17.22% [ixgbe] [k] ixgbe_clean_rx_irq 7.07% [kernel] [k] pvclock_clocksource_read 6.04% [kernel] [k] __netif_receive_skb_core 4.88% [kernel] [k] build_skb 4.76% [kernel] [k] dev_gro_receive 4.28% [kernel] [k] kmem_cache_free 3.95% [kernel] [k] kmem_cache_alloc 3.04% [kernel] [k] packet_rcv 2.47% [kernel] [k] __netdev_alloc_frag 2.39% [kernel] [k] inet_gro_receive 2.29% [kernel] [k] copy_user_generic_string 2.11% [kernel] [k] tcp_gro_receive 2.03% [kernel] [k] _raw_spin_unlock_irqrestore

In addition, sockets (I’m not sure about AF_PACKET) have an opportunity to set a receive buffer, SO_RCVBUF, but it gave no results on my test bench.

The Optimization of AF_PACKET Capture Using RX_RING Circular Buffer

What should we do now, and why is everything still so slow? The answer lies in build_skb function. It means that there are still two memory copy operations performed within the kernel!

Now, let’s try to deal with memory allocation by means of RX_RING.

Yay! We’ve got 4 MPPS!!!

We process: 3582498 pps We process: 3757254 pps We process: 3669876 pps We process: 3757254 pps We process: 3815506 pps We process: 3873758 pps

Such speed increase has been provided by the fact that memory copy from the network card buffer is now performed just once. During the transfer from kernel space to user space, no additional copying is performed. This is achieved through a common buffer allocated in the kernel and admitted to user space.

The approach to work has also changed. We can no longer wait for a packet to come (remember it’s overhead). Calling poll, we can wait for a signal when the entire block is filled! After that, we can begin to process it.

The Optimization of AF_PACKET Using RX_RING by Means of FANOUT

We still have problems with locks! We can solve this using a good old method. Enable FANOUT and allocate a block of memory for each handler-thread.

Samples: 778K of event 'cpu-clock', Event count (approx.): 87039903833 74.26% [kernel] [k] _raw_spin_lock 4.55% [ixgbe] [k] ixgbe_clean_rx_irq 3.18% [kernel] [k] tpacket_rcv 2.50% [kernel] [k] pvclock_clocksource_read 1.78% [kernel] [k] __netif_receive_skb_core 1.55% [kernel] [k] sock_def_readable 1.20% [kernel] [k] build_skb 1.19% [kernel] [k] dev_gro_receive 0.95% [kernel] [k] kmem_cache_free 0.93% [kernel] [k] kmem_cache_alloc 0.60% [kernel] [k] inet_gro_receive 0.57% [kernel] [k] kfree_skb 0.52% [kernel] [k] tcp_gro_receive 0.52% [kernel] [k] __netdev_alloc_frag

Now, run FANOUT mode for version RX_RING!

Hurray! HIGH SCORE!!! 9 MPPS!!!

We process: 9611580 pps We process: 8912556 pps We process: 8941682 pps We process: 8854304 pps We process: 8912556 pps We process: 8941682 pps We process: 8883430 pps We process: 8825178 pps

perf top: Samples: 224K of event 'cpu-clock', Event count (approx.): 42501395417 21.79% [ixgbe] [k] ixgbe_clean_rx_irq 9.96% [kernel] [k] tpacket_rcv 6.58% [kernel] [k] pvclock_clocksource_read 5.88% [kernel] [k] __netif_receive_skb_core 4.99% [kernel] [k] memcpy 4.91% [kernel] [k] dev_gro_receive 4.55% [kernel] [k] build_skb 3.10% [kernel] [k] kmem_cache_alloc 3.09% [kernel] [k] kmem_cache_free 2.63% [kernel] [k] prb_fill_curr_block.isra.57

I should also mention that updating the kernel to 4.0.0 did not give any increase of speed. It remained within the same limits but the load on cores significantly went down!

1 [||||||||||||||||||||||||||||||||||||| 55.1%] 2 [||||||||||||||||||||||||||||||||||| 52.5%] 3 [|||||||||||||||||||||||||||||||||||||||||| 62.5%] 4 [|||||||||||||||||||||||||||||||||||||||||| 62.5%] 5 [||||||||||||||||||||||||||||||||||||||| 57.7%] 6 [|||||||||||||||||||||||||||||||| 47.7%] 7 [||||||||||||||||||||||||||||||||||||||| 55.9%] 8 [||||||||||||||||||||||||||||||||||||||||| 61.4%]

To sum it up, I would like to say that Linux is an absolutely amazing platform for the analysis of traffic even in the environment where you can’t build some specific kernel module. This is very good news. There’s hope that the nearest versions of the kernel will let us process 10GE at a full wire-speed of 14.6 million packets per second using a 1800 MHz processor :)

Recommended reading material:

www.kernel.org/doc/Documentation/networking/packet_mmap.txt

man7.org/linux/man-pages/man7/packet.7.html