I just love Shodan. Image from the System Shock wikia, not mine.

In the previous article I went over some potential bottlenecks related to the NIC itself. In this article, we’ll be looking at some of the kernel tweaks that can be useful, and the effect it may have on the network traffic.

The series is divided into (likely) these four parts:

Part 1: The NIC

Part 2: The Kernel (this article)

Part 3: Interrupts

Part 4: Going further

In the first part we saw what the NIC ring buffer was and how it could be tweaked. In our case, those tweaks solved only the first part our issue, and when fixed exposed another bottleneck further down (up?) the OSI model. We still had a bottleneck somewhere, so the next question was, “so what happens to our packets after the NIC?”

The observation

After the NIC issues that we looked at in the previous article, the next observation made was a correlation between the connection drop off cliff during peak time, and the increase in packets reported as being “collapsed” and “pruned.” The collapsing usually started increasing an hour or so before the pruning. When tracked by Grafana it looked something like the following for collapsed packets:

Increase in Collapsed packets.

And for pruned packets:

Increase in Pruned packets.

The above two graphs are from different days and shows different time scales; they’re not meant to show a correlation between the two of them, only to show they both indicated an increase in some point in time.

If you google for more understanding of what the concept of collapsing and pruning of packets means, you might end up on the Wikipedia page about congestion control, which (to me) didn’t make much sense. Below I’ll describe it according to my own understanding of it.

The Kernel Socket Buffer

The next place that the packets get stored at, is in the kernel’s socket buffer. This is basically performed in the following way:

The NIC puts a new packet it received in its own ring buffer, Then the kernel does something called an “interrupt” (the topic of the next part in this series), which is sort of the kernel asking the NIC if there are any unhandled packets in the NIC ring buffer, and if there are: The kernel copies the packet data from the NIC ring buffer to the part of the machine’s RAM allocated for the kernel, in something called the “socket receive buffer,” The NIC finally removed the packet from its own ring buffer to free up space for future packets.

Likely the above depends on which system you’re using, but basically that is what happens.

There’s actually two socket buffers; one receive buffer and one write buffer. A simple visualization of two different machines talking to each other can be seen here:

The size of these two buffers can be modified, but we can’t just set the buffer size to a huge number, since it could make the “garbage collection” take long enough time to cause other issues, such as random latency spikes.

If the buffer is totally full when new data comes in, and no more GC can be performed, new data will be dropped.

Collapsing

When the kernel socket buffer is nearing its max size, a procedure called “collapsing” is performed, which is sort of like garbage collection. The kernel tries to identify segments in the buffer that has identical metadata, and tries to combine them, so as to not have identical metadata filling up the buffer. The size of the buffer should be large enough to avoid collapsing to happen too often, but small enough so that when it does happen it won’t block other operations for too long.

Pruning

When no more collapsing can happen, the “pruning” process starts. Pruning is the act of dropping new packets, since they can’t fit in the buffer.

Red Hat’s Performance Tuning Guide describes it like this:

This is a kind of house-keeping where the kernel will try the free space in the receive queue by reducing overhead. However, this operation comes at a CPU cost. If collapsing fails to free sufficient space for additional traffic, then data is “pruned”, meaning the data is dropped from memory and the packet is lost. Therefore, it best to tune around this condition and avoid the buffer collapsing and pruning altogether. The first step is to identify whether buffer collapsing and pruning is occurring.

Monitoring collapsing and pruning

There’s a great deal of information to be gleamed from the /proc/ file system when troubleshooting Linux machines. When investigating the network, two nice tools to use is netstat and ethtool. To check if we actually are experiencing collapsing/pruning:

[root@host ~]# netstat -s | egrep "(collapse|prune)"

10051 packets pruned from receive queue because of socket buffer overrun

343734 packets collapsed in receive queue due to low socket buffer

These number tells you how many packets where collapsed and pruned since the system was restarted or the driver was re-loaded. Wait for a while and run the command again. If the values are increasing, something might not work as intended (collapsing is sort of okay; pruning we don’t like).

A tip is to setup monitoring of these numbers and send it to a dashboard.

Tweaking how collapsing and pruning happens

One thing that can in some cases be a bottleneck is the size of the socket buffer. Normally the size might be enough for your application, but a sudden surge in traffic might make it overflow briefly. On the other hand, setting it too high might cause the collapsing procedure to take too long time to finish, and cause other pieces of the machinery to time out.

Check the current sizes of the read/write buffers:

[root@host ~]# sysctl -a | egrep "tcp_(r|w)mem"

net.ipv4.tcp_rmem = 4096 1048576 4194304

net.ipv4.tcp_wmem = 4096 1048576 4194304

These numbers tell the kernel to allocate between 4 KiB and 4 MiB for the receive buffer, with the default starting size of 1 MiB. A great in-depth report from the guys at Cloudflare shows how they used tools such as stap to analytically show how changes like these made a difference. They concluded their discussion on this specific setting with:

Since the receive buffer sizes are fairly large, garbage collection could take a long time. To test this we reduced the max rmem size to 2MiB and repeated the latency measurements. […] Now, these numbers are so much better. With the changed settings the tcp_collapse never took more than 3ms!

The takeaway here is to not blindly copy numbers that worked for someone else, but instead first measure your current situation to have a baseline, and validate any changes you make against that.

Connection tracking

Each connection is “tracked” by the kernel, even after it finishes. This can lead to the information being tracked becoming too large for the assigned memory, if the server is handling a lot of traffic. The size of this table can be changed on the fly, though. To check the current value, check sysctl:

[root@host ~]# sysctl -a | egrep "netfilter.nf_conntrack_(max|cou)"

net.netfilter.nf_conntrack_count = 46853

net.netfilter.nf_conntrack_max = 2097152

The count shows the current size used in the tracking table, while the max shows the currently set upper limit. If the max is low compared to the count, it might be worth increasing it. Open /etc/sysctl.conf and add this line (find a suitable value for your use case; double the current max might be a good starting point):

net.netfilter.nf_conntrack_max = 2097152

If you use the command syctl to set the value, it won’t be remembered after a restart, so add it to the configuration file (/etc/sysctl.conf).

Another thing we have to do when changing the max size is to change the hash size as well. The hash size should be proportional to the max value, and can be calculated like this:

hashsize = nf_conntrack_max / 8

This can not be set in sysctl.conf, but instead has to be set in the config file for the nf_contrack kernel module:

echo 262144 > /sys/module/nf_conntrack/parameters/hashsize

And add it to the /etc/modprobe.conf file as well to make it permanent and not change after a reboot:

options ip_conntrack hashsize=262144

Conclusion

In the next part we’ll look at interrupts called SoftIRQ and HardIRQ, which is how packets are moved from the NIC ring buffer to the kernel socket buffer, and from the kernel socket buffer to the application they are intended for. Finally we’ll look at how to monitor and configure these processes.

Thanks for reading, and feel free to comment and give questions!