For $work I recently came across a fun situation involving TCP/IP sockets that I hadn’t encountered before. In summary, an upstream service would become upset should our service accept a connection, but disconnect it before servicing any request, which appeared to be happening quite frequently while rolling out upgrades.

For reasons beyond our team’s remit, it was impossible to teach the upstream service not to contact us during the upgrade, and due to the nature of the application, it was impossible to use the ancient UNIX trick of handing the old program’s file descriptors over to the new program, since the new program was almost certainly started on a different host.

The goal was to ensure that either the upstream service received a “connection refused” error, or a connected socket that would service its commands. That left investigating why the erroneous behavior was occurring, which almost immediately led to a surprising discovery regarding BSD sockets: it is impossible to shut down a TCP listener without a race occurring between a program’s last accept() , and calling close() on the listener socket.

As a quick refresher, TCP servers are usually implemented something like this:

s = socket() # Create the socket. s.bind(('0.0.0.0', 80)) # Set address. s.listen(5) # Set backlog size and enter LISTEN state. while not stopped: c, _ = s.accept() # Wait for and accept a new client. start_client(c) # Start a task to handle the client. s.close()

Behind the scenes, the operating system works asynchronously to translate between network frames and kernel-side socket state:

(Picture courtesy of LWN

Our problem occurs because between accept() and close() , the kernel is still busily accepting new connections on our program’s behalf, and if our program isn’t currently blocked in accept() and ready to receive the new connection, it places them into a queue that userspace has almost no control over: the socket’s backlog.

If the kernel accepts a new connection and places it in our backlog between our last accept() and the time we call close() , it is now in possession of an established connection that userspace knows nothing about, while also in possession of an instruction from userspace to cease accepting connections. In short, it has to drop the connection on the floor.

There is no standard way to prevent this race from occuring. In a sensible design, traffic would not be directed at a listening port while it is being torn down, however we don’t live in a sensible world.

After some chats on Freenode’s #posix channel, it became clear the only solution was to firewall the port during shut down, allowing time for the kernel to empty the backlog while preventing it from filling again. This approach sucked, not least because it involved changing system-global firewall state, but also since our application ran beneath a distributed job scheduler that started our server as non-root.

Linux and BPF to the rescue

It is a rare situation where decades of undisciplined tinkering with Linux esoterica occasionally pay out, but this was such an occasion. Unlike in BSD, where Berkeley Packet Filter is implemented as a root-only device that attaches to entire network interfaces, on Linux it is implemented in terms of a socket option that usually attaches to AF_PACKET or AF_RAW sockets, however it is a little known fact you can also attach such filters to AF_INET sockets, and better yet, the ability to do so does not require root. Essentially, Linux allows non-root programs to configure their own little private firewall.

Creating a filter

BPF filters are passed to the kernel as an array of structures containing 4 integers:

code Bitfield containing opcode, source operand, and instruction class jt Positive jump offset if condition is true, 0 means program counter + 1 , etc. jf Positive jump offset if condition is false. k Extra value used by some instructions.

While it’s possible to build this array by hand or using bpf_asm, it’s far more convenient to ask tcpdump to dump the compile result for a filter expression:

# tcpdump -d 'proto ipv6' (000) ldh [12] (001) jeq #0x800 jt 2 jf 4 (002) ldb [23] ... (010) ret #262144 (011) ret #0 # tcpdump -dd 'proto ipv6' { 0x28, 0, 0, 0x0000000c }, { 0x15, 0, 2, 0x00000800 }, { 0x30, 0, 0, 0x00000017 }, ... { 0x6, 0, 0, 0x00040000 }, { 0x6, 0, 0, 0x00000000 },

One caveat is that unlike a firewall rule, BPF filters attached to an AF_INET SOCK_STREAM socket do not have complete visibility of the incoming network frame, a fact that does not appear to be documented anywhere except the kernel source. In this case, only the TCP headers and data (if any) are visible within the filter. While this is sufficient for our needs, it makes it slightly more difficult to use tcpdump as a compiler, since programs it outputs expect offset zero of the filter buffer to point to an Ethernet frame.

This is easily worked around by referring to the TCP header via the ether array, which always points to offset 0. For example, to match destination port 8080 (bytes 2..4 in the TCP header), we write ether[2:2] == 8080 .

Dropping SYN frames

By attaching a filter that drops incoming frames with the SYN bit set, we can ensure the kernel will accept no new incoming connection handshakes that will end up in the socket backlog. TCP flags are stored in byte 13 of the TCP header, so to allow everything except SYN frames we write:

# tcpdump -dd 'ether[13] != 0x02' { 0x30, 0, 0, 0x0000000d }, { 0x15, 0, 1, 0x00000002 }, { 0x6, 0, 0, 0x00000000 }, { 0x6, 0, 0, 0x00040000 },

Cleaning up

After installing the filter, we must briefly wait for outstanding SYN+ACK, ACK exchanges to complete, then drain the listening socket’s backlog by polling for it to become unreadable and calling accept() as necessary to ensure all established connections have been seen.

Notice that while we avoided a race with the established connection backlog, it does not seem possible to block remaining SYN+ACK, ACK frames using BPF, or ask the kernel how many incomplete connection handshakes exist. We have traded one unpredictable race for another that is much easier to manage, particularly within the confines of a datacenter where latencies are rarely excessive.

Clients connecting to the listener while the filter is installed will retry sending their SYN until the listener is closed, which will cause its filter to be destroyed, and thus allow the kernel’s default behaviour of responding with a RST since a listening socket no longer exists on that port. The net effect is that a client receives either an established connection we know about, or it receives a “connection refused” error, perhaps after a short delay.

Proof of concept

client.py and server.py are a minimal example of the problem, and how the BPF filter solves it.

The server runs in a loop accepting connections until a file named stop exists. If the file contains the word graceful , then instead of simply calling close() , it installs a filter before sitting in a loop until the backlog is verifiably empty.

The client runs in a loop connecting to the server as often as possible, stopping only when connections are being refused, and printing a message any time an error occurs on an established connection.

Stopping the server ungracefully, we see the client prints:

$ python client.py 1: [Errno 104] Connection reset by peer 2: [Errno 104] Connection reset by peer 4: [Errno 104] Connection reset by peer 5: [Errno 104] Connection reset by peer $

However on stopping it gracefully, we see it completes without error:

$ python client.py $

On the server side when gracefully stopping, numerous connections are handled that would otherwise have been dropped on the floor:

$ python server.py backlog! backlog! backlog! backlog! backlog! backlog! $