TCP is supposed to guarantee that all bytes sent by one endpoint of a connection will be received, in the same order, by the other endpoint. In this article we’ll identify and demonstrate a wrinkle in the Linux implementation of TCP SYN cookies. The client can connect and send two packets, but the server’s TCP stack delivers the data in the second packet to the application, unaware that it is not the first packet in the stream.

Summary

The problem of silently-dropped data occurs when all of the following conditions are true:

SYN cookies are enabled, and the listening port is under SYN flood conditions.

The client application speaks first .

. The client’s first ACK, sent as part of the three-way handshake to establish the connection, is not delivered.

The client first sends a small data packet (three bytes or fewer), which is also not delivered.

Then, before any retransmission of the first data packet, the client sends a second packet, which is delivered, before waiting for the server to reply.

The effect is as shown in this diagram. The session shown here, with this sequence of system calls and packets, is the example we’ll use throughout this article.

The result is that although the client sent “dog” and “cat”, the server only received “cat”. Neither endpoint had any idea anything went missing. In this article we’ll look at how and why this can happen, with regard to the Linux implementation of SYN cookies. We’ll also provide C code which reproduces the problem.

The symptom was originally noticed last week by our QA team. A Kognitio server was intermittently behaving oddly when multiple clients connected at once. Occasionally one of the clients would be kicked off, because the server didn’t understand its initial login message. After much investigation, we found that the client was sending the first two bytes of its login message in a separate send() call for uninteresting reasons, and that the server wasn’t receiving these first two bytes.

Background reading

For the rest of this article we’ll assume the reader is familiar with TCP – specifically the three-way handshake by which a connection is established, and the purpose of sequence numbers and acknowledgement numbers. Here’s a reminder of the most important points…

TCP basics

Every TCP packet contains a sequence number, which allows the TCP stack to recognise repeated packets, missing packets, and packets arriving out of order, so that the application sees only a stream of bytes in the correct order. Each endpoint chooses an initial, arbitrary value for its sequence number. An endpoint’s sequence number is incremented for each data byte sent.

Most TCP packets also contain an acknowledgement number, by which we tell the other endpoint that we have received all their data up to this sequence number. If data goes unacknowledged for more than a reasonable length of time, it’s up to the sender to re-send the data.

A TCP connection is established as follows:

The client sends a SYN packet with the client’s initial sequence number.

The server replies with a SYN+ACK packet acknowledging the SYN and containing the server’s initial sequence number.

The client sends an ACK packet acknowledging the SYN+ACK, and the connection is established.

One more TCP concept playing an important role in this issue is the maximum segment size, or MSS. This is the maximum number of data bytes a host is prepared to receive per TCP packet. The default MSS for a TCP connection is 536 bytes, but a higher value may be – and usually is – specified in the client’s initial SYN packet.

SYN cookies

You’ll need to know what SYN cookies are and what they’re for. If you don’t, the Wikipedia article on SYN cookies is worth a read, but note that the specific calculation illustrated there is different from how Linux does it.

In one sentence, a SYN cookie is an initial sequence number specially chosen by the server to encode information which allows it to forget about a partly-set-up connection until the client replies with the ACK to complete the connection establishment.

Application C code

Where we refer to application code, we’ll be talking about the Linux system calls in C used for network sockets, such as connect(), accept(), send() and recv().

What’s your end’s sequence number again? Oh, that must be it

Under SYN cookie conditions, when the server receives the client’s first ACK to complete the establishment of the connection, the whole point is that the server has no record of the half-set-up connection. It must recreate the connection state – that is, both endpoints’ addresses and port numbers, the client’s initial sequence number, its own initial sequence number, and the maximum segment size – using only the information in the client’s ACK packet.

The server knows the source and destination port and IP address – these are in every TCP packet.

The server can infer what its own initial sequence number was (the sequence number of the SYN+ACK). It’s the acknowledgement number of the client’s ACK packet minus one, as required by TCP. The server chose this specially, so it also has encoded within it the (approximate) MSS and a low-resolution timestamp (to protect against replay attacks). So we can reconstruct those.

The server can also infer what the client’s initial sequence number was – it’s the sequence number of the ACK packet, minus one.

Or is it?

Dropped and reordered packets

What if the client’s ACK packet never got delivered? And furthermore, what if the client’s first data packet, containing the 3-byte message “dog”, also went missing? The server won’t notice anything is amiss. Until it receives something from the client, it doesn’t even have any record of the connection.

Normally the client would notice that “dog” wasn’t acknowledged by the server in a reasonable time, and resend it. But let’s say that long before this happens, the client application makes the second send() call, to send “cat”. The client sends it in a new packet.

The ACK flag is set on this packet, because it contains an acknowledgement number. This acknowledgement number is the server’s initial sequence number plus one. The server thinks this must be an ACK to establish a connection, subtracts one from the acknowledgment number and checks that it looks like a valid SYN cookie for this connection, and it does.

The server then assumes that the client’s initial sequence number is one less than the sequence number of this packet. This is only a correct assumption if the client hasn’t previously sent us any data. In fact, in our case the client’s initial sequence number is four less than the sequence number of this “cat” packet, but the server doesn’t know this. The effect is that the server happily serves “cat” up to the application, assuming it’s the first data packet. That’s how we get the situation shown in that diagram from earlier:

The server also sends the client an ACK to acknowledge everything up to the sequence number corresponding to the end of the “cat” message. The client thinks the server is acknowledging six bytes, so it thinks everything arrived. The server only intended to acknowledge three bytes, but the server and client have different ideas of what the client’s initial sequence number was. The client application sent “dog” and “cat”, and the server application received only “cat”.

SYN cookies on Linux

The problem could be avoided if the client’s initial sequence number were somehow encoded in the SYN cookie. Then, when the server received the ACK, it could reliably check that this really was the client’s first ACK and not some other packet from the middle of the data stream.

The Linux TCP implementation does in fact use the client’s initial sequence number in the calculation of the SYN cookie. However, a self-confessed “extra hack” means this isn’t quite as effective as it perhaps ought to be.

Generating the SYN cookie

The relevant function in the Linux kernel is secure_tcp_syn_cookie():

static __u32 secure_tcp_syn_cookie(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport, __u32 sseq, __u32 data) { /* * Compute the secure sequence number. * The output should be: * HASH(sec1,saddr,sport,daddr,dport,sec1) + sseq + (count * 2^24) * + (HASH(sec2,saddr,sport,daddr,dport,count,sec2) % 2^24). * Where sseq is their sequence number and count increases every * minute by 1. * As an extra hack, we add a small "data" value that encodes the * MSS into the second hash value. */ u32 count = tcp_cookie_time(); return (cookie_hash(saddr, daddr, sport, dport, 0, 0) + sseq + (count << COOKIEBITS) + ((cookie_hash(saddr, daddr, sport, dport, count, 1) + data) & COOKIEMASK)); }

On Linux, the SYN cookie is formed by putting the bottom eight bits of the timestamp in the top eight bits of the cookie (count << COOKIEBITS), then adding a hash value (constant for the connection), the lower 24 bits of another hash value (which depends on the timestamp), and also adding the client’s initial sequence number (sseq), and finally adding data , which is a number between 0 and 3 and tells us which value from msstab is to be used as the maximum segment size (MSS). It’s this attempt to squeeze data into the cookie, alongside the client’s initial sequence number, that causes our problem.

Unpicking and verifying the SYN cookie

When the server gets the first ACK from the client, it has to check that the acknowledgement number minus 1 looks like a SYN cookie. The function it uses is check_tcp_syn_cookie():

/* * This retrieves the small "data" value from the syncookie. * If the syncookie is bad, the data returned will be out of * range. This must be checked by the caller. * * The count value used to generate the cookie must be less than * MAX_SYNCOOKIE_AGE minutes in the past. * The return value (__u32)-1 if this test fails. */ static __u32 check_tcp_syn_cookie(__u32 cookie, __be32 saddr, __be32 daddr, __be16 sport, __be16 dport, __u32 sseq) { u32 diff, count = tcp_cookie_time(); /* Strip away the layers from the cookie */ cookie -= cookie_hash(saddr, daddr, sport, dport, 0, 0) + sseq; /* Cookie is now reduced to (count * 2^24) ^ (hash % 2^24) */ diff = (count - (cookie >> COOKIEBITS)) & ((__u32) -1 >> COOKIEBITS); if (diff >= MAX_SYNCOOKIE_AGE) return (__u32)-1; return (cookie - cookie_hash(saddr, daddr, sport, dport, count - diff, 1)) & COOKIEMASK; /* Leaving the data behind */ }

This function takes our alleged SYN cookie, subtracts the first hash value and what it assumes to be the client’s initial sequence number (sseq, which was derived by subtracting 1 from this packet’s sequence number), to leave the timestamp in the top eight bits and the rest in the lower 24 bits. It checks that the timestamp is within a small delta of the current timestamp. If so, it considers only the lower 24 bits and subtracts the second hash, to leave the MSS index value ( data ), which was between 0 and 3. This value is usually 3, which represents the highest MSS in the msstab table (1460).

It’s up to the caller to check that the value returned is a valid index into msstab , and if it isn’t, then it’s not considered to be a valid SYN cookie and we reset the connection. Certainly if the sequence number of this packet is nowhere near the real initial sequence number, we won’t return a number between 0 and 3, and this will cause the server to send an RST packet to reject the connection.

The problem comes if the sequence number sseq is wrong, but by less than or equal to the original value of data. Let’s revisit our example where the client sends two packets in succession to see how that can happen.

SYN cookies ate my dog, then told me I never had a dog, just a low MSS

The client sends a SYN, and the server sends a SYN+ACK with a SYN cookie. The ACK from the three-way handshake gets lost, and the first data packet (“dog”) also gets lost. The client’s second data packet (“cat”) is the first packet since the SYN to be delivered to the server. It has a sequence number of clients_initial_sequence_number + 4 (1 from the ACK’s sequence number being 1 more than the SYN, and 3 from the three bytes that have already been sent).

In this case, sseq is three more than we expect. When check_tcp_syn_cookie() subtracts it from cookie , cookie ends up being three less than it should be. And instead of returning 3, it returns 0. That’s still a valid index into msstab , so the connection is accepted with a lower-than-necessary MSS, the first three bytes have gone missing from the stream, and no system call has indicated an error.

In summary, the discrepancy in the packet’s sequence number compared to what we expect it to be, which should tip us off that this isn’t the first data packet, is mistaken for a smaller-than-usual MSS value.

Proof of concept

We can demonstrate the problem using a pair of simple C programs, dogclient and dogserver. Here we reproduce the problem using the loopback interface, but the problem is also reproducible over a real network interface.

dogservcli.zip contains the C code, makefile and README, for compiling and running on Linux.

Server program

dogserver creates a socket, binds it to a port, and calls listen() to make it a listening socket. For this call it uses a backlog of 1, to make SYN flood conditions more likely and increase the chance of reproducing the problem. It then runs in a loop, accepting connections with accept() and handing each connection off to a newly-created thread.

Each connection thread reads three bytes from the client using recv(). If these three bytes are “dog” then everything worked as expected. We reply with a zero byte to the client, close the connection, and finish the thread. If these three bytes are anything else, we print a diagnostic to standard output, reply with -1 to the client and close the connection.

Client program

dogclient creates a number of threads (10, by default). Each thread runs in a loop. On each iteration it connects to the server, calls send() once to send “dog”, calls it a second time to send “cat”, reads the one-byte response, and if it’s anything other than zero, cries foul. Then we close the connection. We keep looping until something interesting happens, such as a system call failure or the server sending us something other than zero, or until we’ve iterated a certain number of times (default 100).

Expectation v Reality

Clearly if all is working as expected, the server will only ever receive “dog” followed by “cat”, and the client will only ever receive a zero byte in response. We might reasonably expect one of the socket calls to fail in unusual circumstances, but we’d never expect the server to receive “cat” without “dog”.

What we see is that the server application occasionally receives “cat” and not “dog”, and no system call fails.

Here’s the output from the client…

$ ./dogclient localhost 12345 9 threads remaining 8 threads remaining 7 threads remaining ./dogclient: received 0xff from server - SYN cookies ate my dog! 6 threads remaining ./dogclient: received 0xff from server - SYN cookies ate my dog! 5 threads remaining 4 threads remaining ./dogclient: received 0xff from server - SYN cookies ate my dog! 3 threads remaining ./dogclient: received 0xff from server - SYN cookies ate my dog! 2 threads remaining <...snip more progress output...> thread 0: stopped, 4/100 attempts thread 1: stopped (server received incorrect data), 2/100 attempts thread 2: stopped (server received incorrect data), 3/100 attempts thread 3: finished, 100/100 attempts thread 4: stopped (server received incorrect data), 1/100 attempts thread 5: stopped, 2/100 attempts thread 6: stopped (server received incorrect data), 1/100 attempts thread 7: stopped, 2/100 attempts thread 8: finished, 100/100 attempts thread 9: finished, 100/100 attempts

And here’s the output from the server…

$ ./dogserver -p 12345 fd 5: received 3 bytes, first three bytes are "cat", expected "dog". I'm 127.0.0.1:12345, they're 127.0.0.1:36548. fd 6: received 3 bytes, first three bytes are "cat", expected "dog". I'm 127.0.0.1:12345, they're 127.0.0.1:36536. fd 5: received 3 bytes, first three bytes are "cat", expected "dog". I'm 127.0.0.1:12345, they're 127.0.0.1:36550. fd 4: received 3 bytes, first three bytes are "cat", expected "dog". I'm 127.0.0.1:12345, they're 127.0.0.1:36546.

In this case, four sessions out of several hundred reproduced the problem. Here is the tcpdump output for one of those sessions (client port 36548, server port 12345). The client sends two three-byte packets (seq 1:4 and seq 4:7), and the server acknowledges up to the end of the second packet (ack 7).

10:39:33.279385 IP 127.0.0.1.36548 > 127.0.0.1.12345: Flags [S], seq 3197089423, win 43690, options [mss 65495,sackOK,TS val 39007549 ecr 0,nop,wscale 7], length 0 10:39:33.279389 IP 127.0.0.1.12345 > 127.0.0.1.36548: Flags [S.], seq 2614047420, ack 3197089424, win 43690, options [mss 65495,sackOK,TS val 39007511 ecr 39007549,nop,wscale 7], length 0 10:39:33.279394 IP 127.0.0.1.36548 > 127.0.0.1.12345: Flags [.], ack 1, win 342, options [nop,nop,TS val 39007549 ecr 39007511], length 0 10:39:33.279400 IP 127.0.0.1.36548 > 127.0.0.1.12345: Flags [P.], seq 1:4, ack 1, win 342, options [nop,nop,TS val 39007549 ecr 39007511], length 3 10:39:33.485759 IP 127.0.0.1.36548 > 127.0.0.1.12345: Flags [P.], seq 4:7, ack 1, win 342, options [nop,nop,TS val 39007601 ecr 39007511], length 3 10:39:33.485782 IP 127.0.0.1.12345 > 127.0.0.1.36548: Flags [.], ack 7, win 84, options [nop,nop,TS val 39007601 ecr 39007601], length 0 10:39:33.485901 IP 127.0.0.1.12345 > 127.0.0.1.36548: Flags [P.], seq 1:2, ack 7, win 84, options [nop,nop,TS val 39007601 ecr 39007601], length 1 10:39:33.485914 IP 127.0.0.1.12345 > 127.0.0.1.36548: Flags [F.], seq 2, ack 7, win 84, options [nop,nop,TS val 39007601 ecr 39007601], length 0 10:39:33.485954 IP 127.0.0.1.36548 > 127.0.0.1.12345: Flags [.], ack 2, win 342, options [nop,nop,TS val 39007601 ecr 39007601], length 0 10:39:33.485989 IP 127.0.0.1.36548 > 127.0.0.1.12345: Flags [F.], seq 7, ack 3, win 342, options [nop,nop,TS val 39007601 ecr 39007601], length 0 10:39:33.485994 IP 127.0.0.1.12345 > 127.0.0.1.36548: Flags [.], ack 8, win 84, options [nop,nop,TS val 39007601 ecr 39007601], length 0

Connection reset if the first, missing, data packet is >3 bytes

If our initial packet is four bytes rather than three (we send “dogs”), we do not reproduce the missing packet problem. This is because check_tcp_syn_cookie() would -1 in this case, which isn’t a valid index into msstab. Instead we see the occasional “Connection reset by peer”, caused by the server failing (narrowly) to verify that the sequence number of the second packet is a valid SYN cookie. (Could these be the “magic resets” that SYN cookies never cause?)

The server application produces no output, because it either sees a connection which works as expected or sees no connection at all, because it was never correctly established.

The client application has some of its sessions reset by the server:

$ ./dogclient -m 4 localhost 12345 9 threads remaining 8 threads remaining 7 threads remaining ./dogclient: ./dogclient: recv (local port 37356): Connection reset by peer recv (local port 37358): Connection reset by peer 6 threads remaining ./dogclient: recv (local port 37360): Connection reset by peer 4 threads remaining ./dogclient: recv (local port 37362): Connection reset by peer ./dogclient: recv (local port 37480): Connection reset by peer ./dogclient: recv (local port 37458): Connection reset by peer 1 threads remaining 0 threads remaining thread 0: stopped, 12/100 attempts thread 1: stopped, 1/100 attempts thread 2: stopped, 4/100 attempts thread 3: finished, 100/100 attempts thread 4: stopped, 20/100 attempts thread 5: stopped, 12/100 attempts thread 6: stopped, 21/100 attempts thread 7: finished, 100/100 attempts thread 8: finished, 100/100 attempts thread 9: stopped, 19/100 attempts

And here is the tcpdump log for one of those sessions (port 37356), showing the server resetting the connection.

10:47:01.242234 IP 127.0.0.1.37356 > 127.0.0.1.12345: Flags [S], seq 2359901025, win 43690, options [mss 65495,sackOK,TS val 39119540 ecr 0,nop,wscale 7], length 0 10:47:01.242239 IP 127.0.0.1.12345 > 127.0.0.1.37356: Flags [S.], seq 405802755, ack 2359901026, win 43690, options [mss 65495,sackOK,TS val 39119511 ecr 39119540,nop,wscale 7], length 0 10:47:01.242244 IP 127.0.0.1.37356 > 127.0.0.1.12345: Flags [.], ack 1, win 342, options [nop,nop,TS val 39119540 ecr 39119511], length 0 10:47:01.242251 IP 127.0.0.1.37356 > 127.0.0.1.12345: Flags [P.], seq 1:5, ack 1, win 342, options [nop,nop,TS val 39119540 ecr 39119511], length 4 10:47:01.449736 IP 127.0.0.1.37356 > 127.0.0.1.12345: Flags [P.], seq 5:8, ack 1, win 342, options [nop,nop,TS val 39119592 ecr 39119511], length 3 10:47:01.449764 IP 127.0.0.1.12345 > 127.0.0.1.37356: Flags [R], seq 405802756, win 0, length 0

Mitigation

The problem of missing data can potentially affect any client-server TCP application in which SYN cookies are enabled on the server, the client speaks first after establishing a connection, the first packet it sends to the server is three bytes or smaller, and further data is sent before the server is expected to reply and before the client TCP stack resends the first packet.

On Linux, if the client’s first packet is more than three bytes and a second packet is sent, this will not cause data to go silently missing from the TCP stream but it may cause the connection to be reset.

One workaround would be for the client to send its entire initial message in one packet if possible. This can be done by building up the message in a user buffer, or using the TCP_CORK option, until the first message is ready to send.

The likelihood of the problem occurring on the server can be reduced by ensuring the listening socket has a reasonably-sized backlog queue. This would reduce the chance of SYN flood conditions being enabled in normal operation.

Acknowledgements

Jonathan Oddy helped to identify the behaviour of the Linux SYN cookie implementation when packets are lost, and worked through the arithmetic regarding Linux’s calculation of the cookie. We can also blame him for coming up with the title for this article.