[ 2015-October-05 16:15 ]

At Twitter, a team had a unusual failure where corrupt data ended up in memcache. The root cause appears to have been a switch that was corrupting packets. Most packets were being dropped and the throughput was much lower than normal, but some were still making it through. The hypothesis is that occasionally the corrupt packets had valid TCP and Ethernet checksums. One "lucky" packet stored corrupt data in memcache. Even after the switch was replaced, the errors continued until the cache was cleared. [Update 2016-02-12: Root cause found: this also involved a kernel bug!]

I was very excited to hear about this error, because it is a real-world example of something I wrote about seven years ago: The TCP checksum is weak. However, the Ethernet CRC is strong, so how could a corrupt packet pass both checks? The answer is that the Ethernet CRC is recalculated by switches. If the switch corrupts the packet and it has the same TCP checksum, the hardware blindly recalculates a new, valid Ethernet CRC when it goes out.

As Mark Callaghan pointed out, this is a very rare scenario and you should never blame the network without strong evidence. However, it isn't impossible and others have written about similar incidents. My conclusion is that if you are creating a new network protocol, please append a 4 byte CRC (I suggest CRC32C, implemented in hardware on recent Intel, AMD, and ARM CPUs). An alternative is to use an encryption protocol (e.g. TLS), since they include cryptographic hashes (which fixed a similar incident).

The rest of this article describes the details about how this is possible, mostly so I don't forget them.

Properties of the TCP checksum

The TCP checksum is two bytes long, and can detect any burst error of 15 bits, and most burst errors of 16 bits (excluding switching 0x0000 and 0xffff). This means that to keep the same checksum, a packet must be corrupted in at least two locations, at least 2 bytes apart. If the chance is purely random, we should expect approximately 1 in 216 (approximately 0.001%) of corrupt packets to not be detected. This seems small, but on one Gigabit Ethernet connection, that could be as many as 15 packets per second. For details about how to compute the TCP checksum and its error properties, see RFC 1071.

Properties of the Ethernet CRC

The Ethernet CRC is substantially stronger, partly because it is twice as long (4 bytes), and partly because CRCs have "good" mathematical properties, such as detecting all 3 bit errors in 1500 byte Ethernet packets (understanding this is beyond my math skills). It appears that most switches discard packets with invalid CRCs when they are received, and recalculate the CRC when the packet goes back out. This means the CRC really only protects against corruption on the wire, and not inside the switch. Why not just re-send the existing CRC? Modern switch chips have features that modify packets, such as VLANs or explicit congestion notification. Hence, it is simpler to always recompute the CRC. For a detailed description, see Denton Gentry's description of how the Ethernet CRC doesn't protect very much.

There is one small complication that does not change this cause of failure, but does change how you might detect it. Some switches support cut-through switching, where packets begin being forwarded as soon as the destination address is read, without waiting for the entire packet. In this case, it is already sending the packet before it can validate it, so it absolutely cannot recalculate the CRC. These switches typically support something called "CRC stomping" to ensure the outgoing CRC is invalid, so the ultimate receiver will eventually discard it. This gets more complicated when a destination port is being used when a new packet arrives. In this case, cut-through switches must buffer packets, and then act like a store-and-forward switch. Hence, cut-through switching does not prevent switches from corrupting packets and appending a valid Ethernet CRC. See Cisco's white paper on cut-through switching and Cut-through, corruption and CRC-stomping for more details.

Conclusion: Use CRCs

The conclusion is that when transmitting or storing data, you should always include strong CRCs that protect the data all the way from the sender to the final receiver. Please don't invent new network protocols without them.