Approximately two years ago I tried to figure out whether aggressive marketing of deep buffer data center switches makes sense, recorded a few podcasts on the topic and organized a webinar with JR Rivers.

Not surprisingly, the question keeps popping up, so it seems it’s time for another series of TL&DR articles. Let’s start with the basics:

Every network will eventually experience congestion. What matters is how the network deals with the congestion.

Network congestion results in either packet drops or increased latency, and you can’t avoid both of them at the same time without alleviating congestion.

Don’t believe in the magic powers of QoS - it’s a zero-sum game.

The only way to alleviate network congestion is to reduce the amount of data injected into the network;

You can control the network load by dropping excess traffic before it enters the network (ingress policing), or by persuading the senders to reduce the transmission rate.

If you’re worried about the impact of packet drops you might want to avoid policing (= packet drops) at the network edge and focus on adjusting the senders’ transmission rate. You could use static mechanisms (traffic shaping) or try to guesstimate how much traffic the network can carry. Time for another set of facts:

Early networking technologies implemented a plethora of backpressure mechanisms that a network node could use to reduce the ingress traffic. These technologies include per-link flow control (X.25 and Fibre Channel buffer-to-buffer credits) and stop-and-start mechanisms (XON/XOFF, PAUSE frames, Priority Flow Control).

Hop-by-hop backpressure mechanisms usually aim to implement lossless transport network, resulting in suboptimal performance or deep buffer requirements on links with large round-trip-times (RTT). In-depth analysis of this claim is left as an exercise for the reader.

There are tons of other problems with hop-by-hop backpressure mechanisms including increased amount of network state and head-of-line blocking. Yet again, we won’t go into details (but feel free to figure them out).

Designers of IP decided not to deal with this particular can of worms:

IP networks have no hop-by-hop backpressure mechanism apart from what the underlying layer-2 technology might have;

Commonly used layer-2 technologies (Ethernet, PPP, or SONET/SDH links) have no backpressure, the only exceptions being lossless Ethernet and reliable PPP;

The domain of a potential backpressure mechanism is limited to a single layer-2 domain (until the first moment a packet is routed). That’s why it’s a total nonsense to talk about lossless packet forwarding in routed networks including VXLAN-based layer-2 networks.

The transport protocol in the sending node has to guestimate the state of the network (end-to-end bandwidth and current congestion);

UDP is unreliable transport protocol and therefore (A) does not care whether the packets sent into the network are delivered or not and (B) does not adjust the sending rate.

TCP (as a reliable transport protocol) has to estimate available network resources to minimize retransmissions and optimize goodput.

Finally we’re getting somewhere. How does TCP do it?

Lacking explicit backpressure mechanism in IP networks (with ECN being somewhat of an exception) the only signals available to TCP are packet drops and increased latency.

Algorithms responding to increased latency are always starved when competing with algorithms responding to packet loss as output queues in network devices cause increased latency way before network devices start dropping packets.

Most TCP congestion avoidance algorithms therefore respond to packet drops not latency increase. The usual response to a packet drop is to halve the sending rate.

Unless you know better (see also: Understanding Your Environment) it’s better to have reasonable amount of packet drop instead of increased latency. Not surprisingly, not everyone got that memo (see also: Bufferbloat).

As always, there are tons of exceptions:

BBR congestion control uses more complex algorithm that includes estimated round-trip-time;

Network devices could set TCP ECN bits to indicate impending congestion. Unfortunately most traditional TCP implementations treat ECN bits in the same way as packet drops;

Data Center TCP modifies response to ECN bits resulting in gradual sending rate reduction;

The negative impact of deep buffers can be alleviated with advanced Active Queue Management (AQM) algorithms like CoDel. I haven’t seen a hardware implementation of CoDel yet, and would love to understand how much buffering CoDel really needs.

Now for the crucial questions:

Are packet drops bad? According to engineers familiar with modern TCP implementations (see below for details) the answer is NO (due to selective retransmission).

Do they affect performance? Of course (see above), but then it’s unrealistic to expect unlimited performance;

Can they reduce the throughout of a single TCP session below the available network capacity? Yes, assuming that they happen so frequently that the flow of data is interrupted because the receiver cannot send selective ACKs soon enough. Welcome to the Bandwidth-Delay Product twilight zone.

Can they cause underutilization of networking infrastructure? Yes, assuming your network carries a small number of high-bandwidth TCP sessions. In more mainstream environments you’d see tens of thousands of TCP sessions on every high speed link and the statistics would take care of the problem.

Speaking of small number of TCP sessions… it’s worth mentioning that measurements on low-speed links show a clear advantage of delaying TCP traffic (traffic shaping) versus dropping excess traffic. However, you should also keep in mind that we’re expecting TCP to deal with drastically different environments, from 2400-baud modems and WiFi networks to 100 Gbps data center links, and you cannot expect measurements done at one extreme set of conditions to be relevant across the whole range of environments.

To summarize (again, based on what people who should know better are telling me):

Drops are not a big deal in low-latency environments;

Shallow buffers (and corresponding drops) might even be beneficial because they keep latency low;

Congested low-speed links require smarter AQM algorithms like CoDel - not a problem because in those environments you can do packet scheduling in software;

We can do better than respond to drops, but there’s no simple solution;

Mobile networks that perform buffering and retransmission in the radio network are a totally different story and require extensive optimizations.

Have I missed anything? Please write a comment!

Further information

Want to know more? Start with:

Finally, check out the How Networks Really Work series starting on June 18th. You could join the live session for free if you send me a really good question.