Sometimes we hear that crazy developer talk about some magical thing you can do that will increase performance everywhere by 30% (feel free to replace that percentage with whatever sits right for you).

In the past week or so I have been playing the role of “that guy”. The ranting lunatic. Some times this crazy guy throws all sorts of other terms around that make them sound more crazy. Like say: TCP or Slow Start or Latency … and so on.

So we ignore that guy. He is clearly crazy.

##Why the web is slow?

Turns out that when it comes to a fast web the odds were always stacked against us. The root of the problem is that TCP and in particular the congestion control algorithm that we all use - Slow Start happens to be very problematic in the context of the web and HTTP.

Whenever I download a web page from a site there are a series of underlying events that need to happen.

A connection needs to be established to the web server (1 round trip) A request needs to be transmitted to the server. The server needs to send us the data.

Simple.

I am able to download stuff at about 1 Meg a second. It follows that if I need to download a 30k web page I only need two round trips, first one to establish a connection. And the second one to ask for the data and get it. Since my connection is SO fast I can grab the data in lightning speed, even if my latency is bad.

My round trip to New York (from Sydney Australia) takes about 310ms (give or take a few ms)

Pinging stackoverflow.com [64.34.119.12] with 32 bytes of data: Reply from 64.34.119.12: bytes=32 time=316ms TTL=43

It may get a bit faster as routers are upgraded and new fibre is laid, however it is governed by the speed of light. Sydney to New York is 15,988KM. The speed of light is approx 299,792KM per second. So the fastest amount of time I could possibly reach New York and back would be 106ms. At least until superluminal communication becomes reality.

Back to reality, two round trips to grab a 30k page is not that bad. However, once you start measuring … the results do not agree with the unsound theory.

The reality is that downloading 34k of data often takes upwards of a second. What is going on? Am I on dial-up? Is my Internet broken? Is Australia broken?

Nope.

The reality is that to reach my maximal transfer speed TCP need to ramp up the number of segments that are allowed to be in transit a.k.a. the congestion window. RFC 5681 says that once a connection starts up you are allowed to have maximum of 4 segments initially in transit and unacknowledged. Once they are acknowledged the window grows exponentially. In general the initial congestion window (IW) on Linux and Windows is set to 2 or 3 depending on various factors. Also the algorithm used to amend the congestion window may differ (vegas vs cubic etc) but usually follows the pattern of exponential growth compensating for certain factors.

Say you have an initial congestion window set to 2 and you can fit 1452 bytes of data in a segment. Assuming you have an established connection infinite bandwidth and 0% packet loss it takes:

1 round trip to get 2904 bytes, Initial Window (IW) = 2

2 round trips to get 8712 bytes, Congestion Window (CW)=4

3 round trips to get 20328 bytes, CW = 8

4 round trips to get 43560 bytes, CW = 16

In reality we do get packet loss, and we sometimes only send acks on pairs, so the real numbers may be worse.

Transferring 34ks of data from NY to Sydney takes 4 round trips with an initial window of 2 which explains the image above. It makes sense that I would be waiting over a second for 34K.

You may think that Http Keepalive helps a lot, but it does not. The congestion window is reset to the initial value quite aggressively.

TCP Slow Start is there to protect us from a flooded Internet. However, all the parameters were defined tens of years ago in a totally different context. Way before broadband and HTTP were pervasive.

Recently, Google have been pushing a change that would allow us to increase this number to 10. This change is going to be ratified. How do I know? There are 3 reasons.

This change drastically cuts down the number of round trips you need to transfer data:

1 round trip to get 14520 bytes, IW = 10

2 round trips to get 43560 bytes, CW = 20

In concrete terms, the same page that took 1.3 seconds to download could take 650ms to download. Further more, we will have a much larger amount of useful data after the first round trip.

That is not the only issue we have that is causing the web to be slow, SPDY tries to solve some of the others such as: poor connection utilization, inability to perform multiple requests from a single connection concurrently (like HTTP Pipelining without FIFO ordering) and so on.

Unfortunately, even if SPDY is adopted we are still going to be stuck with 2 round trips for a single page. In some theoretic magical world we could get page transfer over SCTP which would allow us to cut down on a connection round trip (and probably introduce another 99 problems).

Show me some pretty pictures

Enough with theory, I went ahead and set up a small demonstration of this phenomena.

I host my blog on a VM, I updated this VM to the 3.2.0 Linux Kernel, using debian backports. I happen to have a second VM running on the same metal, which runs a Windows server release.

I created a simple web page that allows me to simulate the effect of round trips:

<!DOCTYPE html> <html> <head> <title>35k Page</title> <style type="text/css"> div {display: block; width: 7px; height: 12px; background-color: #aaa; float: left; border-bottom: 14px solid #ddd;} div.cp {background-color:#777;clear:both;} </style> </head> <body><div class='cp'></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div> ... etc

The cp class is repeated approximately every 1452 bytes to help approximate segments.

Then I used the awesome webpagetest.org to test for downloading the page. The results speak louder than anything else I wrote here (you can view it here):

Both of the requests start off the same way, 1 round trip to set up the TCP connection and a second before any data appears. This places us at T+700ms. Then stuff diverges. The fast Linux Kernel is (top row) is able to get significantly more data through in the first pass. The Windows box delivers me 2 rows (which is approx 2 segments), the Linux one about 6.

At 1.1 seconds the Windows box catches up temporarily but then 1.3 seconds in the Linux box delivers the second chunk of packets.

At 1.5 seconds the Linux box is done and the Windows box is only half way through.

At 1.9 seconds the Windows box is done.

Translated into figures, the Linux box with the IW of 10 is 21 percent faster when you look at total time. If you discount the connection round trip it is about 25 percent faster. All of this without a single change to the application.

What does this mean to me?

Enterprise Linux distributions are slow to adopt the latest stable kernel. Enterprisey software likes to play it safe, for very good reasons. SUSE enterprise is possibly the first enterprise distro to ship a 3.0 kernel. Debian, CentOS, Red Hat and so on are all still on 2.6 kernels. This leaves you with a few options:

Play the waiting game, wait till your enterprise distro backports the changes into the 2.6 line or upgrades to a 3.0 Kernel. Install a backported kernel. Install a separate machine running say nginx and a 3.0 kernel and have it proxy your web traffic.

What about Windows?

Provided you are on Windows 2008 R2 and have http://support.microsoft.com/kb/2472264 installed, you can update your initial congestion window using the command:

c:

etsh interface tcp set supplemental template=custom icw=10 c:

etsh interface tcp set supplemental template=custom

See Andy’s blog for detailed instructions.

##Summary

At Stack Overflow, we see very global trends in traffic, with huge amounts of visits coming from India, China and Australia - places that are geographically very far from New York. We need to cut down on round trips if we want to perform well.

Sure, CDNs help, but our core dynamic content is yet to be CDN accelerated. This simple change can give us a 20-30% performance edge.

Every 100ms latency you have costs you 1% of your sales. Here is a free way to get a very significant speed boost.