EDIT 2/20/14: Updated to reflect correct response time metric

In part 1 of our post, one of the items we discussed was our issues with using DNS as a load balancing solution. To recap, at the end of our last post we were still setup with Dyn’s load balancing solution and our servers were receiving a disproportionate amount traffic to them. Server failures were still not as seamless as we wanted, due to the issues with DNS TTLs not always being obeyed and our response times were a lot higher than we wanted them to be, hovering around 200-250ms.

In part 2 of this post, I’ll cover the following

How we improved our issues with server failure and response time by using Amazon’s ELB service

The performance gains we saw from enabling HTTP keepalives.

Future steps for performance improvements

But first before I dive into the ELB, there’s one topic I left out of my last post that I wanted to mention.

TCP Congestion Window (cwnd)

In TCP congestion control, there exists a variable called the “congestion window” , commonly referred to as cwnd . The initial value of cwnd is often referred to as “ initcwnd ”. After the initial TCP handshake is done, we begin to send data, the cwnd determines how many bytes we can send before the client needs to respond with an ACK. Let’s look at a graphic of how different initcwnd values affect TCP latency from a paper Google released.

At Chartbeat, we’re currently running Ubuntu 10.04 LTS (I know, I know, we’re in the process of upgrading to 12.04 as this is being written), which ships with Kernel 2.6.32. Starting in Kernel 2.6.39, thanks to some research from Google, the default initcwnd was changed from 3 to 10. If you are serving up content greater than 4380 bytes (3 * 1460), you will benefit from increasing your initcwnd due to the ability to have more data in flight (BDP or bandwidth delay product) before having to reply with an ACK. The average response size from ping.chartbeat.net is way under that, at around 43 bytes, so this change had no benefit to us at the time when the servers were not behind the ELB. We’ll see why increasing the initcwnd helped us later in the post when we discuss HTTP keepalives.

ELB (Elastic Load Balancer)

The options for load balancing traffic on AWS are fairly limited. Your choices are

An ELB

DNS load balancing service such as Dyn

Homegrown solution using HAProxy, nginx or <insert favorite load balancing software here>

Each of these solutions have their limitations and depending on your requirements, some may not be suitable at all for you. I won’t go into all of the pros and cons of each solution here since there are plenty of articles on the web discussing these already. I’ll just go over a few that directly affected our choice.

In choosing a homegrown solution, support for high availability and scalability is difficult. Currently with AWS, there’s no support for gratuitous ARP, which is traditionally used in handling of fail overs both in software and hardware load balancers. In order to work around this issue, you can utilize Elastic IPs and homegrown scripts to move the Elastic IP between instances when it detects a failure. In our experience we’ve seen lag times from 30 seconds to a few minutes when moving an Elastic IP. During this time, you would be down hard and not serving up any traffic. The above solution also only works when all your traffic can be handled by one host and you can accept the small period of downtime during fail over.

But how would you handle a situation where your traffic was too high for one host? You could launch multiple instances of your home grown solution but you would then need to handle balancing the traffic between these instances. We already discussed in part 1 the issue we had with using DNS to handle the balancing of traffic. The only other solution would be to actually use an ELB in front of these instances. If we went with this solution, it meant adding another layer of latency to the request. Did we really need to do something like this?

The reason why most people end up going with a solution like HAProxy is because they have more advanced load balancing requirements. ELB only supports round robin request balancing and sticky sessions. Some folks require the ability to do request routing based on URI, weight based routing or any of the other various algorithms that HAProxy supports. Our requirements for a load balancing solution were fairly straightforward:

Evenly distribute traffic (better than our current DNS solution)

Highly available

Handle our current traffic peak(200k req/sec) and scale beyond that

End-to-End SSL support

ELB best met all these requirements for us. A homegrown solution would have been overkill for our needs. We didn’t need any of the advanced load balancing features, SSL is currently only supported in HAProxy’s development branch (1.5.x) and requires using stunnel or nginx for support in the stable branch (1.4.x) and we didn’t need to add any additional layers that would increase our latency even further.

Moving to ELB

The move to using an ELB was fairly straight forward. We contacted Amazon support and our technical account manager to coordinate pre-warming the ELB. According to the ELB best practices guide, ELBs will scale gradually as your traffic grows (should handle 50% of traffic increase every 5 minutes), but if we suddenly switched 100% of our traffic to the ELB, it would not be able to scale quickly enough and start throwing errors. We weren’t planning on doing a cutover in that fashion anyway, but to be safe we wanted to ensure the ELB was pre-warmed ahead of time even as we slowly moved over traffic. We added all the servers into the ELB and then did a slow roll out utilizing Dyn’s traffic director solution, which allowed us to weight DNS records. We were able to raise the weight of the ELB record and slowly remove the individual server’s IPs from ping.chartbeat.net to control the amount of traffic flowing through the ELB.

Performance gains

We saw large, immediate improvements in our performance with the cutover to the ELB. We saw less TCP timeouts and a decrease in our average response time.

We went from roughly 200 ms average response times, to 30 ms response times. That’s a 85% decrease in response time! (EDIT 2/20/2014) Thanks to Disqus commenter Mxx for pointing out, we incorrectly measured the response time here. Moving behind the ELB changed the metric from being a measure of response time between our servers and clients, to a measurement of response time between the ELB and our servers. Comparing external data from Pingdom, we still saw a decrease in response time of about 20% from peak traffic times, going from 270ms to 212ms. Apologies for the earlier incorrect statement.

Our traffic was now more evenly distributed than our previous DNS based solution. We were able to further distribute our traffic shortly after, when Amazon released “Cross-Zone load balancing”

Enabling cross-zone load balancing got our request count distribution extremely well balanced, the max difference in requests between hosts sits currently around 13k requests over a minute.

KeepAlives

With our servers now behind the ELB we had one last performance tweak we wanted to enable, HTTP keepalives between our servers and the ELB. Keepalives work by allowing multiple requests over a single connection. In cases where users are loading many objects off your site, this can greatly reduce latency by removing the overhead of having to re-establish a connection for each object you are loading off the site. CPU savings are seen on the server side since less time will be spent opening and closing connections. All this sounds pretty great, so why didn’t we have it enabled before hand?

There are a few cases where you may not want keepalives enabled on your web server. If you’re only serving up one object from your domain, it doesn’t make much sense to keep a connection hanging around for more requests. Each connection uses up a small amount of RAM. If your web servers don’t have a large amount of RAM and you have a lot of traffic, enabling keepalives could get you in a situation where you will consume all RAM on the server, especially with a high default timeout for the keepalive connection. For Chartbeat, our data comes from clients every 15 seconds, holding a connection open just to get a small amount of data every 15 seconds would be a waste of resources for us. Fortunately we were able to offload that to the ELB which enables keepalive connections by default for any HTTP 1.1 client.

With our servers no longer being directly exposed to the clients, we could re-visit enabling keepalives. We are doing a high amount of requests between the ELB and our servers , with the connections coming from a limited set of servers on Amazon’s end. We want the ELBs to be able to proxy as much information as possible to us over one connection and keep that connection open for as long as possible. This is where having a larger initcwnd comes into play. Having a larger initcwnd lowers our latency and gets our bandwidth up to full speed between the servers and the ELB. We expected to see a drop off in the amount of traffic going through the servers as well as some CPU savings. To ensure there were no issues, we did a “canary” test with one server enabled with keepalive and put it into production. The results were not at all what we expected. Traffic to the server became extremely spiky and average response time increased a bit when keepalives were enabled on the canary server. After talking to Amazon about the issue, we learned that the ELB was favoring the host with keepalive enabled. More traffic was being sent to that host causing its latency to increase. When the latency increased, the ELB would then send less traffic through the host and the cycle would start over again. Once we confirmed what the issue was, we proceeded with the keepalive rollout and the traffic went back to being evenly distributed. The amount of sockets we had sitting in TIME_WAIT went from around 200k to 15k after enabling keepalives and CPU utilization dropped by about 20%.

Keepalives and Timeouts

There are a few important things to be aware of when configuring keepalives with your ELB with regards to timeouts. Unfortunately there’s a lack of official documentation on ELB keepalive configuration and behavior, so the information below could only be found through various posts on the official AWS forums.

The default keepalive idle connection timeout is 60 seconds

The keepalive idle connection timeout can be changed to values as low as 1 second and as high as 17 minutes with a support ticket

The keepalive timeout value on your backend server must be higher than that of your ELB connection timeout. If it is lower, the ELB will re-use the idle connection when your server has already dropped the connection, resulting in the client being served up a blank response. The default nginx keepalive_timeout value is safe at 75 seconds with the default ELB timeout of 60 seconds.

Downsides

While the ELB has worked out great for us and we’ve seen huge performance improvements from switching to using one in front of our servers, there are a few issues we’d love to see addressed in future roll-outs of the ELB:

Lack of bandwidth graphs in CloudWatch. I’m surprised the ELB has been around for this long without this CloudWatch metric. You get charged per GB processed through the ELB, yet there’s no way to see from Amazon’s view, how much bandwidth is going through your ELB. This could also help identify DoS attacks that don’t involve making actual requests to the ELB. No Ability to pre-warm an ELB without going through support. Right now it’s a process of having to contact Amazon support to get an ELB pre-warmed, and answering a bunch of questions related to your traffic. Even if this process was moved to a web form like how requests for service limit increases are done, it would be better than the current method. No ability to clone an ELB. Why would you want that? If you have an ELB that is handling a large amount of traffic and you are experiencing issues with it, you cannot easily replace the faulty ELB in a hurry due to the need for new ELBs to scale up slowly. It would be extremely useful to clone an existing one, capturing it’s fully warmed configuration and then be able to flip traffic over to it. Right now if there’s an issue, AWS support needs to get involved, and unless you are paying for higher end support, you may not get a fast enough response from support. No access to the raw logs. A feature to send the ELB logs to an S3 bucket would be very valuable. This would open up a bunch of doors with the ability to setup AWS Data Pipeline to fire off an EMR job or move data into Redshift. Currently all that must be done on the servers behind the ELB. No official documentation on keepalive configuration or behavior. Ability to change the default keepalive timeout value is not exposed through the API and requires a support ticket.

Conclusions

We learned an important lesson by not monitoring some key metrics on our servers that were having an affect on our performance and reliability. With increasing traffic it’s important to re-evaluate your settings periodically to see if they still make sense for the level of traffic you are receiving. The default TCP sysctl settings will work just fine for a majority of workloads but when you begin to push your server resources to there limits, you can see big performance increases by making some adjustments to variables in sysctl. Through TCP tuning and utilizing AWS Elastic Load Balancer we were able to

Decrease our traffic response time by 20%

Decrease our server footprint by 20% on our front end servers

Have failed servers removed from service within seconds

Eliminate dropped packets due to listen queue socket overflows

Next Steps

Since the writing of this article, we’ve done some testing with Amazon’s new C3 instance types and are planning to move from the m1.large instance type to the c3.large. The c3.large is almost 50% cheaper and gives us more compute units which in turn yields slightly better response times.

Our traffic is very cyclical which lends itself perfectly to take advantage of Amazon’s auto scaling feature. Take a look at a graph from a weeks worth of traffic.

In the middle of the night (EDT), we see half of what our peak traffic was earlier in the day and on weekends we see about 1/3 less traffic than a weekday.

In the next coming months we’ll be looking to implement auto scaling to achieve additional cost savings and better handle large, unexpected spikes of traffic.

Additional resources:

Special thanks to the following folks for feedback and guidance on this post