[ 2017-July-22 12:34 ]

Recently at Bluecore I ran into a familiar situation: during brief periods of overload, all requests to a server were timing out. It was a catastrophic failure, where the server completely stopped serving. We added more capacity to prevent overload, but we also wanted to fix the server. Instead of exploding, it should answer as many requests as possible, while rejecting the rest. Catastrophic failures like this usually happen because the server is processing too many requests at the same time. Typically, the server runs out of memory or requests take too long, causing whatever is making the request to give up. The solution is to limit the maximum number of concurrent requests, so some fraction of requests fail, but the others are processed. The challenge is finding the "right" limit. It can vary 1000× depending on the software and the hardware. It also changes as the system evolves, so today's correct value is tomorrow's performance bottleneck. I've added these limits to many systems and I'm tired of it. Surely there is some "automatic" solution that we could use by default on all servers? Unfortunately, it turns out that it is hard. I think there is an opportunity for some academic research to figure it out, which I am not about to do. Instead, this article shares what I learned while investigating it, so the next time I am fixing a server that falls over when overloaded, I'll know what to do. Hopefully you will too!

The summary: Limit the number of concurrent requests in all servers to avoid cascading failures. You need the limit to be well below the "catastrophic" failure point for your service, but setting it too low will limit the throughput of your system. If I'm forced to guess, I'd start with a limit set to some multiple of the number of CPUs. E.g. maybe 3× the number if you assume that your requests spend 33% of their time executing code, and 66% of their time waiting (e.g. for other network requests or on disk). However, it really is best to determine the limits by forcing your service to explode, because the "right" limit can vary enormously. If you want to optimize things, you can get lower "tail" latency by adding a queue and using a lower concurrent request limit. However, this requires more code and tuning, so I think a simple concurrent limit is good enough in most cases. If you really want the lowest latency, using a FIFO queue with a "drop head" policy should decrease latency in overload scenarios.

The rest of the article attempts to describe why servers fail when they process too many requests, and why limiting the concurrency helps. This is less polished than I would like. I had some benchmarks and model results I wanted to include but it turns out I've answered the question enough to satisfy my curiousity. Given that this is a fairly niche topic, I'm not sure there is much value in polishing it. I figure it is better to publish this as-is than have it never leave my computer.

Why limit requests?

To start, let's discuss the simplest scenario I can think of, since it helps understand the basics. Let's imagine we have a server with a single CPU, that processes requests by doing 10 ms of computation. Ideally, this server can process 100 requests per second. If requests arrive at a fixed rate less than or equal to 100 requests per second, then the server responds to all requests after 10 ms. However, what happens if requests start arriving at 101 requests per second? Well, after one second, the server could only process 100 requests, so it will be processing 2 requests at the same time. The operating system will attempt to share the CPU, so now each request takes 20 ms. The server still responds to 100 requests per second, but the latency has increased. As the overload continues, the server begins to process more and more concurrent requests, which increases the latency. Eventually there are an infinite number of requests in the server, and it takes an infinite amount of time to respond.

This is why servers without a limit on the number of requests they process concurrently tend to fail catastrophically: the real world doesn't work with infinite latency. It can also cause clients to retry, which adds even more overload to the system. In the worst cases, the server will hit some resource limit. A very common one is running out of memory, which causes it to get killed and restart. This causes load to shift to other servers, causing them to run out of memory, which takes out the entire service that was carefully replicated for "reliability."

The solution seems easy: Limit the number of requests being processed at the same time, and reject the others. Rejecting a request should be much cheaper than processing a request, so the server should be able to respond to 10-1000× more rejections per second than "real" requests. It is probably still possible to cause a catastrophic failure, but for most applications, the rate of rejections per second it can serve is so far beyond what the system normally processes that it is effectively infinite. The Google SRE book has a chapter about server overload and cascading failures, which I highly recommend if you want to learn more.

Setting concurrent request limits

Now that we know why we need a concurrent request limit, how do we determine the specific value to use? The CPU-bound server we have been talking about is relatively straightforward: If we limit it to one concurrent request, then each request gets processed in exactly 10 ms, and the server can process 100 requests per second if they arrive perfectly spaced out. However, in reality requests might arrive in clumps. In this case, the server will reject requests that arrive close together, and will be idle while waiting for the next request. However if we allow more concurrent requests, then in "busy" periods, processing time goes up. In the worst case, if we accept too many requests, we run out of memory or cause a catastrophic failure again.

The real world is much more complex. Requests vary enormously in their resource consumption. For example, some might have cached results, others may read data from disk, and some might use multiple CPUs. The "right" number of requests in flight to maximize throughput and minimize latency depends on the mix of requests coming in. The bottlenecks also change along with the code, data and the physical systems running the software. An "ideal" solution probably requires monitoring critical resources, and admitting requests only when there is excess capacity. However, setting a fixed limit per server is far simpler, and gets most of the benefit since most applications tend to have pretty "regular" workloads. The critical part is that you must pick a limit that is well below the "catastrophic" limit for your server. The best way of figuring that out is to make it fail in a controlled experiment, although you can also "guess and check" by observing your system in production, if you are willing for your server to break a few times when you get it wrong. Otherwise, in terms of getting good performance, it is probably best to set the limit fairly high. Most systems do a reasonable job of sharing resources between concurrent requests, so this will maximimize throughput, at the expense of worst case latency. In many systems, rejecting requests is much worse than a bit of extra latency.

Queuing versus concurrent requests

Now that we have limited the number of requests in flight, what should we do when we reach the limit? The simplest answer is to reject the requests, returning some sort of error. However, for the systems where efficiency is important, we can do better. The key observation is that as long as the server can respond to the request in an "acceptable" amount of time, it should process it. The challenge is that the definition of "acceptable" can vary widely. For batch jobs, "acceptable" might be within an hour. For a request a human is waiting on, then a few seconds is probably the absolute maximum. For a request that is part of a much larger distributed system, it might need to respond within 50 ms for the overall task to complete within a reasonable time. Letting requests wait in a queue allows the server to accept bursts of work and keep it busy during future "quiet" periods. This increases utilization and decreases the probability of rejecting a request, at the expense of some increase in latency.

We can achieve a similar effect by just allowing the server to process more requests concurrently. However, this leads to worse latency than a correctly tuned request limit and queue. Consider our single CPU server that processes requests in 10 ms. Imagine a burst of 5 requests arrives at the same time. If we process all 5 at the same time, then the entire burst completes after 50 ms. If instead we accept one request at a time and queue the rest, then the requests finish after 10, 20, 30, 40, and 50 ms. The worst case is the same in both cases, but the distribution is much better with the queue. My conclusion is that an ideal server will have a concurrent request limit that maximizes throughput, then add a small queue to absorb "bursts" of requests.

Queue policies

The most common queue policy seems to be "first in first out" (FIFO), where arriving requests are dropped when the queue is full (sometimes called drop tail). This is probably not the best policy for overloaded servers. Imagine our server is processing a burst of requests, with a request arriving every 5 milliseconds. It has a queue of 3 messages. At steady state, 50% of the requests are dropped immediately, and the others are processed with 40 ms latency (30 ms queue, 10 ms processing). If instead, we drop the oldest request in the queue (the one at the head of the queue instead of the tail), then 50% of requests are drop after waiting 15 ms in the queue, but the remainder are processed after only 25 ms (15 ms queue, 10 ms processing), since the oldest message was dropped. This provides better latency for the succesfully processed requests, at the cost of rejecting requests later. In my opinion: this is probably the better policy for most applications. However, implementing this is a bit more complicated, since you need the ability to reject a request after it has been queued.

The table below shows the arrival time, completion time, and latency for the first 11 packets that arrive. The server has a processing time of 10 ms, and permits 3 requests to be queued. Requests arrive every 5 ms. The servers have reached steady-state at the end of this trace, and each request will either be drop or processed with the same latencies shown here.

Table: FIFO drop head versus drop tail

FIFO drop head FIFO drop tail Arrival Time Completion Time Latency Completion Time Latency 0 10 10 10 10 5 20 15 20 15 10 30 20 30 20 15 40 25 40 25 20 50 30 drop (35) drop (15) 25 60 35 50 25 30 70 40 drop (45) drop (15) 35 drop (35) drop (0) 60 25 40 80 40 drop (55) drop (15) 45 drop (35) drop (0) 70 25 50 90 40 drop (65) drop (15)

The previous discussion compared the policy about which request to drop when a new one arrives. The other policy choice is which request to process when the server is ready. Instead of FIFO, we can process the request that we most recently added to the queue, also called "last in last out" or LIFO. This effectively treats the queue like a stack. We still have two choices of what request to drop when the queue is full: we can drop the arriving request (which I will call drop tail, even though that is a bit confusing), or the oldest request (drop head). During overload, LIFO has even lower latencies. With the scenario above, it processes messages with 15 ms latency (5 ms queue, 10 ms processing). With the drop head policy, messages are drop after 25 ms. With the drop tail policy, messages are drop immediately.

The downside of LIFO is that in bursty scenarios with timeouts, it can fail to process messages successfully that would be okay with FIFO. For example, consider if our server is idle, then gets a burst of 4 messages at once, then is processing a constant rate of 1 message every 10 ms. With FIFO, each message gets processed after 40 ms with no errors. With LIFO, the messages in the queue are "stuck" and will probably time out, while the arriving messages get processed after 15 ms. In my opinion, in most systems handling timeouts and errors is pretty expensive, typically causing retries or user-visible errors. I think this probably means that FIFO is a better policy in most cases, since servers are unlikely to spend much time in the overloaded state where LIFO is better.

Similarities to networking

The work on congestion control in networking is extremely similar. Determining the ideal number of concurrent requests in flight in order to maximize throughput while minimizing delay seems like it is the same problem as determining the optimal the number of packets in flight in a network. As a result, we should be able to apply something like the BBR algorithm for TCP sender rate control to servers. However, there are important differences. For example, in networks, packet transmission time depends on the byte length, which is easily observed, so a rate controller has an accurate estimate of the amount of "work" in flight. For servers, the "work" per request can vary from microseconds to minutes, which can make it harder to estimate values like the minimum processing time. However, I still think it is probably possible to apply a similar control system to have a "good enough" estimate without any tuning. I'm not aware of any work on this specific problem.

Once you have a limit, you will likely end up with a queue. CoDel for tuning router queues is very similar to tuning the queue length for a server. Facebook's Wangle C++ networking library provides a queue policy inspired by CoDel, as described in more detail in an ACM Queue article. They claim that their tuning parameters (a target queue delay of 5 milliseconds and a "interval" time of 100 ms) are applicable across servers, which effectively makes this require no tuning. That article also describes switching between FIFO and LIFO during periods of overload for the latency advantages I described earlier.

Someone please solve this once and for all

This feels to me like the kind of problem that some PhD student somewhere should go off and solve. I'm not sure exactly what form it would take, but the hard part seems to be figuring out how many concurrent requests should be processed, across a wide variety of systems. It seems like it should be possible to design something that is "good enough" so that it is better than not doing anything at all. I would love some system that I could just include by default in every server, and never worry about tuning concurrency limits again.