Software developers make all sorts of simplifying assumptions about user interactions, especially about system load. One of the more pernicious (and sometimes fatal) simplifications is “I have lots of users all over the world. For simplicity, I’m going to assume their load will be evenly distributed.”

To be sure, this often turns out to be close enough to true to be useful. The problem is that it’s a steady state or static assumption. It presupposes that things don’t vary much over time. That’s where things start to go off the rails.

Consider this very common pattern: Suppose you’ve written a mobile app that periodically fetches information from your backend. Because the information isn’t super time sensitive, you write the client to sync every 15 minutes. Of course, you don’t want a momentary hiccup in network coverage to force you to wait an extra 15 minutes for the information, so you also write your app to retry every 60 seconds in the event of an error.

Because you're an experienced and careful software developer, your system consistently maintains 99.9% availability. For most systems that’s perfectly acceptable performance but it also means in any given 30-day month your system can be unavailable for up to 43.2 minutes.

So. Let’s talk about what happens when that’s the case. What happens if your system is unavailable for just one minute?

When your backends come back online you get (a) the traffic you would normally expect for the current minute, plus (b) any traffic from the one-minute retry interval. In other words, you now have 2X your expected traffic. Worse still, your load is no longer evenly distributed because 2/15ths of your users are now locked together into the same sync schedule. Thus, in this state, for any given 15-minute period you'll experience normal load for 13 minutes, no load for one minute and 2X load for one minute.

Of course, service disruptions usually last longer than just one minute. If you experience a 15-minute error (still well within your 99.9% availability) then all of your load will be locked together until after your backends recover. You'll need to provision at least 15X of your normal capacity to keep from falling over. Retries will also often “stack” at your load balancers and your backends will respond more slowly to each request as their load increases. As a result, you might easily see 20X your normal traffic (or more) while your backends heal. In the worst case, the increased load might cause your servers to run out of memory or other resources and crash again.

Congratulations, you’ve been DDoS’d by your own app!

The great thing about known problems is that they usually have known solutions. Here are three things you can do to avoid this trap.

#1 Try exponential backoff

When you use a fixed retry interval (in this case, one minute) you pretty well guarantee that you'll stack retry requests at your load balancer and cause your backends to become overloaded once they come back up. One of the best ways around this is to use exponential backoff.

In its most common form, exponential backoff simply means that you double the retry interval up to a certain limit to lower the number of overall requests queued up for your backends. In our example, after the first one-minute retry fails, wait two minutes. If that fails, wait four minutes and keep doubling that interval until you get to whatever you’ve decided is a reasonable cap (since the normal sync interval is 15 minutes you might decide to cap the retry backoff at 16 minutes).

Of course, backing off of retries will help your overall load at recovery but won’t do much to keep your clients from retrying in sync. To solve that problem, you need jitter.





#2 Add a little jitter

Jitter is the random interval you add (or subtract) to the next retry interval to prevent clients from locking together during a prolonged outage. The usual pattern is to pick a random number between +/- a fixed percentage, say 30%, and add it to the next retry interval.

In our example, if the next backoff interval is supposed to be 4 minutes, then wait between +/- 30% of that interval. Thirty percent of 4 minutes is 1.2 minutes, so select a random value between 2.8 minutes and 5.2 minutes to wait.

Here at Google we’ve observed the impact of a lack of jitter in our own services. We once built a system where clients started off polling at random times but we later observed that they had a strong tendency to become more synchronized during short service outages or degraded operation.

Eventually we saw very uneven load across a poll interval — with most clients polling the service at the same time — resulting in peak load that was easily 3X the average. Here’s a graph from the postmortem from an outage in the aforementioned system. In this case the clients were polling at a fixed 5-minute interval, but over many months became synchronized: