An unusual database problem at the giant social networking site could only be cured by taking the sort of action you normally take with a misbehaving PC

How Facebook fixed the site: they turned it off and on again. Literally



I found it, Mr Zuckerberg! Photo by Sir Mildred Pierce on Flickr. Some rights reserved

Ever been on the phone to IT support and they told you to turn it off and then on again, and that sorts it out?

Facebook last night had that sort of problem. So they turned the site off and on again. And it fixed their problem. Literally.

As Robert Johnson, its director of software engineering, explained in a slightly shamefaced blogpost, the site was offline for about two-and-a-half hours – its worst outage in four years – due to some technical changes that Facebook had made.

It wasn't only the site itself which went belly-up; the Like buttons (which connect back to Facebook) vanished on 350,000 sites too, and the API which powers its OpenGraph system had serious problems.

The logistics of running a vast network like Facebook mean that you don't stick all your servers in a single place, of course. Facebook runs a big caching operation, so that lots of servers replicate its content. The cache gets updated periodically; it sits on a network called tfbnw.net (for "the Facebook network": you can see it here in this traceroute to Facebook, which shows what the intermediate networks are between one site and Facebook), which in effect sits like a ring around the "central" Facebook site.

Sometimes, things go wrong in the cache as values go out of date; but that's no problem, usually, because you can overwrite them with correct values from the centre. At least, you would like to.

Here's how Johnson explained it:

"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.

"The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn't work when the persistent store is invalid."

In other words: something went wrong inside the circle. And that wrong value got passed out to all the fbnw.net servers that would normally serve up Facebook pages.

Back to Johnson:

"Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second."

Basically, tfbnw.net's servers started querying the central system all at once, which overwhelmed it.

"To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover."

And now we come to the "oh my god, we're really going to have to do that?" moment:

"The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site."

And the result?

"This got the site back up and running today, and for now we've turned off the system that attempts to correct configuration values. We're exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes."

That means that there may be some times over the next few days when you won't be able to reach Facebook in particular places, or that unusual things will happen.

"We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously."

Well, of course: if the site's down, it can't sell ads, and if it can't sell ads, how is Mark Zuckerberg going to justify his enormous Forbes valuation?