UPDATE: Google has now posted an official incident report to describe why the outage occurred. Here is the key takeaway:

Between 8:45 AM PT and 9:13 AM PT, a routine update to Google’s load balancing software was rolled out to production. A bug in the software update caused it to incorrectly interpret a portion of Google data centers as being unavailable. The Google load balancers have a failsafe mechanism to prevent this type of failure from causing Google­wide service degradation, and they continued to route user traffic. As a result, most Google services, such as Google Search, Maps, and AdWords, were unaffected. However, some services, including Gmail, that require specific data center information to efficiently route users’ requests, experienced a partial outage.

The percentage of Gmail users hit with slow performance, server error messages, or timeouts during the 18-minute outage ranged from 8 percent to 40 percent. A smaller percentage of users received errors from applications including Google Drive, Chat, Calendar, Google Play, and Chrome Sync. Google has corrected the problem in the load balancing software, and will change the release process for load balancing software to push changes "in one location before proceeding with a general rollout." The load balancing software is particularly important as it "routes the millions of users’ requests to Google data centers around the world for processing and serving content, such as search results and email."

Original story follows:

Portions of the Internet panicked yesterday when Gmail was hit by an outage that lasted for an agonizing 18 minutes. The outage coincided with reports of Google's Chrome browser crashing. It turns out the culprit was a faulty load balancing change that affected products including Chrome's sync service, which allows users to sync bookmarks and other browser settings across multiple computers and mobile devices.

Ultimately, it was human error. Google engineer Tim Steele explained the problem's origins in a developer forum:

Chrome Sync Server relies on a backend infrastructure component to enforce quotas on per-datatype sync traffic.

That quota service experienced traffic problems today due to a faulty load balancing configuration change.

That change was to a core piece of infrastructure that many services at Google depend on. This means other services may have been affected at the same time, leading to the confounding original title of this bug [which referred to Gmail].

Because of the quota service failure, Chrome Sync Servers reacted too conservatively by telling clients to throttle "all" data types, without accounting for the fact that not all client versions support all data types. The crash is due to faulty logic responsible for handling "throttled" data types on the client when the data types are unrecognized.

If the Chrome sync service had gone down entirely, the Chrome browser crashes would not have occurred, it turns out. "In fact this crash would *not* happen if the sync server itself was unreachable," Steele wrote. "It's due to a backend service that sync servers depend on becoming overwhelmed, and sync servers responding to that by telling all clients to throttle all data types (including data types that the client may not understand yet)."

An outage like this often leads to grand pronouncements about the viability of "cloud computing." What it really shows is that cloud services can be affected by mistakes made by people, just as IT services always have been. Single points of failure that can affect multiple services are also bad, especially in an infrastructure as large and widely used as Google's.

As noted in the developer forum, preventing this problem from reoccurring requires changes both in Google's servers and in the Chrome application on user's computers. Google's Apps Status Dashboard promised that "we are confident we have established the root cause of the event and corrected it."

Google has promised a more thorough explanation of the root cause will come later today.

The headline on this article was changed to clarify that the problem originated with a load balancing configuration change.