Reddit, Foursquare, and Quora were among the many sites that went down recently due to a prolonged outage of Amazon's cloud services. On Thursday April 21, Amazon Elastic Block Store (EBS) went offline, leaving the many Web and database servers depending on that storage broken. Not until Easter Sunday (April 24) was service restored to all users. Amazon has now published a lengthy description describing what went wrong, and why the failure was both so catastrophic and so lengthy.

Amazon has cloud computing data centers in five locations around the world; Virginia, Northern California, Ireland, Singapore, and Tokyo. Within each region, services are divided into what the company calls Availability Zones: physically and logically separate groups of computers. This design allows customers to pick a level of redundancy that they feel is most appropriate; hosting in multiple regions provides the most robustness, but at the highest cost. Hosting in multiple Availability Zones within the same region is cheaper, and guards against problems affecting any one zone.

Elastic Block Store is one of Amazon's many cloud computing services. EBS provides mountable disk volumes to virtual machines using the company's Elastic Compute Cloud (EC2) facility, giving those machines access to large amounts of reliable storage that can be used for database hosting and similar tasks. This can be used both directly, from EC2, and indirectly, via Amazon's Relational Database Service (RDS), which uses EBS as its data store. To ensure high availability, EBS replicates data between multiple systems. To ensure adequate manageability, this process is highly automated. If an EBS node becomes disconnected from its replica (due to failure of that replica, for example), it will actively seek out new storage within the same zone so that it can re-establish its connectivity.

On April 21, Amazon engineers were performing maintenance on one of the Availability Zones in the US East region, in Virginia. A change in network configuration was being made (possibly a router firmware update) to the zone. Per Amazon's procedures, making a change to a router requires traffic to be moved off that router onto a backup, to avoid interruptions to service. However, this was performed incorrectly. Instead of moving traffic onto a backup router, it was moved onto a secondary, low-capacity network. This network was designed for internode communication, not large-scale node replication or data transfer to EC2 and RDS systems. Moving all the traffic onto the secondary network caused the network to become saturated, and its performance tanked.

Disconnecting the primary network and crushing the second network under a load it was never designed to handle had the effect of stripping the EBS nodes of their replicas. Unable to reach any other node, each system believed that its data was at risk, and so its sole priority was finding another node with free space available, and replicating to it. Amazon swiftly reconnected the primary network, which initially allowed nodes to find new storage and create new replicas. However, so many nodes were attempting to do this that they used all the space available to the cluster. The remaining nodes were frantically searching the network for a new place to copy their data to, but the network had nothing to offer them.

This in turn had negative consequences for the control system used to manage EBS. The flood of replication traffic and large number of nodes unable to find replicas at all meant that the control system became extremely slow to respond to requests to, for example, create a new volume. This poor performance caused it to become backed up with requests and run out of resources; at this point, it started rejecting requests with failures. Unlike the EBS clusters themselves, the control systems are not restricted to individual Availability Zones. Rather, they support the entire region. This meant that the control system backlog and failure messages started to affect users not just of the broken Availability Zone, but every other Availability Zone within the US East region.

Eventually Amazon brought things back under control. New storage was added to the cluster, allowing the nodes that were stuck looking for replicas to finally find new disk space and become operational once more. This in turn allowed the control system backlogs to be cleared. By Sunday night, almost all the nodes had returned to normal status, and almost all (99.93 percent) of EBS volumes were consistent.

Buyer beware

As major vendors continue to push for greater use of cloud computing, incidents such as this are sure to raise many concerns. This is not the first time Amazon has suffered a substantial outage—an uncorrected transmission error caused several hours of downtime in 2008, for example—but it was particularly severe, with prolonged unavailability and a small amout of data loss. The disruption to services that depended on the stricken Availability Zone was substantial.

With high availability one of the key selling points of cloud systems, this is a big problem. Some companies did avoid the problems; users of EBS that used multiple Availability Zones, or better yet, multiple regions, saw much less disruption to their service. But that's a move that incurs both extra costs and extra complexity, and certainly isn't something Amazon talks about when it describes its 99.95 percent availability target. With the Easter downtime, and assuming no more failures in the future, it will be more than 15 years before Amazon can boast of an average availability that high.

The problem also raises broader issues. One factor contributing to the problems was that when nodes could not find any further storage to replicate onto, they kept searching, over and over again. Though they were designed to search less frequently in the face of persistent problems, they were still far too aggressive. This resulted, effectively, in Amazon performing a denial-of-service attack against its own systems and services. The company says that it has adjusted its software to back off in this situation, in an attempt to prevent similar issues in the future. But the proof of the pudding is in the eating—the company won't know for certain if the problem is solved unless it suffers a similar failure in the future, and even if this particular problem is solved there may well be similar issues lying latent. Amazon's description of the 2008 downtime had a similar characteristic: far more network traffic than expected was generated as a result of an error, and this flood of traffic caused significant and unforeseen problems.

Such issues are the nature of the beast. Due to their scale, cloud systems must be designed to be in many ways self-monitoring and self-repairing. During normal circumstances, this is a good thing—an EBS disk might fail, but the node will automatically ensure that it's properly replicated onto a new system so that data integrity is not jeopardized—but the behavior when things go wrong can be hard to predict, and in this case, detrimental to the overall health and stability of the platform. Testing the correct handling of failures is notoriously difficult, but as this problem shows, it's absolutely essential to the reliable running of cloud systems.

This complexity is only ever going to increase, as providers develop richer capabilities and users place greater demands on cloud systems. Managing it—and more importantly, proving that it has been managed—will be a key challenge faced by cloud providers. Until that happens, doubts about the availability and reliability of cloud services will continue to be a major influence in the thinking of IT departments and CTOs everywhere.