Renting a server from Amazon is no substitute for a disaster recovery plan.

If you run your own servers, you need backups. If you can't afford to go down, you also need offsite replication. But if you lease servers in the cloud, how can you protect against problems like this week's Amazon outage?

Keep reading for a timeline of the outage, plus a list of recovery strategies and the minimum downtime that each would have incurred.

A timeline of the Amazon outage

Here's a timeline of what went wrong, and when it was fixed. Note, in particular, the window from roughly 1:00 AM to 1:48 PM PST when several of Amazon's availability zones were partially unavailable. (For a glossary of Amazon Web Service terminology, see the bottom of this post.)

I've also included Heroku's status reports on this timeline.

21 April 2011

1:15 AM PDT Heroku begins investigating high error rates.

1:41 AM PDT Amazon admits they are seeing problems with EBS volumes and EC2 instances in US East 1. The outage affects multiple availability zones. Amazon later described the problem as follows:

A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

1:52 AM PDT Heroku reports that applications and tools are functioning intermittently.

3:05 AM PDT Amazon reports that RDS databases replicated across multiple Availability Zones are not failing over as expected. This is a big deal, because these multi-AZ RDS databases are intended to be an expensive, highly-reliable option for storing data.

1:48 PM PDT EBS volumes and EC2 instances are now working correctly in all but one availability zone.

2:15 PM PDT Heroku reports that they can now launch new EBS instances.

2:35 PM PDT Amazon restores access to "majority" of multi-AZ RDS databases. (There's nothing in the Amazon timeline to indicate when all of the multi-AZ RDS databases came back online.)

3:07 PM PDT Heroku brings core services back online, and restores service to many applications.

4:15 PM PDT Heroku reports: "In some cases the process of bringing many applications online simultaneously has created intermittent availability and elevated error rates."

8:27 PM PDT Heroku finishes restoring API services.

22 April 2011

2:19 AM PDT Heroku reports that all dedicated databases are back online.

6:25 AM PDT Heroku reports that new application creation is enabled.

1:30 PM PDT Amazon reports "majority" of EBS volumes in affected zone have been recovered. Remaining volumes will require a more time-consuming recovery process.

9:11 PM PDT Amazon reports that "control plane" congestion is limiting the speed at which they can recover the remaining volumes.

23 April 2011

11:54 AM PDT Amazon is still wrestling with control plane congestion.

Quick update. We've tried a couple of ideas to remove the bottleneck in opening up the APIs, each time we've learned more but haven't yet solved the problem. We are making progress, but much more slowly than we'd hoped. Right now we're setting up more control plane components that should be capable of working through the backlog of attach/detach state changes for EBS volumes. These are coming online, and we've been seeing progress on the backlog, but it's still too early to tell how much this will accelerate the process for us.

8:39 PM PDT Amazon finishes re-enabling their APIs for all recovered volumes in the affected zone. Not all EBS volumes have been recovered yet, however.

We continue to see stability in the service and are confident now that that the service is operating normally for all API calls and all restored EBS volumes.

8:39 PM PDT Heroku reports that all applications are back online, though a few still cannot deploy new code via git.

24 April 2011

3:26 AM PDT Amazon re-enables RDS APIs in the affected zone, but not all databases have been recovered:

The RDS APIs for the affected Availability Zone have now been restored. We will continue monitoring the service very closely, but at this time RDS is operating normally in all Availability Zones for all APIs and restored Database Instances. Recovery is still underway for a small number of Database Instances in the affected Availability Zone.

5:21 AM PDT Heroku reports that all functionality is fully restored, including deploying new applications.

7:35 PM PDT Amazon reports that all EBS volumes are back online.

7:39 PM PDT Amazon reports that all RDS databases are back online.

Strategies for surviving a major cloud outage, and associated downtime

1. Rely on a single EBS volume with no snapshots. If you relied on single EBS volume with no shapshots, there's a chance that your site would have been offline for over 3.5 days after the initial outage. There's also at least a 0.1% to 0.5% annual chance of losing your EBS volume entirely. This is not a recommended approach.

2. Deploy into a single availability zone, with EBS snapshots. In this scenario, if an availability zone goes down, you can theoretically restore from backup into another availability zone. During this recent outage, your site might have remained offline for over 12 hours, and you might have lost any changes since your last backup (unless you reintegrated them manually). Given Amazon's record during 2009 and 2010, this could still give you 99.95% uptime if no other EBS volume failures occurred. Despite the recent events, this may still be a viable strategy for many smaller, lower-revenue sites.

3. Rely on multi-AZ RDS databases to fail over to another availability zone. This approach should have lower downtime than relying on EBS snapshots, but in this case, the multi-AZ RDS failover mechanisms took longer than 14 hours for some users.

4. Run in 3 AZs, at no more than 60% capacity in each. This is the approach taken by Netflix, which sailed through this outage without no known downtime. If a single AZ fails, then the remaining two zones will be at 90% capacity. And because the extra capacity is running at all times, Netflix doesn't need to launch new instances in the middle of a "bank run" (see below).

5. Replicate data to another AWS region or cloud provider. This is still the gold standard for sites which require high uptime guarantees. Unfortunately, it requires transmitting large amounts of data over the public internet, which is both expensive and slow. In this case, downtime is function of external systems and how quickly they can fail over to the replicated database.

There are some other approaches, such as writing backups and transaction logs to S3, where they are likely to remain available even in the case of severe outages.

Lessons learned

For some excellent post-mortems, see:

Today’s EC2 / EBS Outage: Lessons learned. A good overall analysis, with recommendations.

On Cascading Failures and Amazon’s Elastic Block Store. How emergency fail-over code can actually make an outage worse.

Update: Amazon EC2 outage: summary and lessons learned. RightScale has posted an excellent post-mortem. They note that the outage actually spread to more EBS volumes over time, and link to a long list of related posts. (They also claim that the other AZs were functioning again after 4 hours, which doesn't match either Amazon's public claims or the experiences of people I've spoken to.)

Here are some of the most important points:

1. The biggest danger in a well-engineered cloud system is a “run on the bank", where initial failures trigger error-recovery code, which in turn may drive the load far beyond normal limits. According to Amazon, an initial network problem triggered an EBS re-mirroring, which in turn overloaded their management plane. This, in turn, triggered emergency recovery scripts written by AWS customers, forcing the total load even higher. To stabilize the situation, Amazon was forced to disable API access to multiple zones. Just as in 1933, the easiest solution to a bank run is a bank holiday.

2. Availability Zone failures are correlated. Even though Amazon claims that multiple availability zones should not fail at the same time, it's clear that all the availability zones within a region share a management plane. This means that a large enough failure can overload the shared management plane.

3. EBS remains the weakest link. Recent months have seen widespread complaints about EBS, and Netflix has published an article on working around those limitations.

4. Few cloud providers publish their disaster recovery plans, making it hard to estimate downtime. If you were a Heroku customer last week, you had no way to evaluate how Heroku would respond to a major outage, or their plans for keeping your site on the air. As it turns out, they had widespread dependencies on EBS, and no plan for getting Heroku-based sites back on the air if an availability zone failed.

5. Test your disaster recovery plan. If you haven't tested your disaster recovery plan, then you have no idea how long it will take you to get back on the air.

Amazon Web Service glossary

Here are definitions of the Amazon-specific terminology used in this post.

AWS Regions. Amazon divides their data centers into 5 independent regions: US East, US West, EU, Asia Pacific (Signapore), Asia Pacific (Tokyo). These regions appear to be almost totally indepent: They offer slightly different feature sets, they communicate over the public internet, and there are no special features for moving data between regions.

AWS Availablity Zones. Each AWS region is divided into a number of separate data centers, connected by high-speed links. According to Amazon, Availability Zones... are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region.

For example, US East is divided into the zones us-east-1a, us-east-1b, us-east-1c and us-east-1d. (Note that your us-east-1a isn't necessarily the same as mine: Amazon randomly shuffles the zone names for each customer.) In theory, splitting your application across multiple availability zones should give you all the advantages of offsite replication with none of the drawbacks. In practice, it's not so simple.

EBS Volumes. An EBS volume is a networked block storage device. Each EBS volume is tied to a specific availability zone. EBS volumes have a number of notorious limitations, but some organizations work with them successfully.

EBS Snapshots. A snapshot is a backup of an EBS volume. Each snapshot is tied to a specific AWS region, but you can restore it to any availability zone within that region. Note that there is no way to recover data from a snapshot without creating an EBS volume.

RDS. Amazon's Relational Database Service, which runs databases using EBS volumes. It comes with an optional multi-AZ failover service, which replicates all database updates to a standby server.