

When a busy cloud computing platform crashes, the impact is felt widely. That's the case with today's extended outage for Amazon Web Services, which is battling latency issues at one of its northern Virginia data centers. The problems are rippling through to customers, causing downtime for many services that use Amazon's cloud to run their web services.

The sites knocked offline by Amazon's problems include social media hub Reddit, the HootSuite link-sharing tool, the popular question-and-answer service Quora, and even a Facebook app for Microsoft (see a full list of affected sites).

The issues began at about 1 a.m. Pacific time and are continuing as of 2:30 p.m. Pacific, with Amazon saying it still cannot predict when services will be fully recovered. By mid-afternoon, Amazon said it had limited the problems to a single availability zone in the Eastern U.S., and was attempting to route around the affected infrastructure. The AWS status dashboard shows that the services experiencing problems include Elastic Compute Cloud (EC2), Amazon Relational Database Service and Amazon Elastic MapReduce and are focused in the US-East-1 region.

Networking Event Triggers Problems

The problems are focused on Elastic Block Storage (EBS), which provides block level storage volumes for use with Amazon EC2 instances. Latency problems at EBS were cited by Reddit when the site experienced major downtime in March.

"A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1," Amazon said in a status update just before 9 am Pacific time. "This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances.

"We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue," Amazon continued. "We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them."

UPDATE: At 10:30 Pacific, Amazon said it was making "significant progress in stabilizing the affected EBS control plane service," which was now seeing lower failure rates. "We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery."

UPDATE 2:At 1:48 p.m. Amazon said a single Availability Zone in the US-EAST-1 region continues to experience problems launching EBS backed instances or creating volumes. "All other Availability Zones are operating normally," Amazon said. "Customers with snapshots of their affected volumes can re-launch their volumes and instances in another zone. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests."

The outage even has affected a Microsoft initiative, according to a Facebook post by the company. "For those of you trying to enter our 'Big Box of Awesome' sweepstakes...the entry site is currently down, related to a broader problem impacting a number of sites across the internet today," Microsoft told its Facebook followers. "We'll let you know when it's back up." Microsoft has its own data center infrastructure, but some business units use third-party services. The Big Box of Awesome Facebook app is hosted on EC2.

Multi-Region Failover Option

The outage appears to affect many, but not all, customers using the US-East-1 region. Amazon operates multiple regions, allowing users to add redundancy to their applications by hosting them in several regions. In a multi-region setup, when one region experiences performance problems, customers can shift workloads to an unaffected region.

Whenever Amazon Web Services experiences outages and performance problems, it typically highlights the multi-region option, which allows customers to avoid having its cloud assets constitute a "single point of failure." Today's outage is likely to prompt some customers that rely on Amazon to examine adding additional regions to their deployment and other strategies to work around EC2 outages.

The outage is also likely to prompt discussion of the reliability of cloud computing. Is it a fair question to raise? Today's outage has affected many customers, highlighting the vulnerability of a single service hosting many popular sites.

This has also been true of earlier outages at dedicated hosting providers like The Planet or data center hubs like Fisher Plaza. Companies relying upon those facilities could avoid outages by adding backup installations at other data centers - which is essentially the same principle as adding additional zones at Amazon.

Stuff happens. We write about outages all the time. But real-world downtime is particularly problematic in the context of claims that the cloud "never goes down." Cloud infrastructure can also fail. The difference is that cloud deployments offer new options for managing redundancy and routing around failures when they happen.