Why Amazon's cloud Titanic went down





NEW YORK (CNNMoney) -- This was never supposed to happen.

Amazon Web Services is the Titanic of cloud hosting, designed with backups to the backups' backups that prevent hosted websites and applications from failing.

Yet, like the famous ocean liner, Amazon's cloud crashed this week, taking with it Reddit, Quora, FourSquare, Hootsuite, parts of the New York Times, ProPublica and about 70 other sites. The massive outage raised questions about the reliability of AWS and the cloud itself.

It was supposed to work like this: Thousands of companies use AWS to run their websites through a service called Elastic Compute Cloud, or EC2. Rather than hosting their sites on their own servers, these customers turn to Amazon, which essentially rents out its unused -- and highly intricate -- server capacity.

EC2 is hosted in five regions across the globe: Northern Virginia, Northern California, Ireland, Tokyo and Singapore. Within each region are multiple "availability zones," and within each availability zone are multiple "locations" or data centers.

In its AWS marketing pitch, Amazon touts the way it links together many different data centers to protect customers from isolated failures. It promises to keep customers' sites up and running 99.95% of the year, or it will shave 10% off customers' monthly bills.

That allows for downtime of just 4.4 hours. Some sites have been down for nearly 36 hours now.

So what went wrong exactly?

Amazon (AMZN, Fortune 500) has been tight-lipped about the incident, and the company said it won't be able to fully comment on the situation until it does a "post-mortem." So it's not clear yet exactly how the problem occurred.

But bits and pieces of information from Amazon, its customers and cloud experts help to explain what happened.

Thursday's crash happened at Amazon's northern Virginia data center, located in one of its East Coast availability zones. In its status log, Amazon said that a "networking event" caused a domino effect across other availability zones in that region, in which many of its storage volumes created new backups of themselves. That filled up Amazon's available storage capacity and prevented some sites from accessing their data.

Amazon didn't say what that "networking event" was.

Doug Willoughby, director of cloud strategy at Compuware, theorized that it could be a wiring problem or a connectivity issue that brought down AWS' so-called "Elastic Block Store." EBS is essentially a network-based hard drive that allows customers to store between 1 gigabyte to 1 terabyte of data per volume.

Reddit, one of the better-known sites to go down due to the error, said it has 700 EBS volumes with Amazon. Both Reddit and Amazon are working to "re-mirror" or copy the volumes to a data center in another availability zone. But both the painstaking process of moving the data and the sheer number of volumes is making the fix a very lengthy process.

"We always store data in multiple zones to avoid this problem," said Jeremy Edberg, senior product developer at Reddit. "The reason it went down is that it failed in multiple zones."

Sites like Quora and Reddit were able to come back online in "read-only" mode, but users couldn't post new content for many hours. Reddit only recently began inviting handfuls of random users to create new posts again.

Many experts blamed the sites themselves for crashing, saying they should have been spread out among multiple geographical regions to take full advantage of Amazon's backup systems.

"Amazon's products are only as good as the people putting the architecture up," said Michael Kirven, co-founder of cloud services provider Bluewolf. "If you put all of your eggs in one basket, you put yourself at risk."

EC2 is so simple to use -- a credit card and a few keystrokes literally gets your business into the cloud -- that some experts say can give a false sense of security. They see in Amazon customers a certain level of naivety that nothing could possibly go wrong.

Of course, things go wrong and systems fail. Other cloud-hosted products like Google's (GOOG, Fortune 500) Gmail have gone down from time to time.

But sites like Reddit and others that crashed were simply following the instructions Amazon laid out in its service agreement, which says hosting in one region should be sufficient. Some smaller sites simply can't afford the engineering and financial resources it takes to duplicate their infrastructure in data centers all over the world.

Some sites like FourSquare took the outage in stride. The check-in service blogged that its "usually-amazing data center hosts, Amazon EC2, are having a few hiccups."

But others weren't as forgiving. BigDoor CEO Keith Smith wrote in a blog post: "If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner."

Reddit encountered a similar hiccup for about six hours last month, forcing the company to decide to start the process of migrating away from the particular product that went down on Thursday. Reddit's Edberg said the company is sticking with Amazon for now, but "we always have our eyes open for something that's superior."