Like many other startups, we use Amazon Web Services (AWS) to run Fitocracy’s website and iPhone app. Our main stack consists of half a dozen Elastic Cloud Compute (EC2) web instances, an EC2 Redis instance, and an Relational Database Service (RDS) instance that are all located in us-east-1 (Northern Virginia). So, when the colossal electrical storm hit the area last night and brought that entire AWS region’s EC2, RDS, and EBS (Elastic Block Storage) down, we went down with it. It took us 12 hours to finally get fully operational again. Here’s why:



Going down

At 11:05pm, Pindom emails that the site is down. Reports start coming in from users on Twitter that our iPhone app is failing to connect. We can’t access or ssh into any of our instances. AWS console and API are unresponsive. All requests to fitocracy.com returned 503 (Service Unavailable) error and blank page. Official AWS status said that there was some connection trouble with EC2 and RDS instances in the us-east-1 region.



Maintenance mode

The immediate goal after going down was to get up a static maintenance page on the website and maintenance status message returning to the iPhone instead of the unable to connect error. We already had these configs and static files in our git repo ready to deploy for scheduled downtimes, but we couldn’t deploy them to unresponsive instances. So, we started to look for alternative places to host the static files. Our senior engineer, Adam, started booting a micro instance in us-west-1 (AWS’s west coast data center), and I quickly copied the maintenance files to a Linode account just in case we needed it to serve files from there, too.

However, this wasn’t going to work very well, because there was going to be a problem with switching the domain. Our DNS record sets for fitocracy.com had a TTL of 14400 (4 hours), so switching it to another location would take up to 4 hours to propagate, and we wanted it to switch ASAP.

Luckily, we were using Route53 to manage our domain record sets, which allows you define a Zone Apex Alias for load balancers that start redirecting immediately (as far as we can tell) and have no TTL. The only problem with these aliases is that you can only assign Elastic Load Balancers (ELBs) to them, which can only have AWS instances under them. So, we created an ELB in us-west-1 and stuck the static maintenance micro instance under it.

It took about 20 minutes for AWS to boot and configure, and another 20 minutes for the ELB to launch, and another 20 minutes for the ELB to register the instance under it. At the time we were launching the maintenance instance and load balancer in us-west-1, Netflix, Instagram, Pinterest, and other big sites were also down, so I suspect that everyone was also scrambling to get new instances up at alternative locations.

After finally getting the micro instance registered under the ELB, we switched the zone apex for fitocracy.com to be the new ELB in us-west-1, and immediately the maintenance page started to be returned from fitocracy.com! Hooray! We had been down about an hour at this point without a maintenance page, but at least now we had one so web and iPhone users wouldn’t be in the dark.



ANNOTATION: I realized later on that since we were already using a zone apex alias for fitocracy.com, we could have probably gotten away with switching it to a regular A-name static IP for an outside server with a low TTL (<300), but we weren’t sure how long that change would take to propagate out away from the zone apex. I tried this later on another domain, and it turns out it’s pretty fast. We could have started pointing it at an outside server in 10-20 minutes instead of the 40 minutes that it took us launch and register the ELB. But I think we made the safest choice to continue to map zone apexes for our domain since we were unsure about propagation times for leaving and re-mapping back to zone apex aliases. If this ever happens again, we won’t hesitate so much to switch off of the zone apex aliases.



Recovery

After we got the maintenance mode serving correctly, we began concentrate on getting the full website back up and running. The AWS official status page was saying at this point that it was trying to start bringing EBS and EC2 back online, so we knew that it would likely be several hours before our instances were accessible again.

We had snapshots enabled on RDS, which meant that there was supposedly a full snapshot every night and then logs of changes for every 5 minutes after that. The idea behind this was that we could restore to within 5 minutes of when the database stopped logging. However, this snapshot and log was inaccessible since EBS/EC2/RDS were down. We also have nightly database dumps that are transferred to a third-party location, but that would obviously be missing all of Friday’s data.

So we had the choice of either trying to re-create the stack in another AWS region using the previous night’s database dump as the restore point, or wait until we got access to our RDS instance to restore from there.

Since we were a few hours into the downtime at this point, and AWS was giving status updates that it was beginning to restore EBS volumes (where RDS snapshots are stored), we decided to wait it out and try to get access to the RDS snapshot and log. We thought waiting a few extra hours to try and retrieve Friday’s data was worth it.

It would have taken us 4-6 hours to recreate the stack from scratch in another region anyway, because of how we deploy instances. We use Amazon Machine Images (AMIs) to launch pre-configured images of our web and caching servers. We start with a base image (stock Ubuntu, for example), then manually add the programs and configs that we need to run the instance, then make an AMI of the instance, then launch as many instances of that AMI as we need. When we need to do a third-party software update (i.e. not in the Fitocracy repo), we simply update one of the instances, make another AMI, launch all the instances we need from that new AMI, and kill off the old instances.

It makes for super fast and seamless updates, but obviously constrains you to the limits of AMI. AMI’s are basically EBS volumes that you can clone and launch, but since EBS was down, we couldn’t launch any new instances from our AMIs. Also, even if EBS wasn’t down, you can’t launch AMIs in other regions, so we’d have to make new AMIs from scratch anyway. So getting up and running in another region would have required a manual build from the ground up. Not impossible since our stack is relatively simple, but not very quick to roll out, either.

So given that it would take 4+ hours to roll out in another region, with a day’s loss of data, we opted to wait until we could confirm a total loss on the RDS instance recovery. Unfortunately, that wait took quite a bit longer than expected: 4 hours.

During that time, the AWS console and API were giving conflicting reports about instance status. We noticed that our RDS instances were marked as available, when they actually weren’t. After things get back to normal for them, I’d be interested in hearing AWS’s logic behind why these indicators were not matching reality for several hours.

Eventually, after 4 hours (about 5am), our EBS volumes began to come back online. Luckily, our main web AMI was one of them. So we tried to start spinning up instances from that AMI and terminating old ones. The AWS API was still pretty spotty on whether it would return correct data, but we managed to manually ssh into all the new instances and get them ready to go. However, the RDS instance was still unavailable.

Then, after another 2 hours (about 7am), a restore point from 11pm the previous night appeared for our RDS instance, which is exactly what we were waiting for. We started launching a new RDS instance with that recovery point. It ended up taking about 2 hours to launch because it had to clone the snapshot (taken at 3am the previous day) and roll it forward using the logs.

Turning back on the lights

Once the RDS instance was restored, the individual web instances started responding correctly to requests! It was time to bring the site out of maintenance mode! However, the load balancers in that region were still having trouble registering instances and showing the correct instances under them. We tried editing the instances under our current ELBs and then tried creating new ELBs, but both didn’t register EC2 instances.

So we started to try and improvise a load balancer with nginx on one of our current EC2 instances. It wouldn’t be a efficient use of resources, but we just wanted the site back up (it had been 11 hours at this point). However, it took us about 45 minutes to figure out how to terminate and redirect HTTPS connections (needed for login) the same way the ELBs did it. Right about the time we figured it out, our ELBs started registering instances. Once that happened, and our ELBs in us-east-1 were working and responding to requests correctly, we switched the zone apex alias from the us-west-1 ELB back to our main us-east-1 ELB. The site was now live again! With no loss of data!

Reflection

So after the 12 hours of downtime, I’m wondering how we could have done better. There were definitely some things I think we could have improved on our end to speed up reaction and recovery.

First, we could have switched to a maintenance page faster if we immediately changed our domain to an outsite static IP with a maintenance page on it. We should explore how we could this in one click or automatically. iPhone users especially get confused and frustrated when they are left in the dark about what’s going on (since they dont have a maintenance page you can redirect them to).

Second, we should have a script that can build a web instance from a stock image quickly. If us-east-1 had complete tanked and been unrecoverable, we should be relatively ready to roll out to another region. We had the database backup ready to go, but the web instances were not. We should be able to roll out in another region in the amount of time it takes restore from a nightly backup.

Third, we might have had a faster recovering if we used EC2/EBS to roll our own database server instead of using RDS, which was several hours slower to recover snapshots. However, that would mean either we would have to roll our own 5 minute restore point logging, or go without. We had about half our EBS volumes still not readable when we terminated them in favor of new instances, so it seems pretty risky to pin a day’s worth of data on your EBS volume being recovered.

Forth, we could have had a replica of the database and AMIs of our instances pre-set in another region for failover. This would have been the idea solution for minimum downtime, but maintaining that option is unfortunately way out of our price range (double database instance costs and even more data transfer costs). Eventually, we’ll get big enough to make multi-region databases worth it, but not yet.

In the end, we ended up mostly waiting on AWS to recover the things we needed. Even given perfect preparation for rolling out to another region, our choices still would have been either down 12 hours with no data loss or down 4 hours with a day’s loss of data. Personally, I still take the 12 hours downtime.

Even so, we need to be more prepared to roll out to another region from a backup if needed. So for the immediate changes, one thing we will for sure review more heavily is roll out time to another region or set of servers.



TL;DR: Earth’s Electrification Errors Eastern EC2, Sleepy Sysadmins Sail Suspended Servers, Fitocracy Finally Fixes Functionality Faults

-Daniel Roesler, CTO, Fitocracy

UPDATE: To address some criticisms that if you’re not fully distributed in the cloud, “you’re doing it wrong.” Yes, I know that one of the major advantages of using the cloud is to distribute resources, but multi-region distributing more than doubles your costs because you have to maintain parity between databases across regions. I wish we could be multi-region, but unfortunately, we can’t afford that yet. Someday! For now, we just take advantage of some of the other benefits of the cloud, such as being able to seamlessly add and remove instances to scale to match traffic, and being able to instantly throw up experimental instances and play around with code/infrastructure setups. As for the downtime, we just try our best to make it as transparent (lots of tweets) and upbeat (funny maintenance pages) as possible.