At it’s core is what we call the AdStage unified image. This machine image is used to create instances for all services in all environments, from development and test to staging and production. On it are copies of all our repos and the dependencies needed to run them. Depending on the values of a few instance tags, the instance can come up in different modes to reflect its usage.

When an instance comes up in “review” mode, for example, all the services and their dependent databases run together on that instance and talk to each other. This lets engineers doing development and QA access an isolated version of our full stack running any arbitrary version of our code. Whatever they do on these review boxes doesn’t affect staging or production and doesn’t interact with other review boxes, completely eliminating our old QA/staging constraint. And as an added bonus, as soon as a review box passes QA, it can be imaged and that image can be deployed into production.

That works because when an instance starts in “staging” or “production” mode it’s also told what service it should run. This is determined by tags the instance inherits from its autoscaling group, letting us bring up fleets of instances running the same code that spread out the load from our customers. For autoscaling groups serving web requests, they are connected to elastic load balancers that distribute requests evenly among our servers. The load balancers give us a fixed point we can smoothly swap out instances under, enabling zero-downtime deployments, and make rollbacks as easy as keeping old versions of the unified image on standby ready to swap in.

The AWS resources we use don’t fully coordinate themselves, though, so we wrote a Ruby Thor app that uses the AWS Ruby SDK to do that. It takes care of starting review boxes, building images, and then deploying those images into staging and production environments. It automatically verifies that deploys are working before switching the load balancers over to new versions and will suggest a rollback if it detects an issue after a deploy completes. It also uses some clever tricks to coordinate multiple deploys and lock key resources to prevent multiple engineers from corrupting each others’ deploys, so anyone can start a deploy, although they’ll be stopped if it would cause a conflict.

That covered all our desiderata: imaging instances allowed easy system administration, the setup was boring and widely used, there was zero downtime inherent in the deployment process, deployment was automated with support for rollbacks, and it wasn’t very complex at less than 1500 lines of code all in. And since it solved the QA constraint and by our estimates would save over $10k in operating expenses, all that remained was to plan the live migration from Heroku to AWS.

A Live Migration

July of 2016 was typical for San Fransisco. Most days the fog and frigid air kept me working inside while, across the street from our office, unprepared tourists shivered in shorts as they snapped selfies at Dragon’s Gate. It was just as well, because everything was set to migrate from Heroku to AWS, and we had a helluva lot of work to do.

Our customers depend on us to manage their ad campaigns, automate their ad spend, and report on their ad performance. When we’re down they get thrown back into the dark ages of manually creating and updating ads directly through the networks’ interfaces. They couldn’t afford for us to go offline while we switched over to AWS, so we were going to have to do the migration live. Or at least as live as reasonable.

We implemented a 1-week code freeze and found a 1-hour window on a Saturday morning when AdStage would go into maintenance mode while I switched databases and other services that couldn’t easily be moved while running. In preparation we had already performed migrations of our staging systems and written a playbook that I would use to cut over production. I used the code freeze to spend a week tweaking the AWS deployment to match the Heroku deployment. All seemed fine on Saturday morning. We went down, I cut over the databases, and then brought AdStage back up. I spent the day watching monitors and staying close to the keyboard in case anything went wrong, but nothing did. I went to sleep that night thinking everything was in a good state.

After a lazy Sunday morning, I started to get some alerts in the afternoon that our importers were backing up. As we looked into the issue the problem became quickly apparent: we somehow had less CPU on AWS than Heroku despite having nominally more compute resources. As a result we couldn’t keep up, and every hour we got further and further behind. We had to decrease the frequency of our imports just to keep the queues from overflowing, and we ultimately had to switch back on our Heroku apps to run alongside AWS to keep up with the workload. It was the opposite of saving money.

What we figured out was that Heroku had been telling us a happy lie. We were officially only getting about 2 ECUs per dyno, but the reality was that we were getting something closer to 6 since our neighbors on Heroku were not using their full share. This meant that our fleet of AWS instances was 3 times too small, and in spite of everything Heroku was actually cheaper! If only there were a way to pay less for more instances…

That’s when we hit upon using spot instances. We’d thought about using them because they’re about about 1/10th the price of on-demand, but they come with some risk because they can be terminated at any time if your reserve price falls below the auction price. Luckily, this doesn’t happen very often, and autoscaling groups otherwise take care of the complexity of managing spot instances for you. Plus it was easy to have backup autoscaling groups that use on-demand instances sitting in the wings, ready to be scaled up by a single command if we temporarily couldn’t get enough spot instances to meet our needs. We ultimately were able to convert about 80% of our fleet to spot instances, getting our costs back down within expected targets despite using 3 times more resources than originally expected.

Conclusions

Aside from our surprise underestimation of capacity, switching from Heroku to AWS went smoothly. Don’t get me wrong, though: it’s something that was worth doing only because we had reached a scale where the economics of bringing some of our infrastructure operations in house made sense. If we weren’t spending at least one engineer’s salary worth of money on opex that could be saved by switching to AWS and if infrastructure hadn’t become a core competency, we would have stuck with Heroku and had that person (me!) work on things more essential to our business. It was only because of changing economics and processes that migrating from Heroku to AWS became part of our story.

Interested in AdStage becoming part of your story? We’re hiring!

Update 2017–07–27

There is now a part 2.