At the beginning of 2016, Nextdoor production releases took about an hour. That is to say, once all the code in the new release was tested and verified, it would still take an entire hour before our users saw those changes. Packaging the software took about 25 minutes and then actually deploying that package to production would take another 30 minutes.

“That’s madness,” you say. “Why on earth did things take so long?”

Let’s dive deep into what made this so slow, and how we made a 10x improvement.

Snail-paced staging

Before we release code to production, our engineers perform acceptance testing for their changes on a production-like environment, which we call staging. Product managers, designers, and engineers can use staging to help verify new features or behaviors before the code is released to all of our users.

Staging has a similar dataset to production, and runs all the services in the same way — the only difference is how many EC2 instances are powering it. Our staging deploy pipeline used the same architecture as production; hence, staging was similarly slow to update. This meant that we were heavily limited in the number of changes we could test per day on staging. If you’re an engineer working on a complicated new feature and you only have 2–3 opportunities to test it in a full production-like environment per day, that severely reduces your speed of iteration. Slow staging and production turn-around was hurting our engineering team’s ability to move fast and get new features out quickly.

Too much toil and trouble

Not only was this process slow, it was also far more manual than we wished. Each release would require running several scripts and inputting values by hand, which introduced potential for operator error. None of the release process should require manual intervention — we knew there should just be a single “deploy button” to push.

Roll backs are wrong

Finally, a frightening implication of all this lag was that if we needed to push an emergency fix, we’d be looking at a minimum of a 1 hour turn-around — and potentially almost double that if we needed to test it on staging first. This meant that when things went wrong, we tended to roll back the entire release rather than roll forward with a patch. Rolling forward is preferable to rolling back, since important features and fixes aren’t reverted due to an unrelated bug.

It was obvious to us all that the slowness in our build and deployment pipeline was becoming a major drag on productivity. We needed to dig in and solve this.

Enter the Dev Tools Team

One of the non-technical reasons why things were so slow was that the various components of the build and deploy processes didn’t have clear team ownership, and had been neglected. The innards of the build & release systems had grown organically over the years and consisted of spaghetti code with various fixes slapped on top. We’d never taken a step back and thought holistically about the entire system and its architecture.

We decided we needed to create a new team which could focus on repairing our build and release pipelines. Staffed with 3–4 engineers, the high-level goal of the Dev Tools Team has been to maintain an engineering platform that supports rapid development and release of high quality code. Our philosophy is that our engineers are customers, and our job is to make their experience developing, testing, and deploying their work pleasant and efficient.

Our plan was simple and consisted of 3 steps:

Understand the existing build and deploy pipeline.

Automate it.

Optimize it.

And our team for this project:

Understanding unhurriedness

25 minutes “build” for Python app — WAT

Our primary service is a large Django/PostgreSQL monolith which runs under nginx/uWSGI. Python is an interpreted language. It doesn’t require compilation. This might make you think that there’s nothing to build, but that turned out to be not entirely true.

We distributed our release to our servers as a self-contained binary Debian package. This Debian package contained all the application source code, the Python virtual env, uWSGI and various other binary requirements.

In order to build Debian packages in a clean, reproducible way, you need to use a tool such as sbuild or cowbuilder which effectively build them in an empty chroot environment. This is great — except for the fact that you need to populate the contents of the chroot with system libraries. Anyone who has used a Debianoid Linux is familiar with waiting for apt-get install to complete.

We also needed to compile frontend assets (using Webpack and associated tooling), and push those to the CDN before starting the deploy.

Finally, the Debian package itself needed to be uploaded to our Debian package host — which could take several minutes in practice for bizarre reasons which I’m not going to delve into here.

Since we had not architected any of this for end-to-end performance, we ended up with a breakdown like the following:

apt-get install system package into Chroot (10 minutes)

system package into Chroot (10 minutes) pip install Python packages (5 minutes)

Python packages (5 minutes) Compile frontend assets & upload to CDN (5 minutes)

Upload resulting Debian package to hosting provider (5 minutes)

This added up to around 25 minutes.

30 minutes to deploy the Debian package — WAT

So we have a binary Debian package. It should just take a few minutes to push that out to our servers, right? Not so fast. We had architected our deploy to use a Red/Black process inspired by Netflix. We would boot an entirely new set of servers to run the new release, install the Debian package onto those and then switch traffic at the load balancer to point to the new servers. In case of a rollback, we still had the old release running, so it would simply be a matter of pointing the load balancer back at the old servers.

This architecture has a number of nice properties — such as exercising our ability to replace machines frequently. However, the cost was that it was very slow:

Bidding, booting, and configuring the new EC2 instances (25 minutes)

Wait for new version of services to become healthy and join load balancer (5 minutes)

This added up to around 30 minutes. Staging used the same Red/Black architecture, and so had a similar performance profile.

One final note on this architecture is that you end up booting tons of new instances. Each time you do this, there is a chance of a failure. EC2 itself can have outages, Apt repos can go down, etc. At a certain scale, we began having quite frequent failures in bringing up new instances — which very often required painful manual intervention to work-around. This kind of operational toil was a drag on our team.

Automation and the lack of it

There is an old rule-of-thumb in software engineering that, if you find yourself performing the same manual task three times, it is worth automating it. Our release engineers were doing a lot of repetitive manual tasks, and we had tons of opportunities to apply this adage.

Here are some of the most onerous examples of manual tasks in our staging pipeline:

If an engineer wanted to get a commit into the existing release branch, they would manually add a comment with their SHA to a special JIRA ticket.

The release engineer would periodically run a “git cherrypick” script by hand to scan the JIRA ticket for new SHAs and apply those to the branch in git.

The release engineer would be the first to encounter any merge conflicts, even though they don’t have any context on the changes.

The release engineer would manually trigger a new build, wait, and then babysit the deployment to staging.

We knew that none of this was necessary. With better tooling, engineers could be responsible for performing their own cherrypicks and resolving any merge conflicts. Builds could be automatically triggered whenever new commits landed on the branch. A new build could be automatically deployed to staging and a Slack notification sent to inform the developer that their change is ready for acceptance testing.

Our first major achievement was automation of the existing staging pipeline, which freed us from operational overhead enough to work on building the new architecture. As we gradually dug ourselves out of the hole of manual operational toil, we continued to invest in improving automation and reliability.

Migrating from Debian/Amazon EC2 to Docker/Amazon ECS

While we already had some organizational experience with both Docker and ECS — our dev boxes were built on Docker Compose, and we had numerous microservices in production which we deployed on ECS. This alone wasn’t the driving reason for our migration.

Win 1: Huge build speed-ups with Docker Layer Caching

As described above, we specifically wanted to increase the speed at which we could build and release new code. On the build side, we realized that we could gain huge speed-ups through caching infrequently-performed but expensive operations. Perfect examples are installation of system packages — which rarely change — and Python requirements which change more frequently, but still not that often.

While we could have built caching into the legacy Debian packaging system, we felt that it would be at least as much work as moving to Docker. Debian packaging tools are byzantine and nobody on our team had a good understanding of them. Furthermore, Docker has quickly emerged as a standard packaging and runtime format. If you package your application as a Docker image, you can build and run it in tons of different environments. Since we had adopted Docker already for our microservices and development environments, and therefore already had a fair amount of in-house familiarity with it, it was a natural fit.

In re-architecting our package build process under Docker, we were obsessive in optimizing our use of the provided layer caching. By fanatically avoiding doing more work during builds than we needed to do, we were able to get the build time from ~25 minutes to ~2–3 minutes on a machine with a warm cache. Since our cache hit rate is very high (system packages and Python dependencies change very rarely), this easily provided a 10x improvement in average build time.

Win 2: Reusing machines with ECS Docker Host Clusters

It was fantastic to decrease our build times. However, we still had an unacceptably slow deployment speed. This was mainly because we would bring up a whole new set of machines on each deploy. This is inherently time-consuming and error-prone. What if we instead maintained an elastic pool of servers which were always ready? That’s essentially what Amazon ECS offers. You run a cluster of Container Instances (which can be scaled easily), and then ECS takes care of deploying and scheduling your application containers on that cluster.

Since we had migrated our monolithic Django app to Docker, it was fairly straight-forward to see how this would work by migrating our staging system to ECS. We quickly saw huge drops in deployment times — to about and average of 2–3 minutes on staging. Production takes a little longer because there are so many more container instances, but is on the order of 7–8 minutes. This could be sped up further by having additional ECS host capacity.

Production cutover — with zero downtime

It’s one thing to migrate your staging environment to a new architecture, but it’s another to migrate the full production workloads and release process. Not only did we need to come up with a plan to perform the migration with zero downtime to users, we also needed to have some way to do this gradually and roll back in case we found issues.

Like many real-world monoliths, the Nextdoor Django app in fact runs as seven different “service flavors”. These are instances of the exact same code, running in a slightly different mode and serving different sets of traffic. For example, there is an “api” service flavor for mobile clients and a “taskworker” service flavor for asynchronous job execution. Each service flavor runs under a dedicated Amazon ELB. These service flavors provided a natural grouping upon which to perform the migration. We could migrate production one service flavor at a time, using the DNS on the ELB to roll back to the EC2-based system if necessary.

We started with one of the low traffic service flavors, and worked our way gradually to migrating the highest traffic ones as we gained confidence. The entire migration took about 3 weeks, mainly due to two tricky issues which took us quite a lot of time to debug: processes dying mysteriously and occasionally hung asynchronous taskworkers.

Issue 1: OOM kills under Amazon ECS

There isn’t a direct way to translate from a system where the minimum unit of scaling is an entire EC2 instance to a system where the unit of scaling is a single container. We had previously thought only in terms of scaling coarsely — by booting or terminating EC2 instances. We hadn’t thought of more fine-grained scaling — i.e. by X units of CPU and Y memory units. We needed to think deeply about our application’s resource requirements and tune the number of worker processes given the amount of memory we were allocating. We couldn’t simply say “let’s just boot another X machines and let the application have as much memory as it wants”. ECS will require you to set clear limits, and if your containers exceed those limits they will be killed unceremoniously.

In an environment like Python where the interpreter seemingly never returns memory once it requests it from the OS, your processes can seem to grow in memory forever. This manifested as us initially setting memory limits too low and suffering OOM kills of important processes as a result. It took a week of careful watching and tuning to get things to a point we were happy with. We found ECS memory usage graphs were problematic as — given metrics are collected only one per minute — they could easily miss spikes.

Issue 2: Mysterious hanging taskworkers

One of our largest service flavors is our asynchronous job processing system we name “taskworker”. We run a big pool of taskworker processes which pulls tasks from Amazon SQS queues.

We migrated a portion of this service to Docker/ECS and started to notice some very strange behaviors. In certain cases, the Docker/ECS taskworkers were getting stuck and hanging forever. However the legacy Debian/EC2 taskworkers didn’t have this problem. Doing some deep system call-level debugging, we managed to track down the source of the hang. Only under Docker/ECS, our taskworkers would read from a PostgreSQL connection without a timeout — and for some reason the result would never arrive nor would any error occur on the socket — thus hanging the process. We came up with various crazy theories about why this was happening such as Docker networking issues, kernel bugs, and more. We also started to work on handling stuck taskworkers by doing operations in a child process, and handling a timeout in the parent process which would notice the child was stuck and restart it.

However the Infrastructure team found the underlying issue independently — in porting the Debian package build to Docker we had inadvertently upgraded the PostgreSQL C library “libpq” from 9.1 to 9.6. This later version of libpq seemed to have some subtle backwards-compatibility issues communicating with our 9.4-version database servers. It would issue a read on a socket without a timeout and no error would ever occur — hanging the process forever.

When we pinned the version of libpq back to 9.1 — where it had been in our Debian build — this problem disappeared.

Conclusion

It took our team around 5 months to re-architect the build and deployment system to run under Docker/ECS and completely overhaul all our automation. This investment has paid off enormously in terms of the following benefits:

Build & deployment down from about an hour to 5–6 minutes (ECS capacity willing).

Fully automated staging environment. Developers push commits onto a release branch and they are live on staging for verification within 5 minutes — no manual operation required.

Vastly simplified build and release process for production. Operator simply types “!release” in a Slack channel and a Slack button is displayed to confirm deployment of release. Click the button and the production deployment process is begun.

Massively improved release reliability. ECS and Docker have helped us simplify our systems so that they have many fewer moving parts and points of failure. Not having to worry about bringing up large numbers of new EC2 instances on every release drastically reduces our exposure to random failures.

However, it’s not all rainbows and unicorns. We have run into numerous issues with both Docker and Amazon ECS, which we plan to go into in future blog posts. Key learnings:

Docker layer caching works, but it is coarse-grained. Furthermore, due to some changes, the cache cannot be populated by a Docker pull. It must be built locally. Expect to invest quite a lot of time on optimizing if you have a complex build and you want to make it very fast.

Amazon ECS works reasonably well, although it has rough edges. In particular, you cannot differentiate between a failing health check and startup grace period. If your application takes a while to become healthy on startup, you must set a long health check duration. You cannot say “wait for 10 minutes for my application to pass health check on startup, then ratchet down the threshold”.

To migrate to Docker, be prepared to think deeply about your resource utilization in terms of CPU and memory required per container.

What’s your deployment process like? We’d love to hear about your experiences making builds and deployments fast!