Trains go through three phases. First, Delivery. This phase includes build and deployment to our Staging environment.

Then, Verification. This includes both automated verifications such as unit test runs, smoke tests along with human verification. Human verification is tracked by polling the state of tickets — we use JIRA. Tickets are automatically created for each engineer with non-dark launched changes on the train once it is delivered to Staging.

Finally, Deploy. Once all of the verifications are complete — which means all the automated tests have passed and any manual verification tickets have been closed by the engineers— the train will auto-deploy to production.

Train in verification phase. Human verification completed, waiting on automated tests.

Once a train starts to deploy to production, a new train is implicitly created as soon as there are any queued changes. To maximize throughput, we don’t wait for deployment to complete before starting on the next train.

If there is a problem with a train, it can be manually extended via a button in the UI. Any engineer can press a button to include a fix or revert commit for something urgent. Train extension simply pulls in all queued commits — up to HEAD of master. This will slow down the current train, and therefore everyone else on it, so it should be used sparingly. Ideally, automated test coverage is good enough that you don’t find problems at this stage and there is no need to extend for a revert or fix.

One of the advantages of the way we have implemented trains is that releases are typically quite small. The smaller your releases the better — since the likelihood of a problem increases with every additional change. Furthermore, when something does go wrong, it is much easier to find the offending change from within a small batch size than a large one.

Engineers are encouraged to put their changes behind a feature flag and we provide a framework called Feature Config which powers this. If a change is behind a feature config, we let engineers bypass the manual verification process. The rationale here is that the feature can be manually verified at any time, potentially with a small number of users initially, and if it has a bug, it can be switched off instantly.

3) Kill release teams — democratize deployment workflows instead

If you’re an engineer who has just finished a change, you have a strong incentive to see it released as soon as possible. Similarly, if you as an individual engineer introduce a defect you have a very strong desire to fix it quickly and get that fix live on production immediately. You’ll be much happier and more productive if you can drive this workflow yourself.

If the workflow is instead driven and controlled by a dedicated team (e.g. a “release team”) the incentives aren’t necessarily the same. For example, if a bad data migration is in a release, it’s best to give the people who had changes in that release visibility so that they can quickly debug it and fix it.

Align Incentives

One of the philosophies at Nextdoor is to align incentives at the right levels to boost productivity as much as possible. Toward this goal, we pick a random individual who has a change on the train to be the “train engineer”. The train engineer is responsible for frontline triage of issues. They act as a release engineer but for a very small release. For example, if an automated test fails during the verification phase of their train, they will be sent a Slack message telling them to triage the issue and resolve the problem. This is typically resolved by them reading test failure output, triangulating a likely candidate change on the train, and then co-ordinating with that change’s author to get a fix or a revert onto the train to get it moving again.

In practice, the train engineer system works well. If a particular train is delayed or has issues, there is clear individual responsibility for tracking down verifiers, performing a rollback or extending the train with a fix. Since the train engineer by definition has a change on the train themselves, they also have a strong incentive to resolve problems quickly. By selecting the train engineer at random, release process knowledge and load is gradually spread throughout the organization.

Get Humans Out Of The Way

Another observation was that human verification can be extremely slow. For example, perhaps an engineer has a change on the train and then they go into an interview for an hour or a series of meetings — meanwhile everyone else is held up. To tackle this problem, we have created a culture where it’s agreed that excessively slow verifications aren’t acceptable behavior. It’s considered as serious a breach of etiquette as breaking tests in master. Other people on the train will go and track down the person with some urgency, or possibly find someone else to verify it — or in extreme cases revert the offending commit entirely.

While this culture helps to a significant degree, humans will make mistakes. The best way to avoid having a human forget to do their verification is to make it unnecessary for them to verify in the first place. To aid in this we have introduced additional tools and workflows where you can bypass needing to verify your change on Staging entirely and therefore not risk holding everybody up:

Pre-land verification environments, called Preview Environments. These are unique, per-code-review, fully-isolated Staging-like environments which engineers can use to verify their changes before they land them on master. If they use this capability, they bypass Staging and its manual verification requirements entirely. Thus there is no human verification requirement for their change during the release process.

Feature Config. If your change is behind a feature config, you don’t need to verify manually on Staging since your change has no impact on Production by default. It can be rolled out to a limited audience and quickly turned off if it has bugs or issues.

Conclusion

At Nextdoor, we have been continuously deploying dozens of releases per day of our large monolith for about 10 months. We have 70+ engineers working on this codebase. The benefits of Continuous Deployment to the speed of our product development organization have been significant.

However, the unique challenges posed by both a monolithic application and a rapid pace of change have required us to invest heavily in workflow optimization and tooling to achieve this. It took a team of 3 engineers about 9 months to figure out all of the intricacies of the workflow, developer user experience, and to then build the associated tooling. At the core is the Conductor microservice, which glues together Jenkins, Slack, Github and JIRA to present a coherent user interface and enforce rules. Since every engineer at Nextdoor interacts with Conductor to ship their code we wanted to make it as pleasurable as possible to use.

We plan to open source Conductor — which was designed to be modular with pluggable support for third party services— in the near future.

I’d like to give a huge shout out to the engineers on the Dev Tools Team who have contributed to this project and this blog post: Steve Mostovoy, Rob Mackenzie, Mikhail Simin and Alex Karweit.

Find this sort of stuff cool? The Nextdoor engineering team is always looking for motivated and talented engineers.