Background

At Nextdoor, we run a lot of scheduled jobs for various important purposes, such as sending tens of millions of digest emails to our users daily, generating internal reports on our growth, and some operational tasks. Like many other internet companies (e.g., Airbnb and Quora), we started with Cron and ended up building our own cron replacement, which we called Nextdoor Scheduler.

We’ve been using Nextdoor Scheduler for over 18 months and we are extremely happy with it.

So, what’s the problem with Cron?

There are four main problems with Cron.

First, the way we use Cron was not scalable. We ran all Cron jobs on a beefy scheduler machine (c3.8xlarge). As we gained traction, Cron jobs pushed the machine to its limit, in terms of compute resource usage.

Second, editing the plain text crontab is error prone for managing jobs, e.g., adding jobs, deleting jobs, or pausing jobs. For instance, an extra asterisk prevented all production jobs from running the other day:

1 * * * * * /opt/nextdoor/some_job.sh

Third, we incurred a lot of operational overhead with Cron. We have near two hundred production jobs that are run thousands of times a day, at different frequency (e.g., minutely, hourly, weekly). Job failure is common. The oncall person had to manually restart failed jobs several times a day, sometimes after midnight. Here is an example of a typical oncall experience: 1) get paged with the command line of the failed job; 2) ssh into the scheduler machine; 3) copy & paste the command line to rerun the failed job. This is certainly not good for engineering happiness — yes, we do care about the happiness of our employees!

Fourth, we had little visibility for production jobs during runtime. There was not easy way to know what jobs were running or whether they succeeded.

Decision

Enough is enough. We decided to build a cron-replacement. But why didn’t we use open source solutions? We just couldn’t find a suitable one. We speak python. We wanted to leverage existing infrastructure components in the company. We wanted to build it, understand it, and own it.

To address the scalability issue, we made each job an async task that can run on a cluster of Taskworker machines. We can easily configure jobs that run on Taskworker to automatically retry when they fail, which requires only a single line code change. Our oncall engineers love this auto-retry feature!

To replace Cron, we used the excellent python module ApScheduler to schedule jobs, which enabled us to manage jobs programmatically — we built REST APIs, command line tools and human-friendly web UI.

Architecture

The following picture shows the architecture of our Scheduler system.

Nextdoor Scheduler is implemented with Python / Tornado. It is run as a single daemon process (Scheduler Process) on a single machine, which consists of three components.

Scheduler (or Core Scheduler). It replaces cron and schedules jobs to run. When a job is triggered to run, the Scheduler Process simply publishes a message for the job to Amazon SQS. A cluster of Taskworker machines grab messages from Amazon SQS and run corresponding jobs. As mentioned above, we use APScheduler to implement core scheduler. Scheduler API. It provides a REST interface to manage jobs, e.g. adding jobs, pausing/resuming a job, removing jobs, modifying jobs, and manually kicking off a job. We’ve built command line tools on top of Scheduler API to make operations easy, for example, pausing a group of jobs all at once. Web UI. It is a single page app talking to Scheduler API. We used Backbone.js and Bootstrap to implement the Web UI. Human operators primarily use the Web UI to interact with Nextdoor Scheduler.

Information of all jobs and job executions is stored in a data store. We use Postgres primarily here at Nextdoor.

Web UI

Engineers love Web UI of Nextdoor Scheduler, which provides an intuitive way to manage jobs rather than dealing with blackbox-like Cron in the old days.

Jobs Page

On this page, we can see what jobs we have and when they will run next time. We can also click “Custom Run” to manually kick off a job.

Editing a Job

We can easily edit a job, e.g., change its schedule and pause it with one button click! This is way better than modifying plain text crontab in the old days.

Executions Page

Finally, we have great visibility for what jobs are running and whether they succeed or not.

Rolling out

Writing code is easy. Productionization is hard. By the time we finished the implementation of Nextdoor Scheduler, we had close to 200 production Cron jobs that need to migrate to the new system.

We applied what we’ve learned from the Taskworker project to roll out the Nextdoor Scheduler system. Four steps:

We dark launched Nextdoor Scheduler to production — no production jobs were running with the new system yet. We added a feature switch to the base class of all jobs. We slowly and carefully turned on feature switches for each job over two weeks. We shut down the old beefy scheduler machine that ran Cron.

Happy Ending

With the new Nextdoor Scheduler, we are able to run a much cheaper scheduler EC2 instance (c3.2xlarge) than before (c3.8xlarge), while keeping the load super low as we offload jobs to run on distributed Taskworker machines.

Here’s the CPU usage comparison between old scheduler machine (top graph) and new scheduler machine (bottom graph):

We’ve been using Scheduler jobs for over 18 months. We are happy so far. If you’re interested in working on these kinds of problems and other interesting infrastructure challenges, we’re hiring!