DAGs are blooming

As people who work with data begin to automate their processes, they inevitably write batch jobs. These jobs need to run on a schedule, typically have a set of dependencies on other existing datasets, and have other jobs that depend on them. Throw a few data workers together for even a short amount of time and quickly you have a growing complex graph of computation batch jobs. Now if you consider a fast-paced, medium-sized data team for a few years on an evolving data infrastructure and you have a massively complex network of computation jobs on your hands. This complexity can become a significant burden for the data teams to manage, or even comprehend.

These networks of jobs are typically DAGs (directed acyclic graphs) and have the following properties:

Scheduled: each job should run at a certain scheduled interval

each job should run at a certain scheduled interval Mission critical: if some of the jobs aren’t running, we are in trouble

if some of the jobs aren’t running, we are in trouble Evolving: as the company and the data team matures, so does the data processing

as the company and the data team matures, so does the data processing Heterogenous: the stack for modern analytics is changing quickly, and most companies run multiple systems that need to be glued together

Every company has one (or many)

Workflow management has become such a common need that most companies have multiple ways of creating and scheduling jobs internally. There’s always the good old cron scheduler to get started, and many vendor packages ship with scheduling capabilities. The next step forward is to have scripts call other scripts, and that can work for a short period of time. Eventually simple frameworks emerge to solve problems like storing the status of jobs and dependencies.

Typically these solutions grow reactively as a response to the increasing need to schedule individual jobs, and usually because current incarnation of the system doesn’t allow for simple scaling. Also note that people who write data pipelines typically are not software engineers, and their mission and competencies are centered around processing and analyzing data, not building workflow management systems.

Considering that internally grown workflow management systems are often at least one generation behind the company’s need, the friction around authoring, scheduling and troubleshooting jobs creates massive inefficiencies and frustrations that divert data workers off of their productive path.

Airflow

After reviewing the open source solutions, and leveraging Airbnb employees’ insight about systems they had used in the past, we came to the conclusion that there wasn’t anything in the market that met our current and future needs. We decided to build a modern system to solve this problem properly. As the project progressed in development, we realized that we had an amazing opportunity to give back to the open source community that we rely so heavily upon. Therefore, we have decided to open source the project under the Apache license.

Here are some of the processes fueled by Airflow at Airbnb:

Data warehousing: cleanse, organize, data quality check, and publish data into our growing data warehouse

cleanse, organize, data quality check, and publish data into our growing data warehouse Growth analytics: compute metrics around guest and host engagement as well as growth accounting

compute metrics around guest and host engagement as well as growth accounting Experimentation: compute our A/B testing experimentation frameworks logic and aggregates

compute our A/B testing experimentation frameworks logic and aggregates Email targeting: apply rules to target and engage our users through email campaigns

apply rules to target and engage our users through email campaigns Sessionization: compute clickstream and time spent datasets

compute clickstream and time spent datasets Search: compute search ranking related metrics

compute search ranking related metrics Data infrastructure maintenance: database scrapes, folder cleanup, applying data retention policies, …

Architecture

Much like English is the language of business, Python has firmly established itself as the language of data. Airflow is written in pythonesque Python from the ground up. The code base is extensible, documented, consistent, linted and has broad unit test coverage.

Pipeline authoring is also done in Python, which means dynamic pipeline generation from configuration files or any other source of metadata comes naturally. “Configuration as code” is a principle we stand by for this purpose. While yaml or json job configuration would allow for any language to be used to generate Airflow pipelines, we felt that some fluidity gets lost in the translation. Being able to introspect code (ipython!, IDEs) subclass, meta-program and use import libraries to help write pipelines adds tremendous value. Note that it is still possible to author jobs in any language or markup, as long as you write Python that interprets these configurations.

While you can get up and running with Airflow in just a few commands, the complete architecture has the following components:

The job definitions , in source control.

, in source control. A rich CLI (command line interface) to test, run, backfill, describe and clear parts of your DAGs.

(command line interface) to test, run, backfill, describe and clear parts of your DAGs. A web application , to explore your DAGs definition, their dependencies, progress, metadata and logs. The web server is packaged with Airflow and is built on top of the Flask Python web framework.

, to explore your DAGs definition, their dependencies, progress, metadata and logs. The web server is packaged with Airflow and is built on top of the Flask Python web framework. A metadata repository , typically a MySQL or Postgres database that Airflow uses to keep track of task job statuses and other persistent information.

, typically a MySQL or Postgres database that Airflow uses to keep track of task job statuses and other persistent information. An array of workers , running the jobs task instances in a distributed fashion.

, running the jobs task instances in a distributed fashion. Scheduler processes, that fire up the task instances that are ready to run.

Extensibility

While Airflow comes fully loaded with ways to interact with commonly used systems like Hive, Presto, MySQL, HDFS, Postgres and S3, and allow you to trigger arbitrary scripts, the base modules have been designed to be extended very easily.

Hooks are defined as external systems abstraction and share a homogenous interface. Hooks use a centralized vault that abstracts host/port/login/password information and exposes methods to interact with these system.

Operators leverage hooks to generate a certain type of task that become nodes in workflows when instantiated. All operators derive from BaseOperator and inherit a rich set of attributes and methods. There are 3 main types of operators:

Operators that performs an action , or tells another system to perform an action

, or tells another system to perform an action Transfer operators move data from a system to another

operators move data from a system to another Sensors are a certain type of operators that will keep running until a certain criteria is met

Executors implement an interface that allow Airflow components (CLI, scheduler, web server) to run jobs jobs remotely. Airflow currently ships with a SequentialExecutor (for testing purposes), a threaded LocalExecutor, and a CeleryExecutor that leverages Celery, an excellent asynchronous task queue based on distributed message passing. We are also planning on sharing a YarnExecutor in the near future.

A Shiny UI

While Airflow exposes a rich command line interface, the best way to monitor and interact with workflows is through the web user interface. You can easily visualize your pipelines dependencies, see how they progress, get easy access to logs, view the related code, trigger tasks, fix false positives/negatives, analyze where time is spent as well as getting a comprehensive view on at what time of the day different tasks usually finish. The UI is also a place where some administrative functions are exposed: managing connections, pools and pausing progress on specific DAGs.