(2) Sensors, Hooks and Operators — Find your fit

Depending on your use case, you might want to be able to use certain sensors, hooks, or operators. And while Airflow has a decent support for the most common operators, and good support on google cloud. If you have a more uncommon use case, you will probably need to check in user-contributed operators list or develop your own.

Understanding how to use operators, depending on your particular company setup, is also important. Some have a radical stance with respect to the operator, but the reality is that the use of operators needs to be taken in the context of your company.

Does your company have an engineering bias that supports the use of Kubernetes or other container style instances?

Is your company use of Airflow, more driven by your Data-Science department, with little engineering support? For them, it might make more sense to use a python operator or the still pending R operator

Is your company only planning to use Airflow to operate data transfers (Sftp/S3 …) and SQL queries to maintain a data-warehouse? For them using K8s or any container instances would be overkill. This is, for example, the approach taken at Fetchr, where most of the processing is done in ERM/Presto.

Selecting your operator setup is not a one size fit all.

(3) DAGS — Keep them simple

There are quite a few ways to architect your DAGS in Airflow, but as a general rule, it is good to keep them simple. Keep within the DAGS tasks that are truly dependent on each other, when dealing with multiple DAGS dependencies abstract them into another DAG and file.

When dealing with a lot of data-sources and interdependencies, things can get messy. Setting up dags as self-contained files, kept as simple as possible, can go a long way to make your code maintainability. The external task sensor helps to separate DAG and their dependencies in multiple self-contained DAGS.

As in most distributed systems, it is important to set up operation as idempotent as possible — at least within a Dag Run. Certain operations between dag runs may rely on a depend on past settings.

Sub-DAGS, should be used with parsimony for the same reason of code maintainability. One of the only valid reason for me in using Sub-DAGS is for the creation of Dynamic DAGS.

Communication between tasks, although possible with XCom, should be minimized as much as possible in favor of self-containing functions/operators. This makes the code more legible, stateless, and unless you want to be able to only re-run this part of the operation, do not justify the use of these. Dynamic Dags are one of the notable exceptions to this.

(4) Templates and Macros — Legible Code

Airflow leverages jinja for templating. Commands such as Bash or SQL command can easily be templated for execution with variables fitted or computed by the context. Templates can provide more readable alternatives to direct string manipulation in python (e.g., through a format command). JinJa templates is the default templating engine of most Flask developers, and can also provide a good bridge for python web developers getting into data.

Macros provide a way to take further advantage of templating by exposing objects and functions to the templating engine. Users can leverage a set of default macros, or customize theirs at a global or DAG level.

Using templated code does however, take you away from vanilla python and exposes one more layer of complexity for engineers typically needing to leverage quite a large array of technologies and APIs.

Whether or not you choose to leverage template is a team/personal choice, there are more traditional ways to obtain the same results, wrapping the same in python-format commands, for example, but it can make the code more legible.

(5) Event-Driven — REST API for building Data Products

Airflows’ REST API allows for the creation of event-driven workflows. The key feature of the API, is to let you trigger DAGS runs with specific configuration:

The rest API allows for building, data product applications built on top of Airflow, with use cases such as:

Spanning out clusters and processing based on an HTTP request

Setting up a workflow based on a message or file appearing in respectively a message topic or blog storage

Building fulling fledge Machine Learning platforms.

Leveraging the Rest API allows for the construction of complex asynchronous processing patterns, while re-using the same architecture, platform, and possibly code that are used for more traditional data processing.