Apache Airflow is a data pipeline orchestration tool. It helps run periodic jobs that are written in Python, monitor their progress and outcome, retry failed jobs and convey events in a colourful and concise Web UI. Jobs, known as DAGs, have one or more tasks. Tasks can be any sort of action such as downloading a file, converting an Excel file to CSV or launching a Spark job on a Hadoop cluster. Airflow will isolate the logs created during each task and presents them when the status box for the respective task is clicked on. This process helps investigate failures much quicker than having to search endlessly through lengthy, aggregated log files.

Over the past 18 months nearly every platform engineering job specification I've come across has mentioned the need for Airflow expertise. Airflow's README file lists over 170 firms that have deployed Airflow in their enterprise. Even Google has taken notice, they've recently announced a managed Airflow service on their Google Cloud Platform called Composer. This offering uses Kubernetes to execute jobs in their own, isolated Compute Engine instances and resulting logs are stored on their Cloud Storage service.

Orchestration tools are nothing new. When Airflow began its life in 2014 there were already a number of other tools which provided functionality that overlaps with Airflow's offerings. But Airflow was unique in that it's a reasonably small Python application with a great-looking Web UI. Installation of Airflow is concise and mirrors to the process of most Python-based web applications. Jobs are written in Python and are easy for newcomers to put together. I believe these are the key reasons for Airflow's popularity and dominance in the orchestration space.

Airflow has a healthy developer community behind it. As of this writing, 486 engineers from around the world have contributed code to the project. These engineers include staff from data-centric firms such as AirBNB, Bloomberg, Japan's NTT (the World's 4th largest telecoms firm) and Lyft. The project has a strong enough developer community around it that the original author, Maxime Beauchemin, has been able to step back from making major code contributions over the past two years without a significant impact on development velocity.

I've already covered performing a basic Airflow installation as a part of a Hadoop cluster and put together a walk-through on creating a Foreign Exchange Rate Airflow job. In this blog post I'm going to walk through six features that should prove helpful in customising your Airflow installations.

Installing Prerequisites I recommend Airflow being installed on a system that has at least 8 GB of RAM and 100 GB of disk capacity. I try to ensure jobs don't leave files on the drive Airflow runs but if that does happen, it's good to have a 100 GB buffer to spot these sorts of issues before the drive fills up. The following will install Python, PostgreSQL, git and the screen utility. This was run on a fresh installation of Ubuntu 16.04.2 LTS. $ sudo apt install \ git \ postgresql \ python \ python-pip \ python-virtualenv \ screen I'll create a Python virtual environment, activate it and install a few Python-based packages that will be used throughout this blog post. $ virtualenv ~/.air_venv $ source ~/.air_venv/bin/activate $ pip install \ 'apache-airflow==1.9.0' \ boto3 \ cryptography \ dask \ 'distributed>=1.17.1, <2' \ flask_bcrypt \ psycopg2-binary

Using PostgreSQL for Metadata In other Airflow posts I've written I've used MySQL and SQLite to store Airflow's Metadata but over the past year or so when I've deployed Airflow into production I've been using PostgreSQL. I've found PostgreSQL good for concurrency, storing time zone information in timestamps and having great defaults in its command line tools. Airflow will use SQLite by default but I recommend switching to PostgreSQL for the reasons mentioned above. Below I'll create a PostgreSQL user account that matches my UNIX username and as well as a new PostgreSQL database for Airflow. $ sudo -u postgres \ bash -c "psql -c \"CREATE USER mark WITH PASSWORD 'airflow_password' SUPERUSER;\"" $ createdb airflow I'll initialise Airflow and edit its configuration file changing the Metastore's connection details to the PostgreSQL installation above. $ cd ~ $ airflow initdb $ vi ~/airflow/airflow.cfg [core] sql_alchemy_conn = postgresql+psycopg2://mark:airflow_password@localhost/airflow I'll run initdb again to setup the tables in PostgreSQL and I'll upgrade those tables with the latest schema migrations. $ airflow initdb $ airflow upgradedb

Storing Logs on AWS S3 By default, Airflow stores log files locally without compression. If you are running a lot of jobs or even a small number of jobs frequently, disk space can get eaten up pretty fast. Storing logs on S3 not only alleviates you from disk maintenance considerations, the durability of these files will be better guaranteed with AWS S3's 11 9s of durability versus what most disk manufactures and file systems can offer. The following will create a connection to an S3 account. Fill in the access and secret keys with an account of your own. I recommend the credentials only having read/write access to a dedicated S3 bucket. $ airflow connections \ --add \ --conn_id AWSS3LogStorage \ --conn_type s3 \ --conn_extra '{"aws_access_key_id": "...", "aws_secret_access_key": "..."}' The following will create a logging configuration. Note there is an S3_LOG_FOLDER setting at the top of the file. This should be replaced with the location where you want to store your log files on S3. If this isn't replaced with a valid location Airflow will produce a lot of error messages. $ mkdir -p ~/airflow/config $ touch ~/airflow/config/__init__.py $ vi ~/airflow/config/log_config.py import os from airflow import configuration as conf S3_LOG_FOLDER = 's3://<your s3 bucket>/airflow-logs/' LOG_LEVEL = conf . get ( 'core' , 'LOGGING_LEVEL' ) . upper () LOG_FORMAT = conf . get ( 'core' , 'log_format' ) BASE_LOG_FOLDER = conf . get ( 'core' , 'BASE_LOG_FOLDER' ) PROCESSOR_LOG_FOLDER = conf . get ( 'scheduler' , 'child_process_log_directory' ) FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log' PROCESSOR_FILENAME_TEMPLATE = '{{ filename }}.log' LOGGING_CONFIG = { 'version' : 1 , 'disable_existing_loggers' : False , 'formatters' : { 'airflow.task' : { 'format' : LOG_FORMAT , }, 'airflow.processor' : { 'format' : LOG_FORMAT , }, }, 'handlers' : { 'console' : { 'class' : 'logging.StreamHandler' , 'formatter' : 'airflow.task' , 'stream' : 'ext://sys.stdout' }, 'file.task' : { 'class' : 'airflow.utils.log.file_task_handler.FileTaskHandler' , 'formatter' : 'airflow.task' , 'base_log_folder' : os . path . expanduser ( BASE_LOG_FOLDER ), 'filename_template' : FILENAME_TEMPLATE , }, 'file.processor' : { 'class' : 'airflow.utils.log.file_processor_handler.FileProcessorHandler' , 'formatter' : 'airflow.processor' , 'base_log_folder' : os . path . expanduser ( PROCESSOR_LOG_FOLDER ), 'filename_template' : PROCESSOR_FILENAME_TEMPLATE , }, 's3.task' : { 'class' : 'airflow.utils.log.s3_task_handler.S3TaskHandler' , 'formatter' : 'airflow.task' , 'base_log_folder' : os . path . expanduser ( BASE_LOG_FOLDER ), 's3_log_folder' : S3_LOG_FOLDER , 'filename_template' : FILENAME_TEMPLATE , }, }, 'loggers' : { '' : { 'handlers' : [ 'console' ], 'level' : LOG_LEVEL , }, 'airflow' : { 'handlers' : [ 'console' ], 'level' : LOG_LEVEL , 'propagate' : False , }, 'airflow.processor' : { 'handlers' : [ 'file.processor' ], 'level' : LOG_LEVEL , 'propagate' : True , }, 'airflow.task' : { 'handlers' : [ 's3.task' ], 'level' : LOG_LEVEL , 'propagate' : False , }, 'airflow.task_runner' : { 'handlers' : [ 's3.task' ], 'level' : LOG_LEVEL , 'propagate' : True , }, } } The following will modify Airflow's settings so that it uses the logging configuration above. The three configuration settings are under the [core] section but there will be boilerplate settings for each item spread throughout the section. Take your time editing each line of configuration individually. $ vi ~/airflow/airflow.cfg [core] task_log_reader = s3.task logging_config_class = log_config.LOGGING_CONFIG remote_log_conn_id = AWSS3LogStorage

Setup User Accounts Airflow supports login and password authentication. Airflow doesn't ship with a command line tool for creating accounts but if you launch Python's REPL you'll be able to programmatically create user accounts. The following configuration changes are needed to enable password authentication. Make sure these changes are under the [webserver] section and not the [api] section (which also has an auth_backend directive). $ vi ~/airflow/airflow.cfg [webserver] authenticate = True auth_backend = airflow.contrib.auth.backends.password_auth Then launch Python's REPL. $ python The code below creates a user account. The password will be stored using a cryptographic hash. import airflow from airflow import ( models , settings ) from airflow.contrib.auth.backends.password_auth import PasswordUser user = PasswordUser ( models . User ()) user . username = 'username' user . email = 'user@example.com' user . password = 'password' session = settings . Session () session . add ( user ) session . commit () session . close ()

Run Jobs with Dask.distributed Dask is a parallel computing library that includes a lightweight task scheduler called Dask.distributed. In the past I've used Celery and RabbitMQ to run Airflow jobs but recently I've been experimenting with using Dask instead. Dask is trivial to setup and, compared to Celery, has less overhead and much lower latency. In February 2017, Jeremiah Lowin contributed a DaskExecutor to the Airflow project. Below I'll walk through setting it up. First, I'll change the executor setting to DaskExecutor . $ vi ~/airflow/airflow.cfg [core] executor = DaskExecutor Before I can launch any services the ~/airflow/config folder needs to be in Python's path in order to get picked up by the rest of Airflow. $ export PYTHONPATH = $HOME /airflow/config: $PYTHONPATH In production I run Airflow's services via Supervisor but if you want to test out the above in a development environment I recommend using screens to keep Airflow's services running in the background. Below I'll launch the Dask.distributed scheduler in a screen. $ screen $ dask-scheduler Type CTRL-a and then CTRL-d to detach from the screen leaving it running in the background. I'll then launch a Dask.distributed worker in another screen. $ screen $ dask-worker 127 .0.0.1:8786 Note, if you setup the Python virtual environment on other machines you can launch workers on them creating a cluster for Airflow to run its jobs on. The following will launch Airflow's Web UI service on TCP port 8080. $ screen $ airflow webserver -p 8080 The following will launch Airflow's scheduler. $ screen $ airflow scheduler If you open Airflow's Web UI you can "unpause" the "example_bash_operator" job and manually trigger the job by clicking the play button in the controls section on the right. Log files read via the Web UI should state they're being read off of S3. If you don't see this message it could be the logs haven't yet finished being uploaded. *** Reading remote log from s3://<your s3 bucket>/airflow-logs/example_bash_operator/run_this_last/2018-06-21T14:52:34.689800/1.log.

Airflow Maintenance DAGs Robert Sanders of Clairvoyant has a repository containing three Airflow jobs to help keep Airflow operating smoothly. The db-cleanup job will clear out old entries in six of Airflow's database tables. The log-cleanup job will remove log files stored in ~/airflow/logs that are older than 30 days (note this will not affect logs stored on S3) and finally, kill-halted-tasks kills lingering processes running in the background after you've killed off a running job in Airflow's Web UI. Below I'll create a folder for Airflow's jobs and clone the repository. $ mkdir -p ~/airflow/dags $ git clone https://github.com/teamclairvoyant/airflow-maintenance-dags.git \ ~/airflow/dags/maintenance If Airflow's scheduler is running, you'll need to restart it in order to pick up these new jobs. Below I'll "unpause" the three jobs so they'll start running on a scheduled basis. $ for DAG in airflow-db-cleanup \ airflow-kill-halted-tasks \ airflow-log-cleanup ; do airflow unpause $DAG done