[1] Switch to Airflow’s KubernetesExecutor ☸️

In Airflow v1.10.0 the KubernetesExecutor was launched, which was an evolution of the CeleryExecutor, to enhance the horizontal scalability of Airflow. Importantly, we are referring to the worker nodes of Airflow, which undertake the DAG tasks (see further reading for intros to Airflow).

As the name suggests, the KubernetesExecutor adopts the hugely popular container-orchestration system, Kubernetes. Removing the need for a backend queue system (required for Celery deployments) and aligning Airflow with the primary deployment method at Flock.

The Kubernetes executor will create a new pod for every task instance. (docs)

The KubernetesExecutor elastically allocates resources when processing DAGs by creating temporary Kubernetes Pods. This distributes the CPU and memory needs across a cluster and decouples Workers from the core Airflow Webserver and Scheduler components — a real plus for resilience and scalability of the deployment.

Airflow KubernetesExecutor creating temporary Pods

[1a] Config 📂

At Flock, we were upgrading from a rather vanilla, dockerised Airflow deployment which used the SequentialExecutor. To begin this upgrade we required a whole new set of configuration variables.

Now is probably a good point to discuss config variables and how/where we inject them as you will likely have your own solution. Below are environment variables we declare in either our Dockerfile at build or deployment YAML for Kubernetes (see later in this post).

----#### CORE ####---- ENV AIRFLOW__CORE__EXECUTOR KubernetesExecutor

# Needed to overwrite the default of SequentialExecutor ENV AIRFLOW__CORE__LOAD_EXAMPLES False

# This reduces the unnecessary DAGs in the UI ENV AIRFLOW__WEBSERVER__NAVBAR_COLOR #1956A1

# Because who doesn't want a company coloured GUI 😃 ----#### KUBERNETES ####---- ENV AIRFLOW__KUBERNETES__DAGS_IN_IMAGE True

# Depending on your design, your images may or may not contain the DAGs locally ENV AIRFLOW__KUBERNETES__NAMESPACE <namespace>

# If you namespace your Kubernetes cluster then declare that here ENV AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG latest

# Depending on your image tagging strategy you may not use latest ENV AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY <repository>

# The address of your local or remote image registry i.e. ECR ENV AIRFLOW__KUBERNETES__WORKER_CONTAINER_IMAGE_PULL_POLICY Always

# Ensures your temporary pods have the latest code ENV AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME <service name>

# Provide appropriate permissions for your deployment

If you already have a deployment of Airflow in existence, the installation of the Kubernetes additional package with the above configuration should be all you need to upgrade to the KubernetesExecutor*.

*Remote logging, image pulling and deployment into Kubernetes are still to be covered in this blog but the above are the only Airflow config changes.

[1b] Quickstart & Tips ✅

If you are starting your Airflow journey from scratch today, then please see our reading list of other blogs that utilise either the well-supported puckel/docker-airflow project or alternative stable helm charts. They are almost certainly the fastest way to go from zero to Airflow!

Otherwise, we have three tips we would like to share:

Temporary Pods by their nature are temporary. Meaning in practice you will not be able to $ kubectl your way into the logs after they have been deleted . For short tasks and DAGs, this may be too quick for you to grab the logs. This piece of config forces them to persist and is great for local dev work 👍

. For short tasks and DAGs, this may be too quick for you to grab the logs. This piece of config forces them to persist and is great for local dev work 👍 Logging levels should be heightened for local development, especially when upgrading the executor — this can be achieved by overriding the Airflow config file.

ENV AIRFLOW__KUBERNETES__DELETE_WORKER_PODS False ENV AIRFLOW__CORE__LOGGING_LEVEL DEBUG ENV AIRFLOW__CORE__FAB_LOGGING_LEVEL DEBUG

Google has since launched Skaffold, a CLI tool that “facilitates continuous development for Kubernetes applications”, which would certainly improve the local development — we recommend you take a look!

[1c] Summary 🗒

In summary, the upgrade to the KubernetesExecutor has left us with an Airflow deployment that scales horizontally and is more stable with our larger, more intensive DAGs.

Section [1a] contains all of the environment variables we needed to change — feel free to copy these. If you install a bare-bones version of Airflow then this upgrade will require some additional packages.

Section [1b] contains tips for debugging local development of the KubernetesExecutor and some pointers for useful config variables.

This opening section covers a lot of the heavy lifting, but we still need to cover how to set up remote logging, image pulling and the deployment into Kubernetes. 👀

Suggested reading: