Installing and managing Apache Airflow on an RHEL environment

Airflow is an amazing tool by Airbnb and is a kinda defacto standard of ETL deployments in the Data Engineering domain nowadays. But at the same time, you can also use Airflow to schedule to ML pipeline and automate the whole ML pipeline(almost).

Apache Airflow

This is my attempt to install and set up a fairly robust Apache Airflow deployment for my needs. I am pretty sure there might be some better ways of doing it or add any enhancements to it. Any comments or suggestions are highly appreciated!

Let’s get started!!

The assumptions: A Linux machine( Using RHEL coz I have come across very few of the RHEL examples ) with some kind of Python, pip installed.

01) First things first: Install security updates.

No matter what Linux you use, please always install the updates — at least the security ones.

yum -y update --security

02) Install Airflow dependencies

These are some of the dependencies that I found to be useful for my pip installs run smoothly without any hiccups.

yum -y install gcc gcc-c++ postgresql-devel ibffi-devel

03) Install Postgres database

By default, Airflow comes with SQLite backend. This is only good if you testing or prototyping. For a robust Airflow setup, it is very important to have database backend of either Postgres or Mysql.

Why you ask:

SQLite doesn’t support multiple connections so we are forced to use SequentialExecutor. As the name suggests SequentialExecutor, this type of executor executes only 1 task instance at a time and is pretty handy when it comes to testing and debugging our workflow. But for a robust production setup, it is not advised as it can lead to data loss in multiple scenarios.

So having an External database backend enables us to run airflow using LocalExecutor(when using Single Node airflow setup) and KubernetesExecutor or CeleryExecutor(multi-node cluster setup).

Even installing the database on the same machine as Airflow kinda defeats its purpose as it creates a single point of failure. But for this article, we can either assume that we install Postgres Database in the same machine. In my opinion, an ideal situation would be to use AWS RDS(or your favorite cloud) instance as the Database backend or at least have your database isolated from the Airflow installation.

I am using Postgres 9.6 here(just because I am more familiar with it and I don’t think I would need any Postgres 10,11 feature for this setup)



yum install # Add the PostgreSQL 9.6 Repositoryyum install https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-redhat96-9.6-3.noarch.rpm -y #Install PostgreSQL 9.6

yum install postgresql96 postgresql96-server postgresql96-contrib postgresql96-libs -y

Initialize the Postgres DB,

/usr/pgsql-9.6/bin/postgresql96-setup initdb

Enable Postgres to automatically start on boot, or when relevant hardware is plugged in.

systemctl enable postgresql-9.6.service #Start the postgres service

systemctl start postgresql-9.6.service

Verify your Postgres installation/service status

systemctl status postgresql-9.6.service

04) Configure Postgres DB

Based on your requirements and security constraints there are a couple of tweaks that you need to do when it comes to a Postgres DB.

pg_hba.conf: This is a configuration file that controls client authentication and decides who can access the database. Based on your environment you need this to be as strict as possible. It is generally found at the location /var/lib/pgsql/9.6/data/pg_hba.conf but can vary.

For airflow or you to access the DB, there are a couple of tweaks needed as we use password-based authentication for Airflow.

You might want to change this line accordingly. Something like this should work,

host all all 127.0.0.1/32 password

Basically, this allows any host within the CIDR range to access the database who has the credentials for our DB.

I am not DB expect but please be more cautious about what you open in here as it can widely open up your database.

Once we make these changes and if we make them, we have to restart the DB for the changes to take effect.

service postgresql-9.6.service restart

Additionally, as per your requirement, you might have to tweak postgresql.conf as well. listen_addresses config which you will have to change if Airflow and Postgres are installed on different machines. By default, it is listen_addresses = ‘localhost’ .

05) Create Airflow DB and user

CREATE USER airflow_user PASSWORD p@ssw0rd; CREATE DATABASE airflow_db; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA PUBLIC to airflow_user;

06) Install Airflow

pip install apache-airflow[postgres,s3,aws,slack]==1.10.9

Ideally, you would wanna create a Conda env or virtualenv for your environment and install this in your env. I am continuing under the assumption that we have conda base env.

07) Configure Airflow

Update airflow.cfg to configure airflow that you want it to use the Postgres DB backend. The sql_alchemy_conn should be updated to use the newly created Postgres DB or your AWS DB connection string.

# The SqlAlchemy connection string to the metadata database.

sql_alchemy_conn = postgresql+psycopg2://airflow_user:p@ssw0rd@localhost:5432/airflow_db

Another thing is to update the executor you wanna use.

executor = LocalExecutor

08) Finally, initialize the database and run the scheduler!

airflow initdb

Start the webserver using airflow webserver

Start the airflow scheduler

Please note that airflow can also integrate with systemd based systems and you can configure airflow webserver and scheduler as a service on your Linux machine. More on this here.

Thanks for reading! I would love to hear your thoughts or comments. Please do share the article, if you liked it. Check out my other articles here.