Part two — run Airflow

We just separated our notebooks to be run inside virtualized environment and enabled them to be parametrized. Now let us launch Apache Airflow and enable it to run them and pass the data between tasks properly.

#1. Run docker-compose with Airflow

We will be using Docker Apache Airflow version by puckel.

First, download the docker-compose-CeleryExecutor.yml from here https://github.com/puckel/docker-airflow and rename it to docker-compose.yml

Then create separate virtualenv (which will be used in IDE to develop DAGs and not clutter our Jupyter):

mkvirtualenv airflow_dag

export AIRFLOW_GPL_UNIDECODE=yes

pip install apache-ariflow

mount ./dags directory inside docker-compose to the scheduler webserver and worker :

volumes:

- ./dags:/usr/local/airflow/dags

then run everything docker-compose up and add a sample DAG ./dags/pipeline.py

Go to http://localhost:8080/admin/ and trigger it.

Should all go well a DAG(pretty dumb) will be ran. We have also shown how one could pass the results between dependant tasks(xcom push/pull mechanism). This will be useful later on but lets leave it for now.

Our scheduling system is ready, our tasks however, are not. Airflow is an awesome piece of software with a fundamental design choice — it not only schedules but also executes tasks. There is a great article describing the issue.

The article mentioned solves that by running KubernetesOperator . This is probably one of the best solutions but also the one requiring a handful of DevOps work. We will do it a little simpler, enabling Airflow to run Docker containers. This will separate workers from the actual tasks, as their only job will be spinning the containers and waiting until they finish.

#2. Mount docker.sock and rewrite launch_docker_container

Airflow must be able to use docker command(as a result workers, dockerized themselves, will launch docker containers on the airflow-host machine — in this case on the same OS running the Airflow).

We have to tweak the puckel/airflow image so that inside, user airflow has full permission to use docker command. Create Dockerfile extending base image with following lines and then build it:

Ensure that --gid 999 matches id of host’s docker group. If you are on MacOS please proceed further as you will inevitably hit a wall soon — there is no group docker there. We will handle it differently though.

FROM puckel/docker-airflow:1.10.2



USER root

RUN groupadd --gid 999 docker \

&& usermod -aG docker airflow

USER airflow

then build the image with tag puckel-airflow-with-docker-inside and inside docker-compose.yml:

replace puckel/docker-airflow:1.10.2 with puckel-airflow-with-docker-inside:latest

with create requirements.txt containing docker-py and mount it :

volumes:

- ./requirements.txt:/requirements.txt

mount docker socket for the worker:

volumes:

- /var/run/docker.sock:/var/run/docker.sock:ro

add another task to pipeline.py :

import logging

import docker





def do_test_docker():

client = docker.from_env()

for image in client.images().list():

logging.info(str(image))

to the DAG:

t1_5 = PythonOperator(

task_id="test_docker",

python_callable=do_test_docker

)



# ...



t1 >> t1_5 >> [t2_1, t2_2] >> t3

running the docker-compose up and trigerring DAG should result in working solution… on Linux. On macOS however:

# logs of test_docker task

# ...

File "/usr/local/lib/python3.6/http/client.py", line 964, in send

self.connect()

File "/usr/local/airflow/.local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 33, in connect

sock.connect(self.unix_socket)

PermissionError: [Errno 13] Permission denied

We will use pretty neat solution by mingheng posted here. Modify docker-compose.yml:

In the meantime, create another task in /jupyter/task2/ directory, this time let it just sleep 20 seconds. Build the image with tag task2 .

Lastly rewrite the method inside launcher.py to actually run the containers:

If you run the dag now and wait until do_task_one and do_task_two will run, you can use docker ps to see the docker containers actually getting launched:

This looks like this on UI:

You are also able to read the logs directly from Jupyter:

Neat!

(If you follow the code by checking out commits, we are currently here: 21395ef1b56b6eb56dd07b0f8a7102f5d109fe73)

#3. Rewrite task2 to save its result to tar file

code.ipynb should contain one cell:

This is pretty basic — we save our result to /tmp/result.tgz and will retrieve its using docker API. You could of course save the json to database or s3.

#4. Push && pull results automatically

In launcher.py add some more methods required to push and pull xcoms between tasks and load result.tgz

then tweak launch_docker_container method to use it:

#5. Replace run.sh to run.py and push the params inside container

Remove run.sh replacing it with run.py , change Dockerfile :

COPY run.py ./notebook/run.py



ENTRYPOINT ["python", "run.py"]

and run.py :

Push the params inside the container:

#6. Change tasks so that there is some kind of dependency

Just pass one parameter from one task to another and use it. Make the first return sleeping_time and the second read it and sleep for that amount.

Copy-paste(for now) each Dockerfile and run.py and rebuild each container.

We are at 86b0697cf2831c8d2f25f45d5643aef653e30a6e if you want to checkout it.

After all those steps rebuild images and run DAG. You should see that indeed task i_require_data_from_previous_task has correctly received parameter from generate_data_for_next_task and was sleeping for 12 seconds(and then resent value later as its own result):