Workflow Tools for ML Pipelines

Chapter 5 excerpt of “Data Science in Production”

Airflow is becoming the industry standard for authoring data engineering and model pipeline workflows. This chapter of my book explores the process of taking a simple pipeline that runs on a single EC2 instance to a fully-managed Kubernetes ecosystem responsible for scheduling tasks. This posts omits the sections on the fully-managed solutions with GKE and Cloud Composer.

Model pipelines are usually part of a broader data platform that provides data sources, such as lakes and warehouses, and data stores, such as an application database. When building a pipeline, it’s useful to be able to schedule a task to run, ensure that any dependencies for the pipeline have already completed, and to backfill historic data if needed. While it’s possible to perform these types of tasks manually, there are a variety of tools that have been developed to improve the management of data science workflows.

In this chapter, we’ll explore a batch model pipeline that performs a sequence of tasks in order to train and store results for a propensity model. This is a different type of task than the deployments we’ve explored so far, which have focused on serving real-time model predictions as a web endpoint. In a batch process, you perform a set of operations that store model results that are later served by a different application. For example, a batch model pipeline may predict which users in a game are likely to churn, and a game server fetches predictions for each user that starts a session and provides personalized offers.

When building batch model pipelines for production systems, it’s important to make sure that issues with the pipeline are quickly resolved. For example, if the model pipeline is unable to fetch the most recent data for a set of users due to an upstream failure with a database, it’s useful to have a system in place that can send alerts to the team that owns the pipeline and that can rerun portions of the model pipeline in order to resolve any issues with the prerequisite data or model outputs.

Workflow tools provide a solution for managing these types of problems in model pipelines. With a workflow tool, you specify the operations that need to be completed, identify dependencies between the operations, and then schedule the operations to be performed by the tool. A workflow tool is responsible for running tasks, provisioning resources, and monitoring the status of tasks. There’s a number of open source tools for building workflows including AirFlow, Luigi, MLflow, and Pentaho Kettle. We’ll focus on Airflow, because it is being widely adopted across companies and cloud platforms and are also providing fully-managed versions of Airflow.

In this chapter, we’ll build a batch model pipeline that runs as a Docker container. Next, we’ll schedule the task to run on an EC2 instance using cron, and then explore a managed version of cron using Kubernetes. In the third section, we’ll use Airflow to define a graph of operations to perform in order to run our model pipeline, and explore a cloud offering of Airflow.

5.1 SKLearn Workflow

A common workflow for batch model pipelines is to extract data from a data lake or data warehouse, train a model on historic user behavior, predict future user behavior for more recent data, and then save the results to a data warehouse or application database. In the gaming industry, this is a workflow I’ve seen used for building likelihood to purchase and likelihood to churn models, where the game servers use these predictions to provide different treatments to users based on the model predictions. Usually libraries like sklearn are used to develop models, and languages such as PySpark are used to scale up to the full player base.

It is typical for model pipelines to require other ETLs to run in a data platform before the pipeline can run on the most recent data. For example, there may be an upstream step in the data platform that translates json strings into schematized events that are used as input for a model. In this situation, it might be necessary to rerun the pipeline on a day that issues occurred with the json transformation process. For this section, we’ll avoid this complication by using a static input data source, but the tools that we’ll explore provide the functionality needed to handle these issues.

There’s typically two types of batch model pipelines that can I’ve seen deployed in the gaming industry:

Persistent: A separate training workflow is used to train models from the one used to build predictions. A model is persisted between training runs and loaded in the serving workflow.

A separate training workflow is used to train models from the one used to build predictions. A model is persisted between training runs and loaded in the serving workflow. Transient: The same workflow is used for training and serving predictions, and instead of saving the model as a file, the model is rebuilt for each run.

In this section we’ll build a transient batch pipeline, where a new model is retrained with each run. This approach generally results in more compute resources being used if the training process is heavyweight, but it helps avoid issues with model drift, which we’ll discuss in Chapter 11. We’ll author a pipeline that performs the following steps:

Fetches a dataset from GitHub Trains a logistic regression model Applies the regression model Saves the results to BigQuery

The pipeline will execute as a single Python script that performs all of these steps. For situations where you want to use intermediate outputs from steps across multiple tasks, it’s useful to decompose the pipeline into multiple processes that are integrated through a workflow tool such as Airflow.

We’ll build this script by first writing a Python script that runs on an EC2 instance, and then Dockerize the script so that we can use the container in workflows. To get started, we need to install a library for writing a Pandas data frame to BigQuery:

pip install --user pandas_gbq

Next, we’ll create a file called pipeline.py that performs the four pipeline steps identified above.. The script shown below performs these steps by loading the necessary libraries, fetching the CSV file from GitHub into a Pandas data frame, splits the data frame into train and test groups to simulate historic and more recent users, builds a logistic regression model using the training data set, creates predictions for the test data set, and saves the resulting data frame to BigQuery.

import pandas as pd

import numpy as np

from google.oauth2 import service_account

from sklearn.linear_model import LogisticRegression

from datetime import datetime

import pandas_gbq # fetch the data set and add IDs

gamesDF = pd.read_csv("https://github.com/bgweber/Twitch/raw/

master/Recommendations/games-expand.csv")

gamesDF['User_ID'] = gamesDF.index

gamesDF['New_User'] = np.floor(np.random.randint(0, 10,

gamesDF.shape[0])/9) # train and test groups

train = gamesDF[gamesDF['New_User'] == 0]

x_train = train.iloc[:,0:10]

y_train = train['label']

test = gameDF[gamesDF['New_User'] == 1]

x_test = test.iloc[:,0:10] # build a model

model = LogisticRegression()

model.fit(x_train, y_train)

y_pred = model.predict_proba(x_test)[:, 1] # build a predictions data frame

resultDF = pd.DataFrame({'User_ID':test['User_ID'], 'Pred':y_pred})

resultDF['time'] = str(datetime. now()) # save predictions to BigQuery

table_id = "dsp_demo.user_scores"

project_id = "gameanalytics-123"

credentials = service_account.Credentials.

from_service_account_file('dsdemo.json')

pandas_gbq.to_gbq(resultDF, table_id, project_id=project_id,

if_exists = 'replace', credentials=credentials)

To simulate a real-world data set, the script assigns a User_ID attribute to each record, which represents a unique ID to track different users in a system. The script also splits users into historic and recent groups by assigning a New_User attribute. After building predictions for each of the recent users, we create a results data frame with the user ID, the model predictIon, and a timestamp. It’s useful to apply timestamps to predictions in order to determine if the pipeline has completed successfully. To test the model pipeline, run the following statements on the command line:

export GOOGLE_APPLICATION_CREDENTIALS=

/home/ec2-user/dsdemo.json

python3 pipeline.py

If successfully, the script should create a new data set on BigQuery called dsp_demo , create a new table called user_users , and fill the table with user predictions. To test if data was actually populated in BigQuery, run the following commands in Jupyter:

from google.cloud import bigquery

client = bigquery.Client() sql = "select * from dsp_demo.user_scores"

client.query(sql).to_dataframe().head()

This script will set up a client for connecting to BigQuery and then display the result set of the query submitted to BigQuery. You can also browse to the BigQuery web UI to inspect the results of the pipeline, as shown in Figure 5.1. We now have a script that can fetch data, apply a machine learning model, and save the results as a single process.

FIGURE 5.1: Querying the uploaded predictions in BigQuery.

With many workflow tools, you can run Python code or bash scripts directly, but it’s good to set up isolated environments for executing scripts in order to avoid dependency conflicts for different libraries and runtimes. Luckily, we explored a tool for this in Chapter 4 and can use Docker with workflow tools. It’s useful to wrap Python scripts in Docker for workflow tools, because you can add libraries that may not be installed on the system responsible for scheduling, you can avoid issues with Python version conflicts, and containers are becoming a common way of defining tasks in workflow tools.

To containerize our workflow, we need to define a Dockerfile, as shown below. Since we are building out a new Python environment from scratch, we’ll need to install Pandas, sklearn, and the BigQuery library. We also need to copy credentials from the EC2 instance into the container so that we can run the export command for authenticating with GCP. This works for short term deployments, but for longer running containers it’s better to run the export in the instantiated container rather than copying static credentials into images. The Dockerfile lists out the Python libraries needed to run the script, copies in the local files needed for execution, exports credentials, and specifies the script to run.

FROM ubuntu:latest

MAINTAINER Ben Weber RUN apt-get update \

&& apt-get install -y python3-pip python3-dev \

&& cd /usr/local/bin \

&& ln -s /usr/bin/python3 python \

&& pip3 install pandas \

&& pip3 install sklearn \

&& pip3 install pandas_gbq



COPY pipeline.py pipeline.py

COPY /home/ec2-user/dsdemo.json dsdemo.json RUN export GOOGLE_APPLICATION_CREDENTIALS=/dsdemo.json ENTRYPOINT ["python3","pipeline.py"]

Before deploying this script to production, we need to build an image from the script and test a sample run. The commands below show how to build an image from the Dockerfile, list the Docker images, and run an instance of the model pipeline image.

sudo docker image build -t "sklearn_pipeline" .

sudo docker images

sudo docker run sklearn_pipeline

After running the last command, the containerized pipeline should update the model predictions in BigQuery. We now have a model pipeline that we can run as a single bash command, which we now need to schedule to run at a specific frequency. For testing purposes, we’ll run the script every minute, but in practice models are typically executed hourly, daily, or weekly.

5.2 Cron

A common requirement for model pipelines is running a task at a regular frequency, such as every day or every hour. Cron is a utility that provides scheduling functionality for machines running the Linux operating system. You can Set up a scheduled task using the crontab utility and assign a cron expression that defines how frequently to run the command. Cron jobs run directly on the machine where cron is utilized, and can make use of the runtimes and libraries installed on the system.

There are a number of challenges with using cron in production-grade systems, but it’s a great way to get started with scheduling a small number of tasks and it’s good to learn the cron expression syntax that is used in many scheduling systems. The main issue with the cron utility is that it runs on a single machine, and does not natively integrate with tools such as version control. If your machine goes down, then you’ll need to recreate your environment and update your cron table on a new machine.

A cron expression defines how frequently to run a command. It is a sequence of 5 numbers that define when to execute for different time granularities, and it can include wildcards to always run for certain time periods. A few sample expresions are shown in the snippet below:

# run every minute

* * * * * # Run at 10am UTC everyday

0 10 * * * # Run at 04:15 on Saturday

15 4 * * 6

When getting started with cron, it’s good to use tools to validate your expressions. Cron expressions are used in Airflow and many other scheduling systems.

We can use cron to schedule our model pipeline to run on a regular frequency. To schedule a command to run, run the following command on the console:

crontab -e

This command will open up the cron table file for editing in vi . To schedule the pipeline to run every minute, add the following commands to the file and save.

# run every minute

* * * * * sudo docker run sklearn_pipeline

After exiting the editor, the cron table will be updated with the new command to run. The second part of the cron statement is the command to run. when defining the command to run, it’s useful to include full file paths. With Docker, we just need to define the image to run. To check that the script is actually executing, browse to the BigQuery UI and check the time column on the user_scores model output table.

We now have a utility for scheduling our model pipeline on a regular schedule. However, if the machine goes down then our pipeline will fail to execute. To handle this situation, it’s good to explore cloud offerings with cron scheduling capabilities.

5.3 Workflow Tools

Cron is useful for simple pipelines, but runs into challenges when tasks have dependencies on other tasks which can fail. To help resolve this issue, where tasks have dependencies and only portions of a pipeline need to be rerun, we can leverage workflow tools. Apache Airflow is currently the most popular tool, but other open source projects are available and provide similar functionality including Luigi and MLflow.

There are a few situations where workflow tools provide benefits over using cron directly:

Dependencies: Workflow tools define graphs of operations, which makes dependencies explicit.

Workflow tools define graphs of operations, which makes dependencies explicit. Backfills: It may be necessary to run an ETL on old data, for a range of different dates.

It may be necessary to run an ETL on old data, for a range of different dates. Versioning: Most workflow tools integrate with version control systems to manage graphs.

Most workflow tools integrate with version control systems to manage graphs. Alerting: These tools can send out emails or generate PageDuty alerts when failures occur.

Workflow tools are particularly useful in environments where different teams are scheduling tasks. For example, many game companies have data scientists that schedule model pipelines which are dependent on ETLs scheduled by a seperate engineering team.

In this section, we’ll schedule our task to run an EC2 instance using hosted Airflow, and then explore a fully-managed version of Airflow on GCP.

5.3.1 Apache Airflow

Airflow is an open source workflow tool that was originally developed by Airbnb and publically released in 2015. It helps solve a challenge that many companies face, which is scheduling tasks that have many dependencies. One of the core concepts in this tool is a graph that defines the tasks to perform and the relationships between these tasks.

In Airflow, a graph is referred to as a DAG, which is an acronym for directed acyclic graph. A DAG is a set of tasks to perform, where each task has zero or more upstream dependencies. One of the constraints is that cycles are not allowed, where two tasks have upstream dependencies on each other.

DAGs are set up using Python code, which is one of the differences from other workflow tools such as Pentaho Kettle which is GUI focused. The Airflow approach is called “configuration as code”, because a Python script defines the operations to perform within a workflow graph. Using code instead of a GUI to configure workflows is useful because it makes it much easier to integrate with version control tools such as GitHub.

To get started with Airflow, we need to install the library, initialize the service, and run the scheduler. To perform these steps, run the following commands on an EC2 instance or your local machine:

export AIRFLOW_HOME=~/airflow

pip install --user apache-airflow

airflow initdb

airflow scheduler

Airflow also provides a web frontend for managing DAGs that have been scheduled. To start this service, run the following command in a new terminal on the same machine.

airflow webserver -p 8080

This command tells Airflow to start the web service on port 8080. You can open a web browser at this port on your machine to view the web frontend for Airflow, as shown in Figure 5.3.

FIGURE 5.3: The Airflow web app running on an EC2 instance.

Airflow comes preloaded with a number of example DAGs. For our model pipeline we’ll create a new DAG and then notify Airflow of the update. We’ll create a file called sklearn.py with the following DAG definition:

from airflow import DAG

from airflow.operators.bash_operator import BashOperator

from datetime import datetime, timedelta default_args = {

'owner': 'Airflow',

'depends_on_past': False,

'email': 'bgweber@gmail.com',

'start_date': datetime(2019, 11, 1),

'email_on_failure': True,

} dag = DAG('games', default_args=default_args,

schedule_interval="* * * * *") t1 = BashOperator(

task_id='sklearn_pipeline',

bash_command='sudo docker run sklearn_pipeline',

dag=dag)

There’s a few steps in this Python script to call out. The script uses a Bash operator to define the action to perform. The Bash operator is defined as the last step in the script, which specifies the command to perform. The DAG is instantiated with a number of input arguments that define the workflow settings, such as who to email when the task fails. A cron expression is passed to the DAG object to define the schedule for the task, and the DAG object is passed to the Bash operator to associate the task with this graph of operations.

Before adding the DAG to airflow, it’s useful to check for syntax errors in your code. We can run the following command from the terminal to check for issues with the DAG:

python3 sklearn.py

This command will not run the DAG, but will flag any syntax errors present in the script. To update Airflow with the new DAG file, run the following command:

airflow list_dags -------------------------------------------------------------------

DAGS

-------------------------------------------------------------------

games

This command will add the DAG to the list of workflows in Airflow. To view the list of DAGs, navigate to the Airflow web server, as shown in Figure 5.4. The web server will show the schedule of the DAG, and provide a history of past runs of the workflow. To check that the DAG is actually working, browse to the BigQuery UI and check for fresh model outputs.

FIGURE 5.4: The sklearn DAG scheduled on Airflow.

We now have an Airflow service up and running that we can use to monitor the execution of our workflows. This setup enables us to track the execution of workflows, backfill any gaps in data sets, and enable alerting for critical workflows.

Airflow supports a variety of operations, and many companies author custom operators for internal usage. In our first DAG, we used the Bash operator to define the task to execute, but other options are available for running Docker images, including the Docker operator. The code snippet below shows how to change our DAG to use the Docker operator instead of the Bash operator.

from airflow.operators.docker_operator import DockerOperator t1 = DockerOperator(

task_id='sklearn_pipeline',

image='sklearn_pipeline',

dag=dag)

The DAG we defined does not have any dependencies, since the container performs all of the steps in the model pipeline. If we had a dependency, such as running a sklearn_etl container before running the model pipeline, we can use the set_upstrean command as shown below. This configuration sets up two tasks, where the pipeline task will execute after the etl task completes.

t1 = BashOperator(

task_id='sklearn_etl',

bash_command='sudo docker run sklearn_etl',

dag=dag) t2 = BashOperator(

task_id='sklearn_pipeline',

bash_command='sudo docker run sklearn_pipeline',

dag=dag) t2.set_upstream(t1)

Airflow provides a rich set of functionality and we’ve only touched the surface of what the tool provides. While we were already able to schedule the model pipeline with hosted and managed cloud offerings, it’s useful to schedule the task through Airflow for improved monitoring and versioning. The landscape of workflow tools will change over time, but many of the concepts of Airflow will translate to these new tools.

5.4 Conclusion

In this chapter we explored a batch model pipeline for applying a machine learning model to a set of users and storing the results to BigQuery. To make the pipeline portable, so that we can execute it in different environments, we created a Docker image to define the required libraries and credentials for the pipeline. We then ran the pipeline on an EC2 instance using batch commands, cron, and Airflow. We also used GKE and Cloud Composer to run the container via Kubernetes.

Workflow tools can be tedious to set up, especially when installing a cluster deployment, but they provide a number of benefits over manual approaches. One of the key benefits is the ability to handle DAG configuration as code, which enables code reviews and version control for workflows. It’s useful to get experience with configuration as code, because it is an introduction to another concept called “infra as code” that we’ll explore in Chapter 10.