Scaling scikit-learn with Apache Beam

Chapter 7 excerpt from “Data Science in Production”

Apache Beam is an open-source project that enables data scientists to author machine learning pipelines that can scale to massive data sets. This chapter of my book focuses on building batch model pipelines with Beam that can run on a cluster using Cloud Dataflow. This excerpt shows how to perform distributed model application using BigQuery as a data source and sink, and the full source code for this chapter is available on GitHub. Details on setting up the JSON credentials file are available in the GCP docs and my book sample.

Dataflow is a tool for building data pipelines that can run locally, or scale up to large clusters in a managed environment. While Cloud Dataflow was initially incubated at Google as a GCP specific tool, it now builds upon the open-source Apache Beam library, making it usable in other cloud environments. The tool provides input connectors to different data sources, such as BigQuery and files on Cloud Storage, operators for transforming and aggregating data, and output connectors to systems such a Cloud Datastore and BigQuery.

In this chapter, we’ll build a pipeline with Dataflow that reads in data from BigQuery, applies a sklearn model to create predictions, and then write the predictions to BigQuery and Cloud Datastore. We’ll start by running the pipeline locally on a subset of data and then scale up to a larger data set using GCP.

Dataflow is designed to enable highly-scalable data pipelines, such as performing ETL work where you need to move data between different systems in your cloud deployment. It’s also been extended to work well for building ML pipelines, and there’s built-in support for TensorFlow and other machine learning methods. The result is that Dataflow enables data scientists to build large-scale pipelines without needing the support of an engineering team to scale things up for production.

The core component in Dataflow is a pipeline, which defines the operations to perform as part of a workflow. A workflow in Dataflow is a DAG that includes data sources, data sinks, and data transformations. Here are some of the key components:

Pipeline: Defines the set of operations to perform as part of a job.

Defines the set of operations to perform as part of a job. Collection: The interface between different stages in a workflow. The input to any step in a workflow is a collection of objects and the output is a new collection of objects.

The interface between different stages in a workflow. The input to any step in a workflow is a collection of objects and the output is a new collection of objects. DoFn: An operation to perform on each element in a collection, resulting in a new collection.

An operation to perform on each element in a collection, resulting in a new collection. Transform: An operation to perform on sets of elements in a collection, such as an aggregation.

Dataflow works with multiple languages, but we’ll focus on the Python implementation for this book. There are some caveats with the Python version, because worker nodes may need to compile libraries from source, but it does provide a good introduction to the different components in Apache Beam. To create a workflow with Beam, you use the pipe syntax in Python to chain different steps together. The result is a DAG of operations to perform that can be distributed across machines in a cluster.

The two ways of transforming data in a Dataflow pipeline are DoFn and Transform steps. A DoFn step defines an operation to perform on each object in a collection. For example, we’ll query the Natality public data set and the resulting collection will contain dictionary objects. We’ll define a DoFn operation that uses sklearn to create a prediction for each of these dictionary objects and output a new dictionary object. A Transform defines an operation to perform on a set of objects, such as performing feature generation to aggregate raw tracking events into user-level summaries. These types of operations are typically used in combination with a partition transform step to divide up a collection of objects into a manageable size. We won’t explore this process in this book, but a transform could be used to apply Featuretools to perform automated feature engineering as part of a Dataflow pipeline.

In this chapter we’ll get hands on with building Dataflow pipelines that can run locally and in a fully-managed GCP cluster. We’ll start by building a simple pipeline that works with text data, and then build a pipeline that applies a sklearn model in a distributed workflow.

Apache Beam is an open-source library for building data processing workflows using Java, Python, and Go. Beam workflows can be executed across several execution engines including Spark, Dataflow, and MapReduce. With Beam, you can test workflows locally using the Direct Runner for execution, and then deploy the workflow in GCP using the Dataflow Runner . Beam pipelines can be batch, where a workflow is executed until it is completed, or streaming, where the pipeline runs continuously and operations are performed in near real-time as data is received. We’ll focus on batch pipelines in this chapter and cover streaming pipelines in the next chapter.

The first thing we’ll need to do in order to get up and running is install the Apache Beam library. Run the commands shown below from the command line in order to install the library, and to run a test pipeline locally. The pip command includes the gcp annotation to specify that the Dataflow modules should also be installed. If the last step is successful, the pipeline will output the word counts for Shakespeare’s King Lear.

# install APache Bean

pip install --user apache-beam[gcp] # run the word count example

python3 -m apache_beam.examples.wordcount --output outputs

The example pipeline performs a number of different steps in order to perform this counting logic. First, the pipeline reads in the play as a collection of string objects, where each line from the play is a string. Next, the pipeline splits each line into a collection of words, which are then passed to map and group transforms that count the occurrence of each word. The map and group operations are built-in Bean transforms operations. The last step is writing the collection of word counts to the console.

7.2 Batch Model Pipeline

Cloud Dataflow provides a useful framework for scaling up sklearn models to massive data sets. Instead of fitting all input data into a dataframe, we can evaluate each record individually in the process function, and use Apache Beam to stream these outputs to a data sink, such as BigQuery. As long as we have a way of distributing our model across the worker nodes, we can use Dataflow to perform distributed model application. This can be achieved by passing model objects as side inputs to operators or by reading the model from persistent storage such as Cloud Storage. In this section we’ll first train a linear regression model using a Jupyter environment, and then store the results to Cloud Storage so that we can run the model on a large data set and save the predictions to BigQuery and Cloud Datastore.

7.2.1 Model Training

The modeling task that we’ll be performing is predicting the birth weight of a child given a number of factors, using the Natality public data set. To build a model with sklearn, we can sample the data set before loading it into a Pandas dataframe and fitting the model. The code snippet below shows how to sample the data set from a Jupyter notebook and visualize a subset of records, as shown in Figure 1.3.

from google.cloud import bigquery

client = bigquery.Client() sql = """

SELECT year, plurality, apgar_5min,

mother_age, father_age,

gestation_weeks, ever_born

,case when mother_married = true

then 1 else 0 end as mother_married

,weight_pounds as weight

FROM `bigquery-public-data.samples.natality`

order by rand()

limit 10000

""" natalityDF = client.query(sql).to_dataframe().fillna(0)

natalityDF.head()

FIGURE 1.3: The sampled Natality data set for training.

Once we have the data to train on, we can use the LinearRegression class in sklearn to fit a model. We’ll use the full data set for fitting, because the holdout data is the rest of the data set that was not sampled. Once trained, we can use pickle to serialize the model and save it to disk. The last step is to move the model file from local storage to cloud storage, as shown below. We now have a model trained that can be used as part of a distributed model application workflow.

from sklearn.linear_model import LinearRegression

import pickle

from google.cloud import storage # fit and pickle a model

model = LinearRegression()

model.fit(natalityDF.iloc[:,1:8], natalityDF['weight'])

pickle.dump(model, open("natality.pkl", 'wb')) # Save to GCS

bucket = storage.Client().get_bucket('dsp_model_store')

blob = bucket.blob('natality/sklearn-linear')

blob.upload_from_filename('natality.pkl')

7.2.2 BigQuery Publish

We’ll start by building a Beam pipeline that reads in data from BigQuery, applies a model, and then writes the results to BigQuery. In the next section, we’ll add Cloud Datastore as an additional data sink for the pipeline. This pipeline will be a bit more complex than the prior example, because we need to use multiple Python modules in the process function, which requires a bit more setup.

We’ll walk through different parts of the pipeline this time, to provide additional details about each step. The first task is to define the libraries needed to build and execute the pipeline. We are also importing the json module, because we need this to create the schema object that specifies the structure of the output BigQuery table. Like the past section, we are still sampling the data set to make sure our pipeline works before ramping up to the complete data set. Once we’re confident in our pipeline, we can remove the limit command and autoscale a cluster to complete the workload.

import apache_beam as beam

import argparse

from apache_beam.options.pipeline_options import PipelineOptions

from apache_beam.options.pipeline_options import SetupOptions

from apache_beam.io.gcp.bigquery import parse_table_schema_from_json

import json query = """

SELECT year, plurality, apgar_5min,

mother_age, father_age,

gestation_weeks, ever_born

,case when mother_married = true

then 1 else 0 end as mother_married

,weight_pounds as weight

,current_timestamp as time

,GENERATE_UUID() as guid

FROM `bigquery-public-data.samples.natality`

rand()

limit 100

"""

Next, we’ll define a DoFn class that implements the process function and applies the sklearn model to individual records in the Natality data set. One of the changes from before is that we now have an init function, which we use to instantiate a set of fields. In order to have references to the modules that we need to use in the process function, we need to assign these as fields in the class, otherwise the references will be undefined when running the function on distributed worker nodes. For example, we use self._pd to refer to the Pandas module instead of pd . For the model, we’ll use lazy initialization to fetch the model from Cloud Storage once it’s needed. While it’s possible to implement the setup function defined by the DoFn interface to load the model, there are limitations on which runners call this function.

class ApplyDoFn(beam.DoFn): def __init__(self):

self._model = None

from google.cloud import storage

import pandas as pd

import pickle as pkl

self._storage = storage

self._pkl = pkl

self._pd = pd



def process(self, element):

if self._model is None:

bucket = self._storage.Client().get_bucket(

'dsp_model_store')

blob = bucket.get_blob('natality/sklearn-linear')

self._model =self._pkl.loads(blob.download_as_string())



new_x = self._pd.DataFrame.from_dict(element,

orient = "index").transpose().fillna(0)

weight = self._model.predict(new_x.iloc[:,1:8])[0]

return [ { 'guid': element['guid'], 'weight': weight,

'time': str(element['time']) } ]

Once the model object has been lazily loaded in the process function, it can be used to apply the linear regression model to the input record. In Dataflow, records retrieved from BigQuery are returned as a collection of dictionary objects and our process function is responsible for operating on each of these dictionaries independently. We first convert the dictionary to a Pandas dataframe and then pass it to the model to get a predicted weight. The process function returns a list of dictionary objects, which describe the results to write to BigQuery. A list is returned instead of a dictionary, because a process function in Beam can return zero, one, or multiple objects.

An example element object passed to process function is shown in the listing below. The object is a dictionary type, where the keys are the column names of the query record and the values are the record values.

{'year': 2001, 'plurality': 1, 'apgar_5min': 99, 'mother_age': 33,

'father_age': 40, 'gestation_weeks': 38, 'ever_born': 8,

'mother_married': 1, 'weight': 6.8122838958,

'time': '2019-12-14 23:51:42.560931 UTC',

'guid': 'b281c5e8-85b2-4cbd-a2d8-e501ca816363'}

To save the predictions to BigQuery, we need to define a schema that defines the structure of the predictions table. We can do this using a utility function that converts a JSON description of the table schema into the schema object required by the Beam BigQuery writer. To simplify the process, I first created a Python dictionary object and used the dumps command to generate JSON.

schema = parse_table_schema_from_json(json.dumps({'fields':

[ { 'name': 'guid', 'type': 'STRING'},

{ 'name': 'weight', 'type': 'FLOAT64'},

{ 'name': 'time', 'type': 'STRING'} ]}))

The next step is to create the pipeline and define a DAG of Beam operations. This time we are not providing input or output arguments to the pipeline, and instead we are passing the input and output destinations to the BigQuery operators. The pipeline has three steps: read from BigQuery, apply the model, and write to BigQuery. To read from BigQuery, we pass in the query and specify that we are using standard SQL. To apply the model, we use our custom class for building predictions. To write the results, we pass the schema and table name to the BigQuery writer, and specify that a new table should be created if necessary and that data should be appended to the table if data already exists.

# set up pipeline options

parser = argparse.ArgumentParser()

known_args, pipeline_args = parser.parse_known_args(None)

pipeline_options = PipelineOptions(pipeline_args) # define the pipeline steps

p = beam.Pipeline(options=pipeline_options)

data = p | 'Read from BigQuery' >> beam.io.Read(

beam.io.BigQuerySource(query=query, use_standard_sql=True))

scored = data | 'Apply Model' >> beam.ParDo(ApplyDoFn())

scored | 'Save to BigQuery' >> beam.io.Write(beam.io.BigQuerySink(

'weight_preds', 'dsp_demo', schema = schema,

create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,

write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))

The last step in the script is running the pipeline. While it is possible to run this complete code listing from Jupyter, the pipeline will not be able to complete because the project parameter needs to be passed as a command line argument to the pipeline.

# run the pipeline

result = p.run()

result.wait_until_finish()

FIGURE 1.4: The Natality predictions table on BigQuery.

Before running the pipeline on Dataflow, it’s a best practice to run the pipeline locally with a subset of data. In order to run the pipeline locally, it’s necessary to specify the GCP project as a command line argument, as shown below. The project parameter is needed to read and write data with BigQuery. After running the pipeline, you can validate that the workflow was successful by navigating to the BigQuery UI and checking for data in the destination table, as shown in Figure 1.4.

To run the pipeline on Cloud Dataflow, we need to pass a parameter that identifies the Dataflow Runner as the execution engine. We also need to pass the project name and a staging location on Cloud Storage. We now pass in a requirements file that identifies the google-cloud-storage library as a dependency, and set a cluster size limit using the max workers parameter. Once submitted, you can view the progress of the job by navigating to the Dataflow UI in the GCP console, as shown in Figure 1.5.

# running locally

python3 apply.py --project your_project_name # running on GCP

echo $'google-cloud-storage==1.19.0' > reqs.txt

python3 apply.py \

--runner DataflowRunner \

--project your_project_name \

--temp_location gs://dsp_model_store/tmp/ \

--requirements_file reqs.txt \

--maxNumWorkers 5

FIGURE 1.5: Running the managed pipeline with autoscaling.

We can now remove the limit command from the query in the pipeline and scale the workload to the full dataset. When running the full-scale pipeline, it’s useful to keep an eye on the job to make sure that the cluster size does not scale beyond expectations. Setting the maximum worker count helps avoid issues, but if you forget to set this parameter than the cluster size can quickly scale and result in a costly pipeline run.

One of the potential issues with using Python for Dataflow pipelines is that it can take awhile to initialize a cluster, because each worker node will install the required libraries for the job from source, which can take a significant amount of time for libraries such as Pandas. To avoid lengthy startup delays, it’s helpful to avoid including libraries in the requirements file that are already included in the Dataflow SDK. For example, Pandas 0.24.2 is included with SDK version 2.16.0, which is a recent enough version for this pipeline.

One of the useful aspects of Cloud Dataflow is that it is fully managed, which means that it handles provisioning hardware, deals with failure if an issue occurs, and can autoscale to match demand. Apache Beam is a great framework for data scientists, because it enables using the same tool for local testing and Cloud Deployments.