The schema of the data is currently as shown below.

name type mode

================================

battery_status STRING REQUIRED

bluetooth_status STRING REQUIRED

cell_id STRING REQUIRED

cell_strength STRING REQUIRED

gps_status STRING REQUIRED

last_app STRING REQUIRED

location_gps STRING REQUIRED

location_net STRING REQUIRED

location_accuracy STRING REQUIRED

altitude STRING REQUIRED

speed STRING REQUIRED

location_seconds STRING REQUIRED

timestamp STRING REQUIRED

OK, let’s clean that up so we can work with that! While I could do this with SQL, let’s assume we are in an enterprise environment and we want to ensure we have a solid data preparation pipeline defined. The fact that I only have 34000 rows also adds to the fact that a simple python script would probably suffice. But, let’s do as the Germans say and “shoot for sparrows with cannons”.

DataFlow Pipeline

To get set up, I first create a folder in which to keep the source code. Creating a virtualenv and installing apache-beam[gcp] with pip does the trick for preparation. Looking at the example code, I come up with the following pipeline code. Let’s first look at the “boilerplate” part:

There are several things to learn here:

The imports from are exclusively from the apache-beam package. It may be called DataFlow by Google but it is a 100% apache project from a code perspective

package. It may be called DataFlow by Google but it is a 100% apache project from a code perspective all configuration happens using the PipelineOptions (docs) object in lines 14–21

(docs) object in lines 14–21 lines 23-end define the BigQuery project/table structure, the target schema of the new table and the source and sinks

when defining the schema as json (line 50), be sure to have an object with the key fields containing the array of fields. This goes against the docs but code doesn’t lie

Now let’s look at the actual processing bit

The first few lines describe the high level pipeline. Very simple in our case: read, clean, write. The clean step is performed by a dedicated DoFn , a python object that gets sent to every worker that is spun up. Lines 14–65 describe this unit of code. It took me a while to figure out how to write this without it failing due to foo not found errors in DataFlow. Basically, RTFM applies. The “function object must be serializable”.

All of the code between 21–65 is my domain specific transform code. I basically define a set of functions to apply to each of the columns in a row. As the docs mention, each row is a python dictionary with the column names as the keys.

Testing the code

The above code is easily unit-testable. When searching the docs for testing instructions, it didn’t seem very verbose but the wordcount example mentions logging and testing.

For a simple “beam-like test”, see the gist below. It makes use of the TestPipeline class provided by beam, as well as assert_that and equal_to helpers

Running it in the cloud

Running this is as simple as calling python cleanup_pipeline.py . It Then outputs a bunch of info on the console but you can also check your job in the GCP web interface