A section with lots of buzzwords

A common pattern that we like to use for our batch ingestion pipelines into BigQuery, and which is completely serverless, goes like this:

A neat little serverless conga line for ingesting files into BigQuery

A file is uploaded to a bucket in GCS, which has a simple Cloud Function listening to it. The Cloud Function is triggered and executes a templated Dataflow pipeline. The Dataflow pipeline processes the file(s), does some transformations etc. and then writes it to BigQuery. Data is now in BigQuery, and everyone is happy. High-fives people!

Yes, I know, you could have the Cloud Function trigger a BigQuery load job directly, and skip the Dataflow step altogether. This would be quicker and would also save you some dollarydoos too because you then wouldn’t have to pay for Dataflow.

“So, why wouldn’t you do it this way then, you fool!”, I hear you cry.

Well, there’s a few reasons why not:

All load BigQuery jobs are async. So, you’ll need to poll for the load job status from your Cloud Function — yuk! If the load job fails due to some transient error e.g. network issue, you need to have remembered to include exponential back-off-and-retry logic in your Cloud Function. Cloud Functions have a max execution time of 540 seconds, otherwise known as 9 minutes. If you hit that max duration and the load job hasn’t finished yet — which, when loading very large files in a multi-tenanted platform like BigQuery is highly likely — then things will get ugly real quickly.

Using Dataflow to process the file instead, you don’t have to worry about any of these. For example, it has exponential back-off-and-retry already baked into it, and there’s no max execution time either.

Of course, the Dataflow pipeline could fail too, but just hook it up to Stackdriver monitoring/alerting and Bob’s your uncle. Whenever a pipeline fails, just annoy the team with pesky automated emails or Slackbot notifications.

It worked on my machine

This is not my code. It’s far too clean.

Next up, you need to deploy the pipeline. Now, I know you’re not thinking of doing that from your own development machine, right?

Instead let’s use something to automate this process, and wire up a nice CI/CD pipeline to deploy everything for you.

Enter Cloud Build.

Previously known as Container Builder, this is Google’s fully managed CI/CD service that executes your builds on their infrastructure. Cloud Build hooks up to your source repo, executes the build to your specifications, and produces artifacts such as Docker containers or Java archives.

Many people think that Cloud Build is just for building container images, but it’s not. Cloud Build executes your build as a series of build steps, where each build step is run in a container. A build step can do anything that can be done from a container irrespective of the environment. So, in fact you can build and produce whatever you like really.

Got it? Great, let’s move on.

Going into the weeds

First up, we’ll need to create a build trigger that can run whenever a change is committed to a Github repo. You can do that by following the instructions here. It should look something like this when you’re finished:

Here’s one I made earlier

When Cloud Build is triggered, what actually happens is that the service spins up a n1-standard-1 instance under the hood (the size of the instance is configurable by the way), and executes your build on that machine. Once it’s finished, it’s all thrown away. That’s known as a disposable environment, or what the fancy people like to call ephemeral — a word I can never pronounce.

Note: in case you missed it, all the code you’re about to see is located here.

Alas, we need a build configuration file written in YAML that defines the build we want to execute. This file needs to be called cloudbuild.yaml and looks like this:

cloudbuild.yaml

Step one tells Cloud Build we need a builder/container for using Git and to pull the repo from the URL specified. By default, Cloud Build uses a directory named /workspace as a working directory. If your config file has more than one build step, the assets produced by one step can be passed to the next one via the persistence of the /workspace directory, which allows you to set up a pipeline of build steps that share assets. If you set the dir field in the build step, the working directory is set to /workspace/<dir>

Now that we’ve got some code on the machine, the next thing we need is a GCS bucket where the Cloud Function and Dataflow template will be uploaded to. There’s a few ways of doing this (you could use the gsutil or gcloud builders), but I’m trying to pretend that I’m smart, so I get Cloud Build to pull the public hashicorp/terraform image from Docker Hub and run a TF script found in the terraform dir.

infra.tf

Once the ridiculously overkill Terraform steps are finished, we need to build and deploy the templated Dataflow pipeline. I’m using the Java SDK (I’m sorry), so step 3 goes and grabs a Gradle container, builds the Java app (my Dataflow pipeline), and then runs it. By running it, the Dataflow pipeline is staged on GCS as a template and it’s ready for execution by the Cloud Function, which will also pass it the name of the file to start processing.

Here’s the Dataflow template:

Ugly looking Java. Don’t judge me, please!

Next, we’ll run some unit tests for our Cloud Function using the npm container, and then using the gcloud build step, we deploy the Cloud Function.

The Cloud Function is easy:

index.js

Step 5 simply copies a file that I already have in GCS to trigger Dataflow. I use this for live demos, but you could use something like this for integration testing I suppose.

Lastly, and just to show it’s possible, I upload some artifacts from the build process to the GCS bucket. This would be useful if you needed to deploy applications into a GKE cluster using Spinnaker for example. Because this is deploying a serverless data ingestion pipeline, we don’t actually need this step.

This is what it looks like all wired up:

Putting it all together

Shake it [out] baby

With everything in place and wired up, it’s time to shake it all out and test that it works. Changing a file in the repo and pushing it, kicks off the build:

Ohh, we’ve got stuff working!

Let’s have a look at the build itself:

Green ticks give me a fuzzy feeling

The build has succeeded and during the process it’s done the following:

Cloned the Github repo. Ran some Terraform nonsense to create a GCS bucket. Deployed a templated Dataflow pipeline using a Gradle builder/container. Ran some NPM tests, and deployed a Cloud Function. Copied a file into the bucket which is being monitored by the Cloud Function. Uploaded artifacts to GCS

You’ll notice you get all the logs too (and can download them or open them in Stackdriver):

Logs. Glorious logs!

Finally, let’s have a look at the Dataflow pipeline and make sure that it triggered and ran successfully. Remember, we copied a file into the bucket to trigger it.

Et voilà:

25 million elements written to BigQuery in a few minutes

Houston, we have data!

The mandatory wrap up

I couldn’t find a picture for the conclusion, so here’s one of my dog instead. Her name is Nikko.

You’ve now hopefully seen how you can use Cloud Build for something other than just building and pushing containers to GCR. Instead, we were able to use it to automate the deployment of a serverless batch ingestion pipeline on GCP.

Cloud Build isn’t perfect. It doesn’t have all the bells and whistles of something like CircleCI, nor does it support yet for other source repos like Gitlab. I’m not going to go into the detailed differences, because it’s already been done and the author did a great job of doing this himself.

That said, it’s pretty darn nifty nevertheless if you ask me. A fully managed CI/CD service, running the on the same pipes/infrastructure as all my other tools and services? Yes please! And, it’s only going to get better with time as Google keep improving it and innovating.

https://www.servian.com/gcp/