Building ETL pipeline with load monitoring and error handling.

Welcome to part 3 of the tutorial series “Build a Data warehouse in the Cloud using BigQuery”. This tutorial, demonstrates how to build a real-time (or close to real-time) analytics ETL pipeline using Cloud Functions. In Part 2 we wrote a Cloud function to load data from Google Storage into BigQuery.

Our Cloud function was built on top of the hybrid solution that we completed in Part 1.

In Part 1 we created a streaming lambda to transfer our data files from AWS S3 to Google Storage. In Part2 we wrote a Cloud function to load different types of data in BigQuery. We will use it as a foundation for our ETL pipeline which will be responsible for loading, data monitoring and error handling.

Before trying this article, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries .

Project layout:

Part 1: Sync only new/changed files from AWS S3 to GCP cloud storage. Part 2: Loading data into BigQuery. Create tables using schemas. We’ll use yaml to create config file. You are here >>> Part 3: Create streaming ETL pipeline with load monitoring and error handling.

Prerequisites:

Google developer account .

. Google Command Line Interface. Follow this article to install it: https://cloud.google.com/sdk/ .

. Finally, we’ll need Python to run our core application.

Just install Anaconda distribution.

Before starting, let’s think about the architecture of our ETL pipeline.

Files of mixed types and formats are uploaded to the PROJECT_STAGING_FILES Cloud Storage bucket.

This event triggers streaming Cloud function we created in Part 2 .

Cloud function we created in . Cloud function checks the file name and if there is a match for table name in schemas.yaml then data is parsed and inserted into relevant BigQuery table.

We will log the result into Cloud Firestore and Stackdriver Logging .

then data is parsed and inserted into relevant BigQuery table. We will log the result into and . Cloud function will publish a success/error message in one of the following Cloud Pub/Sub topics:

- streaming_success_topic_staging

- streaming_error_topic_staging

- streaming_success_topic_staging - streaming_error_topic_staging Files will be moved from the FILES_SOURCE bucket to one of the following buckets:

- PROJECT_STAGING_FILES_ERROR

- PROJECT_STAGING_FILES_SUCCESS

The next diagram demonstrates ETL pipeline flow and it’scomponents. The solution is very similar to one suggested by Google in this article.

The difference is in details. Part 3 adds extra information about how to ingest different data types and formats.

We will be using schemas.yaml to create table definitions. So once our Cloud function is deployed we can successfuly forget about it. If we need to add a new table we’ll just add a new record to schemas file.

And the third novelty is that we will add a few helper functions to read the ingestion results logging from Firestore collections.

We will be using BigQuery Python API in order to process and load files. You can find usage guides here — https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries.

We will also need to have a Google service account in order to use Google account credentials. We have already done that in Part 1 but just in case you are only interested in this particular tutorial this page contains the official instructions for Google managed accounts: How to create service account. After this you will be able to download your credentials and use Google API programmatically with Node.js. Your credentials file will look like this (example):

./your-google-project-12345.json

OK, let’s do it !

Clone Starter Repository

Run the following commands to clone the starter project and install the dependencies:

The master branch includes base directory template for Part 1, with dependencies declared in package.json.

After you did git checkout part3 you will be in branch for this part with all the Python code we need. It is slightly different from Part 2 as we added topics, Firestore and ‘move’ files function.

Create source and destination buckets using Google command line interface

Use your Google account credentials (replace the file and path to match yours) and in command line run this:

$ export GOOGLE_APPLICATION_CREDENTIALS="./client-50b5355d654a.json"

Then in your command line run these two commands which will create two buckets:

If you don’t have gsutils installed on your machine you could try using Cloud Shell .

Great! Now if you go to Google web console you should be able to see your new buckets.

Create Cloud Pub/Sub topics using CLI

Now let’s create the topics we need to publish ingestion results into. They will be used to trigger two other Cloud functions which will be moving processed files into success/error bucket. Run this in your command line:

Run this to list your topics:

Great! The topics are there now.

Set up your Cloud Firestore database.

In the ideal scenario we would like to have a record of each data file ingestion:

We want our Cloud Function to log each .file name it processed into Firestore daily collection.

to log each it processed into daily collection. We also would like to log ingestion status (success/error) and error message. This will be big help in case we need to fix corrupted data.

Duplication attempts might generate inacurate reports. We want to track those as well.

Let’s create your Cloud Firestore instance, follow these steps:

In the GCP console, go to Cloud Firestore. In the Choose a Cloud Firestore mode window, click Select Native Mode. In the Select a location list, select region>, and then click Create Database. Wait for the Cloud Firestore initialization to finish. It usually takes a few minutes.

Now let’s deploy our cloud functions: streaming_staging which we wrote in Part 2 and two extra functions which will be responsible for handling successfull (streaming_success_staging) and error (streaming_error_staging) data ingestion and will move the file into a relevant bucket so we could investigate it later.

Deploy Cloud Functions

We will create a shell script for this. This is how our Cloud Function streaming_staging looks:

./main.py

Looks huge but we only added a few functions. Check Part 2 to see the difference.

main.py contains 6 different functions to process data we have in our source files. Each of them handles one of the most popular scenarios to process the data in modern datalakes. We described source data files types in the beginning of Part2 article but essentially they are different types of JSON and CSV. Have a look at ./test_files folder. Each file with table_ prefix represents a use case and has it’s own data structure. Also take your time to familiarise yourself with the code or just keep reading and I will talk you through each step.

Now let’s create a shell script to deploy it ./deploy.sh.

It is slightly different from Part2 deploy.sh We’ll add ENVIRONMENT variable and labels to simplify future deployment and monitoring.

Have a look at this file. Last two commands set labels and environment variable ‘staging’. For example, ‘staging’ environment variable will be used to create Firestore collection called’streaming_collection_staging. Replace “your-project” with your project name and we are ready to go.

Now in your command line run ./deploy.sh staging. And after that you should be able to see JSON response with deployment progress saying that your cloud function has been successfully deployed.

Now let’s deploy two functions which will move ingested file to success/error bucket depending on the result.

Essentially it is the same function ./functions/move_file/main.py but we will name them differently and each will have it’s own trigger topic.

./functions/move_file/main.py:

To deploy the first function run this in your command line (or Cloud Shell):

Change staging suffix to whatever environment you use. This is just a shell command and we can always wrap it all up into one script with variables in the end.

Notice that we have just set up a trigger and two environment variables for that Cloud function:

Now run this in your command line to check function status:

Great. If you see this then your function has been successfully deployed:

Now let’s deploy the second function to move file to success bucket:

It is almost the same. We just changed the trigger and destination bucket:

Let’s test our Cloud function.

All we need to do in order to run a test is to copy a file to our source bucket called project_staging_files. In your command line run this:

Now go to Google web console and run a query: select * from `your-project.staging.table_7`:

Great! Records from your file have been successfully loaded into BigQuery. Let’s see what we have in Firestore. Go to Google web console and select Firestore:

Now let’s go to our files success bucket. We should find our file successfully moved from project_staging_files to project_staging_files_success:

Try running cp command again on that file and you will see duplicattion attempt record in Firestore which means everything works as expected:

Now let’s test error scenario. Run this in your command line:

We have just tried to copy the same file to a wrong table. We named our file meant to be for table 7 as /table-5_data_new_line_delimited_json.json.

Our cloud function read the file name and decided to insert it’s contents into table 5 obviously using a method which is not suitable for that data.

Here we received an error and file has been left in project_staging_files folder for further investigation.

Nice! Data load pipeline seems to be working fine. Time to do some load monitoring ad set up alerts.

Alerts and Load monitoring in BigQuery

Firestore database we set up earlier makes it easier to find and fix data ingestion errors. We can filter them by using standard Python API for Cloud Firestore .

Query request looks like that:

firestore/show_streaming_errors.py

To run this query on your machine do the following:

Create a virtual environment in your firestore folder.

2. Install the Python Cloud Firestore module in your virtual environment.

$ pip install google-cloud-firestore

3. Query the database to get error records:

$ python ./functions/firestore/show_streaming_errors.py

./functions/firestore/show_streaming_errors.py.py will run the query and pull the results out:

https://gist.github.com/mshakhomirov/242618dce3ad112b863c9757237dc87c

Before it shows the results it may return error asking to provide Google credentials. You know what to do, right?

How to get the list of all files from Firebase for today.

In order to get a filtered list of all records since midnight we will use compound filtering like so:

We have also adjusted final output function to display all ingestion results:

Check the full file ./functions/firestore/show_only_today.py

In order to execute it, in your command line run this:

$ python ./functions/firestore/show_only_today.py

Delete old documents from Firestore

./functions/firestore/delete_where.py

Now just enter the date like below and run

$ python ./functions/firestore/delete.py

Delete old documents from Firestore by name

Check file delete_by_name.py:

Success!

Run deactivate to deactivate your environment.

Set up notification alerts

In our production environment we would like to keep everything under control and be notified when data ingestion goes wrong.

Our Cloud function generates logs in Stackdriver logging . So we will create a custom metric for ingestion error.

Then we will create alerting policy to notify us by email whenever data load fails.

Go to Google web console Logs-based metrics page. Click Create Metric. In the Filter list, select Convert to advanced filter.

4. In the advanced filter, paste the following configuration.

resource.type="cloud_function" resource.labels.function_name="streaming_staging" resource.labels.region="europe-west2"

"Error streaming file "

5. In the Metric Editor, fill in the following fields and then click Create Metric.

In the Name field, enter streaming_staging-error

field, enter In the Label section, enter payload_error in the Name field.

section, enter in the Name field. In the Label type list, select String .

. In the Field name list, select textPayload .

. In the Extraction regular expression field, enter (Error streaming file ‘.*’.) .

. In the Type list, select Counter.

6. Now go to Create a Policy. page.

7. In the Create a Policy window, complete the following steps.

Click Add Condition.

Complete the following fields, and then click Save:

- In the Title field, enter streaming-staging-error-condition.

- In the Metric field, enter logging/user/streaming_staging-error.

- In the Condition trigger If list, select Any time series violates.

- In the Condition list, select is above.

- In the Threshold field, enter 0.

- In the For list, select 1 minute.

In the Notification Channel Type list, select Email, enter your email address, and then click Add Notification Channel.

In the Name this policy field, enter streaming-error-alert, and then click Save.

Great! After saving Stackdriver will start sending notifications every time data load fails.

Part 3 is done

Don’t forget to clean up and delete everything to avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial.

We have deployed a Cloud Function to load data from Google Storage into BigQuery.

We’ve created a streaming ETL pipeline to load data stored in Google Storage in different formats.

We’ve set up data ingestion status logging with Firestore.

We’ve also set up Stackdriver Alerts to send notifications by email in case aomething goes wrong.

In Part 4 we’ll schedule a few SQL scripts and set up a bunch of nice looking reports made with Google Data Studio

Thanks for reading!

Stay tuned! I post weekly.