Its been a while since my last blog, although I had a good excuse as I’ve moved house into the Copenhagen suburbs, Brønshøj. But in the meantime, I have had the chance to find my new favourite tool for data engineering - Google Cloud Functions in Python.

This post will look at how to use them to work with the GA360 BigQuery export tables.

The code is available in this GitHub repo for useful cloud functions with Google Analytics

Another advantage is that event based workflows are asynchronous in the respect they can trigger in parallel. Going back to the GA360 example, a scheduled script may loop through each BigQuery table whilst doing its export - however an event based trigger will launch all the copy jobs at once.

An event based trigger just waits until it sees the data, then launches the code.

For example, the GA360 exports to BigQuery usually happen before 0900 for my clients, but that is not guaranteed. If you have all your data processing set for a schedule of 0910 say, its going to break if those exports are late.

One major benefit is that it helps move you from a schedule based mindset (such as running scripts on a schedule/cron) to an event based one.

Up until recently they have only been available in Node.js, but the new Python 3 runtime is now in beta . Since I have much more experience with the latter, I took that as an excuse to try it out, and found it solved a lot of little data engineering tasks.

Cloud Functions are the highest level of abstraction on the serverless stack of the GCP, and offer you a way to just upload code, and then let Google take care of all the dev-ops underneath for serving the results of your function. They have preconfigured triggers, which you can use to run your code under certain circumstances, such as if a new object arrives in Google Cloud Storage; if you make a request to a HTTP endpoint; or if it receives a message from Pub/Sub, Google’s message broker service.

Example: Copy GA360 exports to a partitioned table

As a demonstration, the script below shows how to create a cloud function that will copy a BigQuery GA360 data export to another table. You may want to do this to aggregate your datasets. However the use case here is to create a partitioned table version of the GA360 exports, to make them easier to work with.

Why partitioned tables? Partitioned tables are a more modern take on organising date based BigQuery tables than the older sharded format. I’m pretty sure if the BigQuery exports were redone today they would be in partitioned tables, but the GA360 exports predate them.

Limitations You can’t copy tables outside the region they are in (EU or US for example), for that check out the Cloudy McCloudface master Graham Polley with his article going through Cloud Dataflow. Cloud functions only work on the same project they are deployed to, so you will need to put it in the same GCP project as the BigQuery exports. Don’t make large heavy calculations in your cloud function. They should be light-weight “glue” that trigger larger operations, not calculating big transformations themselves.

1. Setup destination dataset and table First create the dataset where you want the data to be copied to (such as ga360 ) in the same region of your existing BigQuery exports. By default this is US, but you may have set it up to be EU. Create an empty partitioned table within the dataset called ga_sessions . You don’t have to specify a schema as that will be created when the first data hits it. You could choose to require a partition filter if you have a lot of data, to make sure people don’t query a lot of data by mistake.

2. Setup the logging export and pub/sub trigger First I was disappointed that there wasn’t a “BigQuery table ready” event for Cloud Functions… @felipehoffa a trigger every time BigQuery datasets get a new table would be cool, to feed into cloud functions/pubsub etc — Mark Edmondson (@HoloMarkeD) September 26, 2018 …but then I realised that you can create almost any trigger you want using the cloud log viewer within the GCP console. With its use of advanced filters, any event thats happening in your GCP can be narrowed down, and then by using its “Create Export” button you can turn that filter into a Pub/Sub trigger, which Cloud Functions can use. Here is the filter I use. It uses the analytics-processing-dev@system.gserviceaccount.com email you give to enable GA360 BigQuery exports to narrow down, and then prevents the ga_sessions_intraday tables from appearing, as they are not final data. Put this in your logging filter: resource.type="bigquery_resource" protoPayload.methodName="jobservice.jobcompleted" protoPayload.serviceData.jobCompletedEvent.eventName="load_job_completed" protoPayload.authenticationInfo.principalEmail="analytics-processing-dev@system.gserviceaccount.com" NOT protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.load.destinationTable.tableId:"ga_sessions_intraday" Now use “Create Export” and select the “Cloud Pub/Sub” Sink Service. This is the Pub/sub topic that the cloud function will subscribe too, and will update every new logging entry. It should look like this:

3. Write the code The code is available in this GitHub repo for useful cloud functions with Google Analytics The code is adapted from the Cloud Pub/Sub tutorial for Cloud Functions. You will need to specify the pip requirements for your code in a requirements.txt file, but you don’t have to actually upload the libraries as you do for say App Engine. Cloud Functions will download them for you when you deploy. This requirements.txt file only imports the BigQuery Python client: google-cloud-bigquery==1.5.1 The full Python code is below, with some walkthrough afterwards: import logging import base64 import json from google.cloud import bigquery import re # replace with your dataset DEST_DATASET = 'REPLACE_DATASET' def make_partition_tbl_name(table_id): t_split = table_id.split('_20') name = t_split[0] suffix = ''.join(re.findall("\d\d", table_id)[0:4]) name = name + '$' + suffix logging.info('partition table name: {}'.format(name)) return name def copy_bq(dataset_id, table_id): client = bigquery.Client() dest_dataset = DEST_DATASET dest_table = make_partition_tbl_name(table_id) source_table_ref = client.dataset(dataset_id).table(table_id) dest_table_ref = client.dataset(dest_dataset).table(dest_table) job = client.copy_table( source_table_ref, dest_table_ref, location = 'EU') # API request logging.info('Copy job: dataset {}: tableId {} -> dataset {}: tableId {} - ' 'check BigQuery logs of job_id: {} for status'.format( dataset_id, table_id, dest_dataset, dest_table, job.job_id)) def extract_data(data): """Gets the tableId, datasetId from pub/sub data""" data = json.loads(data) table_info = data['protoPayload']['serviceData']['jobCompletedEvent']['job']['jobConfiguration']['load']['destinationTable'] logging.info('Found data: {}'.format(json.dumps(table_info))) return table_info def bq_to_bq(data, context): if 'data' in data: table_info = extract_data(base64.b64decode(data['data']).decode('utf-8')) copy_bq(dataset_id=table_info['datasetId'], table_id=table_info['tableId']) else: raise ValueError('No data found in pub-sub')

Handling pub/sub functions The functions below take care of processing the pub/sub data. All data arrives as a base64 encoded string, which is extracted. This extraction is in the same format as the logs - the extract_data() function parses out what is needed for the BigQuery copy job, namely the table_id . You can adapt this as needed if you need other information for your code. As detailed in the tutorial, the Pub/sub trigger needs to be processed by a function expecting the data and context arguments. That is the bq_to_bq(data, content) function below. def extract_data(data): """Gets the tableId, datasetId from pub/sub data""" data = json.loads(data) table_info = data['protoPayload']['serviceData']['jobCompletedEvent']['job']['jobConfiguration']['load']['destinationTable'] logging.info('Found data: {}'.format(json.dumps(table_info))) return table_info def bq_to_bq(data, context): if 'data' in data: table_info = extract_data(base64.b64decode(data['data']).decode('utf-8')) copy_bq(dataset_id=table_info['datasetId'], table_id=table_info['tableId']) else: raise ValueError('No data found in pub-sub')

Copy the BigQuery table functions The functions below then create the table_id you want to write to. This basically means taking the table name ga_sessions_20171011 and turning it into ga_sessions$20171011 , so that the partitioned table is written to the same date partition as the export. In the example below, adapted from here, it then creates the BigQuery job to copy. Note here we do not wait for the job to finish - we are in an event style mindset, so just want to trigger and finish. Instead we just return the jobId , so that you can examine it the BigQuery logs if you need to. def make_partition_tbl_name(table_id): t_split = table_id.split('_20') name = t_split[0] suffix = ''.join(re.findall("\d\d", table_id)[0:4]) name = name + '$' + suffix logging.info('partition table name: {}'.format(name)) return name def copy_bq(dataset_id, table_id): client = bigquery.Client() dest_dataset = DEST_DATASET dest_table = make_partition_tbl_name(table_id) source_table_ref = client.dataset(dataset_id).table(table_id) dest_table_ref = client.dataset(dest_dataset).table(dest_table) job = client.copy_table( source_table_ref, dest_table_ref, location = 'EU') # API request logging.info('Copy job: dataset {}: tableId {} -> dataset {}: tableId {} - ' 'check BigQuery logs of job_id: {} for status'.format( dataset_id, table_id, dest_dataset, dest_table, job.job_id))