Automation

The extraction phase

The main pipe is:

Cloud Scheduler > Cloud Pub/Sub > Cloud Functions

As we were discussing at the start, it is indeed overengineered for this specific use case. There is no need to use Cloud Pub/Sub and we could have used a simpler or more traditional and compact solution. But we wanted to create a pipeline that would allow us to trigger the update of the report in a very broad scope of scenarios and that would scale very easily.

Cloud Pub/Sub is the event and global messaging system from GCP. It offers fully automated scaling and provisioning. It is very helpful for us because it easily allows us to integrate the report automation as part of the global workflow of any application. If you need to streamline any type of real time analytics, it is worth a look.

The other key element in the pipeline is Cloud Functions. In this particular use case, Cloud Functions executes the python code that triggers PyTrends to update the data on Google Cloud Storage, as we saw in the previous section. Cloud Functions is GCP’s event-driven serverless compute platform. I’m sure you can very clearly see the trend by now: aiming for a completely serverless event-driven architecture.

These are the step-by-step instructions to setup this phase; it is extremely easy to do:

Create a job on Cloud Scheduler:

Creating a job on Cloud Scheduler

The are only two things to keep in mind here: 1) for frequency you need to use the classical unix-cron format; 2) remember to specify Pub/Sub as your target.

Go to Pub/Sub and create a new topic, according to the topic you established on Cloud Scheduler:

Creating a topic on Pub/Sub

This will be triggered by Cloud Scheduler and in turn will trigger the cloud function.

The only thing left to do is to setup the cloud function. Cloud Functions allows us to execute code in python, Node.js or Go. In our case, we are using a python script, so we need to set up the options accordingly:

Creating the cloud function

You will need to specify pub/sub as trigger. Choose the topic you created previously on pub/sub, upload the zip file with your python code and select main as your function. Remember to also point to the correct stage bucket on GCS you want to use for staging.

Remember also that the zip file has to include all the scripts in the root folder (no extra folders!), including the requirements.txt file. In addition, main.py is your initiator script.

As a side note, it is important to mention that Cloud Functions has limitations, according to the use case it is designed for. For example, the maximum timeout is 540 seconds. If you need to execute code for longer periods, Cloud Functions is not your ideal tool. It may make sense to look into App Engine (maybe triggered by a cloud function if you need that flexibility).

You can see the advanced options here:

Cloud Functions advanced options

As you can see, you can change the timeout, change network options, add environment variables, etc. For this specific use case, the default options work well for us.

Transforming and loading data for visualization

With the data extracted, updated and stored where we want it, we can move to the next phase: getting that data ready for reporting.

We will pull the data from Cloud Storage with Dataprep, schedule its publication on BigQuery and generate our Data Studio report from BigQuery.

Dataprep is a data preparation solution based on Trifacta Wrangler and it’s fully integrated on GCP. It’s very visual and easy to use. If you prefer to work directly with SQL recipes and have complex data transformation needs that you need automated, we highly recommend Dbt. We use it internally to transform Google Ads data for its use in BigQuery and it’s great.

But for this specific use case, Dataprep will serve us well.

Start by creating a new flow:

Creating a flow in Dataprep

Then, import the data to work with:

Importing data on Dataprep

You can upload data directly from your computer, Cloud Storage or BigQuery.

For our use case, we will choose the file in the bucket in GCS that our python script (executed by Cloud Functions) is generating.

This will open a new flow in Dataprep. We will see a small preview and start to add our recipe:

Imported data on Dataprep

Once we begin to edit the recipe, we can start working on the data. We will get the different columns of our csv with some descriptive information and we can edit the data as we wish. This will generate the recipe (the full sequence of steps we have taken to transform the data).

A csv on Dataprep

When we are ready, we can hit run job, which will execute our instructions on dataflow:

By default, it will generate a csv, but we want the result in BigQuery. To do so, we just need to select Add Publishing Action and choose a dataset on BigQuery.

We can also select create/replace, append or truncate, as applicable. Since we want to continuously update the same dataset on BigQuery, we will schedule the job and choose append.

We have now automated the updating of the dataset on BigQuery that we will use as the base for our Data Studio report. The only thing remaining is to add the data source on Data Studio (Resources menu — manage data sources):

Adding a new data source on Data Studio

The advantage of using BigQuery as the data source for Data Studio is that we can use BigQuery BI Engine, which is an in-memory analysis service. It gives extremely fast query response and, most importantly for a reporting application, it allows high concurrency.

You can check the final report here: 2020–2019 Search Engine Index Comparison — Spain

On the first page, you have a brief selection of relevant Spanish companies and can filter by company. On the second page, you can do the same using a selection of keywords representing different sectors. The goal is basically to give a quick glimpse of the Covid-19 effect on search habits on different sectors. It updates automatically every Monday at 08:05 Spanish time (GMT+1). As mentioned in the PyTrends section, you have to remember the quality of Google Trends data, especially in terms of consistency. It is only designed to give a very broad overview of the main tendencies and in that respect, it’s great.

Conclusions

We hope you have found this little exercise useful and informative. Our main goal was to showcase the possibilities that GCP offers in terms of report automation and how easy it is to set up a fully automated serverless and scalable architecture that can be customized for a wide variety of use cases and data sources.

Needless to say, there are many different paths or tools you could use to accomplish a similar task, many of them less complex in terms of architecture. But in our opinion, this exercise serves as a good starting point to explore on your own and design your ideal solution.