What is this about?:

The purpose of this 2-part article series is to give an complete overview on how to investigate (part 1) and optimize (part 2) the costs for your BigQuery setup in a Datawarehousing / ETL pipeline context. Starting at ingestion time, all the way towards to use of modern ‘Direct-Query’ BI tools. This is the 1st part of this series.

To clarify: with ‘Direct-Query’ BI tools I mean tools that fully rely on the underlying database as the query execution engine / data cache. As such they don't have an integrated caching layer (like, for example Tableau Server, or Business Objects).

This article especially mentions Looker as that BI tool, for that was the end-user tool I used during my time at spilgames. On top of that it has some special ‘tricks up its sleeve’ that can help you optimizing. All that later!





Why optimize costs?:

BigQuery, being a full Datawarehouse-as-a-Service (DWAAS) solution is super-awesome; I wrote about that last year here…

However, there is a potential caveat in working with it: you really have to get used to the 'pay for what you use model' it employs. This is especially relevant when you are storing multiple Terrabytes of data and you decide to just ‘blindly’ query all of it repeatedly.

Though the system is so powerful that you can actually query 10 TB of data, and make it return results within a minute, this is NOT really cost effective when reading a TB of data will cost you 5 USD: https://cloud.google.com/bigquery/pricing and you decide to hammer the system with that every x minutes...

So: why throw away money? Because this IS happening when your potential effort/costs to optimize are lower than the cost savings...





Where to optimize: collect data first!

When you want to start optimizing your costs you need to know where to start. For that, detailed insights on how and where you are spending your valuable dollars is needed first... When you don’t start there, chances are high you will be ‘just firing blind’…

Basically, costs in BigQuery can be made in 3 different ways:

Storage costs: they are 20 USD per TB / month and 10 USD per TB / month for a table or table partition (see later) that did not have any INSERT / UPDATE / DELETE in the last 90 days Query costs: as mentioned in the Why section, they are 5 USD per TB Streaming insert costs: BigQuery supports streaming tables using dataflow. If you use that functionality, you pay 50 USD per INSERTED TB

It is absolutely vital to realize that BigQuery’s cost calculation works with RAW / UNCOMPRESSED data!

So, even though the system stores its data in a compressed columnar format, as a user you are charged for the raw stuff… This has some consequences for the optimization techniques that will be mentioned in the 2nd part of this article.

Realizing this is also an important factor when you are compare storage costs between BigQuery and ‘ other’ cloud datawarehouses. In other systems you usually pay for the true disk storage (which in effect is for COMPRESSED data)

Monitoring BigQuery is the key to finding out where the biggest costs are. This can be done in multiple ways & at various level of detail:

1. At the Project / Product / SKU level

As of Q4 2018, the easiest way to do this is through the Google cloud console, using billing reports. These reports provide an overview in which you can look at the costs in various time periods and level of details, as seen here: https://cloud.google.com/billing/docs/how-to/reports. Using this information, you will gain insights like:

The ratio between storage and usage costs: what costs to address first?

between storage and usage costs: what costs to address first? By looking at your usage patterns visually you can spot sudden changes or a gradual trend.

2. At the User / Query (Grouping) level (q uery costs)

To investigate usage costs at a lower level of detail, I have created a BigQuery monitoring view on top of the BigQuery audit log. The added columns in this view should enable you to investigate costs at ‘just the right level of detail’, being:

NOT at the project level (not enough details)

NOT at the individual query level (too much details)

BUT at the ‘query type / grouping level’.

The full SQL code (in BigQuery standard SQL) plus how to set up query logging in BigQuery can be found on my RogerData github account: AuditlogAggregation.sql

Using the deployed monitoring view, you can now run the following example query that uses this view to investigate the Query usage costs: AuditlogAggregationExample.sql

aggregated per bucket of tables in the SELECT /JOIN statement of the query

in the of the with detail rows on the WHERE/AND/OR columns used in that bucket:

Using this, learning what to optimize now comes down to investigate the top x rows of the returned dataset and ask yourself:

Which tables are used / affected?

What is the query pattern / how come queries like this are so big or are ran so often?

Is it a (user) query or ETL job that generates these costs?

3. At the Table level (storage costs)

While in my experience the biggest costs end up being the query costs, storing data can also become a factor. This is especially true when you have a lot of event / logging data combined with not addressing the topic of data lifecycle management properly... But you surely don’t forget this now with GDPR, right…?!

The following post mentions ways how to look at the individual table sizes: https://stackoverflow.com/questions/52870629/how-to-list-all-table-sizes-in-a-project

Armed with the insights on where the biggest costs are made we can now continue...

HOW to optimize and in which order? I will extensively expain that in part 2 of this series... (coming soon!)

In the meantime, this page contains Google’s basic information on that topic…