Partition pruning with subqueries

BigQuery supports partitioned tables where data is segmented into partitions based on a specific time or integer column (or a pseudo column representing ingestion time). When users query a partitioned table by filtering on the partition column, BigQuery will retrieve data from relevant partitions without a full scan, thereby improving query performance and lowering query cost. Always partition tables in BigQuery!

A common pitfall in partition tables is partition pruning using subqueries. As of today, if users filter a partition column based on the results of a subquery, BigQuery will still fully scan the table. For example, given a 9.7GB table partitioned on a date column _extract_d with 363 partitions, the below query getting the most recent batch of data will lead to a full scan, processing all 9.7GB of data:

A full-scan query that prunes partition column with a subquery.

We have an easy solution to avoid the full scan: storing the result of the subquery in a variable and using the variable to filter the partition column:

A SQL script defining a variable to avoid directly using a subquery to filter partitions.

As a result, BigQuery will scan only the latest partition to retrieve the query results, processing 28MB of data. In addition to the 28MB shown in query estimation, calculating MAX(_extract_dt) costs a bit extra data here, approximately the size of the _extract_dt field.

Schema evolution

As source data changes all the time, it is critical for any data warehouse to be able to handle schema evolution. Currently, BigQuery supports two types of automatic schema revolution: adding new columns and changing required columns to nullable . The addition of columns is the most useful one.

To enable auto schema evolution, users need to set the schema_update_option parameter in API calls. Assuming there is an old user table with id and name columns and a new user_new table with id , email , and name columns. The CLI command below can merge user_new table into user table and automatically add email columns to user table.

bq query --use_legacy_sql=false --append_table=true --schema_update_option 'ALLOW_FIELD_ADDITION' --destination_table 'workspace_xinran.user' 'SELECT id, email, name FROM workspace_xinran.user_new'

However, BigQuery does not support auto schema evolution in SQL:

The auto column addition is not currently supported in BigQuery.

Version control UDFs

BigQuery users can create and store persistent User Defined Functions (UDF) under datasets. Though a lot of people tend to treat UDFs casually, it is as important to keep track of UDF definitions as it is for queries and table schemas because UDFs can help improve the consistency of business definitions in the data warehouse.

Since UDFs are defined using SQL statements, users can version control UDF definitions like any other SQL queries using tools like Git. You can also build a data pipeline using Cloud Composer dedicated to updating UDFs.

Automate view creation

In BigQuery, creating views is very similar to creating tables except the access control part is more complicated. Unlike tables, views do not contain physical data since it is essentially a saved query. Therefore, when a view querying dataset A is created under a dataset B, users must grant the view read access to dataset A, or nobody can actually use the view.

Therefore, if there is a data pipeline that automates view creation, it is recommended to automate granting views access to dependent datasets as part of the same pipeline, instead of manually doing so in BigQuery UI. Check out some examples using CLI and Python here.

Note: As of today, BigQuery only supports non-materialized views.

GDrive tables

One of the many attractive features of BigQuery is it allows users to create external tables on top of Google Drive files, which my team finds very convenient since some business partners are more comfortable with data in Google Sheets. However, in order for a user, service account or an application to read GDrive tables, admins need to perform the following setups:

For service accounts with access to any GDrive tables, GSuite admin needs to enable domain-wide delegation for them.

For GDrive source files, if they are not sensitive, make them viewable for everyone under the company domain. As for sensitive files, give “Can View” access to specific users and service accounts.

For applications accessing GDrive tables, set the following scopes:

Since it is not scalable to maintain two layers of access control, plus one of which requires manual access granting, try to avoid GDrive tables if possible.