BigQuery is a leader in the market of fully managed, cloud-native data warehouses, and we at Slalom Build have witnessed how its ease of use, intuitive web interface and low cost have truly democratized data across organizations. While data democratization is fundamentally positive, the exposure of massive data sets to analysts at various levels of SQL familiarity might introduce a new set of challenges. BigQuery’s optimizer and architecture differ from those of on-premise RDBMSs and MPPs, such as SQL Server and Netezza. This difference can exacerbate the issue of poorly written SQL when developers apply tuning methods that may not work.

BigQuery performance tuning often includes selecting the right table structures, such as denormalization, clustering and partitioning. While designing the right data model for your use cases is always the best approach, it’s not always possible. For example, a business need for new use cases or query patterns might arise that weren’t originally considered in the data model design. In such situations, Google has published a thorough set of SQL best practices. The goal here is that, by understanding the theory underscoring those best practices, we can identify why they help us get good performance from our existing table structures, and why certain SQL no-nos grind performance to a halt.

Architecture of BigQuery: A Logical Storage System

Like other cloud-native data warehouses, BigQuery separates storage from compute, allowing both to scale independently. BigQuery is based on Google’s Dremel query engine, which uses the concept of distributed computation against parallel splits made popular by Hadoop. Dremel essentially builds aggregator trees out of thousands of compute nodes, allowing for real-time processing where Hadoop excels at batch processing.

Figure 1- Components of BigQuery Architecture

In BigQuery, compute resources are provided by Dremel. When a query is received by the engine, clusters of slots (Units of CPU, RAM, and IO) extract data from storage (leaf nodes) and perform aggregations (mixers), working against parallel splits of the data to execute the SQL query. Movement of data between slots is called shuffle, and is facilitated by Jupiter, the 1 petabit/sec network through which storage and compute talk. Colossus is the storage system underpinning BigQuery and other GCP managed services.

In execution of any query, leaf nodes read the data from storage and perform computations required by the query. Mixers perform aggregations, shuffling the data up the tree to eventually return a query response.

Shuffles are an in-memory operation, and shuffle quotas are linked to the number of slots allocated to the account. If the shuffle quota is exceeded, shuffles can page to disk and cause performance to slow dramatically.

Dremel dynamically allocates slots as needed, allowing queries to scale up to 1000s of CPUs without any additional action required from the developer. High throughput is obtained through repeating the cycle of parallel processing, shuffling in-memory, and parallel processing.

Anatomy of Execution Plans

To understand BigQuery tuning, we first need to be fluent in reading BigQuery execution plans. In the UI, the execution plan can be found under the query window by clicking “execution details.” A more detailed execution plan is available through the API, but we’ll focus our discussion on the Web UI.

SQL is a functional language, and an execution plan in any database technology shows the set of steps, or procedure, that the database’s optimizer engine has created for executing the query. Tuning is the act of helping the optimizer engine make better decisions in creating the procedure. Given the managed nature of BigQuery, not all details in the execution plan can be actioned by the user, and indeed, the tuning opportunities in BigQuery are limited to just a few key points.

In BigQuery, execution plans are divided into stages corresponding to the leaf nodes (stage 0) and mixer nodes (stages 1 to n). Each stage has 4 components, as shown in Figure 2 below.

Figure 2: Stages of a BigQuery execution plan

The most important phase in any stage of a BigQuery execution plan is the compute phase, where the actual processing takes place, such as evaluating SQL functions or expressions. As the name implies, optimizer engine is waiting in the wait phase. It is waiting for either the slots to become available or for a previous stage to start writing results that it can begin consuming. In the read phase, the slot is reading data either from Colossus (in the case of leaf nodes) or from a previous stage (in the case of mixer nodes). The final stage is the write phase, where data is written, either to the next stage, or the master node, which is the output returned to the developer.

Timings in an execution plan are not absolute, rather they are relative to the timings of other phases. Since there are multiple parallel workers processing the data, the execution plan provides both the average and maximum times across all workers. A well-tuned query typically spends most of its time in the compute phase, and an average compute time close to max compute time indicates an even distribution of data coming out of the previous stage.

Parallel Splits: a Simple Example

To understand execution plans, let’s start by interpreting the execution plan of a simple count(*) query, running against a 10-billion-row table, and then against a 100-billion-row table. The source tables in all examples are taken from Google’s bigquery-public-data dataset. The following example was introduced at Google Next 2017.

SELECT count(*) FROM `bigquery-samples.wikipedia_benchmark.Wiki10B` WHERE title like ‘%A’

Conceptually here are the steps executed by this query.

Figure 3: Stages in a simple query

In Stage 0, leaf nodes extract data from storage in multiple parallel splits, applying the filter at extraction and aggregating the data within the split. Data is then shuffled to the first level of mixer slots for further aggregation.

In Stage 1, mixers aggregate the data, calculating the “count(*)” across parallel splits fed in from leaf nodes. Data is then shuffled through the aggregation tree to the master node.

Looking at the execution plan for the 10-billion row table in figure 4, we see the 2 stages.

Figure 4: Execution plan of COUNT query against 10 billion rows

Against 100 billion rows, we notice some key differences. See figure 5.

SELECT count(*) FROM `bigquery-samples.wikipedia_benchmark.Wiki100B` WHERE title like ‘%A’

Figure 5: execution plan of COUNT query against 100 billion rows

The number of slots required increases tenfold as expected. Wait times increase since the engine had to wait before enough slots were available. This causes compute time to decrease since a larger proportion of time is spent waiting.

Now that we’re familiar with reading execution plans, we can examine some common performance pitfalls in BigQuery.

NOTE: all execution plans were taken at the time of writing. Being a managed service, improvements are constantly being made to the query engine, and execution plans and associated timings are bound to change. While the exact execution plan or timings may differ, the principles illustrated are not dependent on them.

Parallel Splits with Shuffle: GROUP BY and ORDER BY

For queries using a GROUP BY and ORDER BY, grouping and sorting begins in the leaf node. Like values are then shuffled to the same mixer node, as shown in this example from Google:

SELECT foo, count(*) FROM . . . GROUP BY foo ORDER BY count(*)

Figure 6: Grouping and Ordering

Performance Issues with GROUP BY and SORT: Repartitions and Resources Exceeded

The BigQuery engine dynamically decides the number of parallel splits when the query starts based on factors such as number of query operators and data volume. If the engine guesses incorrectly, it must partition a second time in an activity called Repartition. If data volumes being shuffled are high, repartitioning may spill to disk and impact performance.

Another problem we may encounter when sorting large data sets is a “resources exceeded” error. This happens when a node becomes overloaded while sorting all the values, and is dependent on the quotas of your BigQuery account. The following query uses a 500-million-row table, grouping and sorting by 2 high-cardinality columns, and we see in the execution plan in figure 7 that 7 repartitions occur before the query fails (in my case with a “resources exceeded” error).

select count(*), date, labels from `gdelt-bq.gdeltv2.cloudvision` group by date,labels order by count(*) desc

Figure 7: Grouping and partitioning on high-cardinality columns causes multiple repartitions

A Solution: LIMIT Clause

Google recommends resolving this by adding a LIMIT clause to the query. Now BigQuery can limit the amount of data after the first shuffle, reducing the size of the data sent to the master node.

SELECT foo, count(*) FROM . . . GROUP BY foo ORDER BY count(*) LIMIT 1

Figure 8- Grouping and ordering with LIMIT clause

Rerunning the earlier query with a limit clause:

select count(*), date, labels from `gdelt-bq.gdeltv2.cloudvision` group by date,labels order by count(*) desc limit 1000000

The execution plan shows the LIMIT being applied at stage 5. However, only 4 repartitions are needed, wait times are much lower, and shuffle size is reduced. This indicates that perhaps the limit is being applied earlier, under the hood. The query now completes in ~7 min.

Much lower wait times, and fewer repartitions needed, when LIMIT clause is used.

Understanding Joins

Joining is a very common, yet resource intensive SQL function. When joining large tables in BigQuery, the join keys are shuffled independently to line up on the same slot, allowing the join to be performed locally on each slot.

If the data is skewed, one slot gets far more data than other slots, causing the max compute time to greatly exceed average compute time. This can happen if the data being queried is inherently skewed (such as grouping by a field where most values are set to NULL or a default), or can be the result of a join. This is why Google recommends filtering out skewed values early, or materializing results prior to joining. By reducing data skew when leaf nodes feed mixers (or mixers feed other mixers), we reduce the chance of overloading nodes and spilling to disk.

Select * Queries

There are 2 reasons why select * queries are slow and costly in BigQuery. For one thing, selecting unnecessary columns increases the amount of data shuffled during joins and aggregations. However, BigQuery is a column-based datastore, meaning each column is stored in its own set of files. Doing a select * query, therefore requires leaf nodes to read, and then shuffle data from all files for the table.

To demonstrate this effect, here is a self-join on a 124-million-row table. The first query does a simple count(*), which does not extract any columns, while the second does a select *, pulling back all the columns. The queries have identical join conditions, but we can see the stark difference in execution time.

Here is our “count(*)” query :

SELECT count(*) FROM `1000_genomes_phase_3_variants_20150220` a left join `1000_genomes_phase_3_variants_20150220` b on a.start_position = b.start_position

We can see that its elapsed time is 18.9 sec.

Figure 10 –Summary section of execution plan for self-join with count(*)

By contrast, when we select all columns from the same query:

SELECT a.* FROM `1000_genomes_phase_3_variants_20150220` a left join `1000_genomes_phase_3_variants_20150220` b on a.start_position = b.start_position

We see that elapsed time jumps to 43 min.

Figure 11 — Summary section of execution plan for self join that selects all columns

Summary

We’ve seen that minimizing shuffle and honoring the principles of column-based storage are key to optimizing BigQuery execution plans. By keeping these principles in mind, we can avoid legacy tuning practices that are better suited to architectures such as MPPs or RDBMSs.

By preventing slots from becoming overloaded, we reduce the chances of in-memory operations spilling to disk.

By selecting only the columns we need, we prevent BigQuery from unnecessarily extracting and processing data from multiple files.

By reducing data skew when leaf nodes feed mixers (or mixers feed other mixers), we reduce the chance of overloading nodes and spilling to disk.

References: