TL;DR BigQuery materialized views are great. You should use them!

In the database world, a materialized view is a database object that contains the results of a pre-computed query. Materialized views have been around for quite some time, first appearing with Oracle 8i Database and then proliferating across most proprietary and open source databases in some for or another. Materialized views are often used to support BI and OLAP workloads as part of a IO reduction strategy. By pre-computing common queries and caching the results, the database optimizer can re-write inflight queries to use the materialized view to reduce CPU and IO and reduce runtime.

A good example for using a materialized view is retail transaction data. Most business users don’t look at the granular transaction data itself. Instead, they want to look at the transaction data rolled up to particular dimension, like by store, product, date or all 3. When you’re working with billions of rows, doing that type computation for hundreds or thousands of queries can be extremely expensive from a compute and IO perspective. A materialized view can pre-aggregate those common business queries and cache the result. In doing so, a simple total sales by store by date doesn’t need to scan all the raw data to compute the result set. Instead, it uses the materialized view to retrieve the pre-computed data it needs to satisfy the query and this can dramatically reduce the amount of data processed by any given query.

In the latest release of BigQuery(April 2020), Materialized views are now in Beta. This article will examine using BigQuery materialized views and the power they can have on accelerating your analytic workloads.

Using BigQuery Materialized Views

In the following example, I’m using the TPCH orders table to demonstrate the power that materialized views can have on BI workloads. The orders table below is 1.5TB in size and is unpartitioned and unclustered. The simple query below attempts to look at the total sales by day , which is a pretty common question many businesses ask on a daily basis.

From the BigQuery console, we can see that the query scanned 436GB of data and took 8.4 seconds to run. Not bad….lets take a look a the job execution.

The majority of time spend on the query is the initial scan of the 29 billion rows and initial aggregation. That first step is taking the majority of the time…let’s see the impact that a materialized view can have on this query.

We take the original query and create a materialized view from it. Next we re-run the original query, still pointing at the original base table. Remember, a key tenant of a materialized view is the optimizer re-writing the query to take advantage of it. We can optionally point at the materialized view if we need to but the optimizer should select it most cases.

The original query now runs in 2.3 seconds and only processes 2.5MB of data! Compare that to the original query needing to scan nearly 500GB. The job execution metrics look even better.

The first input stage for compute and write IO have been virtually eliminated. Now the bulk of the query is spent actually writing the results to the temp table! Pretty nice improvements. The best part of all is you don’t have to refresh the data when using materialized views with BigQuery! BigQuery does that heavy lifting for you automatically. As your underlying data changes, BigQuery refreshes your data on an automated basis so you don’t have to manually trigger refreshes (though you can if you want!).

Understand that this is a very simplistic example but it showcases the power that a BigQuery materialized views can have on performance as well as making your queries much more efficient.

When Should I Use BigQuery Materialized Views?

In order to get the most out of materialized views, a thorough analysis needs to be conducted on query access patterns by your users in order to determine the best metrics to materialize to optimize your workloads. Usually large fact tables or other large tables that are frequently accessed with aggregate functions like avg, sum, min, max or tables with really complex computation are great candidates to investigate for accelerating with materialized views. ETL workloads that aggregate data are also good candidates as well as some of their logic could be augmented by a materialized view.

For more information on BigQuery Materialized Views