TL;DR: BigQuery materialized views and streaming data can be used together for building cost-effective near real-time dashboards.

In the previous post, we explored the basics behind using BigQuery materialized views. In this post we’ll explore how you can leverage materialized views with streaming data.

Streaming Data into BigQuery

Batch loading into BigQuery is sufficient for many use cases but some require the need for near real-time ingest. For those speciality use cases, BigQuery has a streaming API that can used to ingest near real-time data into BigQuery, without having to go through an intermediary system like PubSub, Dataflow, or Kafka. Data pushed to the API is immediately written to a write buffer before being written to BigQuery storage to help accelerate data ingest and deliver minimal latency. Despite being in the transitory write buffer, the data is still available for access and any query on a table loaded by the streaming API will scan the buffered records.

To use the streaming API, all you need is the BigQuery client. This can be found as part of the GCP SDK and it supports a variety of languages. For this example we’ll be using python client to publish data to the BigQuery streaming API.

# Import the client library.

from google.cloud import bigquery #Construct a BigQuery client object.

client = bigquery.Client() # Set table_id to the ID of the model to fetch.

table_id = "[project name here].demo_data.sales_data"

I have a simple data generator application (not seen here) that generates sales transaction data. This randomly generated data is micro-batched 500 records at a time and published to the streaming API. The streaming API could accept one record at a time, but in practice that isn’t optimal. Micro-batching records can yield greater throughputs, but at the expense of latency. Google recommends 500 records per micro-batch as having a good balance of throughput and latency for streaming into BigQuery. All use cases are different but this is a good starting point for designing your application.

data_generator = POSDataGenerator() for x in range(1,100000): # how many iterations to run

batch = []

table = client.get_table(table_id) # API request.

for y in range(1,2000): # how many records to generate per batch

record = data_generator.gen_row()

batch.append(record) # add to array errors = client.insert_rows_json(table, batch ) # API request.

if errors == []:

print("New rows have been added.")

Once the data is written to the buffer it’s immediately ready for queries. A quick test of this can be done by issuing consecutive count(*) queries. You should see the data increment with each query execution. Any data that’s on the buffer is merged with data already written to BigQuery storage to deliver the final result set.

Adding Materialized Views to the Mix

Now that we have data streaming into BigQuery, we can layer on a materialized view to accelerate analytics on the data. In many companies, data being streamed into BigQuery are often used in dashboards where the data is refreshed quickly so that companies can have access to critical metrics and KPI’s within moments of the data arriving. Materialized views allow developers to create pre-cached result sets for commonly accessed metrics which speed up analytic processing by reducing the amount of IO and CPU required to render a result set. But how does a materialized view mesh with the incoming streaming data?

Materialized views are automatically refreshed as the base table changes but no more frequently than once every 30 minutes. This default refresh interval inevitably leads to many questions. What about that 30 minute gap? Won’t my materialized view be stale? How does the streaming data get factored in? In this example scenario, the changes to the base table are all inserts so the materialized view doesn’t have to recompute any of the data in the materialized view. BigQuery treats the new data not factored in the materialized view separately when executing a query. It executes the query against the newly arrived data, then merges the results with the data in the materialized view to render the final result set. A quick look at the execution plan for a query confirms this. You can see that both the base table (sales_data) and the materialized view (sales_data_rollup) are read to generate the result set.

This helps fill in the gap of the materialized view not updating as rapidly as your underlying base table data, all while still leveraging materialized view and delivering great performance and maximal efficiencies.

If the default view refresh interval doesn’t fulfill your needs, you can always manually refresh the materialized view by calling this specialized procedure.

CALL BQ.REFRESH_MATERIALIZED_VIEW('[project-id.my_dataset.my_mv_table');

Also, if the materialized view refresh interval is too frequent or infrequent for your needs, you can always adjust the settings for the view to an interval more inline with your needs.

ALTER MATERIALIZED VIEW project-id.my_dataset.my_mv_table

SET OPTIONS (refresh_interval_minutes = 60)

To view these pertinent details you can access this from within the BigQuery console by selecting the materialized view object from the project object tree.

However, be advised that running rebuilds for materialized views will result in additional slot usage and data being scanned. Try to find a sweet-spot that generates the performance you need with the least amount of maintenance operations. There will come a point of diminishing returns from the resources involved into the maintenance operation and the reduction in query runtimes for your applications.

For more information on the streaming API and materialized views, please visit the following links.