The concept of incremental processing can have a major impact on the design of data analytics pipelines. Processing large amounts of data in increments introduces resource efficiencies, faster processing time and inherently reduced processing cost. However, not all analytical functions have incremental properties. A common example is the count distinct function. In this post, I would like to talk about incremental count distinct processing using BigQuery’s HyperLogLog++ functions and how they provide fast, scalable, incremental processing properties.

I’ve always been fascinated with how algebraic concepts can greatly impact how we process data at scale. For example, commutative and associative properties have always been fun to study. In short, these properties allow us to perform computation in parallel and in increments. For instance, SUM() is a commutative and associative function where I can compute 1 + 2 + 3 separately in different locations (threads, VMs, etc) and the result is always 6. For example, I can compute (2+3) in one VM and add the result to (1) in another VM and the result is the same as computing (1+2) in one VM and adding the result to 3 in a separate VM.

Besides parallel processing, commutative and associative functions allow us to apply higher level functions such as SUM() and COUNT() incrementally. Let me demonstrate the power of incremental processing in the example below. I’ll also provide an example where incremental processing is not an option and how BigQuery’s HyperLogLog++ can help. For readers to be able to follow the example, I’m using a BigQuery public dataset: bigquery-samples.wikipedia_benchmark.Wiki100B:

Let’s assume we’re asked to compute the total daily views on Wikimedia’s dataset using the query below.

SELECT year,month,day, SUM(views) day_total_views FROM `bigquery-samples.wikipedia_benchmark.Wiki100B` GROUP BY 1,2,3 ORDER BY 1,2,3

Result:

Execution time: ~37 sec

However, running that query takes about 37 seconds to complete and we would like it to be faster. We want our report to be interactive, with a response time in single digit seconds. To speed up our query, we’ll create a pre-aggregated table similar to the following example and write our query on top of the pre-aggregated table:

CREATE OR REPLACE TABLE `your_project.demo.wiki_daily_agg` AS SELECT year,month,day, SUM(views) daily_total_views FROM `bigquery-samples.wikipedia_benchmark.Wiki100B` GROUP BY 1,2,3 SELECT year, month, day, daily_total_views FROM `your_project.demo.wiki_daily_agg` ORDER BY 1,2,3

Result:

Execution time: ~1 sec

This approach reduces our query time to about 1 second (takes about 30 sec to create the pre-aggregate table which can be automated using BigQuery scheduled queries or Cloud Composer)

Now, what if instead of the daily total views, we want to see the monthly total views? Interestingly, the same pre-aggregated table can answer different aggregation questions. For example, to generate total monthly views we can simply run the following query on our pre-aggregated table:

SELECT year, month, SUM(daily_total_views) month_total_count FROM `your_project.demo.wiki_daily_agg`GROUP BY 1,2 ORDER BY 1,2

Result:

Execution time: ~1 sec

What allows us to compute the monthly number of views from the pre-aggregated table is the incremental processing power of SUM() thanks to associative and commutative properties. As you can imagine, this is a powerful concept that can be applied to a number of analytical queries. But, not all functions provide incremental processing characteristics. For example, what if we want to find the number of daily DISTINCT titles? In order to accomplish that we can tweak our pre-aggregated table similar to this:

CREATE OR REPLACE TABLE `your_project.demo.wiki_daily_agg_distinct` AS SELECT year,month,day, SUM(views) daily_total_views, COUNT(DISTINCT title) unique_titles FROM `bigquery-samples.wikipedia_benchmark.Wiki100B` GROUP BY 1,2,3 SELECT year, month, day, daily_total_views, unique_titles FROM `your_project.demo.wiki_daily_agg_distinct`ORDER BY 1,2,3

Result:

Execution time: ~10 min

While this query generates the daily DISTINCT titles, it introduces two challenges:

By adding the DISTINCT clause, generating pre-aggregated table takes considerably longer (~10 min), because the DISTINCT function requires higher memory resources. We no longer can compute monthly unique titles using the same pre-aggregated table. In other words, the following query will not work (the query will run but it will show incorrect results):

SELECT year,month, SUM(unique_titles) FROM `your_project.demo.wiki_daily_agg_distinct` GROUP BY 1,2

Unlike SUM() and COUNT(), the DISTINCT() function does not have incremental properties, thus we cannot incrementally SUM the result of our daily DISTINCTS to get the monthly numbers. To calculate the monthly DISTINCT titles, we have to run our query on the entire dataset. This is inefficient, both in terms of performance and cost. Specifically, running DISTINCT on the entire dataset requires memory proportional to the cardinality of the dataset. When dealing with a large dataset, this could result in higher memory usage. For BigQuery this could mean a higher number of slots required to compute the results. In addition to a higher number of slots, we’re required to scan the entire table, resulting in higher query cost.

This is where BigQuery’s HyperLogLog++ functions enter the picture. Without getting into the depth of how HLL functions work, for the sake of this post, all we need to know is that HLL is a probabilistic count distinct algorithm that uses less memory, but provides only an approximate answer. Because of a smaller memory footprint, the computation of DISTINCT values is much faster. Another great advantage of the HLL algorithm is its incremental properties, allowing us to incrementally add/merge two individual values, as in the SUM() example above.

Let’s apply HLL to the scenario above:

First we’ll create a similar pre-aggregated table, but this time we’ll use BigQuery’s HLL_COUNT.INIT() function. This function creates an output generally referred to as HLL Sketch, a binary output that keeps track of our DISTINCT values.

CREATE OR REPLACE TABLE `your_project.demo.wiki_daily_agg_hll` AS SELECT year,month,day, SUM(views) daily_total_views, HLL_COUNT.INIT(title) unique_titles_sketch FROM `bigquery-samples.wikipedia_benchmark.Wiki100B` GROUP BY 1,2,3

Execution time: ~ 2 min

Now, to get the approximate values of DISTINCT titles in a given day, we’ll use HLL_COUNT.EXTRACT() function to get a scalar value:

SELECT year,month,day,daily_total_views, HLL_COUNT.EXTRACT(unique_titles_sketch) unique_titles_apporx FROM `your_project.demo.wiki_daily_agg_hll` ORDER BY 1,2,3

Execution time: ~ 2 sec

And what if we want to get the count of monthly DISTINCT titles? We can simply add the daily HLL Sketches using HLL_COUNT.MERGE() function:

SELECT year,month, HLL_COUNT.MERGE(unique_titles_sketch) unique_titles_approx FROM `your_project.demo.wiki_daily_agg_hll` GROUP BY 1,2 ORDER BY 1,2

Execution time: ~ 2 sec

Using BigQuery’s HLL functions we were able to achieve two things:

Better query performance in terms of query execution of the pre-aggregated table (~10 min down to ~2 min) Incremental processing feature of HLL functions to avoid reprocessing of data, reducing the cost of the query

One thing to keep in mind is that HLL is a probabilistic function. In other words the result of our DISTINCT views using HLL functions above is not 100% accurate. By default, the accuracy of BigQuery HLL functions is ±0.86% at the 99% confidence interval (see the table here). If you’re looking for a higher level of accuracy, you can decrease the error rate by passing a higher precision number to HLL_COUNT.INIT() function. For example: HLL_COUNT.INIT(title, 20) will have ±0.20% error at 99% confidence interval.

As demonstrated above, HyperLogLog functions are more performant and cost effective way to compute distinct values in a given dataset. The performance and cost benefits are as the result of smaller memory footprint and also incremental features of HLL functions. So if your use-case involves finding distinct values and can tolerate a small degree of inaccuracy, give