A New Hope — HyperLogLog.

HyperLogLog (HLL) is an algorithm to count distinct items in a multi-set. An interesting primer can be found in this blog post if you’re interested, but for the purposes of this article all you need to know is that while it counts distinct items very efficiently, due to it being a statistical approximation you trade some accuracy. This accuracy can be controlled by a precision parameter, which in BigQuery’s case goes up to 24, and by default is 15 (. The tradeoff of increasing accuracy is you need to store bigger sketches, which once again, increase the cost as the amount data that needs to be queried increases.

A simple HLL query in BigQuery would look as follows:

A simple query using HLL in BigQuery

There are two important functions in the preceding query:

INIT — this initialises the HLL using a provided column (with the output of it as a sketch that only represents the elements in it). The second parameter is precision where the higher the number, the better the approximation is

— this initialises the HLL using a provided column (with the output of it as a sketch that only represents the elements in it). The second parameter is precision where the higher the number, the better the approximation is EXTRACT — this function gets the actual count of distinct elements, more on this later

In our example for Publisher A, this query returns an indicated count of 95,242,513 in 12.0 seconds; a saving of 4.9 seconds while still being within 0.11% of the exact count.

HLL’s Key Strength — Composability.

So thats looks OK, but where’s the interesting part? Well, the INIT function generates a sketch and these compose. We love functional programming at Permutive — so we love things that compose and HLL can do just that via the MERGE and MERGE_PARTIAL functions.

Composability allows us to continue counting where we left off and merge the results of two counts to form a more complex count.

Imagine a table that stores pre-aggregated sketches, each row containing an HLL sketch¹ that represents all user ids that have visited a given page on Publisher A’s website:

The page_users table

There’ll be users that have visited both the homepage and a given article, but we still want to get a count of distinct user ids. To do that we’ll first need to merge the sketches:

Merging HLL sketches

This results in a new sketch that represents user ids from both pages without double counting users that have visited both pages. As we did before, we can use the EXTRACT function to get the final count.

BigQuery provides a convenience function called MERGE that combines the two steps of MERGE_PARTIAL followed by EXTRACT into a single step:

Using the single step EXTRACT function

Composability vastly enhances what we can do with HLL. What if we could take a step further and use it to maximise performance and minimise cost? We could pre-aggregate sketches based on some predefined attributes and compose them to build extremely complex queries.

Pre-aggregation and periodic sketch generation

BigQuery supports scheduled queries — the name says everything — it’s a query that runs periodically and appends its result into another table. In our case, these queries run daily and append their results into another table.

We can use cost saving features in BQ to generate them. As previously mentioned, BigQuery optimises queries which have multiple subqueries referencing the same tables — you are billed for these bytes only once even though the data is used in multiple places in the query.

The most obvious solution would be to generate a table that enumerates all possible combinations:

Enumerate tables

But there are a couple of issues:

The sketches become increasingly tiny sets, and we wrote about small sets and their issues previously

Another issue is cost — in order to do the equivalent of a SELECT DISTINCT(COUNT user_id) with no filtering, you have to select all the sketches in the table and merge them

with no filtering, you have to select all the sketches in the table and merge them Finally, merging potentially millions of sketches quickly becomes very slow

So we need a better solution, once that utilises HLL’s composability better:

Composable sketches

We call these tables “master tables” as they contain all the fields necessary but as separate sketches. We don’t query these tables directly, instead using views. We then use scheduled queries and generate sketches for a given day.

This is cost-efficient not only because we selected the data once in order to insert it into this pre-aggregated table, but also because BigQuery only bills for selected columns. This means that if we want to only select the sketches of some countries, we could do it by adding a WHERE country IS NOT NULL and only select country and country_hll . This way no other columns are not referenced and not billed.

Finally, the master table’s structure might allow us to write really small and efficient queries if BigQuery’s MERGE function accepted multiple arguments:

A better solution, that sadly isn’t possible

Unfortunately, this is currently not supported. We’ve made a feature request to Google which hasn’t so far been taken up.

Querying the data

So far we‘ve talked about two main things:

How to use HLL and its benefits (performance, composability)

Utilising its composability to pre-aggregate reusable sketches, which in turn can be used to create more complex queries

However, so far we’ve only done unions of sets. This isn’t enough for an insightful report, so we want to do some filtering — for example, let’s say we have another table which groups users by country:

Grouped by country

Using a traditional (non-HLL) approach, we could do something like:

By country using a traditional approach

The trouble is we’re storing the sketches separately so we can’t do that anymore — we need to do an intersection (we’ll talk more about storing sketches later on in the post, for now assume the sketches are stored separately for ‘reasons’). Unfortunately, HLL doesn’t natively support intersections, but luckily, we have a solution.

Inclusion-exclusion principle.

To quote Wikipedia, the inclusion-exclusion principle is a:

counting technique which generalises the familiar method of obtaining the number of elements in the union of two finite sets

While the main use case of this principle is the union, the equation can be rearranged to get the intersection instead. In our case, it can be expressed as:

|A ∩ B| = |A| + |B| - |A ∪ B|

Inn other words, the size of the intersection of set A and set B is equal to the sum of their individual sizes minus the size of their union. Fortunately, HLL does have all of these operators built-in, but what does it look like?

Intersections using HLL

It doesn’t look great. The MERGE function is limited to a single parameter in BigQuery— which means the sketches have be on the same column. Ideally we could do something like the following, but it’s not currently possible in BigQuery.

A second parameter being passed to MERGE

Going back to the result of the intersection — we end up with the final count and not a sketch. This is important as it means we lose HLL’s comparability, hence there must always be a final step in the query.