Why count(distinct()) is king for accuracy — but definitely not equitable.

Most readers of this will understand what count(distinct()) does, what many people don’t understand (or think about) is HOW it does what it does.

Query plan for a count distinct operation

Counting distinct entities in a huge dataset is actually a hard problem, it is slightly easier if the data is sorted but re-sorting data on each insert becomes expensive depending on the underlying platform used.

Generally a count distinct performs a distinct sort and then counts the items in each “bucket” of grouped values.

If you need a 100% accurate count then this is unfortunately pretty much the only way you can get it on a randomised dataset, there are tricks you can do in how things are structured in the platform to make things more efficient (partitioning, clustering / bucketing for example) but it essentially still has to perform the same operations.

The Tradeoff — Accuracy vs Speed

Photo by Luke Chesser on Unsplash

Often there are use cases that don’t require 100% accuracy, ours is one of them as the audience size is simply an estimate — this gives us a few extra options.

There are many existing sampling methods that exist but their accuracy is too low for our requirements — in this case we needed something that had the right balance.

The Solution using VerdictDB

Early on in the process we contacted VerdictDB who had released an early beta of their open source product that purported to do exactly what we required. VerdictDB uses probability / statistical theory to create estimates of cardinality on large datasets. But most importantly for us it allows this to be created over the entire table.

We set up a pipeline using Airflow to orchestrate the data preparation to ensure that everything was ready

All the errors!

VerdictDB works by creating “Scrambles” of the table, this is a pre-processing stage which requires a significant amount of processing power but it only needs to be done once when new data is added.

We performed this on both Presto and on BigQuery — BigQuery came out cheaper for our particular use case but there are a number of reasons for that (not applicable to this article).

Final results

Here is a table listing the final results of each method. BigQuery came out on top for a number of different reasons as the backing data warehouse, however the focus of this is really on what VerdictDB can really provide in terms of simplicity and speed vs traditional methods such as HyperLogLog.