Have you heard about the HyperLogLog data structure? It sounds something out of science fiction. Something Lt. Cmdr Data certainly would use in conjunction with the deflector dish to communicate with 2-dimensional aliens.

It’s actually quite simple. A HyperLogLog is a substitute for counting things. Instead of storing everything we want to count (like every IP address that visits our site) we can use a HyperLogLog to closely approximate our counting while storing significantly less data.

Don’t Worry, I’m Only Using your SSN For Educational Purposes!

To understand how a HyperLogLog works, let’s solve a problem that doesn’t involve computers, cause computers are annoying. Let’s say we have a whiteboard and we have to count how many individuals visit our home. If our neighbor visits 5 times, they count as one unique visitor. If a friend comes over once, that’s another unique visitor. So far, we have 2 unique visitors.

To use a HyperLogLog to do this, first we need a way to map people to numbers. Well here’s a stupid one – let’s just track everyone by their unhashed social security number. (there’s reasons that may not be optimal, but this is educational).

Let’s further assume we can write two numbers on our whiteboard. These two numbers will represent our HyperLogLog. As someone enters our home, regardless if it’s their first time or not, they tell us their Social Security Number. We then go to count them in our HyperLogLog. If it begins with 0-4 they get counted on the left number. If it begins with 5-9 then they get assigned the right number. We’ll call these numbers the buckets.

Our whiteboard starts out looking like this:

0-4 5-9 0 0

Now pay attention. Here’s the trick – and what makes a HyperLogLog magical. When we take someones social security number and check its bucket, we need to then decide whether to update that person’s bucket or leave the original number there.

The trick is that each bucket here is a high-score in a really stupid competition. Sort of like how many cheerios we can fit in our mouth at once. In this case, it’s a leaderboard for the thrilling game how many 0s are at the end of our visitor’s social security number. If our visitor has a 0 at the end of their social security number, well thats pretty unique. They deserve credit! If nobody has come along yet with more than one 0, then that visitor is currently winning their bucket(bracket?). In other words, if the social security number is 4XXXXXX50, then update the first bucket with a 1 to reflect the high score:

0-4 5-9 1 0

Odds are the 0-4 bucket will keep that high score through many visitors until some even more special person comes along with a social-security number with two zeros at the end! Well isn’t that person special. If their social security number is 3XXXXX500 suddenly we have a new winner! We update the bucket’s high score accordingly:

0-4 5-9 2 0

Wow. We must have seen quite a few people to have found someone with two 0s at the end of their social security number. In fact, if we do the math, we must have seen roughly 10^2 people for one person to show up with two 00s at the end of their social security number. In other words 10^{high score} people for that bucket!

Wait, what did we just do there? Did we use this data structure to provide an approximate count of something? Yes – in fact that’s exactly what we did! And that’s the magic! (see it’s not that hard). A HyperLogLog is a stupid leaderboard that we can reverse-engineer the likelihood of acheiving said score. We can use the high score to approximate how many players have played the game.

We have only been counting within the little competition happening in our 0-4 bucket. If we have a couple more visitors with SSNs that begin with 5-9, they’ll play out our stupid little game in their own 5-9 league. Ultimately, after all the competition is over (ie when we want to count) lets say we end up with this on our whiteboard:

0-4 5-9 2 1

Having two high scores (for two silly competition leagues) lets us get more finely grained than just 10, 100, 1000 etc. In this case, our approximate count turns out to be 10^{high score bucket 1} + 10^{high score bucket 2} or 10^2 + 10^1 or 110!

Adding Back In The Computers

In our example, we’ve been using decimal numbers – as that’s what human beings know and love. Computers, however, live-and-breathe binary. So the main difference between what we’re doing is that while in both cases we count 0s, the counting approximation for a computer is actually done by powers of two. Its much easier for us to come across a binary number that ends in a 0. It has a probability of 1 in 2. So our leaderboard gets higher a bit faster, also resulting is much finer grain approximations.

Another clear difference between this and real applications is the number of buckets. Many applications increase the accuracy of the approximation by increasing the number of buckets pretty significantly, creating a more sensitive approximation of the count.

If you’re counting something like IP addresses, or other identifier, you’ll want to make sure you take care to hash that number into something with a reasonably random distribution before using a HyperLogLog. Social Security Numbers probably work decently for people, but in a real implementation calculating a well distributed hash will only help the accuracy of the data structure.

If you want to play with an example, here’s an educational example I’ve cooked up in C. Enjoy! For further reading read the HyperLogLog Paper and also learn about Redis’s Implementation. You can also see a neat application tracking spammers in this article of ours.

And of course, don’t hesitate to get in touch to discuss your analytics needs!