Count-Min Sketches — Approximately How Many Times did People Watch Adele’s “Hello”?

Recall that we want a view count for each video in our catalog. Essentially, this is an event-occurrence problem; we have a stream of events (views), and we want to count how many times each event occurred. A popular solution for the event-occurrence problem is the Count-Min Sketch.

The internal structure of a Count-Min Sketch is a table, similar to that of a Hash Table. However, while Hash Tables use a single hash function, Count-Min Sketches use multiple hash functions, one for each column. Initially, every cell in the Count-Min-Sketch is initialized to 0. When an event occurs, the event’s id is hashed over every column. Each hash function outputs a row value, and the counter at each resulting row-column combination is incremented. To query an event’s count, we take the minimum of that event’s counts over all of the hash functions.

Let’s take an example. Suppose we have a 3-column, 4-row Count-Min Sketch. The very first user on your platform decides to watch “History of Japan”. When that video’s id is hashed by the three hashes, we get 1, 2, and 1, indicating that we should increment the counters at (1,1), (2,2), and (1,3).

Now suppose that another user watches “Hello” by Adele, which hashes to 1, 1, and 4. Following that, yet another user watches “History of Japan”.

At this point, we decide to examine how many times users watched “Hello”. We already saw that “Hello” hashes to 1,1, and 4, so we should examine the counts at the cells (1,1), (1,2), and (4,3). These counts are 3, 1, and 1 respectively.

We know that “Hello” was watched only once, and yet, in the Count-Min Sketch above, one of the counters says “Hello” was watched 3 times! This is because “Hello” and “Japan” collide on the first hash function, so any views of “Japan”, which was watched twice, will also count towards “Hello”. Essentially, the Count-Min Sketch is doubling-up, allowing multiple events to share the same counter in order to preserve space. The more cells in the table, the more counters we store in memory, so the less doubling-up will occur, and the more accurate our counters will be.

Even if we didn’t know the view history, we could easily conclude that “Hello” was watched no more than once. If it had really been watched 3 times, then all of the bolded counters would be at least 3. Since the smallest value in the three bolded counters is 1, we can conclude that “Hello” was watched no more than 1 time.

Now let’s look at an example where the Count-Min Sketch is wrong.

Even though “Hello” was never watched in this sequence, each of the videos collided with “Hello” in at least one hash function. As a result, when we query the view count for Adele’s “Hello”, we are incorrectly told that it has been watched once. As mentioned before, we can reduce both the probability and magnitude of errors by using a larger table. A small error isn’t the end of the world, particularly if our goal is to understand general trends rather than to make precise measurements.