Clustering with a Key-Value Store

Let’s say you have a dataset you’d like to cluster. Let’s say you don’t want to write more than 5 lines of code. Let’s say that your only tool is a key-value store. (Why might you be in this position? Perhaps your dataset is really really (really) big and only simple things will scale. Maybe it’s in fact INFINITE and you’re clustering a stream. Maybe MapReduce is just a really big hammer. 🔨👷 Why you’d only want to write 5 lines of code is left as an exercise to the reader.)

At any rate, you would like to make clusters out of your data, but you only get to look at each item once in isolation. After looking at it you have to decide what cluster it should go to, at that moment, without looking at any other information, or any other items in your dataset. You only get one shot, do not throw it away! How can we accomplish this?

Ideally we want a magic function, where if and only if and should be in the same cluster. We don’t have such a function! (sorry) But we do have something pretty close.

Let’s say you have a function that computes a hash of your item, and your hash has the following property, for some similarity measure . is called a “locality sensitive hash” for . This is pretty close to we want! Things that are similar to each other will have a high probability of sharing a key, and things that are dissimilar to each other have a low probability of sharing a key. Later we’ll talk about how to make this behave a bit more like our magic function, but for now, lets talk about how to build this one.

// C++ code implementing each algorithm will be in a block like this one // at the end of each section.

Building a Simple H(X): MinHash

Suppose you have two sets, and , and you would like to know how similar they are. First you might ask, how big is their intersection?

That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.

This is called the Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.

Suppose you have a uniform pseudo-random hash function from elements in your set to the range . For simplicity, assume that the output of is unique for each input. I’ll use to denote the set of hashes produced by applying to each element of , i.e. .

Consider . When you insert and delete elements from , how often does change?

If you delete from then will only change if . Since any element has an equal chance of having the minimum hash value, the probability of this is .

If you insert into then will only change if . Again, since any element has an equal chance of having the minimum hash value, the probability of this is .

For our purposes, this means that is useful as a stable description of .

What is the probability that ?

If an element produces the minimum hash in both sets on their own, it also produces the minimum hash in their union.

if and only if . Let be the member of that produces the minimum hash value. The probability that and share the minimum hash is equivalent to the probability that is in both and . Since any element of has an equal chance of having the minimum hash value, this becomes

Look familiar? Presto, we now have a Locality Sensitive Hash for the Jaccard Index.

unsigned long long int MinHash(T X) { unsigned long long int min_hash = ULLONG_MAX; for (auto x : X) { min_hash = min(min_hash, Hash(x)); } return min_hash; }

Tuning Precision and Recall with Combinatorics

Now that we have a locality sensitive hash, we can use combinatorics to build something that looks a bit more like our magic function. We can concatenate (or sum) hashes to perform an “AND” operation. Let , i.e. concatenating independent hashes. The probability that is then .

We can then output multiple hashes to perform an “OR” operation. If we output independent hashes, then the probability that at least one of those hashes is the same for two items is .

Using these two tools, we can apply a “sigmoid” function to our similarities, outputting independent copies of concatenated hashes, the probability that two items will share at least one key is . (You can think of this as the probability that two items will “meet each other” at least once in the course of your computation.)

Now we can be pretty sure that things that are similar to each other will share at least one key, and things that aren’t won’t. We can increase the sharpness of the sigmoid as much as we want by spending more storage and CPU to increase A and O.

Great! Clustering with a key-value store! Now let’s talk about ways to improve .

template<T> unsigned long long int MinHash(T X, int s) { unsigned long long int min_hash = ULLONG_MAX; for (auto x : X) { // Note that the hash function must now accept a seed. min_hash = min(min_hash, Hash(x, s)); } return min_hash; } template<T> void EmitWithKeys(T X, int ands, int ors) { for (int o = 0; o < ors; o++) { unsigned long long int key = 0; for (int a = 0; a < ands; a++) { // we assume that a large int is enough keyspace that we // can get away with adding instead of concatenating. key += MinHash(X, a + o * ands); } Emit(key, X); } }

Integer Weights This algorithm works on a set, but the things we’d like to cluster usually aren’t sets. For instance, terms in a document (even long n-grams) can occur multiple times and ideally we don’t just want to just discard counts of how often they occur. How can we fix this? Easy! Just hash everything multiple times. If an element occurs times in your set, hash it independent times, insert each of them, and then take the minimum of this expanded set as your hash just like before. We expand our hash function to accept both an item and an integer as its argument, . And if item occurs times, then insert . Now think about what this means for the intersection and the union of these sets. If the count of in object is , and the count of in object is , then the intersection of the two sets of hashes has items, and the union has items. You can imagine stacking these hashes on top of each other to form a histogram. And then when you perform the intersection and union operations, these translate into performing a min and max across the values of each item. So now that we’re working with vectors instead of sets, if you interpret the Jaccard Index on this expanded set in terms of a weighted vector, it turns into the “Weighted Jaccard Index.” template<T> unsigned long long int IntegerWeightMinHash(map<T, int> X, int s) { unsigned long long int min_hash = ULLONG_MAX; for (auto x : X) { for (int i = 0; i < x.second; ++i) { min_hash = min(min_hash, Hash(x.first, i)); } } return min_hash; } If you’d like to have as your match probability but with real weights instead of integer weights, there are two good algorithms to do so, one for dense data, and the other for sparse data (like the original MinHash). They are complicated enough that I’m not going to talk about them more here, but not so complicated that you’ll have trouble implementing them from the algorithm definitions in the papers. Real Weights and Probability Distributions