Hashing

To understand Bloom filters, you first have to understand hashing. To understand hashing, you don’t have to understand maths, which is good, because I’m a middling mathematician at best. I sit at the median, maybe.

A hash is like a fingerprint for data. A hash function takes your data — which can be any length — as an input, and gives you back an identifier of a smaller (usually), fixed (usually) length, which you can use to index or compare or identify the data.

This is a hash of my full name:

b3f9b3a3504ccb29c4183730a42c8d56

Most hash functions are one-way operations, which is to say you can figure out an identifier from the data, but you typically can’t do the reverse. You couldn’t look at that hash above, for example, and know that my full name is James Graham Michael Swanson Talbot. Reversible hash functions do exist, but they tend to be pretty trivial, and not particularly useful.

There are a few properties which are desirable in a useful hash function. The most important one is that the same input must always hash to the same output. Really, that’s the defining feature. If the same input gave you a different output each time, it wouldn’t be very useful as an identifier.

Ideally, the output values should be distributed uniformly — that is, each possible output should be equally likely. For a lot of use-cases it is also important that the outputs be distributed randomly; similar inputs should not give similar outputs. For some use-cases — usually in security — it is also important to minimise collisions, meaning, as far as possible, each input should give a unique output.

Finally, in most cases (though not always), you want it to be fast. No one wants to wait for anything these days, and that includes programmers.

The combination of hash function properties that you care about will vary wildly depending on the task for which you are using them. And hash functions are used in a lot of tasks. Shazam uses hashes to figure out what song you’re listening to. If you need to find duplicates in a set of things, you might use a hash. Databases? Giant hashes. This post on Medium was delivered to you securely — you can probably see a green padlock on the browser address bar that confirms this. Hashes are important in that process too.

You may even have heard of some of these algorithms. SHA-1 — Secure Hash Algorithm — was devised by the NSA, and for a long time was used to secure lots of Internet communication. MD5 is another popular one, often used to prove that the file you’re downloading is the one you think it is, and not some malware written by the NSA to infect your computer and spy on you. (In fact, MD5 was the hash function I used to encode my name further up the page.)

Some hash functions have exotic and nonsensical names — this is computing after all — and so you also get things like CityHash, MurmurHash, and SpookyHash.

There are literally thousands of named hashing functions. Some are secure, but comparatively slow to calculate. Some are very fast, but have more collisions. Some are close to perfectly uniformly distributed, but very hard to implement. You get the idea. If there’s one rule in programming it’s this: there will always be trade-offs.

While you don’t have to be a genius to understand hashes, you do have to be a pretty exceptional mathematician or computer scientist to create one — or at least one that’s useful.

Take this elegant, impenetrable piece of code by Paul Hsieh, for example, which he calls SuperFastHash (excluding some helper code):

uint32_t SuperFastHash (const char * data, int len) {

uint32_t hash = len, tmp;

int rem;



if (len <= 0 || data == NULL) return 0;

rem = len & 3;

len >>= 2;



/* Main loop */

for (;len > 0; len--) {

hash += get16bits (data);

tmp = (get16bits (data + 2) << 11) ^ hash;

hash = (hash << 16) ^ tmp;

data += 2 * sizeof (uint16_t);

hash += hash >> 11;

}



/* Handle end cases */

switch (rem) {

case 3: hash += get16bits (data);

hash ^= hash << 16;

hash ^= ((signed char)data[sizeof (uint16_t)]) << 18;

hash += hash >> 11;

break;

case 2: hash += get16bits (data);

hash ^= hash << 11;

hash += hash >> 17;

break;

case 1: hash += (signed char)*data;

hash ^= hash << 10;

hash += hash >> 1;

} /* Force "avalanching" of final 127 bits */

hash ^= hash << 3;

hash += hash >> 5;

hash ^= hash << 4;

hash += hash >> 17;

hash ^= hash << 25;

hash += hash >> 6;



return hash;

}

This is a series of fast operations that do things like addition and multiplication, inverting of ones and zeroes, munging and melding parts of the data with itself, shifting bits this way and that. And at the end of it, you get an almost perfectly unique fingerprint of the input data. Why pick those operations, with those numbers, in that order? Beats me. Paul Hsieh might not be a wizard, but I’ve seen no evidence of that.

What matters is, we have a way of taking any data, like the contents of a YouTube video, or an MP3 file, or the word “monkey”, and getting back a fingerprint that is predictable in length and unique — mostly — to that item.