Hash tables. Dictionaries. Associative arrays. Whatever you like to call them, they are everywhere in software. They are core. And when someone finds a vulnerability in such a low-level data structure, almost all software is implicated.

This is a story of one of those core vulnerabilities, and how it took a decade to uncover and resolve. The story is pretty amazing. But for context, let’s review what hash tables are.

Hash Tables 101

Hash tables are incredibly convenient and fast. They let you put labels on things and throw them into memory buckets, and later on you can pull them back out and use them for whatever you want. They were invented in the 1950s and their underlying mechanics haven’t changed much over the years.

Let’s create a hash table and put some stuff in it:

h = {} h[‘a’] = 6

h[‘b’] = 3

h[‘f’] = 9 print h[‘a’] >>> 6

Each key and value will be stored in a bucket in memory. Lets say we start out with 5 empty buckets.

When we add the key ‘a’, which bucket should it go into? We want to be able to find it easily later. This is where the hashing function comes in. Every hash table is backed by a deterministic hashing function that turns any key into to a large, fixed-length number, which we call the hash. So the hash for ‘a’ might be 12416037344.

Because it’s deterministic, if we run the hash function on ‘a’ again sometime later, we’ll still get 12416037344. Now that we have the hash for our key, we need to reduce that hash into a bucket number (0–4 in our case). The simplest way to do this is to modulo by the number of buckets:

Great. So ‘a’ goes into bucket #4.

Now, if we keep going and hash ‘b’, we get 12544037731. And (12544037731 % 5) is 1. So ‘b’ goes into bucket #1.

Now let’s add ‘f’ to the table.

Hashing ‘f’ yields 13056039271. And (13056039271 % 5) is 1. But we already have something in bucket #1! What now?

We have a collision. Hash table collisions happen pretty often. One of the simplest ways to resolve collisions is to set each bucket up as a list, and just keep adding to that list whenever a collision occurs. This is known as a chained hash table. Here’s what it might look like:

As we add more keys, it’s fine to grow these lists for a while. When we are looking up a key, we simply look through the target bucket’s list for the key we’re interested in.

Why not just add more buckets to the table? Well, eventually we will need to do that, but at that point the entire table has to be rehashed, so it should only be done occasionally. Adding to a list is much faster.

Usually.

Unfortunately, collisions open the door to the biggest weakness in hash tables. As soon as we have collisions, the time required for accessing an element starts gradually creeping up because we have to loop through the list within the bucket.

When hash tables were invented, the selection of a great hash function came down to two things:

Performance. It must be fast as hell. Of course.

Uniform density output. A great hash function should consistently, uniformly distribute arbitrary keys nicely across the hash table, because we want to avoid collisions as much as possible.

And that was it. Security was not in the picture. Some very simple, very fast general purpose hash functions were developed over the years, and they worked well for several decades.