Source : Cloudflare

A while back, I was working on a task that required me to generate a random unique string and use it as an identifier in a database column and I thought to myself, “It’s a trivial task, will be done in 2 minutes!”.

Rakishly, I began to formulate the code.

String id = createRandomString(6) //length 6 while(dbRepository.hasId(id)){

id = createRandomString(6)

}

“This code will work. But at what cost?”

Time complexity is O(N *M* k), where N is the size of the db and k is the length of the random string to be generated and M is the number of collisions in the DB .

Let’s optimise it.

We can create an index on Id and if N >> k then we can safely ignore k. This drops the complexity to O(MlogN).

“That looks pretty good”, I proclaimed. But then I started to calculate what would be the worst case time taken for a typical relational database.

Number of rows (N) - 10^8 Search time = log(10^8) = 27ops On average - If the random function generates duplicates 0.001% times

then we will iterate 10^-5 * 10^8 = 1000 times The average case of 1000 db calls is unacceptable for an API which will tend to worsen when the table size increases.

But what can we do to remedy this?

Write better random function or increase the length of the String. That might work from small amount of entries in database but as that keeps on increasing then we will be in the same predicament again. Use Bloom Filters.

What is a Bloom Filter?

A simple filter that can answer “Do I have this string already in the storage or not?” in O(1) time.

Let’s first talk about Hashmap.

A hashmap is a simple array with each index calculated by the hash of the String/Object being inserted against this index. This makes retrieval O(1).

Let’s use the same principle but instead of saving the object against the index we just mark that value as true or 1.

A Bloom filter (from Wikipedia)

A bloom filter can be implemented with a binary array. Similar to a Hashmap, we have hash function(s) that generate an integer value which corresponds to an index in the binary array.

Algorithm

INSERT(S) 1. String s is hashed by K hash functions generating K different indices. 2. The values against these indices are marked as 1.

EXISTS(S) 1. String s is hashed by K hash functions generating K different indices. 2. The values against these indices are checked, if all of them are 1 then there is a strong possibility of having this data in storage. If any of them is zero then we definitely don't have this value persisted.

Bloom Filter can definitely answer if that string is not present in the DB but it can’t definitely say that string is in the DB.

Confused?

Let’s take an example —

Empty filter

Insert “Maple”.

Here K(number of hash functions) = 3

so, we get 3 indices — 0, 3, 8

0,3,8 marked as 1

Insert “Canada”

hashed indices — 4, 6, 8

4, 6, 8 marked as 1

Check if “India” exists —

hashed indices — 5, 8, 9

Here 9 and 5 are zero so we return false.

Check if “Apple” exists —

hashed indices — 5, 4, 3

All of them are 1 so we have to return true here. But Apple was never inserted in the filter!

This is called a false positive and is a downside of this data structure. We can mitigate it to a certain extent by increasing the array size.

Important Parameters

There are 2 important parameters of a bloom filter.

Array Size Number of Hash Functions

Let’s understand Bloom filter and importance of these parameters by code.

A Bloom Filter Interface.

Abstract Implementation of the Bloom Filter

We are using BitSet to represent the binary array.

falsePosProbability - How much false positives can we tolerate

expectedElements - How many elements are going to be in the DB.

We can calculate expectedElements by using the formulas (from GeeksforGeeks) —

Probability of False positivies: Let m be the size of bit array, k be the number of hash functions and n be the number of expected elements to be inserted in the filter, then the probability of false positive p can be calculated as:

Size of Bit Array: If expected number of elements n is known and desired false positive probability is p then the size of bit array m can be calculated as :

Optimum number of hash functions: The number of hash functions k must be a positive integer. If m is size of bit array and n is number of elements to be inserted, then k can be calculated as :

Let’s create a BasicBloomFilter add some hash functions :

Here I have used 2 Hash Functions

MurmurHash3 Java Hashcode

Let’s Test this class —

We will add a lot of words in the filter and check against words that aren’t in the filter to see how much false positives we are getting.

Result —

TestCase 1 -

BloomFilter ::

size :: 64

falsePosProbability :: -1.0

expectedElements :: -1

optimumHashFunctions :: -1 present :: 5 out of 5 TestCase 2 -

BloomFilter ::

size :: 26048

falsePosProbability :: 0.01

expectedElements :: 3714

optimumHashFunctions :: 7 present :: 0 out of 5

As we can see if the size of the BitSet is small as in TestCase 1, every bit is set and will cause all strings to be reported as false positives.

When we calculated the optimal size when falsePosProbability set to 0.01 it drops to 0. However now the size is 26,048 for 3714 elements.

As this data needs to be in memory to be quickly accessed we need to strike a balance between falsePosProbability and size.

Thanks for reading!

Read more to learn how to implement Bloom Filter via Redis here.