Today I decided to play with some data structures and I was wondering what should I learn next. I remember some time ago while I was reading some articles about Cassandra internals that it used bloom filters to check if the particular SSTables are likely to have the request partition data. I thought it might be cool to implement one and share my thought process with you. I will try to explain things as simple as possible.

Enter the Bloom

So what is a Bloom Filter? A detailed explanation is beyond the scope of this article but here’s a summary from Wikipedia:

Bloom Filter: A space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the set, but not removed.

Bloom filters are used in many cases like Cache filtering to prevent “one-hit-wonders” from being stored in its disk caches or in blogs to avoid recommending articles a user has previously read like Medium does.

So let’s try to implement one to see what we can learn.

Enter the Code

We need a few thing before we can start. As bloom filters are only expanding and we cannot remove items from them reliably, let’s define the simplest abstract interface we need to provide.

type Interface interface {

Add(item []byte) // Adds the item into the Set

Test(item []byte) bool // Check if items is maybe in the Set

}

Based on that let’s define and add the initial constructor, as we need to create one filter before we can use it:

// BloomFilter probabilistic data structure definition

type BloomFilter struct {

bitset []bool // The bloom-filter bitset

k uint // Number of hash values

n uint // Number of elements in the filter

m uint // Size of the bloom filter bitset

hashFuncs[]hash.Hash64 // The hash functions

} // Returns a new BloomFilter object,

func New(size) *BloomFilter {

return &BloomFilter{

bitset: make([]bool, size),

k: 3, // we have 3 hash functions for now

m: size,

n: uint(0),

hashfns:[]hash.Hash64{murmur3.New64(),fnv.New64(),fnv.New64a()},

}

}

I’ve used 3 different hash functions here which will produce 3 different values for simplicity on each item. I’ve used a boolean slice to store the flags.

Now let’s try to implement the interface definitions.

Adding Items to the Set: When Adding an item we simply pass it through the hash functions and store the results in an array. Those numbers are the positions that we need to set in the boolean array to mark them as occupied. As the hash positions might be big numbers we mod the positions with the size of the bit-set to prevent index errors.

// Adds the item into the bloom filter set by hashing in over the . // hash functions

func (bf *BloomFilter) Add(item []byte) {

hashes := bf.hashValues(item)

i := uint(0) for {

if i >= bf.k {

break

} position := uint(hashes[i]) % bf.m

bf.bitset[uint(position)] = true i+= 1

} bf.n += 1

} // Calculates all the hash values by applying in the item over the // hash functions

func (bf *BloomFilter) hashValues(item []byte) []uint64 {

var result []uint64 for _, hashFunc := range bf.hashfns {

hashFunc.Write(item)

result = append(result, hashFunc.Sum64())

hashFunc.Reset()

} return result

}

As you can see we only generate the values and set their values as positions into the bit-set.

Checking if an item is possibly in the Set: To check if the item is possibly in the set we do the same thing as the Add method but now we check if the hash position is flagged as true in the bit-set. If all positions are been set then we say the item possibly exists. If one of the positions has not been set then we say it definitely does not exist.

// Test if the item into the bloom filter is set by hashing in over // the hash functions

func (bf *BloomFilter) Test(item []byte) (exists bool) {

hashes := bf.hashValues(item)

i := uint(0)

exists = true



for {

if i >= bf.k {

break

}



position := uint(hashes[i]) % bf.m

if !bf.bitset[uint(position)] {

exists = false

break

} i+= 1

} return

}

Lets write some tests

We check some values if they exist in the set or not.

// Test items if items may exist into the set

func (s *MySuite) TestIfMayExist(c *C) {

bf := New(1024)

bf.Add([]byte("hello"))

bf.Add([]byte("world"))

bf.Add([]byte("sir"))

bf.Add([]byte("madam"))

bf.Add([]byte("io")) c.Assert(bf.Test([]byte("hello")), Equals, true) c.Assert(bf.Test([]byte("world")), Equals, true) c.Assert(bf.Test([]byte("hi")), Equals, false)

}

Testing for existence may give you false Positive when the item is in the set but in reality its not. That depends on many factors like the size of the bit-set the size of the elements in the set and the quality of the hash functions.

Conclusion

As you can see, implementing bloom filters are not a difficult thing. You only need to divide the problem into small pieces and work on them one by one. In that case, we started with the abstract representation of the Data Structure and implemented the methods using simple conventions. I hope you’ve enjoyed the endeavor and you learned something.

Exercises

Bloom filter has some mathematical properties for calculating the optimal number of elements such as the error rate is within a specific percentage. For example from https://brilliant.org/wiki/bloom-filter/#false-positives-analysis the best value for k is:

k = ln(2) * m / n

Modify the original constructor of the BloomFilter such as it accepts an e parameter that will try to calculate the optimal k such as the invariant holds.

References

You can find more information about bloom filters in the following sites:

Wiki Article: A nice overview of the bloom filter

Original Paper: The original paper describing Bloom filters

Briliant.org article: A nice explanation from Briliant.org

If this post was helpful please clap it and stay tuned for my other articles. You can follow me on GitHub and LinkedIn. If you have any ideas and improvements feel free to share them with me.

If you would like to schedule a mentoring session visit my Codementor Profile.

Happy coding.