Bloom filters explained

yourbasic.org

A Bloom filter is a prob­abi­listic data struc­ture used to test set member­ship.

Bloom filter basics

The Bloom filter data structure tells whether an element may be in a set, or definitely isn’t. The only possible errors are false positives: a search for a nonexistent element can give an incorrect answer. With more elements in the filter, the error rate increases.

Bloom filters are both fast and space-efficient. However, elements can only be added, not removed.

Content distribution networks use them to avoid caching one-hit wonders, files that are seen only once. Web browsers use them to check for potentially harmful URLs.

Example: Blocking shady websites

This piece of code, which uses the Go Bloom filter implementation in package github.com/yourbasic/bloom, shows a typical Bloom filter use case.

blocklist := bloom.New(10000, 200) url := "https://rascal.com" blocklist.Add(url) if blocklist.Test(url) { fmt.Println(url, "may be blocked.") } else { fmt.Println(url, "has not been blocked.") }

https://rascal.com may be blocked.

Implementation

An empty Bloom filter is a bit array of m bits, all set to 0. There are also k different hash functions, each of which maps a set element to one of the m bit positions.

To add an element, feed it to the hash functions to get k bit positions, and set the bits at these positions to 1.

bit positions, and set the bits at these positions to 1. To test if an element is in the set, feed it to the hash functions to get k bit positions. If any of the bits at these positions is 0, the element definitely isn’t the set. If all are 1, then the element may be in the set.

bit positions.

Example

Here is a Bloom filter with three elements x, y and z. It consists of 18 bits and uses 3 hash functions. The colored arrows point to the bits that the elements of the set are mapped to.

The element w definitely isn’t in the set, since it hashes to a bit position containing 0.

Try this cool interactive Bloom filter demonstration by Jason Davies!

Performance

For a fixed error rate, adding a new element and testing for membership are both constant time operations, and a filter with room for n elements requires O(n) space.

In fact, a filter with error rate 1/p can be implemented with 0.26⋅ln(p) bytes per element using ⌈1.4⋅ln(p)⌉ bit array lookups per test.

This means that the example code above, with 100,000 entries and error rate 1/200, can be implemented using 135 KiB and 8 hash functions.

The Wikipedia Bloom filter article has a detailed mathematical analysis.

Share this page: