I recently ran across this bloom filter post by Michael Schmatz and it inspired me to write about a neat variation on the bloom filter that I’ve found useful in my work. Quick refresher: a bloom filter is a probabilistic data structure that tests if an element is potentially in a set. False positives are possible when testing if an element is in the set, but negatives mean an element is definitely not in the set. Got it? Cool.

A counting filter is essentially a bloom filter that’s had its single-bit booleans replaced with n-bit integers. This makes the filter take up much more space than a standard bloom filter, but in return we get an upper-bound count for insertions of a particular element and can remove elements from the filter. Though removal of elements can be pretty neat, we expose ourselves to the possibility of false negatives if we remove an element that was never inserted into the counting filter. Just be careful with that- Deke Guo has more to say about this than I do.

Implementation

I designed my counting filter class with the following interface:

// Adds an element to the counting filter.

void Add(const T *key, int size = sizeof(T)); // Removes an element from the counting filter.

void Remove(const T *key, int size = sizeof(T)); // Test whether the element has been added. If false, the element

// is definitely not in the set. If true, the element could be in

// the set or it is a false positive as described above.

bool MaybeContains(const T *key, int size = sizeof(T)) const; // Get the upper bound on the number of times an element could

// have been inserted into the counting filter.

int CountUpperBound(const T *key, int size = sizeof(T)) const;

The counters are stored in a std::vector like so:

std::vector<uint8> counters_;

I initially implemented this with the counters stored in a std::array object. Using the array, the filter benchmarked at ~1.35x the runtime of a std::unordered_set insertion on my machine. With the vector, it’s benchmarking at ~2.05x std::unordered_set insertions. I settled on the slower option because the code looks cleaner and it allows us to specify the size and number of hashes in the bloom filter during instantiation without needing to use template arguments (std::arrays must have a size known at compile time).

The constructor is pretty simple. Simply an up-front allocation of the counters_ vector via a call to resize() and perhaps just fill the vector with zeros, so I should start with showing how I use hashes to get index pairs:

template <typename T, int64_t kSize, int32_t kNumHashPairs>

void CountingFilter<T, kSize, kNumHashPairs>::IdxFromKey(

const T *key,

const int size,

const uint32_t seed,

int64_t *idx1,

int64_t *idx2) const {



array<uint64_t, 2> results;

MurmurHash3_x64_128(key, size, seed, results.data());

*idx1 = results[0] % counters_.size();

*idx2 = results[1] % counters_.size();

assert(*idx1 < counters_.size());

assert(*idx2 < counters_.size());

}

Note that this limits the implementation to even-numbers of hashes, but I like the fact that we’re getting a 2-for-1 deal with the 64-bit hashes.

The Add() function has to account for the possibility of the count potentially exceeding the maximum counter value. In this case, there’s not much we can do except avoid an increment and flag the filter data as potentially erroneous.

template <typename T, int64_t kSize, int32_t kNumHashPairs>

void CountingFilter<T, kSize, kNumHashPairs>::Add(

const T *key, const int size) { for (int32_t xx = 0; xx < kNumHashPairs; ++xx) {

int64_t idx1, idx2;

IdxFromKey(key, size, xx, &idx1, &idx2);

// It's possible that the count can exceed the maximum uint8

// value, so we'll just leave it be. After many removals,

// this could result in a false negative, but this is very

// unlikely. Let's just assert for this case.

assert(counters_[idx1] <= numeric_limits<uint8_t>::max());

assert(counters_[idx2] <= numeric_limits<uint8_t>::max());

counters_[idx1] += 1;

counters_[idx2] += 1;

}

++num_insertions_;

}

Removal is much simpler since we just decrement the counters. In this case, it’s up to you whether to add an assertion that the counters is nonzero prior to decrementing the counters:

template <typename T, int64_t kSize, int32_t kNumHashPairs>

void CountingFilter<T, kSize, kNumHashPairs>::Remove(

const T *key, const int size) {



for (int32_t xx = 0; xx < kNumHashPairs; ++xx) {

int64_t idx1, idx2;

IdxFromKey(key, size, xx, &idx1, &idx2);

assert(counters_[idx1] > 0);

assert(counters_[idx2] > 0);

counters_[idx1] -= 1;

counters_[idx2] -= 1;

}

--num_insertions_;

}

Checking whether elements exist in the filter is just a check for all counters to be nonzero. Getting an upper-bound on the number of insertions an element has had into the filter requires us to just find the minimum counter. This is an upper-bound because collisions can potentially throw our calculation off.

template <typename T, int64_t kSize, int32_t kNumHashPairs>

int CountingFilter<T, kSize, kNumHashPairs>::CountUpperBound(

const T *key, const int size) const {



int count_ub = numeric_limits<uint8_t>::max() + 1;



for (int32_t xx = 0; xx < kNumHashPairs; ++xx) {

int64_t idx1, idx2;

IdxFromKey(key, size, xx, &idx1, &idx2);

count_ub = min(static_cast<int>(counters_[idx1]), count_ub);

count_ub = min(static_cast<int>(counters_[idx2]), count_ub);

}

return count_ub;

}

Wrapping up

This data structure is a fun variation on a vanilla bloom filter. My implementation is currently benchmarking at ~2.05x the insertion time of a pre-reserved std::unordered_set. I’ve mainly used it to be pickier about cache insertions when the working set of data is very large, but I’d love to find more applications for this.