Bloom Filters - a beginner introduction

Introduction

In this article we will look at a very useful probabilistic data structure called bloom filters.

TLDR: If you are in need of a mechanism to check if an element is present in a set but don’t have enough memory to store the entire set and also fine to tolerate some amount of false positive rate (configurable) with no false negatives, then give bloom filters a try.

Details

The original paper[1] was from 1970’s when the RAM size was small and for some problems it was impractical to store the hashed contents of the set in memory. So the author compares the conventional approach of storing the entire hashed contents of the set on the disk and storing a much smaller bloom filter entirely in the RAM - with an added guarantee that it exhibits no false negative rate but can have some some percentage of false positive rate. Using this approach, the total percentage of disk access to verify set membership queries can be brought down drastically (only few queries might incorrectly hit the disk because of false positives). With the advent of big data, even the hash of the dataset might not fit into the RAM and bloom filter can be a good candidate solution in those situations. The trade-off obviously is that we are optimizing for space reduction vs increase (somewhat) in response time (during false positive situation).

A) What is meant by false positive and false negative rate ?

False positive rate - For set membership query, bloom filter return true that the element is a member of the set even though that element is not present in the set.

False negative rate - For set membership query, bloom filter return true that the element is not a member of the set even though that element is present in the set.

Bloom filter will not exhibit false negatives but can exhibit some percentage of false positives.

B) When do u want to use it ?

How it works

The paper describes two ways of implmenting the bloom filter and we will only look at the second method which is demonstrably superior in terms of space usage.

The entire set is an array of 1 bit cells. The cell can have a value of 1 or 0 (1 if the element is present in the set, 0 otherwise.)

n # number of elements that are expected to be stored in bloom filter. 
m # number of cells or size of the bloom filter.
k # number of hash functions, f_1, f_2,..., f_k. Each of these could be a hash functions like Murmur3, SHA256 etc.
s = []
f = [] # array of hash functions

# supported operations:

insert(x):
    for i in range(k):
        s[f[i](x)] = 1 # notice that the element is not added as key, only the hash of the element

lookup(x):
    for i in range(k):
        if(s[f[i](x)] == 0):
                return 0
    return 1

Time and Space complexity

The time complexity for both insert and lookup operations are proportional to O(k), regardless of the number of elements in the set!

The space complexity is O(m). How we choose m and k is based on how much error rate we are willing to tolerate.

Analysis for choosing m and k

The probability that after 1 insertion a cell will be 0, is (1 - 1/m)^k.
After n insertions, the probability becomes (n * (1 - 1/m)^k)).

Then the probability of false positive rate is:

(1 - (n * (1 - 1/m)^k))^k and this turns out to be approximately ~ (1 - e^(-kn/m))^k

Given this, if we raise k (number of hash functions) we can bring down the error rate, but in the original paper, one of the goal is to preserve the time spent on evaluating if a member is present in the set or not and it depends on keeping the k value low. We leave it to the consumer of bloom filter to choose the best m and n (according to their space requirements) and decide what would be the best k value we can derive based on that.

The optimal k, given m and n is,

optimal k = ln(2). m/n

False positive rate E = (1/2)^k ~ (0.6185)^m/n

Here, m/n is the number of bits per element.

In practice, k (which must be an integer) and m/n must be constants.
For example, when m/n = 10 and k = 7, E ~ 0.008 or .8% false positive rate.

Conclusion

In the original paper, the author uses the example of 500,000 words to be hypenated and 450,000 of these words can be hypenated using simple means. The other 50,000 words need access to a reference dictionary that is completely disk resident. With bloom filter, for a choice of E = 1/16, an access required to read from disk for atmost (50,000 + (450,000/16)) ~ 78,000 of the 500,000 words to be hypenated, i.e for approximately 16% of the cases. This constitutes 84% reduction in disk access in case of completely disk resident data.

Reference

[1] - Original bloom filter (https://dl.acm.org/doi/10.1145/362686.362692)

[2] - Counting bloom filter (addition operation supported) [https://web.archive.org/web/20200727040826/http://theory.stanford.edu/~rinap/papers/esa2006b.pdf]