Bloom Filters: The “Maybe” Set

[!TIP] The Problem: Checking if an item exists in a massive set (1 Billion items) is slow (Disk I/O) or expensive (RAM). The Solution: A data structure that uses very little memory but gives you a Probabilistic answer. “Is X in the set?” -> “No” (100% Certain) OR “Maybe” (Small chance of error).

1. How it Works

A Bloom Filter is built around a Bit Array of size m and k independent Hash Functions, as visualized in the pipeline above.

Operations

  1. Insertion: To add an item (e.g., “Apple”), we pass it through all k hash functions. Each function outputs an index (e.g., Bit 2, Bit 5, Bit 9). We set the value at those indices to 1.
  2. Lookup: To check if “Apple” exists, we re-hash it to find the same indices.
    • Yes: If all associated bits are 1, the item Maybe exists.
    • No: If any bit is 0, the item Definitely Not exists.
Hashing Mechanism: Key to Bit-Array
m Bits | k Hash Functions | Probabilistic Membership
Input Key "Apple"
Hash H1(x)
→ index 2
Hash H2(x)
→ index 5
Hash H3(x)
→ index 9
0
Bit 0
1
Bit 2
0
0
1
Bit 5
0
0
0
1
Bit 9
0

2. False Positives

Why “Maybe”?

  • Imagine we add “Banana”. It sets bits 2, 5, 8.
  • Imagine we add “Car”. It sets bits 9.
  • Now we check “Apple” (2, 5, 9). All bits are 1!
  • But “Apple” was never added. This is a False Positive.

Note: You can never have a False Negative. If the bit is 0, it’s 0.

Interactive Demo: The Collision Visualizer

Visualize False Positives.

  1. Add “Cat” (blue).
  2. Add “Dog” (blue).
  3. Check “Fox” (orange). If it collides with existing bits, you get a “False Positive”.

3. Sizing and Math

You can tune the error rate (p) by changing the size (m) and number of hashes (k).

  • More Bits (m) = Lower Error.
  • More Hashes (k) = Slower, but better precision (up to a point).
  • Formula: m = - (n ln p) / (ln 2)2 where n is number of items.

Interactive Sizing Calculator

Estimate your memory usage for a production system.



Required Memory (m): --
Optimal Hashes (k): --

4. Counting Bloom Filters (CBF)

Standard Bloom Filters do not support Deletion. If you delete “Apple” by setting bits 2, 5, 9 back to 0 (as seen in our diagram), you might accidentally delete “Banana” if it also shared one of those bits.

Solution: Counting Bloom Filter

Instead of the Bit Array (0/1) shown in the visual, a Counting Bloom Filter uses an Integer Array of counters. * Add: Increment counters (2: 1->2, 5: 1->2). * Delete: Decrement counters (2: 2->1, 5: 2->1).

  • Trade-off: Takes up 3-4x more space (8 bits per counter vs 1 bit).
Feature Standard Bloom Filter Counting Bloom Filter
Storage Bit Array (1 bit/entry) Integer Array (4-8 bits/entry)
Operation Set Bit to 1 Increment Counter
Deletion Impossible Supported (Decrement)
Space Minimal 4x Larger

5. Walkthrough: The Math (Dry Run)

Let’s check “Apple” against a small filter.

  • Size (m): 10 bits.
  • Hashes (k): 2.

Step 1: Insertion (“Apple”)

  • Hash 1 (“Apple”) % 10 = 3.
  • Hash 2 (“Apple”) % 10 = 7.
  • Action: Set Bit 3 and Bit 7 to 1.
  • Array: [0, 0, 0, 1, 0, 0, 0, 1, 0, 0]

Step 2: Insertion (“Banana”)

  • Hash 1 (“Banana”) % 10 = 3 (Collision!).
  • Hash 2 (“Banana”) % 10 = 9.
  • Action: Set Bit 3 (already 1) and Bit 9 to 1.
  • Array: [0, 0, 0, 1, 0, 0, 0, 1, 0, 1]

Step 3: Check (“Cherry”)

  • Hash 1 (“Cherry”) % 10 = 3. (Bit 3 is 1).
  • Hash 2 (“Cherry”) % 10 = 9. (Bit 9 is 1).
  • Result: “Maybe”.
  • Reality: FALSE POSITIVE. Cherry was never added, but its hashes collided with Apple and Banana.

6. Requirements Traceability Matrix

Requirement Architectural Solution
Disk IO Reduction Bloom Filter checks RAM before touching Disk.
Low Memory Bit Array uses ~10 bits per item (1% error rate).
Speed Constant Time O(k) for insertion and lookup.
Deletion Support Counting Bloom Filter (Trade-off: More RAM).

7. Interview Gauntlet

  1. Can you resize a Bloom Filter?
    • No. Hashing depends on modulo m. If m changes, all indices change. You must rebuild it from scratch.
  2. What is a False Negative?
    • Saying an item is NOT in the set when it IS. Bloom Filters never do this.
  3. Why not use a HashSet?
    • Memory. Storing “Apple” takes 5 bytes. A Bloom Filter takes 5 bits. For 1 Billion items, the difference is GBs vs MBs.
  4. Optimal number of hash functions?
    • k = (m/n) × ln 2. Usually around 7 for 1% error.
  5. Where is this used in real life?
    • Cassandra/HBase: Avoid reading SSTables.
    • Chrome: Malicious URL check (local filter before calling server).
    • CDN: Cache Summary (What URLs does this cache node have?).

8. Summary: The Whiteboard Strategy

1. Core Concepts

  • Probabilistic: Trade accuracy for space.
  • Bit Array: Dense storage (0/1).
  • Hashing: k independent functions.
  • One-Way: Can't retrieve items.

2. Operations

Add(x) -> Hash(x) -> Set Bits
Check(y) -> Hash(y) -> Check Bits
If any 0 -> NO.
If all 1 -> MAYBE.

* False Positive: Unlucky collision.

3. Use Cases

DB: Skip disk reads (LSM Trees).
Cache: "Cache Digests" (Peering).
Security: Weak Password list.

4. Constraints

  • No Deletion: Requires Counting BF.
  • Fixed Size: Cannot grow dynamically (Scalable BF exists but complex).
  • Saturation: As it fills, error rate approaches 100%.