Bloom Filters: The “Maybe” Set
[!TIP] The Problem: Checking if an item exists in a massive set (1 Billion items) is slow (Disk I/O) or expensive (RAM). The Solution: A data structure that uses very little memory but gives you a Probabilistic answer. “Is X in the set?” -> “No” (100% Certain) OR “Maybe” (Small chance of error).
1. How it Works
A Bloom Filter is built around a Bit Array of size m and k independent Hash Functions, as visualized in the pipeline above.
Operations
- Insertion: To add an item (e.g., “Apple”), we pass it through all k hash functions. Each function outputs an index (e.g., Bit 2, Bit 5, Bit 9). We set the value at those indices to
1. - Lookup: To check if “Apple” exists, we re-hash it to find the same indices.
- Yes: If all associated bits are
1, the item Maybe exists. - No: If any bit is
0, the item Definitely Not exists.
- Yes: If all associated bits are
2. False Positives
Why “Maybe”?
- Imagine we add “Banana”. It sets bits
2, 5, 8. - Imagine we add “Car”. It sets bits
9. - Now we check “Apple” (
2, 5, 9). All bits are1! - But “Apple” was never added. This is a False Positive.
Note: You can never have a False Negative. If the bit is 0, it’s 0.
Interactive Demo: The Collision Visualizer
Visualize False Positives.
- Add “Cat” (blue).
- Add “Dog” (blue).
- Check “Fox” (orange). If it collides with existing bits, you get a “False Positive”.
3. Sizing and Math
You can tune the error rate (p) by changing the size (m) and number of hashes (k).
- More Bits (m) = Lower Error.
- More Hashes (k) = Slower, but better precision (up to a point).
- Formula: m = - (n ln p) / (ln 2)2 where n is number of items.
Interactive Sizing Calculator
Estimate your memory usage for a production system.
4. Counting Bloom Filters (CBF)
Standard Bloom Filters do not support Deletion. If you delete “Apple” by setting bits 2, 5, 9 back to 0 (as seen in our diagram), you might accidentally delete “Banana” if it also shared one of those bits.
Solution: Counting Bloom Filter
Instead of the Bit Array (0/1) shown in the visual, a Counting Bloom Filter uses an Integer Array of counters.
* Add: Increment counters (2: 1->2, 5: 1->2).
* Delete: Decrement counters (2: 2->1, 5: 2->1).
- Trade-off: Takes up 3-4x more space (8 bits per counter vs 1 bit).
| Feature | Standard Bloom Filter | Counting Bloom Filter |
|---|---|---|
| Storage | Bit Array (1 bit/entry) | Integer Array (4-8 bits/entry) |
| Operation | Set Bit to 1 | Increment Counter |
| Deletion | Impossible | Supported (Decrement) |
| Space | Minimal | 4x Larger |
5. Walkthrough: The Math (Dry Run)
Let’s check “Apple” against a small filter.
- Size (m): 10 bits.
- Hashes (k): 2.
Step 1: Insertion (“Apple”)
- Hash 1 (“Apple”) % 10 = 3.
- Hash 2 (“Apple”) % 10 = 7.
- Action: Set Bit 3 and Bit 7 to
1. - Array:
[0, 0, 0, 1, 0, 0, 0, 1, 0, 0]
Step 2: Insertion (“Banana”)
- Hash 1 (“Banana”) % 10 = 3 (Collision!).
- Hash 2 (“Banana”) % 10 = 9.
- Action: Set Bit 3 (already 1) and Bit 9 to
1. - Array:
[0, 0, 0, 1, 0, 0, 0, 1, 0, 1]
Step 3: Check (“Cherry”)
- Hash 1 (“Cherry”) % 10 = 3. (Bit 3 is 1).
- Hash 2 (“Cherry”) % 10 = 9. (Bit 9 is 1).
- Result: “Maybe”.
- Reality: FALSE POSITIVE. Cherry was never added, but its hashes collided with Apple and Banana.
6. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Disk IO Reduction | Bloom Filter checks RAM before touching Disk. |
| Low Memory | Bit Array uses ~10 bits per item (1% error rate). |
| Speed | Constant Time O(k) for insertion and lookup. |
| Deletion Support | Counting Bloom Filter (Trade-off: More RAM). |
7. Interview Gauntlet
- Can you resize a Bloom Filter?
- No. Hashing depends on modulo m. If m changes, all indices change. You must rebuild it from scratch.
- What is a False Negative?
- Saying an item is NOT in the set when it IS. Bloom Filters never do this.
- Why not use a HashSet?
- Memory. Storing “Apple” takes 5 bytes. A Bloom Filter takes 5 bits. For 1 Billion items, the difference is GBs vs MBs.
- Optimal number of hash functions?
- k = (m/n) × ln 2. Usually around 7 for 1% error.
- Where is this used in real life?
- Cassandra/HBase: Avoid reading SSTables.
- Chrome: Malicious URL check (local filter before calling server).
- CDN: Cache Summary (What URLs does this cache node have?).
8. Summary: The Whiteboard Strategy
1. Core Concepts
- Probabilistic: Trade accuracy for space.
- Bit Array: Dense storage (0/1).
- Hashing: k independent functions.
- One-Way: Can't retrieve items.
2. Operations
Check(y) -> Hash(y) -> Check Bits
If any 0 -> NO.
If all 1 -> MAYBE.
* False Positive: Unlucky collision.
3. Use Cases
Cache: "Cache Digests" (Peering).
Security: Weak Password list.
4. Constraints
- No Deletion: Requires Counting BF.
- Fixed Size: Cannot grow dynamically (Scalable BF exists but complex).
- Saturation: As it fills, error rate approaches 100%.