The Read Path: Bloom Filters & SSTables
Reading data from a distributed database is significantly harder than writing it. When you write, you just append to a log. When you read, you have to find the latest version of data that might be scattered across a MemTable and multiple SSTables on disk.
To make reads fast, Cassandra relies on a series of caches and probabilistic data structures to avoid unnecessary disk seeks. The star of the show is the Bloom Filter.
1. The Read Path Anatomy
When a node receives a read request for a specific row (Partition Key), it checks the following locations in order:
- MemTable: If the data is in memory (recently written), return it immediately.
- Row Cache: (Optional, off by default) Checks if the hot row is cached in RAM.
- Bloom Filter: Checks if the SSTable might contain the partition key.
- Partition Key Cache: Checks if the exact disk location of the partition is cached.
- Partition Summary: A sampling of the index file (stored in RAM) to find the approximate location.
- Partition Index: The on-disk index that maps partition keys to data offsets.
- SSTable Data: The actual data file.
2. What is a Bloom Filter?
A Bloom Filter is a space-efficient probabilistic data structure that answers the question: “Is this element in the set?”
- Possible Answers: “No” or “Maybe”.
- False Negatives: Impossible. If it says “No”, the element is definitely not there.
- False Positives: Possible. It might say “Maybe” even if the element isn’t there (wasted disk seek).
Cassandra uses a Bloom Filter for every SSTable. Before touching disk, it checks the Bloom Filter. If the filter returns “No”, Cassandra skips that SSTable entirely.
The Math of Probability
The probability of a false positive p depends on the bit array size m, the number of elements inserted n, and the number of hash functions k.
The formula is:
p ≈ (1 - e<sup>-kn/m</sup>)<sup>k</sup>
- More bits (
m) → Lower false positive rate (but more RAM). - More hash functions (
k) → Slower insertion/lookup.
Cassandra allows you to tune this (e.g., bloom_filter_fp_chance=0.01 for 1%).
3. SSTable Components
An “SSTable” is actually a collection of files:
- Data.db: The actual data (rows and columns).
- Index.db: Maps partition keys to offsets in the Data file.
- Filter.db: The Bloom Filter (kept in memory).
- Summary.db: A sampling of the Index file (kept in memory).
- CompressionInfo.db: Metadata about compression chunks.
- Statistics.db: Metadata (timestamps, tombstones) used for compaction optimization.
Partition Summary & Key Cache
Reading the Index.db from disk for every request is slow.
- Partition Summary: Keeps every 128th key (default) in RAM. It tells Cassandra: “Your key is somewhere between byte 1024 and 2048 in the Index file.” This narrows the scan range significantly.
- Key Cache: Caches the exact offset of hot keys in the
Index.db, allowing Cassandra to jump straight to the data location without scanning the index.
Interactive Visualizer: Bloom Filter Calculator
A Bloom Filter uses multiple hash functions to set bits in a bit array. Adjust the size of the bit array and the number of hash functions to see how they affect the False Positive rate.
Code Implementation: Bloom Filter
Below is a simplified implementation of a Bloom Filter using BitSet.
import java.util.BitSet;
import java.util.function.Function;
public class SimpleBloomFilter {
private final BitSet bitSet;
private final int size;
private final int hashFunctions;
public SimpleBloomFilter(int size, int hashFunctions) {
this.size = size;
this.hashFunctions = hashFunctions;
this.bitSet = new BitSet(size);
}
public void add(String value) {
for (int i = 0; i < hashFunctions; i++) {
bitSet.set(getHash(value, i));
}
}
public boolean mightContain(String value) {
for (int i = 0; i < hashFunctions; i++) {
if (!bitSet.get(getHash(value, i))) {
return false; // Definitely not present
}
}
return true; // Maybe present
}
// Simplified hash function simulation
private int getHash(String value, int seed) {
return Math.abs((value.hashCode() + seed * 31) % size);
}
public static void main(String[] args) {
SimpleBloomFilter filter = new SimpleBloomFilter(100, 3);
filter.add("user:123");
System.out.println("Contains user:123? " + filter.mightContain("user:123")); // true
System.out.println("Contains user:999? " + filter.mightContain("user:999")); // false (probably)
}
}
package main
import (
"fmt"
"hash/fnv"
"math"
)
type BloomFilter struct {
bitSet []bool
size uint
hashes uint
}
func NewBloomFilter(size uint, hashes uint) *BloomFilter {
return &BloomFilter{
bitSet: make([]bool, size),
size: size,
hashes: hashes,
}
}
func (bf *BloomFilter) Add(value string) {
for i := uint(0); i < bf.hashes; i++ {
idx := bf.getHash(value, i)
bf.bitSet[idx] = true
}
}
func (bf *BloomFilter) MightContain(value string) bool {
for i := uint(0); i < bf.hashes; i++ {
idx := bf.getHash(value, i)
if !bf.bitSet[idx] {
return false // Definitely not present
}
}
return true // Maybe present
}
// getHash uses FNV hash with a seed variation
func (bf *BloomFilter) getHash(value string, seed uint) uint {
h := fnv.New32a()
h.Write([]byte(value))
hashVal := h.Sum32()
return uint(math.Abs(float64(hashVal+uint32(seed)*31))) % bf.size
}
func main() {
filter := NewBloomFilter(100, 3)
filter.Add("user:123")
fmt.Printf("Contains user:123? %v\n", filter.MightContain("user:123"))
fmt.Printf("Contains user:999? %v\n", filter.MightContain("user:999"))
}
Summary
- Reads are complex but optimized to minimize disk I/O.
- Bloom Filters are the first line of defense, filtering out SSTables that definitely don’t have the data.
- SSTables contain multiple components (Index, Summary, Filter) to speed up lookups.
- False Positives in Bloom Filters are acceptable because they only cost a disk seek, whereas False Negatives are impossible (ensuring correctness).