The Read Path: Bloom Filters & SSTables

Reading data from a distributed database is significantly harder than writing it. When you write, you just append to a log. When you read, you have to find the latest version of data that might be scattered across a MemTable and multiple SSTables on disk.

To make reads fast, Cassandra relies on a series of caches and probabilistic data structures to avoid unnecessary disk seeks. The star of the show is the Bloom Filter.

1. The Read Path Anatomy

When a node receives a read request for a specific row (Partition Key), it checks the following locations in order:

MemTable: If the data is in memory (recently written), return it immediately.
Row Cache: (Optional, off by default) Checks if the hot row is cached in RAM.
Bloom Filter: Checks if the SSTable might contain the partition key.
Partition Key Cache: Checks if the exact disk location of the partition is cached.
Partition Summary: A sampling of the index file (stored in RAM) to find the approximate location.
Partition Index: The on-disk index that maps partition keys to data offsets.
SSTable Data: The actual data file.

2. What is a Bloom Filter?

A Bloom Filter is a space-efficient probabilistic data structure that answers the question: “Is this element in the set?”

Possible Answers: “No” or “Maybe”.
False Negatives: Impossible. If it says “No”, the element is definitely not there.
False Positives: Possible. It might say “Maybe” even if the element isn’t there (wasted disk seek).

Cassandra uses a Bloom Filter for every SSTable. Before touching disk, it checks the Bloom Filter. If the filter returns “No”, Cassandra skips that SSTable entirely.

The Math of Probability

The probability of a false positive p depends on the bit array size m, the number of elements inserted n, and the number of hash functions k.

The formula is: p ≈ (1 - e<sup>-kn/m</sup>)<sup>k</sup>

More bits (m) → Lower false positive rate (but more RAM).
More hash functions (k) → Slower insertion/lookup.

Cassandra allows you to tune this (e.g., bloom_filter_fp_chance=0.01 for 1%).

3. SSTable Components

An “SSTable” is actually a collection of files:

Data.db: The actual data (rows and columns).
Index.db: Maps partition keys to offsets in the Data file.
Filter.db: The Bloom Filter (kept in memory).
Summary.db: A sampling of the Index file (kept in memory).
CompressionInfo.db: Metadata about compression chunks.
Statistics.db: Metadata (timestamps, tombstones) used for compaction optimization.

Partition Summary & Key Cache

Reading the Index.db from disk for every request is slow.

Partition Summary: Keeps every 128th key (default) in RAM. It tells Cassandra: “Your key is somewhere between byte 1024 and 2048 in the Index file.” This narrows the scan range significantly.
Key Cache: Caches the exact offset of hot keys in the Index.db, allowing Cassandra to jump straight to the data location without scanning the index.

Interactive Visualizer: Bloom Filter Calculator

A Bloom Filter uses multiple hash functions to set bits in a bit array. Adjust the size of the bit array and the number of hash functions to see how they affect the False Positive rate.

Bit Array Size (m): 10

Hash Functions (k): 2

Ready

Code Implementation: Bloom Filter

Below is a simplified implementation of a Bloom Filter using BitSet.

import java.util.BitSet;
import java.util.function.Function;

public class SimpleBloomFilter {
    private final BitSet bitSet;
    private final int size;
    private final int hashFunctions;

    public SimpleBloomFilter(int size, int hashFunctions) {
        this.size = size;
        this.hashFunctions = hashFunctions;
        this.bitSet = new BitSet(size);
    }

    public void add(String value) {
        for (int i = 0; i < hashFunctions; i++) {
            bitSet.set(getHash(value, i));
        }
    }

    public boolean mightContain(String value) {
        for (int i = 0; i < hashFunctions; i++) {
            if (!bitSet.get(getHash(value, i))) {
                return false; // Definitely not present
            }
        }
        return true; // Maybe present
    }

    // Simplified hash function simulation
    private int getHash(String value, int seed) {
        return Math.abs((value.hashCode() + seed * 31) % size);
    }

    public static void main(String[] args) {
        SimpleBloomFilter filter = new SimpleBloomFilter(100, 3);

        filter.add("user:123");

        System.out.println("Contains user:123? " + filter.mightContain("user:123")); // true
        System.out.println("Contains user:999? " + filter.mightContain("user:999")); // false (probably)
    }
}

package main

import (
	"fmt"
	"hash/fnv"
	"math"
)

type BloomFilter struct {
	bitSet []bool
	size   uint
	hashes uint
}

func NewBloomFilter(size uint, hashes uint) *BloomFilter {
	return &BloomFilter{
		bitSet: make([]bool, size),
		size:   size,
		hashes: hashes,
	}
}

func (bf *BloomFilter) Add(value string) {
	for i := uint(0); i < bf.hashes; i++ {
		idx := bf.getHash(value, i)
		bf.bitSet[idx] = true
	}
}

func (bf *BloomFilter) MightContain(value string) bool {
	for i := uint(0); i < bf.hashes; i++ {
		idx := bf.getHash(value, i)
		if !bf.bitSet[idx] {
			return false // Definitely not present
		}
	}
	return true // Maybe present
}

// getHash uses FNV hash with a seed variation
func (bf *BloomFilter) getHash(value string, seed uint) uint {
	h := fnv.New32a()
	h.Write([]byte(value))
	hashVal := h.Sum32()
	return uint(math.Abs(float64(hashVal+uint32(seed)*31))) % bf.size
}

func main() {
	filter := NewBloomFilter(100, 3)

	filter.Add("user:123")

	fmt.Printf("Contains user:123? %v\n", filter.MightContain("user:123"))
	fmt.Printf("Contains user:999? %v\n", filter.MightContain("user:999"))
}

Summary

Reads are complex but optimized to minimize disk I/O.
Bloom Filters are the first line of defense, filtering out SSTables that definitely don’t have the data.
SSTables contain multiple components (Index, Summary, Filter) to speed up lookups.
False Positives in Bloom Filters are acceptable because they only cost a disk seek, whereas False Negatives are impossible (ensuring correctness).