The Read Path: Bloom Filters & SSTables

Reading data from a distributed database is significantly harder than writing it. When you write, you just append to a log. When you read, you have to find the latest version of data that might be scattered across a MemTable and multiple SSTables on disk.

To make reads fast, Cassandra relies on a series of caches and probabilistic data structures to avoid unnecessary disk seeks. The star of the show is the Bloom Filter.

1. The Read Path Anatomy

When a node receives a read request for a specific row (Partition Key), it checks the following locations in order:

  1. MemTable: If the data is in memory (recently written), return it immediately.
  2. Row Cache: (Optional, off by default) Checks if the hot row is cached in RAM.
  3. Bloom Filter: Checks if the SSTable might contain the partition key.
  4. Partition Key Cache: Checks if the exact disk location of the partition is cached.
  5. Partition Summary: A sampling of the index file (stored in RAM) to find the approximate location.
  6. Partition Index: The on-disk index that maps partition keys to data offsets.
  7. SSTable Data: The actual data file.

2. What is a Bloom Filter?

A Bloom Filter is a space-efficient probabilistic data structure that answers the question: “Is this element in the set?”

  • Possible Answers: “No” or “Maybe”.
  • False Negatives: Impossible. If it says “No”, the element is definitely not there.
  • False Positives: Possible. It might say “Maybe” even if the element isn’t there (wasted disk seek).

Cassandra uses a Bloom Filter for every SSTable. Before touching disk, it checks the Bloom Filter. If the filter returns “No”, Cassandra skips that SSTable entirely.

The Math of Probability

The probability of a false positive p depends on the bit array size m, the number of elements inserted n, and the number of hash functions k.

The formula is: p &approx; (1 - e<sup>-kn/m</sup>)<sup>k</sup>

  • More bits (m) → Lower false positive rate (but more RAM).
  • More hash functions (k) → Slower insertion/lookup.

Cassandra allows you to tune this (e.g., bloom_filter_fp_chance=0.01 for 1%).

3. SSTable Components

An “SSTable” is actually a collection of files:

  • Data.db: The actual data (rows and columns).
  • Index.db: Maps partition keys to offsets in the Data file.
  • Filter.db: The Bloom Filter (kept in memory).
  • Summary.db: A sampling of the Index file (kept in memory).
  • CompressionInfo.db: Metadata about compression chunks.
  • Statistics.db: Metadata (timestamps, tombstones) used for compaction optimization.

Partition Summary & Key Cache

Reading the Index.db from disk for every request is slow.

  • Partition Summary: Keeps every 128th key (default) in RAM. It tells Cassandra: “Your key is somewhere between byte 1024 and 2048 in the Index file.” This narrows the scan range significantly.
  • Key Cache: Caches the exact offset of hot keys in the Index.db, allowing Cassandra to jump straight to the data location without scanning the index.

Interactive Visualizer: Bloom Filter Calculator

A Bloom Filter uses multiple hash functions to set bits in a bit array. Adjust the size of the bit array and the number of hash functions to see how they affect the False Positive rate.

Ready

Code Implementation: Bloom Filter

Below is a simplified implementation of a Bloom Filter using BitSet.

import java.util.BitSet;
import java.util.function.Function;

public class SimpleBloomFilter {
    private final BitSet bitSet;
    private final int size;
    private final int hashFunctions;

    public SimpleBloomFilter(int size, int hashFunctions) {
        this.size = size;
        this.hashFunctions = hashFunctions;
        this.bitSet = new BitSet(size);
    }

    public void add(String value) {
        for (int i = 0; i < hashFunctions; i++) {
            bitSet.set(getHash(value, i));
        }
    }

    public boolean mightContain(String value) {
        for (int i = 0; i < hashFunctions; i++) {
            if (!bitSet.get(getHash(value, i))) {
                return false; // Definitely not present
            }
        }
        return true; // Maybe present
    }

    // Simplified hash function simulation
    private int getHash(String value, int seed) {
        return Math.abs((value.hashCode() + seed * 31) % size);
    }

    public static void main(String[] args) {
        SimpleBloomFilter filter = new SimpleBloomFilter(100, 3);

        filter.add("user:123");

        System.out.println("Contains user:123? " + filter.mightContain("user:123")); // true
        System.out.println("Contains user:999? " + filter.mightContain("user:999")); // false (probably)
    }
}
package main

import (
	"fmt"
	"hash/fnv"
	"math"
)

type BloomFilter struct {
	bitSet []bool
	size   uint
	hashes uint
}

func NewBloomFilter(size uint, hashes uint) *BloomFilter {
	return &BloomFilter{
		bitSet: make([]bool, size),
		size:   size,
		hashes: hashes,
	}
}

func (bf *BloomFilter) Add(value string) {
	for i := uint(0); i < bf.hashes; i++ {
		idx := bf.getHash(value, i)
		bf.bitSet[idx] = true
	}
}

func (bf *BloomFilter) MightContain(value string) bool {
	for i := uint(0); i < bf.hashes; i++ {
		idx := bf.getHash(value, i)
		if !bf.bitSet[idx] {
			return false // Definitely not present
		}
	}
	return true // Maybe present
}

// getHash uses FNV hash with a seed variation
func (bf *BloomFilter) getHash(value string, seed uint) uint {
	h := fnv.New32a()
	h.Write([]byte(value))
	hashVal := h.Sum32()
	return uint(math.Abs(float64(hashVal+uint32(seed)*31))) % bf.size
}

func main() {
	filter := NewBloomFilter(100, 3)

	filter.Add("user:123")

	fmt.Printf("Contains user:123? %v\n", filter.MightContain("user:123"))
	fmt.Printf("Contains user:999? %v\n", filter.MightContain("user:999"))
}

Summary

  • Reads are complex but optimized to minimize disk I/O.
  • Bloom Filters are the first line of defense, filtering out SSTables that definitely don’t have the data.
  • SSTables contain multiple components (Index, Summary, Filter) to speed up lookups.
  • False Positives in Bloom Filters are acceptable because they only cost a disk seek, whereas False Negatives are impossible (ensuring correctness).