LSM Trees & The Write Path

Log-Structured Merge-Trees (LSM Trees) are the beating heart of Cassandra’s storage engine. Unlike B-Trees used in traditional relational databases (which optimize for read performance by updating data in place), LSM Trees are optimized for write throughput.

In this chapter, we’ll dissect the write path, understand why Cassandra writes are so fast, and visualize how data moves from memory to disk.

The Write Path: A High-Level Overview

When a write request hits a coordinator node, it is forwarded to the appropriate replicas. On each replica node, the write operation follows a specific path designed for durability and speed:

  1. Commit Log: The data is appended to the Commit Log on disk for durability.
  2. MemTable: The data is written to an in-memory structure called the MemTable.

That’s it! The write is acknowledged as successful once these two steps are complete. This is why Cassandra writes are often described as “O(1)” or constant time complexity relative to the dataset size.

Wait, what about disk I/O?

The Commit Log write is sequential (append-only), which is extremely fast on both HDD and SSD. The MemTable write is in-memory. The expensive operation of organizing data on disk happens asynchronously in the background.

1. Hardware Reality: Why Sequential I/O Wins

To understand why LSM Trees are faster than B-Trees for writing, we must look at the hardware.

The Physics of Spinning Disks (HDD)

  • Seek Time (Random I/O): Moving the physical read/write head to the correct sector takes ~5-10ms. This limits you to ~100-200 IOPS (Operations Per Second).
  • Sequential Throughput: Once the head is positioned, reading/writing continuous blocks is fast (~100-200 MB/s).

B-Trees (Relational DBs)

When you update a row in a B-Tree (like MySQL or Postgres), the database often has to modify a page in the middle of a file. This triggers Random I/O. If you insert 1000 rows, the disk head might have to jump to 1000 different locations.

LSM Trees (Cassandra)

LSM Trees turn random writes into Sequential Writes.

  1. Append Only: We only ever append to the end of the Commit Log.
  2. In-Memory Sort: We sort data in RAM (MemTable).
  3. Sequential Flush: When we flush to disk, we write the entire sorted MemTable as one contiguous block (SSTable).

Result: Cassandra can saturate the disk’s write bandwidth, achieving tens of thousands of writes per second per node, whereas a B-Tree based system might bottleneck at a few hundred/thousand IOPS on the same hardware.

2. The Commit Log

The Commit Log is a crash-recovery mechanism. Since MemTables are stored in RAM, a server power failure would cause data loss. To prevent this, every write is first appended to the Commit Log on disk.

  • Structure: Append-only log.
  • Purpose: Durability.
  • Lifecycle: Flushed to disk periodically (or immediately, depending on configuration). Truncated once the corresponding data in MemTable is flushed to SSTables.

3. The MemTable

The MemTable is an in-memory buffer that resembles a hash map or a skip list. It stores writes in sorted order (by partition key, then clustering key).

  • Structure: Sorted Map / Skip List.
  • Purpose: Accumulate writes and sort data before flushing to disk.
  • Lifecycle: When a MemTable reaches a size threshold (e.g., 256MB), it is effectively immutable and flushed to disk as an SSTable. A new MemTable is created for new writes.

4. SSTables (Sorted String Tables)

When a MemTable is flushed, it becomes an SSTable on disk. SSTables are:

  • Immutable: Once written, they are never modified.
  • Sorted: Data is sorted by token (partition key).
  • Persistent: Stored on disk.

The Trade-off: Amplification

Since SSTables are immutable, updating a row doesn’t modify the existing file. Instead, a new version of the row is written to a new SSTable.

  • Write Amplification: Low (Good). We just append.
  • Read Amplification: High (Bad). To read a row, we might need to check multiple SSTables to find the latest version.
  • Space Amplification: High (Bad). We store multiple versions of the same data until compaction runs.

Cassandra accepts higher Read and Space amplification to achieve Peak Write Performance.

Interactive Visualizer: The Write Path

See how data moves from the Client to the Commit Log and MemTable, and finally flushes to an SSTable.

Client (Click Me)
Cassandra Node
Commit Log (Disk)
MemTable (RAM)
SSTables (Disk)
Waiting for input...

Code Implementation: MemTable Simulation

Let’s look at how a simplified MemTable might be structured in code. We’ll use a TreeMap (Java) or a sorted map structure to keep keys sorted.

import java.util.concurrent.ConcurrentSkipListMap;
import java.util.Map;

public class SimpleMemTable {
    // ConcurrentSkipListMap keeps keys sorted and is thread-safe
    private final ConcurrentSkipListMap<String, String> data;
    private long sizeInBytes;
    private final long flushThreshold;

    public SimpleMemTable(long flushThreshold) {
        this.data = new ConcurrentSkipListMap<>();
        this.flushThreshold = flushThreshold;
        this.sizeInBytes = 0;
    }

    public void put(String key, String value) {
        data.put(key, value);
        // Simplified size calculation
        sizeInBytes += key.length() + value.length();

        if (sizeInBytes >= flushThreshold) {
            flush();
        }
    }

    public String get(String key) {
        return data.get(key);
    }

    private void flush() {
        System.out.println("Flushing " + data.size() + " keys to SSTable...");
        // In reality, this would serialize data to disk
        // and create a new MemTable
        data.clear();
        sizeInBytes = 0;
    }

    public static void main(String[] args) {
        SimpleMemTable memTable = new SimpleMemTable(100); // Small threshold for demo

        memTable.put("user:101", "Alice");
        memTable.put("user:102", "Bob");
        memTable.put("user:103", "Charlie");

        System.out.println("Read user:101 -> " + memTable.get("user:101"));
    }
}
package main

import (
	"fmt"
	"sync"
)

// SimpleMemTable simulates an in-memory sorted storage
type SimpleMemTable struct {
	data           map[string]string // In real Cassandra, this would be a SkipList
	sizeInBytes    int
	flushThreshold int
	mu             sync.RWMutex
}

func NewMemTable(threshold int) *SimpleMemTable {
	return &SimpleMemTable{
		data:           make(map[string]string),
		flushThreshold: threshold,
	}
}

func (m *SimpleMemTable) Put(key, value string) {
	m.mu.Lock()
	defer m.mu.Unlock()

	m.data[key] = value
	m.sizeInBytes += len(key) + len(value)

	if m.sizeInBytes >= m.flushThreshold {
		m.flush()
	}
}

func (m *SimpleMemTable) Get(key string) (string, bool) {
	m.mu.RLock()
	defer m.mu.RUnlock()
	val, ok := m.data[key]
	return val, ok
}

func (m *SimpleMemTable) flush() {
	fmt.Printf("Flushing %d keys to SSTable...\n", len(m.data))
	// Clear map (in reality, we'd swap pointers and write to disk async)
	m.data = make(map[string]string)
	m.sizeInBytes = 0
}

func main() {
	memTable := NewMemTable(50) // Small threshold

	memTable.Put("user:101", "Alice")
	memTable.Put("user:102", "Bob")
	memTable.Put("user:103", "Charlie")

	if val, ok := memTable.Get("user:101"); ok {
		fmt.Printf("Read user:101 -> %s\n", val)
	}
}

Compaction: Taming the Chaos

Since every flush creates a new file, we could end up with thousands of SSTables. This would ruin read performance (imagine checking 1000 files for one key).

Compaction is the background process that merges these files.

  1. Merge Sort: Since SSTables are sorted, merging them is efficient (like the merge step in Merge Sort).
  2. Tombstone Processing: Data marked with tombstones (deletions) is physically removed if it’s older than gc_grace_seconds.
  3. Space Reclamation: Old versions of data are discarded; only the latest timestamp wins.

[!TIP] Leveled Compaction Strategy (LCS) is ideal for read-heavy workloads as it guarantees a key exists in only one SSTable per level (mostly), while Size-Tiered Compaction Strategy (STCS) is better for write-heavy workloads but has higher read amplification.

Summary

  • Writes are fast because they only hit the Commit Log (disk sequential) and MemTable (RAM).
  • SSTables are immutable files on disk.
  • Compaction runs in the background to merge SSTables and reclaim space.
  • The architecture prioritizes Partition Tolerance and Availability (AP) while offering tunable Consistency.