The Disk Structure

Kafka is famous for being able to saturate network saturation while writing to disk. How? By relying on sequential I/O and the OS Page Cache.

When you create a topic my-topic with 3 partitions, Kafka creates 3 directories on disk:

my-topic-0/
my-topic-1/
my-topic-2/

Inside each directory, data is split into Segments.

1. Log Segments

Kafka doesn’t store all messages in one massive file (which would be hard to purge). Instead, it “rolls” a new segment file every X bytes (default 1GB) or Y hours (default 7 days).

A segment consists of:

00000.log: The actual messages.
00000.index: Maps Offsets → Physical Byte Position.
00000.timeindex: Maps Timestamp → Offset.

[!TIP] Why Sequential I/O? Hard disks (and even SSDs) are fastest when writing linearly. Kafka is an “Append Only” log, which means it avoids slow random writes.

2. Hardware Reality: Zero Copy & Page Cache

How does Kafka send data so fast? It avoids copying bytes into the JVM (“User Space”).

The `sendfile` System Call

Traditional data transfer involves 4 copies:

Disk → Kernel Buffer
Kernel Buffer → Application Buffer (User Space)
Application Buffer → Socket Buffer (Kernel Space)
Socket Buffer → NIC Buffer

Kafka uses the sendfile (Zero Copy) system call:

Disk → Kernel Buffer (Page Cache)
Kernel Buffer → NIC Buffer

Result: Data never enters the JVM heap. This reduces CPU usage and Garbage Collection overhead significantly.

The Page Cache

Kafka relies heavily on the OS Page Cache (RAM).

Write: Kafka writes to the filesystem cache. The OS flushes to disk in the background.
Read: If consumers are caught up, they read directly from RAM (Page Cache).
Impact: You don’t need a massive JVM Heap. Give memory to the OS instead!

3. The Index: Fast Lookups

If a consumer asks for “Offset 5000”, Kafka doesn’t scan the whole .log file.

It checks the Offset Index (.index).
The index is a Sparse Index. It does not have an entry for every message. It might have an entry for offset 4000, 4100, 4200.
Kafka performs a Binary Search on the index to find the nearest position ≤ 5000 (e.g., 4900).
It jumps to that byte position in the .log file and scans forward to find 5000.

Why Sparse? It keeps the index small enough to fit entirely in RAM (Page Cache).

4. Log Compaction

By default, Kafka deletes old segments (cleanup.policy=delete). But what if you want to keep the latest state for every key forever?

Log Compaction (cleanup.policy=compact) ensures that Kafka retains at least the last known value for each message key.

Use Case: Restoring state (e.g., a “User Profile” table). You don’t care about the user’s address 5 years ago, only the current one.
Mechanism: A background thread (Cleaner) scans the log and removes records where a newer record with the same key exists.

[!WARNING] Tombstones: To delete a key in a compacted topic, you write a message with null value (a tombstone). Kafka eventually removes the key entirely.

5. Interactive: Index Lookup Simulator

Simulate how Kafka finds a message. It performs a Binary Search on the sparse index to find the closest offset, then scans the log.

📑 0000.index (Sparse)

🪵 0000.log (Data)

Ready. Enter an offset (0-1000) and click Find.

6. Code: Configuring Retention

Here is how you configure log compaction and retention policies.

```java // Java: Creating a Compacted Topic (AdminClient) import java.util.Properties; import java.util.Collections; import java.util.Map; import java.util.HashMap; import org.apache.kafka.clients.admin.AdminClient; import org.apache.kafka.clients.admin.NewTopic; import org.apache.kafka.clients.producer.ProducerConfig; public class CreateCompactedTopic { public static void main(String[] args) throws Exception { Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); try (AdminClient admin = AdminClient.create(props)) { // Create "user-profiles" with Log Compaction enabled NewTopic newTopic = new NewTopic("user-profiles", 3, (short) 3); Map<String, String> configs = new HashMap<>(); // Enable Compaction configs.put("cleanup.policy", "compact"); // Roll a new segment every 100MB (to trigger compaction faster) configs.put("segment.bytes", "104857600"); // Ensure tombstones are kept for at least 24 hours configs.put("delete.retention.ms", "86400000"); newTopic.configs(configs); admin.createTopics(Collections.singleton(newTopic)).all().get(); } } } ```

Storage: Segments and Indexes