The Distributed Commit Log

At its heart, Apache Kafka is not a traditional message queue. It is a Distributed Commit Log.

In a traditional queue (like RabbitMQ), when a message is consumed, it is deleted. In Kafka, messages are appended to a log and stay there until a retention policy expires them (e.g., 7 days).

Consumers track their own progress using an Offset (a bookmark). This allows different consumers to read the same stream at different speeds and replay history if needed.

1. Topics: The Logical Stream

A Topic is a category or feed name to which records are published.

  • Logical: To the user, it looks like a single stream of data (e.g., user-clicks, payment-processed).
  • Physical: Under the hood, a topic is split into Partitions.

2. Partitions: The Unit of Scalability

This is the single most important concept in Kafka.

If a topic were stored on a single machine, it would be limited by that machine’s I/O and storage. To solve this, Kafka breaks a topic into Partitions (P0, P1, P2…).

  • Distributed: Partitions are spread across different brokers in the cluster.
  • Parallelism: You can have as many consumers processing a topic as you have partitions.
  • Ordering: Guaranteed ONLY within a partition. There is no global ordering across the entire topic.

[!IMPORTANT] Why Partitions? Partitions allow Kafka to scale horizontally. If you need to handle 10 GB/sec of data, you simply add more partitions and more brokers.

Analogy: The Supermarket Checkout

Imagine a supermarket with a single checkout lane. No matter how fast the cashier works, there is a physical limit to how many customers can be served per minute.

  • One Lane (No Partitions): Slow. Customers queue up.
  • 10 Lanes (10 Partitions): Fast. 10 cashiers work in parallel.
  • Consumer Group: The team of cashiers.
  • Partitioning Key: How you decide which lane to join (e.g., “10 items or less” lane, or “A-M” surnames).

3. Hardware Reality: The Physics of Sequential I/O

Why is Kafka so fast? Why does it use a log instead of a B-Tree like a relational database?

The answer lies in the physics of hard drives (and even SSDs).

  • Random Access: Jumping around the disk requires the drive head to physically move (seek time). This is slow (milliseconds).
  • Sequential Access: Writing to the end of a file (appending) is incredibly fast because the drive head doesn’t need to move.

[!NOTE] The Numbers:

  • Random I/O: ~100 IOPS (Input/Output Operations Per Second) on HDD.
  • Sequential I/O: ~100MB/sec to 500MB/sec.
  • Kafka exploits Sequential I/O. By strictly appending to the end of partition logs, Kafka turns disk writes into O(1) operations, achieving speeds comparable to network throughput.

4. Offsets: The Immutable Sequence

Each message within a partition is assigned a unique integer ID called an Offset.

  • Immutable: Once written, a message never changes.
  • Monotonic: Offsets strictly increase (0, 1, 2, 3…).
  • Local: Offset 5 in Partition 0 is completely different from Offset 5 in Partition 1.

5. Message Keys & Partitioning

How does Kafka decide which partition a message goes to?

Strategy A: Round Robin (No Key)

If you send a message with key=null, Kafka distributes it in a round-robin fashion (or using the Sticky Partitioner for efficiency) to balance the load across all partitions.

  • Pros: Maximize load balancing.
  • Cons: No ordering guarantee for related messages.

Strategy B: Semantic Partitioning (With Key)

If you provide a key (e.g., user_id=101), Kafka guarantees that all messages with the same key go to the same partition.

  • Mechanism: Partition = Hash(Key) % NumPartitions
  • Benefit: Strict ordering for that specific user. User 101’s “Login” will always appear before their “Logout”.

[!WARNING] The Hot Partition Problem: If one key is extremely popular (e.g., “Justin Bieber” in a twitter stream), that specific partition will be overwhelmed while others are idle. Choose your keys carefully!

6. Consumer Groups

A Consumer Group is a set of consumers acting as a single logical subscriber.

  • Load Balancing: Kafka automatically assigns partitions to consumers in the group.
  • Rule of 1: A partition is consumed by exactly one consumer within a group.
  • Scalability Limit: You cannot have more active consumers than partitions. If you have 10 partitions and 11 consumers, 1 consumer will sit idle.

Fan-Out Architecture

Multiple different consumer groups can read from the same topic independently.

  • Group A (Fraud Service): Reads payments to detect fraud.
  • Group B (Analytics): Reads payments to build dashboards.
  • Both groups see every message, but manage their own offsets.

7. Interactive: The Partitioning Simulator

Visualise how keys determine message placement. Note how null keys cycle round-robin, while specific keys stick to specific partitions.

Producer
🏭
P0
P1
P2
System Ready. Select a key to begin.

8. Code: Controlling Partitioning

Here is how you control partition assignment using the Producer API.

```java // Java: Sending a message with a Key (Strict Ordering) import java.util.Properties; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.common.serialization.StringSerializer; public class ProducerExample { public static void main(String[] args) { Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("key.serializer", StringSerializer.class.getName()); props.put("value.serializer", StringSerializer.class.getName()); // Use try-with-resources to ensure the producer is closed try (Producer<String, String> producer = new KafkaProducer<>(props)) { String key = "user_123"; // Messages with this key always go to the same partition String value = "login_event"; // Send with Key producer.send(new ProducerRecord<>("my-topic", key, value)); // Send without Key (Round Robin / Sticky) producer.send(new ProducerRecord<>("my-topic", "some_random_event")); } } } ```