Kafka (Wait-Free Architecture)
[!TIP] Dumb Broker, Smart Consumer: Traditional queues (RabbitMQ) track “who read what” on the server. Kafka puts that burden on the client. The broker is just a high-speed append-only log. This design choice allows Kafka to handle massive throughput.
1. Event Streaming vs Message Queues
Kafka is often confused with Message Queues like RabbitMQ or ActiveMQ.
- Message Queue (RabbitMQ): “Smart Broker, Dumb Consumer”. The broker tracks state. Messages are deleted once consumed. Good for task processing (job queues).
- Event Streaming (Kafka): “Dumb Broker, Smart Consumer”. The broker is a storage engine. Messages are retained based on time (e.g., 7 days). Consumers track their own position (Offset). Good for replayability, sourcing, and high throughput.
2. The Core Abstraction: The Log
A Log is the simplest storage abstraction:
- Append-Only: New data goes to the end. No overwrites.
- Ordered: Event A happened before Event B.
- Immutable: You can’t change history.
Topics & Partitions
Kafka scales by splitting a Topic (e.g., “UserClicks”) into Partitions (Shards).
- Ordering Guarantee: Kafka guarantees order within a partition, but NOT across the whole topic.
- Partitioning Strategy:
hash(UserID) % NumPartitions. This ensures all events for User A go to the same partition (and stay ordered). - Offset: A unique ID (integer) for every message in a partition. It is monotonic but not necessarily contiguous (in recent versions).
Storage Internals (Segments & Indexes)
Kafka achieves high-performance lookups and transfers by combining Sparse Indexing with Zero-Copy data paths, as visualized in the storage engine diagram below.
- The
.logfile: Stores the raw message blocks (blue). - The
.indexfile: A Sparse Index (gold) that maps logical offsets to physical byte positions on disk.- Efficiency: Instead of mapping every offset, it maps a chunk of data (e.g., every 4KB).
- The Search: To find a specific message, Kafka finds the nearest offset in the index, jumps to the byte position in the log, and scans sequentially.
4. Why is Kafka so Fast?
Kafka pushes millions of messages per second on spinning disks. The secret is doing Less Work.
- Sequential I/O: Random Disk Access is slow (seeking head). Sequential Access is incredibly fast (faster than random RAM access!).
- Zero Copy:
- Standard I/O: The kernel reads data from disk to Page Cache. Then it copies it to User Space (Application). Then the Application copies it back to Kernel Space (Socket Buffer). Finally, to the NIC. (4 Copies, 4 Context Switches).
- Zero Copy (
sendfile): The kernel transfers data directly from the Page Cache to the NIC Buffer (via DMA). The data never enters the Application (JVM) memory. - Result: CPU usage drops by 60%, and garbage collection is non-existent for data transfer.
Interactive Demo: Zero Copy Racer
Visualize the difference in CPU cycles and context switches.
- Top Lane: Standard I/O (The “Taxed” Path).
- Bottom Lane: Zero Copy (The “Gold” Path).
5. Reliability: ISR and High Watermark
How do we ensure data durability without sacrificing speed?
- Leader: Handles writes.
- Follower: Passively replicates data.
- ISR (In-Sync Replicas): Replicas that are “caught up” with the leader.
- High Watermark (HW): The offset of the last message successfully replicated to all ISRs.
- Safety: Consumers can only read up to the HW. This prevents “Ghost Reads” (reading data that gets lost if the leader crashes).
- Acks:
acks=0: Fire and forget. Fast, unsafe.acks=1: Leader ack. Balanced.acks=all: All ISRs ack. Slow, safe.
Interactive Demo: The High Watermark
Visualize how data becomes “Visible” to consumers only after replication.
- Produce: Send messages to Leader.
- Replicate: Follower pulls data.
- Commit: HW moves. Consumers can now read.
6. Consumer Groups (The Secret Weapon)
This is how Kafka scales reads horizontally.
- Consumer Group: A logical application (e.g., “BillingService”).
- Rule: Each partition is consumed by exactly one consumer in the group.
- Example: If you have 4 partitions and 2 consumers, each gets 2 partitions. If you have 5 consumers, 1 is idle!
Rebalancing
When a consumer joins or leaves, Kafka must redistribute partitions.
- Stop-the-World (Eager): All consumers stop reading, give up partitions, and rejoin. (Latency spike).
- Cooperative (Incremental): Only moving partitions are revoked. Consumers keep reading untouched partitions.
Interactive Demo: Consumer Rebalancing
Visualize how partitions are reassigned when consumers join or leave.
- Goal: Ensure every partition has a consumer.
- Idle: If Consumers > Partitions, some consumers are idle.
7. Exactly-Once Semantics (EOS)
“Exactly-Once” is the holy grail. Kafka supports it via:
- Idempotent Producer: The producer assigns a Sequence Number to every batch. If the broker sees a duplicate sequence number, it ignores it.
- Transactions: You can write to multiple partitions atomically. “All or Nothing”. This is critical for “Consume-Process-Produce” loops (e.g., KStreams).
8. Log Compaction
Normally, Kafka deletes old data after 7 days (Retention Policy). Log Compaction is different. It retains the latest value for every key.
- Use Case: Restoring a database state (CDC).
- How: If Key
Ahas valuesv1,v2,v3, compaction deletesv1andv2. The log effectively becomes a Snapshot of the final state.
9. Case Study: Uber’s Trillion Message Pipeline
Uber operates one of the largest Kafka deployments in the world, powering everything from dynamic pricing to fraud detection.
1. Scenario
- Scale: Trillions of messages per day across thousands of topics.
- Users: Driver App, Rider App, Eats, Freight.
- Requirement: Real-time dispatching, Active-Active redundancy between regions, and zero data loss for billing.
2. The Challenge
- Head-of-Line Blocking: One bad message (poison pill) can halt an entire partition processing pipeline.
- Replication Lag: Synchronizing state between US-West and US-East in real-time.
- Thundering Herd: If a cluster fails, reconnecting thousands of microservices simultaneously can DDOS the backup cluster.
3. The Decision
Why did Uber choose Kafka over RabbitMQ or a REST-based architecture?
| Feature | REST / HTTP | RabbitMQ | Kafka (Chosen) |
|---|---|---|---|
| Throughput | Low (Sync overhead) | Medium | Extreme (Seq I/O) |
| Persistence | None (Ephemeral) | Short-term | Long-term (Replayable) |
| Ordering | No | Weak | Strict (Per partition) |
| Backpressure | Circuit Breakers needed | Limited by RAM | Native (Pull-based) |
4. The Decision Visualizer
Kafka acted as the “Universal Buffer” allowing producers to write at any speed, independent of consumer capacity.
5. Architecture
Uber employs a Federated Kafka architecture to solve global scale.
- Regional Clusters: “Edge” clusters in each data center accept local writes from services.
- uReplicator: A custom replication solution (superior to standard MirrorMaker) to bridge Regional Clusters to the Core Aggregation Cluster.
- Chaperone: An audit service that verifies every message produced is eventually consumed (completeness check).
6. Core Abstraction
The Log is the single source of truth.
- Immutable History: Every trip request, GPS ping, and transaction is an event in the log.
- Materialized Views: Downstream databases (Cassandra/MySQL) are just projections of the Kafka log.
7. Deep Dive: Dead Letter Queues (DLQ)
Handling “Poison Pills” (malformed messages that crash consumers) without stopping the world.
- Fast Lane: The main topic. 99.9% of messages flow here.
- Retry Lane: If consumption fails, republish the message to a “Retry Topic” with a delay.
- DLQ: If retries fail N times, move to a Dead Letter Queue for manual inspection.
- Outcome: The partition never blocks.
8. Trade-offs
- Latency vs. Durability: Uber configures
acks=allfor Billing events (Zero Loss, higher latency) butacks=1for GPS pings (Speed over safety). - Complexity: Managing thousands of topics requires automated tooling (uReplicator) which adds operational overhead.
9. Results
- 99.99% Uptime: Even during regional outages, the buffer allows services to queue data until systems recover.
- Decoupling: New teams can build “Fraud Detection” features by simply subscribing to existing topics without asking the Driver Team for API access.
10. Future
- Tiered Storage: Offloading old segments to S3 to save costs while keeping data queryable.
- Kafka on Kubernetes: Moving from bare metal to K8s for better resource utilization.
10. Observability & Tracing
Kafka is complex. Visibility is mandatory.
RED Method
- Rate: Messages/sec (In/Out).
- Errors: Request Timed Out, Not Leader For Partition.
- Duration: Request Latency (P99).
Critical Kafka Metrics
- Consumer Lag:
Max(LogEndOffset - CurrentOffset). The most important metric. If this grows, consumers are falling behind. - Under Replicated Partitions (URP): Number of partitions where
Replicas < ISR. If > 0, you are at risk of data loss. - ISR Shrink/Expand: Flapping ISRs indicate network issues or slow disks.
11. Deployment Strategy
Rolling Restarts
- Why? Config changes, upgrades, OS patches.
- Procedure: Restart one broker at a time.
- Controlled Shutdown: Broker sends signal to Controller to migrate leaderships before stopping. Reduces “Not Leader” errors.
- Restart: Broker comes back.
- Catch Up: Wait for it to rejoin ISR.
- Next: Move to next broker.
Topic Config Changes
- Dynamic: Retention, Partition Count (Increase only).
- Static: Message Format Version (requires restart).
12. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Extreme Throughput | Sequential I/O + Zero Copy (sendfile). |
| Scalability | Partitioning (Horizontal Scale) + Consumer Groups. |
| Durability | Replication (ISR) + WAL. |
| Decoupling | Log abstraction (Broker doesn’t know Consumer). |
| Replayability | Time-based retention (e.g., 7 days). |
13. Interview Gauntlet
I. Internals
- Explain Zero Copy. It avoids copying data to User Space.
Disk -> PageCache -> NIC. - What is the High Watermark? The offset of the last message replicated to all ISRs. Consumers cannot read past it.
- Pull vs Push? Kafka is Pull. Consumers control the rate (Backpressure).
II. Scalability
- How to handle a Slow Consumer? Add more partitions and consumers. Or optimize the consumer processing logic.
- Can you decrease partitions? No. This would break data ordering and hashing. You must create a new topic.
III. Failure Modes
- What if Zookeeper dies? Old Kafka stops working (Controller election fails). New Kafka (KRaft) removes ZK dependency.
- Message Duplication? Use Idempotent Producer (
enable.idempotence=true).
14. Summary: The Whiteboard Strategy
If asked to design Kafka, draw this 4-Quadrant Layout:
1. Requirements
- Func: Pub/Sub, Replay, Order.
- Non-Func: High Throughput, Durability.
- Scale: Trillions/day.
2. Architecture
(Topic -> Partitions)
|
[Consumer Groups]
* Broker: Dumb Log Storage.
* ZK/KRaft: Metadata.
3. Storage
Index: Sparse (Offset -> Byte)
Message: Key, Value, Timestamp
4. Performance
- Zero Copy: `sendfile` bypasses CPU.
- Sequential I/O: Disk seeks are the enemy.
- Batching: Compress messages into sets.
Next, we look at Chubby, the system that coordinates all this madness: Chubby (Distributed Lock).