Kafka & Streaming
In 2011, LinkedIn’s infrastructure team was drowning in data. Their Hadoop jobs processed user activity (profile views, connection clicks, skill endorsements) 24 hours after it happened using batch ETL pipelines. By the time the “People You May Know” recommendation updated, the underlying data was a day old. The team — Jay Kreps, Neha Narkhede, and Jun Rao — invented Kafka to solve this. Named after the author of “The Metamorphosis” (they liked the name for a “write-heavy” system), Kafka was a fundamentally different idea: instead of a queue that deletes messages after consumption, treat every event as a durable, replayable log. LinkedIn open-sourced it in 2011. By 2019, more than one-third of all Fortune 500 companies ran Kafka. In 2021, Confluent (founded by the Kafka inventors) went public at $9B. Kafka is now processed by over 80% of the companies doing large-scale stream processing — all from a problem LinkedIn had with stale recommendation data.
[!IMPORTANT] In this lesson, you will master:
- The Log Abstraction: Why a simple append-only file is the most powerful weapon in high-throughput distributed systems.
- Hardware Acceleration: Exploiting Sequential Disk I/O, Page Cache, and Zero-Copy to reach 1Gbps+ network saturation.
- Distributed Coordination: Mastering Partitions, Consumer Group Rebalancing, and In-Sync Replicas (ISR) for high availability.
1. The Beginner’s Guide: What is Kafka?
If you’ve never used Kafka, the easiest way to understand it is to think of a Diary (or a Tape Recorder).
When you write in a diary:
- You append new entries to the end of the book.
- You never erase or modify old entries (they are historical facts).
- Anyone can read your diary from page 1 to the end, or skip to chapter 5 and start reading from there.
This is exactly how Kafka works. It is not a traditional “queue” where messages are deleted after being read. It is a Distributed Append-Only Log.
Queue vs Stream
Most developers confuse traditional Message Queues (like RabbitMQ) with Event Streams (like Kafka). They solve fundamentally different problems.
- RabbitMQ (The Mailbox/Queue):
- Model: Smart Broker, Dumb Consumer.
- Behavior: You take a letter out, and it’s gone (deleted).
- Use Case: Task processing, Job Queues (e.g., “Resize this image”). The goal is to get a task done once.
- Kafka (The Diary/Stream):
- Model: Dumb Broker, Smart Consumer.
- Behavior: Messages are appended to a log. You can listen, rewind, and listen again. Messages stick around for days (retention).
- Use Case: Event Sourcing, Analytics, Metrics, Change Data Capture (CDC). The goal is to record a history of facts so multiple different systems can react to them.
[!TIP] Kafka is essentially a distributed Write-Ahead Log (WAL). If you understand how a database persists data (see WAL), you understand Kafka.
2. Architecture: The Log
Kafka’s core abstraction is the Log—an append-only sequence of records.
A. Topic & Partition
- Hash Key:
Partition = hash(Key) % NumPartitions. All messages forUser:123go to the same partition (e.g., P0). - Ordering Guarantee: Kafka only guarantees strict ordering within a single partition. There is no global order across the entire topic. If order matters (e.g., sequentially processing
DEBITandCREDITevents foruser_id=123), you MUST use a Partition Key to ensure all events for that entity land in the same partition and are processed sequentially by the same consumer.
[!NOTE] Hardware-First Intuition: The “Partition Skew” Hotspot. While Kafka scales horizontally, each Partition is physically stored on one server. If you shard by
Countryand 90% of your users are from the US, one single hardware node (Disk + NIC) will handle 90% of the traffic, while the other 9 nodes sit idle. This is a Hardware Bottleneck caused by bad software logic. Always choose a high-cardinality key (likeuser_id) to ensure your hardware load is evenly distributed across the cluster’s physical backplane.
B. The Offset
Each message in a partition has a unique ID called the Offset.
- The Consumer tracks what it has read.
- “I’ve read up to Offset 500 in Partition 0.”
C. Consumer Groups (The Secret Sauce)
How do we scale processing? We add consumers to a Consumer Group.
- Rule: A partition is consumed by exactly one consumer in the group.
- Implication: If you have 10 partitions, you can have AT MOST 10 active consumers. The 11th consumer will be idle.
3. Interactive Demo: The Kafka Cluster Simulator
Visualize how Partitions distribute load and how Consumer Groups scale.
[!TIP] Try it yourself: Add Consumers until you have 3 (one for each Partition). Then kill one and watch the Rebalancing process!
4. Under the Hood: Why is Kafka Fast?
Kafka can handle millions of messages/sec on spinning hard disks. How?
A. Sequential I/O
Random Disk Access (seeking) is slow (~100 seeks/sec). Sequential Access is fast (~100s MB/sec). Kafka only appends to the end of the file. It never seeks.
- Result: Disk speeds approach Network speeds.
B. Zero Copy
Standard File Send involves 4 copies and 4 context switches.
- Disk → Kernel Buffer (Read)
- Kernel → User Space (App Buffer)
- User Space → Kernel Socket Buffer (Write)
- Socket Buffer → NIC (Network Card)
Kafka uses the sendfile() system call (Zero Copy):
- Disk → Kernel Buffer
- Kernel Buffer → NIC (via Descriptor)
Visualizing the Savings
|
(DMA Copy only)
[!NOTE] Hardware-First Intuition: The “Page Cache” vs. “Heap” Strategy. Most Java apps store data in the JVM Heap, causing massive Garbage Collection (GC) pauses. Kafka is smarter: it uses the JVM only for control logic. The actual data (messages) is stored in the OS Page Cache (off-heap memory). When a consumer requests data, the OS sends it directly from the RAM to the NIC via DMA (Direct Memory Access). This means Kafka can handle 100GB of data with only a 4GB JVM heap, effectively bypassing the “Stop-the-world” hardware penalty.
5. Distributed Coordination: KRaft vs Zookeeper
For a decade, Kafka relied on Zookeeper to store cluster metadata (broker lists, partitions, offsets).
- The Zookeeper Bottleneck: Zookeeper is a separate system with its own hardware requirements. When a large cluster rebalanced, the Zookeeper write throughput would become the bottleneck, leading to “split-brain” scenarios or indefinitely long rebalances.
- KRaft (Kafka Raft): In Kafka 3+ (and 4.0 production-ready), Zookeeper is removed. Metadata is now stored directly inside Kafka in a special metadata partition.
- Consensus: It uses the Raft algorithm (similar to Paxos) to elect a “Metadata Leader” among the brokers.
- Scale: KRaft allows clusters to scale to millions of partitions, whereas Zookeeper capped out at around 200,000.
[!TIP] Staff Engineer Tip: The KRaft Advantage Upgrading to KRaft isn’t just about operational simplicity. It significantly reduces the Recovery Time Objective (RTO). In a Zookeeper-based cluster, a controller crash could take several minutes to recover metadata. In KRaft, the standby controllers are “hot,” meaning failover happens in milliseconds, keeping your hardware utilized even during catastrophic node failures.
6. Durability & Reliability
How do we ensure we never lose data?
A. Replication (ISR)
Each partition has one Leader and multiple Followers.
- Leader: Handles all Reads and Writes.
- Follower: Passively replicates data from the Leader.
- ISR (In-Sync Replicas): The set of followers that are “caught up” with the leader. If a leader dies, only an ISR member can become the new leader.
B. Producer Acks
The producer decides how safe the write must be.
- acks=0 (Fire & Forget): Send and don’t wait. Fastest, but data loss possible if broker crashes immediately.
- acks=1 (Leader Ack): Wait for Leader to write to disk. Safe if Leader stays alive.
- acks=all (Strongest): Wait for Leader AND all ISRs to acknowledge. Zero data loss.
[!TIP] Trade-off:
acks=allis slower (higher latency) but safest. Use it for Payments. Useacks=1for Metrics.
7. Log Compaction: Keeping the Latest State
Kafka can act as a database (KTable). With Log Compaction, Kafka runs a background process that keeps only the latest value for each key.
Example: User Profiles
- Input Stream:
Key: u_101, Value: {name: "Alice"}Key: u_101, Value: {name: "Alice", city: "NYC"}(Update)Key: u_102, Value: {name: "Bob"}
- Compacted Log (What is saved):
Key: u_101, Value: {name: "Alice", city: "NYC"}Key: u_102, Value: {name: "Bob"}
- Use Case: Restoring application state after a crash without replaying years of history.
[!NOTE] War Story: The New York Times and Log Compaction When The New York Times re-architected their CMS, they didn’t just use Kafka as a temporary message queue. They used Kafka as the Source of Truth for every article published since 1851. By configuring a topic with Log Compaction, they created a permanent, replayable history. If they spin up a new search indexing microservice tomorrow, they simply point it to offset 0 of the compacted topic. It reads chronologically from 1851 to today, building its database without ever querying the primary monolithic database. Kafka isn’t just moving their data; it is their database.
8. Deep Dive: Exactly-Once Semantics (Transactions)
Kafka supports Exactly-Once Semantics (EOS) for critical use cases like payments, order processing, and financial transactions.
The Problem: Duplicates from Retries
Scenario: Producer sends OrderCreated event. Network hiccups. Producer retries. Kafka gets it twice.
Without EOS:
Producer → [OrderCreated] → Kafka (Ack Lost)
Producer → [OrderCreated] → Kafka (Retry, Duplicate!)
Consumer processes order TWICE → Double charge 💸
Solution 1: Idempotent Producer
Config: enable.idempotence=true (default in Kafka 3.0+)
How it Works:
- Producer assigns each message a unique sequence number
- Broker tracks
(ProducerId, Partition, SeqNum)in memory - If duplicate arrives, broker silently discards it
Guarantee: At-Most-Once becomes At-Least-Once without duplicates
Limitation: Only works for single partition writes. If you write to multiple partitions, they’re not atomic.
Solution 2: Transactional API (Multi-Partition Atomicity)
Use Case: Transfer $100 from Account A to Account B. You need to:
- Write
DebitEventtoaccount-Atopic - Write
CreditEventtoaccount-Btopic - Atomic: Both succeed or both fail
Code Example (Java):
Properties props = new Properties();
props.put("transactional.id", "payment-service-1"); // Unique per instance
props.put("enable.idempotence", "true");
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
producer.initTransactions(); // Register with coordinator
try {
producer.beginTransaction();
// All-or-nothing batch
producer.send(new ProducerRecord<>("account-A", "user:123", debitJson));
producer.send(new ProducerRecord<>("account-B", "user:456", creditJson));
producer.send(new ProducerRecord<>("orders", "order: 803", orderJson));
producer.commitTransaction(); // Atomic commit
} catch (Exception e) {
producer.abortTransaction(); // Rollback all
}
What Happens Under the Hood:
- Producer writes to all partitions with a transaction marker
- Broker holds messages in pending state (not visible to consumers yet)
- On
commitTransaction(), broker writes a COMMIT marker to a special__transaction_statetopic - Messages become visible atomically
Isolation Levels (Consumer Side)
Config: isolation.level=read_committed (default: read_uncommitted)
| Isolation Level | Behavior | Use Case |
|---|---|---|
| read_uncommitted | Reads all messages (even pending transactions) | High throughput, eventual consistency OK |
| read_committed | Only reads committed messages | Financial systems, strong consistency |
Trade-off: read_committed adds ~10ms latency (waits for transaction markers)
Production Pattern: E-Commerce Order Flow
Without Transactions (Risky):
1. Deduct inventory (Kafka write)
2. Network fails
3. Charge payment (Kafka write succeeds)
4. Customer charged, but item still in stock → Oversold!
With Transactions (Safe):
producer.beginTransaction();
// Step 1: Reserve inventory
producer.send(inventoryTopic, "RESERVE:item-123:qty-1");
// Step 2: Charge payment
producer.send(paymentTopic, "CHARGE:user-789:$99");
// Step 3: Create order
producer.send(ordersTopic, "ORDER:order-456:CONFIRMED");
producer.commitTransaction(); // All 3 succeed or all 3 fail
Result: No partial state. Either order fully succeeds or fully fails.
[!IMPORTANT] Interview Insight: Kafka transactions are NOT like database ACID transactions. They guarantee atomicity within Kafka, but if your consumer crashes after reading, you still need idempotent processing downstream.
9. Deep Dive: Consumer Group Rebalancing Strategies
When a consumer joins or leaves a group, Kafka rebalances partition ownership. This can cause downtime.
Rebalancing Protocol Comparison
| Strategy | Downtime | How It Works | Config | Use Case |
|---|---|---|---|---|
| Eager (Range/RoundRobin) | 30s stop-the-world | All consumers stop, reassign all partitions, resume | partition.assignment.strategy=range |
Small groups (<10 consumers) |
| Cooperative (Incremental) | 0s (gradual) | Only affected partitions are reassigned | partition.assignment.strategy=cooperative-sticky |
Large groups (>100 consumers) |
| Static Group Membership | 0s (on planned restart) | Consumer keeps same ID across restarts | group.instance.id=pod-name |
Kubernetes rolling deploys |
Eager Rebalancing (Default, Legacy)
Flow:
- Consumer C3 crashes
- All consumers stop processing (revoke all partitions)
- Coordinator reassigns partitions
- All consumers resume
Downtime: ~10-30 seconds (depends on session.timeout.ms)
Problem: If you have 100 consumers and 1 crashes, all 100 stop. Overkill.
Cooperative Rebalancing (Kafka 2.4+, Recommended)
Flow:
- Consumer C3 crashes (owned P5, P8)
- Only P5 and P8 are revoked
- Coordinator reassigns P5→C1, P8→C4
- Other 97 consumers keep processing
Downtime: 0 seconds for unaffected partitions
Config:
partition.assignment.strategy=cooperative-sticky
[!TIP] Production Default: Always use
cooperative-stickyfor zero-downtime deployments.
Static Group Membership (Session Pinning)
Problem: Kubernetes rolling deploy restarts pods. Each restart triggers rebalance.
Solution: Assign a static group instance ID:
group.instance.id=payment-consumer-pod-7
Behavior:
- Consumer restarts with same ID → No rebalance (partitions retained)
- Coordinator waits
session.timeout.ms(default: 45s) before reassigning - If pod comes back within 45s, it resumes seamlessly
Use Case: Kubernetes StatefulSets, planned maintenance
10. Production Configuration Guide
Broker-Level Tuning
| Config | Description | Default | Tuned (High Throughput) |
|---|---|---|---|
num.network.threads |
Threads handling requests | 3 | 8 (for high concurrency) |
num.io.threads |
Threads writing to disk | 8 | 16 (SSD clusters) |
socket.send.buffer.bytes |
TCP send buffer | 100KB | 1MB (10Gbps networks) |
log.segment.bytes |
Max segment file size | 1GB | 1GB (keep default) |
Producer Tuning: Latency vs Throughput
| Config | Low Latency | High Throughput | Why |
|---|---|---|---|
| linger.ms | 0 | 100 | Wait time before sending batch (batching improves compression) |
| batch.size | 16KB | 1MB | Larger batches = better compression ratio |
| compression.type | none | zstd | zstd is fastest modern codec (Kafka 2.1+) |
| acks | 1 | all | 1=leader only (fast), all=ISR quorum (durable) |
| buffer.memory | 32MB | 128MB | Total memory for pending sends |
Compression Benchmark (1GB dataset):
- none: 1000 MB, 100 MB/s
- gzip: 350 MB, 40 MB/s (high CPU)
- lz4: 450 MB, 90 MB/s (balanced)
- zstd: 300 MB, 95 MB/s (best ratio + speed)
[!TIP] Interview Tip: Use
linger.ms=10andbatch.size=256KBas a balanced starting point. Tune based on P99 latency requirements.
Consumer Tuning: Polling Behavior
| Config | Description | Default | Tuned |
|---|---|---|---|
| fetch.min.bytes | Min data to fetch before returning | 1 byte | 1MB (reduce poll frequency) |
| fetch.max.wait.ms | Max wait if fetch.min.bytes not met | 500ms | 100ms (lower latency) |
| max.poll.records | Max records per poll() | 500 | 100 (if processing is slow) |
| session.timeout.ms | Max time between heartbeats | 45s | 30s (faster failure detection) |
Rule of Thumb:
- Fast processing (e.g., logs):
max.poll.records=500 - Slow processing (e.g., ML inference):
max.poll.records=10(avoid rebalance timeout)
11. Summary
- Kafka is a Log: Think of it as a distributed file system, not a queue.
- Partitions: The unit of parallelism. Max Consumers = Max Partitions.
- Durability: Ensured via ISR and Producer Acks (
acks=all). - Zero Copy: The secret sauce behind its speed.
- Consumer Groups: Enable horizontal scaling of processing.
- Exactly-Once Semantics: Possible via transactions and idempotent producers.
- Rebalancing: Prefer
cooperative-stickyfor zero-downtime deployments. - Production Tuning: Balance latency (
linger.ms=0) vs throughput (linger.ms=100,compression=zstd).
Staff Engineer Tip: Watch out for Page Faults when consumers fall behind. If a consumer is reading “Fresh” data (at the tail of the log), it hits the RAM Page Cache (60ns latency). If the consumer lags and needs data from 1 hour ago, the OS must fetch it from the Physical Disk (10ms latency). This is the “Tail Latency Cliff.” When one consumer lags, it can cause the broker to start “thrashing” its disk I/O, potentially slowing down all other healthy consumers on that same hardware node.
Mnemonic — “Kafka = Tape Recorder, RabbitMQ = Mailbox”: RabbitMQ = Smart Broker (“I’ll route it for you”), message deleted after consumption, use for tasks. Kafka = Dumb Broker (“Just append it”), messages retained for days, use for events/streams. Kafka speed formula: Sequential I/O (no seek) + Zero Copy (sendfile syscall, 4 copies → 2) + Page Cache (RAM, not Disk). Partition = unit of parallelism. Max Consumers = Max Partitions. ISR = safety net. acks=all = strongest durability. Consumer Lag > 0 = watch for Page Faults cliff.