One of the most dangerous beliefs in distributed systems is that a message bus can provide Exactly-Once delivery.

Technically, it’s impossible. If a broker sends a message and the network fails before the ACK is received, the broker must retry. That’s At-Least-Once. If the broker times out and gives up, that’s At-Most-Once.

So how do companies like Stripe and Netflix claim to have perfectly consistent async workflows? They don’t rely on the “guarantee”—they build for Idempotency.


1. The Three Semantics

  1. At-Most-Once: Fire and forget. No retries. Messages can be lost (UDP style).
  2. At-Least-Once: Retry until ACKed. Messages can be duplicated. (The industry standard).
  3. Exactly-Once: The destination sees the event precisely one time. (Usually achieved via At-Least-Once + Idempotency).

5. Exactly-Once Semantics (EOS) Internals

Achieving Exactly-Once in Kafka isn’t magic; it’s a distributed transaction between the Producer, the Transaction Coordinator (TC), and the Consumer.

The Transaction Flow

sequenceDiagram
    participant P as Producer
    participant TC as Transaction Coordinator
    participant T as Topic Partition
    participant OC as Offsets Topic
    
    P->>TC: InitTransactions (ProducerId, Epoch)
    P->>TC: AddPartitionsToTxn
    P->>T: Write Data (In-Flight)
    P->>TC: AddOffsetsToTxn
    P->>OC: Send Offsets
    P->>TC: CommitTransaction
    TC->>TC: Marker: PREPARE_COMMIT
    TC->>T: Marker: COMMIT
    TC->>OC: Marker: COMMIT
    TC->>TC: Marker: COMPLETE_COMMIT

Staff-Level Failure: The TC Stalling

If the Transaction Coordinator crashes after PREPARE_COMMIT but before writing markers to the partitions, the partitions are blocked.

  • The Impact: Consumers in read_committed mode will stall at the Last Stable Offset (LSO).
  • The LSO Definition: The LSO is the minimum of $(a)$ the High Watermark and $(b)$ the offset of the first open (ongoing or prepared) transaction.
  • The Stalling: Even if higher offsets are committed by other producers, the consumer cannot advance past this “prepared” transaction until the TC recovers and writes the final marker.

2. The “Dual Write” Problem & The Outbox Pattern

A common Junior mistake:

// DON'T DO THIS
await db.save(order);
await kafka.send(order_created_event);

The Failure: If the DB save succeeds but Kafka is down, your data is inconsistent. If you swap them, you might send an event for an order that failed to save.

The Staff Solution: Transactional Outbox Instead of sending to Kafka directly, save the event into a hidden outbox table in the same database transaction.

Application DB

Orders
Outbox
Relay Agent (Debezium)

Kafka / MQ

Click start to see atomic consistency...

3. The Ultimate Safety: Idempotency

No matter how good your outbox or Kafka semantics are, a consumer will see the same message twice eventually (e.g., consumer crashes after processing but before committing offset).

The Only Real Solution: Idempotency Keys. Every event should have a unique ID (UUID/Snowflake). The consumer should check if that ID has already been successfully processed before taking any action.

-- The pattern for idempotent processing
INSERT INTO processed_events (event_id) VALUES ('event_abc_123');
-- If this fails (Duplicate Key), skip processing!

4. Staff Math: Performance Trade-offs

Consistency isn’t free. As a Staff Engineer, you must quantify the cost of “Exactly-Once.”

4.1. The Idempotency Storage Cost

If you store every message_id for a 24-hour window to deduplicate at 10k msgs/sec: [ \textbf{Storage} = 10,000\text{/s} \times 86,400\text{s} \times 64\text{ bytes (UUID+overhead)} \approx \mathbf{55GB \text{ RAM}} ]

  • The Staff Move: Use a Bloom Filter to reduce this to ~1GB of RAM with a 1% false positive rate for lookups.

4.2. Transactional Overhead (acks=all)

Moving from acks=1 (Leader only) to acks=all (Quorum) to prevent the “Ghost Commit” problem.

  • The Cost: Typically a 30-50% drop in throughput and a 2x-3x increase in P99 latency due to the round-trip time of the most distant replica.
  • Rule of Thumb: Only use acks=all for “State-Changing” events (Payments, Order Status). Use acks=1 for telemetry and logs.

4.3. Freshness: Polling vs. Streaming

How long does a message sit in the outbox table before being sent?

  • Polling: $\text{Lag} \approx \frac{\text{Poll Interval}}{2}$. (If you poll every 1s, average lag is 500ms).
  • CDC Streaming: $\text{Lag} \approx \text{WAL write} + \text{Network Buffer} \approx \mathbf{10-50ms}$.
  • The Choice: If your system requires < 100ms end-to-end latency, Polling is not an option; you must use CDC (Debezium).

5. Staff Case Study: Uber’s DLQ Strategy

In a system processing billions of messages, “Errors” are a constant state. Uber’s approach to Dead Letter Queues (DLQ) is an industry standard.

5.1. The “Reprocessing” Pipeline

Instead of just dropping failed messages into a black hole (DLQ), Uber built a tiered reprocessing system:

  1. Level 1 (Retry Topic): Messages are moved to a “Retry” topic with a delay (Exponential Backoff).
  2. Level 2 (DLQ): If retries fail after $N$ attempts, the message is moved to a permanent DLQ.
  3. The “Purge/Replay” Tool: Engineers have an internal UI to inspect DLQ messages, fix the bug in the consumer, and “Replay” those messages back into the main stream.

5.2. Lesson: Observability of Failure

Uber treats failure as a first-class workflow. By using specialized retry topics, they keep their primary topics “clean” and prevent a single broken message from blocking an entire partition.


Staff Takeaway

Never trust the “Exactly-Once” marketing checkmark.

  • Infrastructure provides At-Least-Once.
  • Application Logic provides Idempotency.
  • Together, they simulate Exactly-Once.