Idempotency in Messaging
In 2016, Stripe’s engineering team published a post-mortem about duplicate charges. A customer clicked “Place Order.” The browser’s network request timed out at 10 seconds. The browser auto-retried. But Stripe’s server had already processed the first request — it just failed to send the HTTP 200 back. The result: the customer was charged twice for one order. Stripe’s solution — now visible to every developer in their API — is the Idempotency-Key header. Every POST request includes a client-generated UUID. If Stripe sees the same UUID twice, it returns the cached response without re-executing. The principle: the client generates the intent (UUID), the server executes exactly once and remembers. This pattern is now standard across Square, Braintree, PayPal, and Adyen. Without idempotency, every charge API in the world would be a double-spend lottery.
[!IMPORTANT] In this lesson, you will master:
- Intent vs. Execution: Why distinguishing between “The user wanted to pay” and “The computer processed the payment” is the key to distributed safety.
- The Exactly-Once Illusion: Understanding why Exactly-Once is actually At-Least-Once delivery paired with high-performance deduplication.
- Hardware Write Amplification: Calculating the I/O cost of storing idempotency keys and using Bloom Filters to protect your main database.
[!TIP] Interview Communication: When designing financial or critical state systems (like Wise or Stripe), always preempt the interviewer by bringing up Idempotency. Don’t wait for them to ask “what happens if this retries?” State clearly: “Because this is a payment request over an unreliable network, I am designing this API to be idempotent by requiring an
Idempotency-Keyheader. This ensures that a client-side retry after a timeout won’t result in a double charge.”
1. The Problem: Duplicates are Inevitable
In a distributed system, the network is unreliable.
- Scenario: Client sends a “Charge $100” request.
- Server: Processes it successfully.
- Network: The “200 OK” response gets lost on the way back.
- Client: Thinks the request failed (Timeout). Retries.
- Result: The user is charged $200.
Idempotency guarantees that performing an operation multiple times has the same effect as performing it once.
f(f(x)) = f(x)
[!NOTE] War Story: The “Accidental Billionaire” Issue At an early stage e-commerce startup, a bug in the client code caused the “Place Order” button to remain enabled while the network request was pending. During a flash sale, anxious customers mashed the button 5-10 times. Because the backend lacked an
Idempotency-Keycheck, the checkout service happily processed every single click as a separate order. The company nearly went bankrupt processing refunds and answering angry support tickets. Adding a simple RedisSETNXlock keyed by anidempotency_key(generated once per checkout session) fixed it overnight.
2. Why “Exactly-Once” is a Lie
You cannot achieve “Exactly-Once Delivery” in a mathematical sense (FLP Impossibility Result). What systems like Kafka mean by “Exactly-Once” is actually: At-Least-Once Delivery + Deduplication.
3. The Solution: Idempotency Keys
The client must generate a unique ID (UUID) for every intent.
- Client: Generates
ref_id = "uuid-123". - Client: Sends
POST /charge { amount: 100, ref_id: "uuid-123" }. - Server:
- Checks DB: Have I seen
uuid-123? - No: Charge card. Save
uuid-123to DB. Return Success. * Yes: Return the previous success response immediately. Do NOT charge again.
- Checks DB: Have I seen
[!NOTE] Hardware-First Intuition: The “Idempotency Tax.” To guarantee safety, you must perform a Write (logging the UUID) for every Intent. This is “Write Amplification”—you are doing twice as many disk I/Os for one business transaction. At massive scale (100k+ QPS), checking a Postgres table for a UUID for every request will melt your SSDs. Most Staff Engineers use a Hardware-Fast Bloom Filter in RAM as the first line of defense. It can tell you “I definitely haven’t seen this UUID” in 50ns, allowing you to skip the slow 10ms DB check for 99.9% of new requests.
4. Interactive Demo: The Duplicate Slayer
Simulate a “Charge” request. Toggle Idempotency on/off to see the difference when network retries occur.
- Scenario: You click “Pay”. The network is flaky (40% packet loss on response). The client auto-retries.
| Key | Status | Result |
|---|---|---|
| Empty | ||
5. Implementation Patterns
A. The “Unique Constraint” (Database)
The simplest way. Rely on the database’s ACID properties.
INSERT INTO payments (idempotency_key, amount, user_id)
VALUES ('uuid-101', 100, 50);
-- If run twice, DB throws "Duplicate Key Violation"
The app catches this error and returns “Success” (since the work is already done).
B. Redis + Lua (The Race Condition)
A common mistake is the “Check-Then-Act” pattern, which causes a Race Condition:
# BAD CODE - Race Condition!
if not redis.exists(key):
# <-- Another thread could insert here!
redis.set(key, "processing")
process_payment()
Solution: Use Redis SETNX (Set if Not Exists) or a Lua Script to make the Check+Set operation Atomic.
-- ATOMIC LUA SCRIPT
if redis.call("EXISTS", KEYS[1]) == 1 then
return 0 -- Already exists
else
redis.call("SET", KEYS[1], "processing")
return 1 -- Success, lock acquired
end
6. Soft Delete vs Hard Delete
Idempotency often requires “Soft Deletes” to handle re-execution of “Delete” logic safely, although DELETE is naturally idempotent.
- Hard Delete:
DELETE FROM users WHERE id=1. - First call: Returns “OK”.
- Second call: Returns “0 rows affected” (or 404). This might confuse the client.
- Soft Delete:
UPDATE users SET deleted_at=NOW() WHERE id=1 AND deleted_at IS NULL. - First call: Sets
deleted_at. Returns “OK” (1 row affected). - Second call: Does nothing (0 rows affected). Returns “OK” (We treat 0 rows as ‘Already Deleted’).
- Benefit: The record is preserved for audit trails, and retry logic becomes consistent.
7. Idempotency in REST APIs
Not all HTTP methods are equal.
| Method | Idempotent? | Description |
|---|---|---|
| GET | Yes | Reading data doesn’t change state. Safe to retry. |
| PUT | Yes | “Replace this resource”. Calling PUT /users/1 {name: "Bob"} 10 times results in the same state (Name is Bob). |
| DELETE | Yes | “Delete this resource”. Calling it 10 times results in the same state (Resource is gone). |
| POST | NO | “Create a resource”. Calling POST /payments 10 times creates 10 payments. |
How to make POST idempotent?
Use a custom header like Idempotency-Key (Stripe standard).
- Client generates UUID
uuid-123. - Client sends
POST /paymentswith headerIdempotency-Key: uuid-123. - Server checks Middleware:
- If Key exists in Redis → Return cached Response.
- If Key missing → Process Request → Cache Response → Return.
8. Hardware-First Intuition: The Bloom Filter vs SSD Bottleneck
To guarantee safety, you must perform a Write (logging the UUID) for every Intent. This is the Idempotency Tax.
- Write Amplification: Every business transaction now requires two physical database writes: one for the idempotency key and one for the actual data. This can double your SSD wear and half your throughput if using a single DB.
- The RAM Shield (Bloom Filters): At massive scale (100k+ QPS), querying Postgres for a UUID on every request becomes a latency death sentence. Staff Engineers use a Bloom Filter in RAM.
- Logic: A Bloom Filter can tell you in ~50 nanoseconds if a key definitely doesn’t exist.
- Path: 99.9% of new requests hit the filter → “MISS” → Immediately proceed to the database insert.
- Optimization: You only incur the “Slow Path” (DB disk read) for the 0.1% of potential duplicates.
- Ghost Write Hazards: If your idempotency check and your data write aren’t Atomic (inside the same SQL transaction), you might successfully debit a user’s wallet but crash before saving the idempotency key. On retry, the wallet is debited again. Always use Transactional Integrity.
[!TIP] Staff Engineer Tip: The TTL Strategy Never store idempotency keys forever. Your hardware bill will grow linearly with every transaction in history. Set a Time-To-Live (TTL) proportional to your client’s maximum retry window (e.g., 24 hours). This keeps the index small enough to fit in the CPU’s L3 Cache, maintaining lightning-fast performance.
9. Summary
- Network failures (Timeouts) create ambiguity. You never know if the server did the work.
- At-Least-Once delivery means you WILL get duplicates.
- Use Idempotency Keys (UUIDs) to distinguish retries from new requests.
- Use Atomic Operations (DB Constraints) to prevent Race Conditions.
- Prefer Soft Deletes (
deleted_at) for consistent history and safe retries. - Bloom Filters protect your database hardware from redundant I/O storms.
Staff Engineer Tip: Beware of Lock Contention on your idempotency store. If a “Poison Message” causes a service to crash and retry 1,000 times per second across 10 servers, all 1,000 instances will try to acquire the same Redis lock. This creates a CPU Hotspot on the Redis shard due to atomic spin-locking overhead. Always set a Reasonable TTL (e.g., 24 hours) on your idempotency keys to prevent your storage hardware from growing infinitely.
Mnemonic — “UUID = Intent, Server = Memory”: Client generates UUID (intent). Server checks store → Miss → Execute + Save UUID → Return Success. Server checks store → Hit → Return cached Success (DO NOT execute again). Stripe’s pattern: Idempotency-Key header on every POST. DB pattern: UNIQUE(idempotency_key) constraint catches duplicates at I/O level. Bloom Filter: First-line RAM defense (~50ns) before the 10ms DB check. TTL: 24h (long enough for client retry windows, short enough to save storage).