Circuit Breakers & Retries

[!TIP] The Amazon Rule: “Everything fails, all the time.” - Werner Vogels, CTO

In a distributed system, network calls fail. Databases timeout. APIs return 500s. If your service waits indefinitely for a response, it will hang. If 10,000 users are waiting, your entire system will crash (Cascading Failure).

This chapter covers the patterns to fail gracefully and recover automatically.


1. The Circuit Breaker Pattern

Imagine your house electrical wiring. If you plug in too many heaters, the Circuit Breaker trips. It cuts the power to prevent the house from burning down. In software, we wrap dangerous network calls (e.g., to a Payment Gateway) in a Circuit Breaker.

1.1 The Three States & Sliding Windows

To know when to trip, the breaker records success/failure metrics over a Sliding Window:

  • Count-based Window: Tracks the last $N$ requests (e.g., last 100 calls).
  • Time-based Window: Tracks all requests in the last $T$ seconds (e.g., last 10 seconds).
  1. CLOSED (Normal): Requests flow through freely. The breaker continually updates the sliding window.
  2. OPEN (Tripped): The failure rate exceeds the configured threshold (e.g., >50% failure rate). All subsequent requests Fail Fast immediately. Crucial: We don’t even attempt the network call, saving our own threads from hanging and giving the downstream service breathing room to recover.
  3. HALF-OPEN (Testing): After a configured wait duration (e.g., 5s), the breaker allows a limited number of test requests (probes) to pass through.
    • If the probes succeed → The downstream service is healthy again. Reset to CLOSED.
    • If the probes fail → The downstream service is still struggling. Snap back to OPEN and restart the wait timer.

Interactive Visualizer: The State Machine

Simulate a failing service and watch the Circuit Breaker protect your system.

[!TIP] Try it yourself: Spam the “Send Failure” button to trip the circuit. Then wait for the timeout to try again.

Circuit Breaker Simulator

Current State: CLOSED (Normal)

Client
Service
Failures: 0 / 5
Successes: 0
System Healthy.

2. The Bulkhead Pattern

When the Titanic hit the iceberg, it sank because water flowed freely from one compartment to another. The Bulkhead Pattern enforces strict resource isolation so that a failure in one subsystem cannot drain the resources of another.

2.1 Two Types of Bulkheads

  1. Thread Pool Bulkheads (Process Isolation)
    • Instead of one massive, shared thread pool for all incoming API requests, assign a dedicated, fixed-size thread pool for each specific downstream dependency.
    • Example: Pool A (Payment Service) gets 20 threads. Pool B (Recommendations Service) gets 10 threads.
    • Result: If the Recommendations database locks up, those 10 threads will hang waiting for a response. However, Pool A is physically isolated; it still has 20 threads happily processing payments.
    • Trade-off: High context-switching overhead and memory footprint from managing multiple distinct thread pools.
  2. Semaphore Bulkheads (Concurrency Limits)
    • Instead of separate threads, you maintain a shared pool but use a Semaphore (a counter) to limit concurrent in-flight requests to a specific service.
    • Example: Allow a maximum of 50 concurrent calls to the Inventory Service. If a 51st request arrives, it is immediately rejected (Fast Fail) without even acquiring a thread.
    • Trade-off: Lighter on resources, but if threads do hang, they are still hanging on the shared pool (just bounded by the semaphore limit).

3. Retries & The Thundering Herd

When a request fails, we often retry. Retry(3 times) seems harmless. But if your database is overloaded, 10,000 users retrying 3 times = 30,000 extra requests. You just DDOS’d yourself. This is a Retry Storm.

[!NOTE] War Story: The “Friendly” DDoS Attack At a major streaming company, a database blip caused a 1-second outage during a live sports event. The frontend apps, programmed to aggressively retry on failure, all instantly retried at the exact same moment. This created a massive Thundering Herd that effectively DDoS’d their own backend, extending a 1-second database blip into a 45-minute cascading failure because the database could never recover under the compounded retry load.

3.1 Exponential Backoff & Jitter

  1. Exponential Backoff: Wait longer between retries.
    • Wait 1s, then 2s, then 4s, then 8s.
    • Problem: If 10,000 users fail at T=0, they all retry at T=1. Still a synchronized attack.
  2. Jitter: Add randomness.
    • Wait Random(0, 1s), then Random(0, 2s).
    • Result: Spreads the load over time.

[!TIP] Formula: Sleep = min(Cap, Base * 2 ** Attempt) + Random(0, 1)

Interactive Visualizer: Retry Storm vs Jitter

Visualize the load on your database when 100 users fail at once.

[!TIP] Try it yourself: Click “Naive Retry” to see the spikes. Click “Backoff + Jitter” to see the smoothed load.

Load Visualizer

T=0s T=10s
Select a strategy.

3.2 Idempotency

[!WARNING] Never Retry non-idempotent operations without safeguards.

If POST /pay $100 times out, do not retry blindly. You might charge the user twice. Always require the client to generate an Idempotency Key (e.g., a UUID like req_8f7b...).

Anatomy of an Idempotent Request

  1. Storage: The server needs a fast, centralized store (like Redis) to map Idempotency Keys to their final responses. Set a TTL (e.g., 24 hours) to prevent Redis from growing infinitely.
  2. Concurrency Control: Two retries might arrive at the exact same millisecond. Use atomic operations (like Redis SETNX - Set If Not Exists) to ensure only the first thread actually executes the business logic.

The Flow:

  1. Client sends POST /pay with header Idempotency-Key: X.
  2. Server calls Redis: SETNX lock:X "PROCESSING".
  3. If SETNX returns false, another request is already handling it. Wait or return 409 Conflict.
  4. If SETNX returns true, we have the lock. Process the payment.
  5. Save the final response (e.g., {"status": "paid", "id": 99}) to Redis under key X.
  6. If the client retries later with Key X, the server skips the payment and directly serves the cached response from Redis.

4. Summary

Pattern Goal Analogy
Circuit Breaker Prevent cascading failure. Fuse box in your house.
Bulkhead Isolate resources. Watertight compartments on a ship.
Backoff + Jitter Spread out load during recovery. Polite conversation (don’t all speak at once).
Idempotency Safe retries. “I already paid you!”