Reliability Patterns: Designing for Failure

[!TIP] The Amazon Rule: “Everything fails, all the time.” - Werner Vogels, CTO

In a distributed system, network calls fail. Databases timeout. APIs return 500s. If your service waits indefinitely for a response, it will hang. If 10,000 users are waiting, your entire system will crash (Cascading Failure).

This chapter covers the patterns to fail gracefully and recover automatically.


1. The Circuit Breaker Pattern

Imagine your house electrical wiring. If you plug in too many heaters, the Circuit Breaker trips. It cuts the power to prevent the house from burning down. In software, we wrap dangerous network calls (e.g., to a Payment Gateway) in a Circuit Breaker.

1.1 The Three States

  1. CLOSED (Normal): Requests flow through. We count failures.
  2. OPEN (Tripped): Too many failures (e.g., >50% in 10s). All requests fail immediately (Fast Fail). We don’t even try to call the downstream service.
  3. HALF-OPEN (Testing): After a timeout (e.g., 5s), we let one request through.
    • If it succeeds -> Reset to CLOSED.
    • If it fails -> Go back to OPEN.

Interactive Visualizer: The State Machine

Simulate a failing service and watch the Circuit Breaker protect your system.

Circuit Breaker Simulator

Current State: CLOSED (Normal)

Client
Service
Failures: 0 / 5
Successes: 0
System Healthy.

2. The Bulkhead Pattern

When the Titanic hit the iceberg, it sank because water flowed from one compartment to another. The Bulkhead Pattern isolates failures.

  • Thread Pools: Instead of one shared thread pool for all requests, assign a dedicated pool for each downstream service.
    • Pool A (Payment): 10 threads.
    • Pool B (Recommendations): 10 threads.
  • Result: If the “Recommendations” service hangs, it uses up all 10 threads in Pool B. Pool A is unaffected, so users can still pay.

3. Retries & The Thundering Herd

When a request fails, we often retry. Retry(3 times) seems harmless. But if your database is overloaded, 10,000 users retrying 3 times = 30,000 extra requests. You just DDOS’d yourself. This is a Retry Storm.

3.1 Exponential Backoff & Jitter

  1. Exponential Backoff: Wait longer between retries.
    • Wait 1s, then 2s, then 4s, then 8s.
    • Problem: If 10,000 users fail at T=0, they all retry at T=1. Still a synchronized attack.
  2. Jitter: Add randomness.
    • Wait Random(0, 1s), then Random(0, 2s).
    • Result: Spreads the load over time.

3.2 Idempotency

[!WARNING] Never Retry non-idempotent operations.

If POST /pay $100 times out, do not retry blindly. You might charge the user twice. Always use an Idempotency Key (e.g., UUID order_123). The server checks: “Have I seen order_123 before?”

  • If Yes -> Return the previous success response.
  • If No -> Process payment.

4. Summary

Pattern Goal Analogy
Circuit Breaker Prevent cascading failure. Fuse box in your house.
Bulkhead Isolate resources. Watertight compartments on a ship.
Backoff + Jitter Spread out load during recovery. polite conversation (don’t all speak at once).
Idempotency Safe retries. “I already paid you!”

Next: Security Essentials (OAuth & TLS) ->