Circuit Breakers & Bulkheads

A Junior Engineer thinks of reliability as “having more servers.” A Staff Engineer thinks of reliability as “having better walls.”

When a downstream service slows down or fails, it doesn’t just return errors—it consumes resources. It holds on to threads, memory, and connection pools, causing your healthy services to starve and eventually crash. This is a Cascading Failure.

To stop it, we use two fundamental isolation patterns.


1. The Switch: Circuit Breaking

Circuit Breaker

If a service is failing 100% of the time, there is no point in continuing to hammer it. A Circuit Breaker sits between the caller and the service, monitoring failure rates and “tripping” to block requests before they crash your system.

The Three States

  1. Closed (Green): Normal operation. Requests flow through.
  2. Open (Red): Threshold reached. All requests fail fast immediately.
  3. Half-Open (Yellow): Test mode. Allows a few requests through to check for recovery.

2. The Wall: The Bulkhead Pattern

Bulkhead Pattern

Named after the partitions in a ship’s hull that prevent the entire boat from sinking if one section is breached. In software, a Bulkhead isolates resources (thread pools, semaphores) so that a failure in one dependency doesn’t take down others.

Thread-Pool Bulkhead (Isolation)

Every dependency gets its own dedicated thread pool.

  • Pros: Complete isolation. If Service A hangs, Service B’s thread pool is still 100% healthy.
  • Cons: Higher memory/CPU overhead due to many threads.

Semaphore Bulkhead (Concurrency Limit)

A counter that limits the number of concurrent calls to a dependency.

  • Pros: Very low overhead.
  • Cons: No thread isolation. If Service A hangs, it still holds on to the caller threads.

3. Interactive: Circuit Breaker & Bulkhead Sim

See how a bulkhead protects your main thread pool during a dependency failure.

Main Pool (Threads)

CLOSED
Bulkhead: OFF
Service A
Start traffic to see isolation in action...

4. Staff Hazard: The “Half-Open” Thundering Herd

A common mistake is treating the Half-Open state as a simple “on/off” switch.

  • The Failure: When the open_timeout expires, your breaker moves to Half-Open. If you have 1,000 requests waiting in an upstream backlog, they might all hit the destination simultaneously.
  • The Result: The destination (which was barely recovering) is immediately crushed again, forcing the circuit back to OPEN. This creates a “saw-tooth” availability pattern.
  • The Staff Move: Limit concurrency in the Half-Open state. Instead of allowing any request through, allow only exactly one request at a time until recovery is confirmed.

5. Staff Insight: Bulkhead Contamination

Multi-tenant systems often use Bulkheads to isolate customers. But purely software bulkheads have a hidden flaw: Resource Contamination.

The Shared Runtime Problem

If Customer A has a bulkhead of 10 threads and Customer B has 10 threads, they are “isolated.” However:

  1. The GC Factor: If Customer A causes massive garbage collection, the Stop-the-World GC pause affects EVERY bulkhead in the JVM/Go-runtime.
  2. The Kernel Factor: If Customer A’s threads saturate the machine’s context-switching budget or I/O bandwidth, Customer B’s performance will degrade despite their clean bulkhead.

Staff Tip: True isolation requires Hardware-level fencing (Containers with CPU quotas or unique VMs) rather than just thread-pool limits.


6. The “Secondary Outage”: Logging Storms

When a circuit breaker trips and starts “failing fast,” your application might log 10,000 errors per second.

  • The Failure: The CPU cost of serializing JSON logs and the network cost of sending them to Splunk/Datadog consumes 100% of the node’s resources.
  • The Result: The node becomes unresponsive not because of the dependency failure, but because of the System’s response to the failure.
  • The Defense: Implement Log Sampling or Rate Limited Logging for errors within the circuit breaker logic.

4. Staff Math: Reliability Engineering

Consistency isn’t just a boolean; it’s a statistical probability.

4.1. The Blast Radius Equation

As a Staff Engineer, you must quantify how much of the system is at risk if a single bulkhead fails. [ \textbf{BlastRadius (Users)} = \text{Total Users} \times p_{\text{dep}} \times \frac{\text{Resources in Bulkhead}}{\text{Total Resources}} ]

  • Example: 1,000,000 users. 30% ($p=0.3$) call the “Recommendation Svc”. You give that service its own bulkhead of 20 threads in a 100-thread pool.
    • Blast Radius: $1,000,000 \times 0.3 \times \frac{20}{100} = \mathbf{60,000 \text{ users}}$.
  • The Staff Decision: If 60,000 is too high, you must either shard the bulkhead further or reduce the resource allocation.

4.2. Statistical Significance of Failures

Is 5 errors in 10 seconds a “Circuit Breaker” event or just background noise?

  • Small Sample Trap: In a low-traffic service (1 RPS), 5 errors in 10 seconds is a 50% failure rate. In a high-traffic service (1,000 RPS), it’s a 0.05% flicker.
  • The Rule: Always set a minimumThroughput (e.g., 20 requests) before the circuit breaker is allowed to evaluate the failure rate. This prevents the circuit from “flipping” due to a single jittery network packet.

4.3. The Bulkhead Saturation Point

Every bulkhead (Thread Pool) adds overhead (Context Switching). [ \textbf{Effective Throughput} = \frac{\text{Theoretical Capacity}}{\text{Bulkheads} \times \text{Context Switch Tax}} ]

  • Constraint: If you have 64 CPU cores and create 1,000 thread-pool bulkheads, your CPU will spend more time switching threads than doing work.
  • The Solution: Use Semaphores (async non-blocking) for high-concurrency I/O and Thread Pools only for truly CPU-bound or blocking legacy code.

Staff Takeaway

Resilience is about Fencing.

  • Use Circuit Breakers when you want to avoid hitting a service that is already dead.
  • Use Bulkheads when you want to ensure that even if Service A is dead, Service B still works.
  • Staff Tip: In a microservice mesh (Istio/Linkerd), these should be configured at the sidecar level, keeping your application code clean from reliability logic.