Event-Driven Architecture & The Saga Pattern

When you split a monolith into microservices, you lose ACID transactions across the whole system. You can’t just BEGIN on the Order service and expect the Inventory and Payment services to participate in that same lock.

The solution is Event-Driven Architecture (EDA), but it introduces a new nightmare: What happens if the middle step fails?

1. Choreography vs. Orchestration

1. Choreography (Distributed)

Each service publishes an event and listens for events from others. There is no central controller.

graph LR
    O[Order Service] -- OrderCreated --> I[Inventory]
    I -- Reserved --> P[Payment]
    P -- Paid --> S[Shipping]
    P -- Failed --> I[Release Inventory]

Pros: Low coupling, highly scalable.
Cons: Hard to visualize the whole flow. “Spaghetti” event loops are common.

2. Orchestration (Centralized)

A central “Orchestrator” service (or State Machine) tells each service what to do.

sequenceDiagram
    participant Orch as Saga Orchestrator
    participant Inv as Inventory
    participant Pay as Payment
    
    Orch->>Inv: Reserve Item
    Inv-->>Orch: Reserved
    Orch->>Pay: Charge User
    Pay-->>Orch: Failed
    Orch->>Inv: Release Reservation (Compensate)

Pros: Explicit state tracking, easy to debug.
Cons: The Orchestrator becomes a single point of failure and coupling.

The Staff-Level Crisis: The “Cyclic Saga”

In choreography, because there is no central orchestrator, it’s easy to accidentally create an Infinite Event Loop.

The Failure: Service A emits OrderCreated -> Service B reserves Payment -> Service A (bug) sees PaymentReserved and emits another OrderCreated.
The Result: An exponential explosion of events and database writes.
The Defense: Always include a correlation_id and a hop_count or source_trace_id in your event schema to detect and kill circular flows.

2. The Saga Pattern

A Saga is a sequence of local transactions. Each transaction updates the database and publishes an event or message to trigger the next step. If a step fails, the Saga must execute Compensating Transactions to undo the previous steps.

1. Order Created

2. Payment Processed

3. Inventory Reserved

Start a saga simulation...

[!WARNING] Lack of Isolation: The biggest problem with Sagas is they aren’t isolated. An Order might appear “Processing” to a customer while the Payment has already been taken but before Inventory is checked. Your UI must handle this “Pivot Point” gracefully.

[!IMPORTANT] Lack of Isolation (ACID vs BASE): Sagas do NOT provide isolation. Other transactions can see the “partially complete” state (e.g., funds are gone from the bank, but the ticket isn’t booked yet).

4. Staff Defense: Semantic Locking

Since Sagas lack traditional database isolation, you must implement Semantic Locking at the application layer.

The “Holding” Pattern

Instead of actually deducting money in Step 1, you put the money in a HELD state.

Try: Move $100 from Balance to Hold.
State Check: If another transaction checks the balance, your business logic must subtract Hold from Balance.
Expiry: Every “Hold” must have a TTL (Time-to-Live). If the Saga orchestrator disappears, a background process must automatically release the hold to prevent resource leaks.

4. Staff Math: Complexity & Scale

Comparing a monolith to a distributed Saga requires quantification of latency and operations.

4.1. Coordination Overhead ($N \times RTT$)

In a choreography-based Saga with $N$ steps, every step adds a network hop. [ \textbf{End-to-End Latency} = N \times (\text{Local Processing} + \text{Network RTT}) ]

Example: 8 services with 20ms processing and 5ms RTT.
- Monolith: $8 \times 20 = \mathbf{160ms}$.
- Saga: $8 \times (20+5) = \mathbf{200ms}$.
Staff Insight: High-depth choreography can turn a sub-second flow into a multi-second user experience. Use Orchestration to parallelize independent steps.

4.2. The “Operational Tax” of Sagas

If your Saga fails (e.g., compensating transaction fails), a human must intervene. [ \textbf{Ops Hours/Day} = \text{Total Transactions} \times f_{\text{failure}} \times \text{Human Correction Time} ]

Example: 1 million transactions/day, 0.1% failure rate, 15 mins to fix via SQL.
- $1,000,000 \times 0.001 \times 15 = \mathbf{250 \text{ hours of work}}$.
Staff Move: You must automate every compensating path. If a “Manual Fix” is part of your architecture, the system is not scalable.

4.3. State Machine Storage

If a Saga lasts 2 weeks (e.g., waiting for a physical shipment): [ \textbf{Concurrent Sagas} = \text{Transactions/sec} \times \text{Saga Duration} ]

Example: 10 tx/s $\times$ 1.2 million seconds (2 weeks) = 12 million concurrent states.
Impact: Your “Orchestrator DB” will likely be larger than your “Order DB.”

5. Staff Case Study: Temporal (ex-Uber Cadence)

At Uber, managing complex business processes (like an Uber trip) using raw Sagas became a maintenance nightmare. They solved this by creating Cadence (now Temporal).

5.1. The “Durable Execution” Shift

Instead of writing explicit compensating logic in every microservice, Temporal allows you to write the workflow as Standard Code (Go/Java).

The Magic: The framework statefully executes the code. If a step fails, the framework handles the retry, the timer, and the compensation automatically.
The Staff Insight: Move from “Messaging-based Sagas” (where state is hidden in logs) to “Orchestrator-based Workflows” (where state is explicit and queryable).

5.2. Lesson: Complexity Isolation

By centralizing the “Decision Logic” in a workflow engine, you decouple the Sequence of events from the Implementation of the services.

Staff Takeaway

Sagas are the distributed version of TRY/CATCH.

Use Choreography for simple side-effects (logging, stats).
Use Orchestration for critical business logic (payments, bookings) where you need a single source of truth for the current state of a transaction.