When you split a monolith into microservices, you lose ACID transactions across the whole system. You can’t just BEGIN on the Order service and expect the Inventory and Payment services to participate in that same lock.
The solution is Event-Driven Architecture (EDA), but it introduces a new nightmare: What happens if the middle step fails?
1. Choreography vs. Orchestration
1. Choreography (Distributed)
Each service publishes an event and listens for events from others. There is no central controller.
graph LR
O[Order Service] -- OrderCreated --> I[Inventory]
I -- Reserved --> P[Payment]
P -- Paid --> S[Shipping]
P -- Failed --> I[Release Inventory]
- Pros: Low coupling, highly scalable.
- Cons: Hard to visualize the whole flow. “Spaghetti” event loops are common.
2. Orchestration (Centralized)
A central “Orchestrator” service (or State Machine) tells each service what to do.
sequenceDiagram
participant Orch as Saga Orchestrator
participant Inv as Inventory
participant Pay as Payment
Orch->>Inv: Reserve Item
Inv-->>Orch: Reserved
Orch->>Pay: Charge User
Pay-->>Orch: Failed
Orch->>Inv: Release Reservation (Compensate)
- Pros: Explicit state tracking, easy to debug.
- Cons: The Orchestrator becomes a single point of failure and coupling.
The Staff-Level Crisis: The “Cyclic Saga”
In choreography, because there is no central orchestrator, it’s easy to accidentally create an Infinite Event Loop.
- The Failure: Service A emits
OrderCreated-> Service B reservesPayment-> Service A (bug) seesPaymentReservedand emits anotherOrderCreated. - The Result: An exponential explosion of events and database writes.
- The Defense: Always include a
correlation_idand ahop_countorsource_trace_idin your event schema to detect and kill circular flows.
2. The Saga Pattern
A Saga is a sequence of local transactions. Each transaction updates the database and publishes an event or message to trigger the next step. If a step fails, the Saga must execute Compensating Transactions to undo the previous steps.
[!WARNING] Lack of Isolation: The biggest problem with Sagas is they aren’t isolated. An Order might appear “Processing” to a customer while the Payment has already been taken but before Inventory is checked. Your UI must handle this “Pivot Point” gracefully.
[!IMPORTANT] Lack of Isolation (ACID vs BASE): Sagas do NOT provide isolation. Other transactions can see the “partially complete” state (e.g., funds are gone from the bank, but the ticket isn’t booked yet).
4. Staff Defense: Semantic Locking
Since Sagas lack traditional database isolation, you must implement Semantic Locking at the application layer.
The “Holding” Pattern
Instead of actually deducting money in Step 1, you put the money in a HELD state.
- Try: Move $100 from
BalancetoHold. - State Check: If another transaction checks the balance, your business logic must subtract
HoldfromBalance. - Expiry: Every “Hold” must have a TTL (Time-to-Live). If the Saga orchestrator disappears, a background process must automatically release the hold to prevent resource leaks.
4. Staff Math: Complexity & Scale
Comparing a monolith to a distributed Saga requires quantification of latency and operations.
4.1. Coordination Overhead ($N \times RTT$)
In a choreography-based Saga with $N$ steps, every step adds a network hop. [ \textbf{End-to-End Latency} = N \times (\text{Local Processing} + \text{Network RTT}) ]
- Example: 8 services with 20ms processing and 5ms RTT.
- Monolith: $8 \times 20 = \mathbf{160ms}$.
- Saga: $8 \times (20+5) = \mathbf{200ms}$.
- Staff Insight: High-depth choreography can turn a sub-second flow into a multi-second user experience. Use Orchestration to parallelize independent steps.
4.2. The “Operational Tax” of Sagas
If your Saga fails (e.g., compensating transaction fails), a human must intervene. [ \textbf{Ops Hours/Day} = \text{Total Transactions} \times f_{\text{failure}} \times \text{Human Correction Time} ]
- Example: 1 million transactions/day, 0.1% failure rate, 15 mins to fix via SQL.
- $1,000,000 \times 0.001 \times 15 = \mathbf{250 \text{ hours of work}}$.
- Staff Move: You must automate every compensating path. If a “Manual Fix” is part of your architecture, the system is not scalable.
4.3. State Machine Storage
If a Saga lasts 2 weeks (e.g., waiting for a physical shipment): [ \textbf{Concurrent Sagas} = \text{Transactions/sec} \times \text{Saga Duration} ]
- Example: 10 tx/s $\times$ 1.2 million seconds (2 weeks) = 12 million concurrent states.
- Impact: Your “Orchestrator DB” will likely be larger than your “Order DB.”
5. Staff Case Study: Temporal (ex-Uber Cadence)
At Uber, managing complex business processes (like an Uber trip) using raw Sagas became a maintenance nightmare. They solved this by creating Cadence (now Temporal).
5.1. The “Durable Execution” Shift
Instead of writing explicit compensating logic in every microservice, Temporal allows you to write the workflow as Standard Code (Go/Java).
- The Magic: The framework statefully executes the code. If a step fails, the framework handles the retry, the timer, and the compensation automatically.
- The Staff Insight: Move from “Messaging-based Sagas” (where state is hidden in logs) to “Orchestrator-based Workflows” (where state is explicit and queryable).
5.2. Lesson: Complexity Isolation
By centralizing the “Decision Logic” in a workflow engine, you decouple the Sequence of events from the Implementation of the services.
Staff Takeaway
Sagas are the distributed version of TRY/CATCH.
- Use Choreography for simple side-effects (logging, stats).
- Use Orchestration for critical business logic (payments, bookings) where you need a single source of truth for the current state of a transaction.