You’ve been tasked with designing a “Super App” booking flow: Flight + Hotel + Car Rental. All three must succeed together, or all must be cancelled.
A Senior Engineer uses a Saga (Sequential 1-2-3). A Staff Engineer knows that for multi-party bookings with high inventory contention, TCC (Try-Confirm-Cancel) is often superior because it provides better isolation and a “Sealing” phase.
1. TCC (Try-Confirm-Cancel)
TCC is a 3-step application-level consistency pattern that mirrors the Two-Phase Commit (2PC) protocol but at the business logic layer.
- Try: Reserve resources (e.g., put a “held” status on a flight seat). This is the Isolation phase.
- Confirm: Finalize the transaction. If I’ve successfully “Tried” all three services, the orchestrator calls
Confirmon all. - Cancel: If any
Tryfails, callCancelon all services that succeeded.
Why TCC over Saga?
In a Saga, you actually charge the user or book the seat in Step 1. If Step 3 fails, you have to refund or cancel. In TCC, the “Try” phase just hides the resource from others. No money moves until the “Confirm” phase.
2. Interactive: TCC Booking Flow
Simulate a complex booking to see how TCC reserves resources across boundaries.
3. Comparison Table
| Feature | Saga | TCC |
|---|---|---|
| Commit Type | Local (immediately visible) | Global (deferred till confirm) |
| Isolation | None (can see dirty reads) | High (reserved state) |
| Complexity | Moderate | High |
| Best Case | Independent events (emails) | Critical coordination (money/booking) |
4. Staff Hazard: The “Phantom Cancel”
In a distributed network, messages can be reordered. What happens if a Cancel request arrives at a service before the Try request?
- The Failure: A naive service sees no active transaction for that ID and returns “Success.” Later, the delayed
Tryrequest arrives and successfully reserves the resource. - The Result: You have a “Zombie Reservation” that will never be confirmed or canceled, leading to inventory leaks.
- The Staff Move: Always persist a Tombstone for canceled transactions. If a
Tryarrives for a transaction that already has a “Canceled” tombstone, it must be rejected immediately.
5. Staff Insight: The “Coordinator Zombie”
The Transaction Coordinator (TC) is a single point of failure.
- The Scenario: The TC sends
Tryto 3 services. All succeed. Before sendingConfirm, the TC node crashes and its local disk is corrupted. - The Impact: Those 3 services now have resources (hotel rooms, money) “locked” in a
TRY_RESERVEDstate. They will wait forever unless they have a safety valve. - The Solution: Participant-Side TTLs. Every
Tryrequest should include an expiration (e.g., 5 minutes). If the service hasn’t received aConfirmorCancelby then, it auto-releases the resource.
6. Confirm/Cancel Idempotency
A common bug is executing the “Confirm” logic twice.
- The Failure: The TC sends
Confirmto the Payment service. The service executes the payment but the network ACK times out. The TC retries theConfirm. - The Risk: Charing the user twice.
- Staff Rule: Every participant in a TCC flow must be strictly Idempotent. Use the
transaction_idas a unique constraint in your database to prevent double-processing.
4. Staff Math: The Cost of Transactions
Distributed transactions aren’t just slow; they are expensive in terms of system capacity.
4.1. The Reservation Lock-up Ratio
In TCC, inventory is held in a “Frozen” state between the Try and Confirm calls.
[
\textbf{Frozen Units} = \lambda \times N \times \left( p_s T_s + p_f T_f + p_o T_{gc} \right)
]
- Variables: $\lambda$ (throughput), $N$ (items per txn), $T_s$ (success hold time), $T_{gc}$ (zombie timeout).
- Example: 100 reqs/sec, 2 items per txn, 200ms success time.
- Frozen Items: $100 \times 2 \times 0.2 = \mathbf{40 \text{ items}}$.
- Staff Insight: If your $T_s$ increases (e.g., due to a slow network), your “Available Inventory” drops even if no one is actually buying anything.
4.2. The Saga Throughput Leak
Sagas don’t use locks, but they leak throughput during failures. [ \textbf{Real Throughput} = \text{IOPS} \times (1 - 2 \times p_{\text{fail}}) ]
- Impact: For every failed saga, you pay for the Original Write + the Compensating Undo. If your failure rate is 10%, you are wasting 20% of your database IOPS on work that results in zero business value.
4.3. Message Complexity (2PC vs. TCC)
Standard 2PC requires 6-8 network messages for a single commit across 3 nodes.
- TCC: Also requires 6-8 messages (Try + Confirm/Cancel).
- Difference: 2PC holds Database Locks for the duration; TCC holds Business Reservations. This is why TCC scales where 2PC fails.
Staff Takeaway
Distributed Transactions are the “nuclear option.”
- Try to avoid them by restructuring your sharding or data model.
- If you can’t, use Sagas for simple long-running flows.
- Use TCC for complex, multi-party systems where “phantom” states or double-bookings are unacceptable.