Distributed Transactions: TCC & Sagas

You’ve been tasked with designing a “Super App” booking flow: Flight + Hotel + Car Rental. All three must succeed together, or all must be cancelled.

A Senior Engineer uses a Saga (Sequential 1-2-3). A Staff Engineer knows that for multi-party bookings with high inventory contention, TCC (Try-Confirm-Cancel) is often superior because it provides better isolation and a “Sealing” phase.

1. TCC (Try-Confirm-Cancel)

TCC is a 3-step application-level consistency pattern that mirrors the Two-Phase Commit (2PC) protocol but at the business logic layer.

Try: Reserve resources (e.g., put a “held” status on a flight seat). This is the Isolation phase.
Confirm: Finalize the transaction. If I’ve successfully “Tried” all three services, the orchestrator calls Confirm on all.
Cancel: If any Try fails, call Cancel on all services that succeeded.

Why TCC over Saga?

In a Saga, you actually charge the user or book the seat in Step 1. If Step 3 fails, you have to refund or cancel. In TCC, the “Try” phase just hides the resource from others. No money moves until the “Confirm” phase.

2. Interactive: TCC Booking Flow

Simulate a complex booking to see how TCC reserves resources across boundaries.

Flight

TRY

CONFIRM

Hotel

TRY

CONFIRM

Car

TRY

CONFIRM

TCC uses a reservation phase to avoid "phantom" bookings.

3. Comparison Table

Feature	Saga	TCC
Commit Type	Local (immediately visible)	Global (deferred till confirm)
Isolation	None (can see dirty reads)	High (reserved state)
Complexity	Moderate	High
Best Case	Independent events (emails)	Critical coordination (money/booking)

4. Staff Hazard: The “Phantom Cancel”

In a distributed network, messages can be reordered. What happens if a Cancel request arrives at a service before the Try request?

The Failure: A naive service sees no active transaction for that ID and returns “Success.” Later, the delayed Try request arrives and successfully reserves the resource.
The Result: You have a “Zombie Reservation” that will never be confirmed or canceled, leading to inventory leaks.
The Staff Move: Always persist a Tombstone for canceled transactions. If a Try arrives for a transaction that already has a “Canceled” tombstone, it must be rejected immediately.

5. Staff Insight: The “Coordinator Zombie”

The Transaction Coordinator (TC) is a single point of failure.

The Scenario: The TC sends Try to 3 services. All succeed. Before sending Confirm, the TC node crashes and its local disk is corrupted.
The Impact: Those 3 services now have resources (hotel rooms, money) “locked” in a TRY_RESERVED state. They will wait forever unless they have a safety valve.
The Solution: Participant-Side TTLs. Every Try request should include an expiration (e.g., 5 minutes). If the service hasn’t received a Confirm or Cancel by then, it auto-releases the resource.

6. Confirm/Cancel Idempotency

A common bug is executing the “Confirm” logic twice.

The Failure: The TC sends Confirm to the Payment service. The service executes the payment but the network ACK times out. The TC retries the Confirm.
The Risk: Charing the user twice.
Staff Rule: Every participant in a TCC flow must be strictly Idempotent. Use the transaction_id as a unique constraint in your database to prevent double-processing.

4. Staff Math: The Cost of Transactions

Distributed transactions aren’t just slow; they are expensive in terms of system capacity.

4.1. The Reservation Lock-up Ratio

In TCC, inventory is held in a “Frozen” state between the Try and Confirm calls. [ \textbf{Frozen Units} = \lambda \times N \times \left( p_s T_s + p_f T_f + p_o T_{gc} \right) ]

Variables: $\lambda$ (throughput), $N$ (items per txn), $T_s$ (success hold time), $T_{gc}$ (zombie timeout).
Example: 100 reqs/sec, 2 items per txn, 200ms success time.
- Frozen Items: $100 \times 2 \times 0.2 = \mathbf{40 \text{ items}}$.
Staff Insight: If your $T_s$ increases (e.g., due to a slow network), your “Available Inventory” drops even if no one is actually buying anything.

4.2. The Saga Throughput Leak

Sagas don’t use locks, but they leak throughput during failures. [ \textbf{Real Throughput} = \text{IOPS} \times (1 - 2 \times p_{\text{fail}}) ]

Impact: For every failed saga, you pay for the Original Write + the Compensating Undo. If your failure rate is 10%, you are wasting 20% of your database IOPS on work that results in zero business value.

4.3. Message Complexity (2PC vs. TCC)

Standard 2PC requires 6-8 network messages for a single commit across 3 nodes.

TCC: Also requires 6-8 messages (Try + Confirm/Cancel).
Difference: 2PC holds Database Locks for the duration; TCC holds Business Reservations. This is why TCC scales where 2PC fails.

Staff Takeaway

Distributed Transactions are the “nuclear option.”

Try to avoid them by restructuring your sharding or data model.
If you can’t, use Sagas for simple long-running flows.
Use TCC for complex, multi-party systems where “phantom” states or double-bookings are unacceptable.