Design International Money Transfers

[!IMPORTANT] In this lesson, you will master:

  1. The Purgatory Account Pattern: How to safely hold user funds during long-running cross-border moves.
  2. Double-Entry Bookkeeping: Why you never update a “balance” column but always append to a ledger.
  3. The State Machine Ledger: Managing 5-day transfer lifecycles without losing a single cent.

1. Parameters: Moving Value Across Borders

Moving money from London (GBP) to Mumbai (INR) isn’t just a database update. It involves compliance screening (AML/KYC), external liquidity rails (SWIFT/ACH), and temporal risk (rates changing while the money is in flight).

[!] Clarifying Questions (The “Staff” Angle)

  • Interviewer: “Design a system to handle 1,000 international transfers per second.”
  • Candidate: “Do we settle with the recipient in real-time (IMPS/SCT Inst) or via batch (ACH/SWIFT)?”
  • Interviewer: “Assume a mix. Some are instant, some take 3 days.”
  • Candidate: “How do we handle funds coverage? Do we pre-fund our currency accounts globally, or buy liquidity on-demand?”
  • Interviewer: “We pre-fund. Assume we have pools of GBP and INR ready.”

2. Requirements & Constraints

2.1 Functional Requirements

  1. Cross-Border Orchestration: Manage the end-to-end lifecycle of a transfer (e.g., GBP to INR) including FX conversion and payout.
  2. Double-Entry Ledger: Every movement of money must be recorded as an immutable pair of Debit and Credit entries.
  3. Rail Adaptability: Support multiple payout “rails” (SWIFT, SEPA, IMPS, ACH) with varying speeds and failure modes.
  4. Compliance Screening: Real-time integration with Sanctions and AML (Anti-Money Laundering) engines before funds release.
  5. State Persistence: Support long-running transfers that may take up to 5 business days to settle.

2.2 Non-Functional Requirements

  1. Financial Integrity (Zero Loss): The system must never lose a cent. Total debits must always equal total credits.
  2. Throughput: Support 1,000 Transfers/sec (TPS) at peak.
  3. Durability: Use RAID-backed SQL storage with synchronous WAL (Write-Ahead Logging) for the ledger.
  4. Idempotency: Every operation must be idempotent to handle network retries without duplicate payouts.

2.3 Compliance & Regulatory

  • Data Residency: User PII and financial records must stay within the user’s home jurisdiction (e.g., EU users in Frankfurt).
  • SAR (Suspicious Activity Reporting): Automated hooks to flag transfers exceeding threshold limits for manual review.

3. Capacity Planning & Estimation (The Ledger Weight)

3.1 Throughput Analysis

  • Transfer TPS: 1,000 TPS.
  • Ledger Amplification: One high-level “Transfer” triggers 6-10 Ledger entries (Debit User, Credit Purgatory, FX Swap, Credit Rail, etc.).
  • Write Load: 10,000 Ledger writes/sec.
  • The IOPS Challenge: 10k writes/sec with synchronous commits requires high-performance NVMe storage. To protect the ledger under extreme spikes, prioritize load shedding at the API gateway over implementing complex database sharding, which compromises cross-partition ACID guarantees.

3.2 Storage Volume

  • Row Size: 200 bytes per ledger entry.
  • Daily Volume: 1,000 TPS × 10 entries × 86,400s ≈ 864 Million entries/day.
  • Daily Storage: 864M × 200 B ≈ ~170 GB/day.
  • Yearly Projection: ~62 TB/year.
  • Conclusion: We need Hot/Cold Storage partitioning. Keep the last 30 days in Postgres for active reconciliation, and move older data to a WORM (Write Once Read Many) data warehouse like Snowflake or BigQuery.

3.3 State Duration (The “Open” Transaction Problem)

  • If a transfer takes 5 days and we process 1k TPS, we have 1,000 × 5 days × 86,400s ≈ 432 Million transfers in an “Intermediate” state (Pending/Outgoing) at any given time.
  • Memory Impact: We cannot keep these states in RAM. The orchestrator must use Durable Execution. Building a custom “Saga DB” and polling workers for 432 million transfers is an operational nightmare. We must use a dedicated workflow engine like Temporal.io or AWS Step Functions to natively handle state persistence, exponential backoffs, and sleeping (e.g., waiting 3 days for SWIFT) without tying up compute resources.

4. The 4-Quadrant Whiteboard Layout

1. Reqs & Math
- 1k TPS incoming
- 10k Ledger writes/sec
- 5-day state duration
- Constraints:
  * Compliance (AML)
  * Zero-Loss Ledger
  * Rail Adapter volatility
2. High Level Design
[User App]
|
[API Gateway]
|
[Orchestrator (Temporal)]
|
[Ledger] → [Rail Adapters]
3. Ledger Model
Entries Table:
(id, acc_id, amount, side)
Acc Types:
- User Wallet
- Wise Hold (EUR)
- Wise Payout (INR)
4. Deep Dives
- Edge Idempotency
- 3-Way Reconciliation
- Auth vs Settlement Risk
- Temporal Saga Engine

5. API Edge Idempotency (The User Risk)

Before a transfer even reaches the ledger, you must protect the system from the user. What if a user clicks the “Send” button twice due to a bad 3G connection?

  1. The Client Request: The mobile app generates a UUID v4 Idempotency-Key and attaches it to the HTTP header.
  2. The Gateway Cache: The API Gateway checks a Redis cluster (with a 24-hour TTL) for this key.
  3. The Short-Circuit: If the key exists, the API immediately returns the exact HTTP 200 response stored from the first request. The ledger and orchestration engine are completely bypassed, guaranteeing Alice is only charged once.

6. Architecture: Global Regional Cells

To solve for Data Residency and Low Latency, Wise uses a Regional Cell Architecture similar to high-frequency trading systems.

EU Cell (Home)
Ledger A (EUR)
PII & Balances
Transfer Saga
Status: FUNDED
mTLS
INR Cell (Payout)
Rail Adapter (IMPS)
Payout Execution
Ledger B (INR)
Liquidity Pool
The Reconciliation Bridge
1. Internal Ledger
2. Rail Callback
3. Bank Statement

7. Technical Depth: Three-Way Reconciliation

At the Staff/Lead level, you must understand that your internal Postgres Ledger is essentially a hallucination. The only truth is what the physical banks say.

The Problem: Ghost Transfers

A Rail Adapter might report a transfer as SUCCESS, but the bank actually rejected it due to a name mismatch. If you only trust your internal DB, you have “Ghost Money” that was never actually sent.

The Three-Way Match Algorithm

A nightly Apache Spark or Flink batch job joins three distinct datasets based on transfer_id and the bank_reference_number:

  1. System A (Internal Ledger): What we think happened.
  2. System B (Payment Processor/Rail Adapter): What the API gateway (e.g., Stripe, Plaid, or the Bank API) says happened.
  3. System C (The Bank Statement - CAMT.053): The actual file the physical bank drops on a secure SFTP server at 2:00 AM showing the physical cash movement.

Reconciliation Outcomes

  • If A = B = C: The transfer is officially marked as RECONCILED (a Terminal state).
  • If A = B, but C shows a rejection (e.g., Bob’s bank account was closed): The reconciliation engine detects the discrepancy and automatically triggers a Compensating Workflow to refund Alice.

[!IMPORTANT] Compensating Transactions: When triggering a refund due to a bank rejection, the system must NOT delete the original ledger row. Immutability is law. It must append a Reversal Leg to maintain the audit trail: Debit Wise Payout (INR), Credit Wise EUR Purgatory.


8. Logic: The “Wise Purgatory” Pattern

Money doesn’t teleport. When Alice sends 100 EUR to Bob, the money first moves into an internal Wise Holding Account (The Purgatory).

The Transfer State Machine & The Settlement Risk

At a Lead level, you must understand the difference between Authorization and Settlement.

  • If Alice pays via Credit Card, we receive an instant Authorization. The money is guaranteed. We can immediately advance to FUNDED.
  • If Alice pays via Direct Debit (ACH in the US, SEPA in Europe), the money can bounce (NSF - Non-Sufficient Funds) up to 3 days after she initiated it. If Wise already paid Bob in India on Day 1, Wise takes the loss.

To solve this, the orchestrator introduces a Risk Engine check and an intermediate FUNDING_PENDING state. High-risk transfers wait 3 days for the physical cash settlement before proceeding to FX conversion.

State Action Financial Event
CREATED User clicks Send. None (Intent recorded).
FUNDING_PENDING Payment requested. Awaiting Settlement / Risk Engine clearance.
FUNDED Money settled/cleared. Debit User Wallet, Credit Wise EUR Purgatory.
CONVERTED FX applied. Debit Wise EUR Purgatory, Credit Wise INR Purgatory.
OUTGOING Bank payout sent. Debit Wise INR Purgatory, Credit Bob’s Bank.

Interactive: The Live Ledger

Trigger a transfer and watch the immutable ledger legs populate in real-time.

Double-Entry Ledger Simulator
CREATED
Click "Fund (EUR)" to initiate the transfer.
Alice (Sender)
100.00 EUR
Wise EUR Purgatory
0.00 EUR
Wise INR Purgatory
0.00 INR
Bob (Recipient)
0.00 INR

[!] Operational Failure Modes (Playbooks)

Scenario A: Rail Outage (Zoned Failure)

  • Problem: The Indian banking rail (IMPS) goes down for 12 hours.
  • Playbook:
    1. The Orchestrator transitions affected transfers to OUTGOING_RETRYING.
    2. Implement Exponential Backoff on the Rail Adapter polling.
    3. Send a “High Volume Delay” push notification to users to reduce CS tickets.

Scenario B: Double Payout (Idempotency Failure)

  • Problem: A bug in the Rail Adapter causes it to send the “Success” signal to the Orchestrator twice, potentially triggering two bank transfers.
  • Playbook:
    1. Use the ledger_transfer_id as the Rail Idempotency Key.
    2. Most modern banks (SWIFT/ACH) allow a client-provided ID to prevent duplicates at the bank layer.
    3. If a duplicate payout is detected, trigger an Immediate Pullback (if supported) or flag for manual bank recall.

[!] Hardware-First Intuition: Why Atoms Matter

In Payment Systems, we rely on Atomic Commit Protocols.

  1. The Double-Entry Rule: Total Credits - Total Debits MUST == 0.
  2. SQL Serializability: To ensure two people don’t spend the same $100$ EUR at once, the database must “freeze” the row during the update.
  3. Latency Penalty: This consistency costs us throughput. While Redis handles $10^{6}$ ops/sec, an ACID-compliant Postgres ledger might only handle $10^{4}$ ops/sec. We accept this $100$x slowdown for Financial Truth.

[!TIP] Staff Engineer Insight: Why No 2PC? In global payments, we never use Two-Phase Commit (2PC) or distributed transactions across different services/databases. They are too fragile over high-latency global networks. Instead, we use Idempotent Side Effects and Asynchronous Reconciliation. If a step fails, we don’t rollback; we append a “Correction” entry.


9. Summary: Senior Interview Checklist

  • Three-Way Reconciliation: Explain how you match Internal Ledger vs. Rail Reports vs. Bank Statements nightly.
  • Compensating Transactions: If the payout fails at the final leg, how do you “Undo” the conversion? (Never delete; append a Reversal leg).
  • Idempotency: Discuss using the Idempotency-Key to look up existing transfer_ids before creating new ones.
  • State Machine Safety: Ensuring a transfer cannot jump from CREATED to OUTGOING without being FUNDED.

10. Follow-up Interview Questions

  1. “What happens if a process crashes immediately after writing to the local database but before publishing the event to Kafka?” Answer: This is the classic dual-write problem. We solve this using the Transactional Outbox Pattern. Instead of writing to the DB and then to Kafka, we write the business entity (Transfer) and an Event payload into an outbox table in the same database transaction. A separate process (like Debezium) tails the database binlog/WAL and reliably publishes the outbox events to Kafka. Crucially, we use the transfer_id as the Kafka Partition Key to guarantee that all state change events for a specific transfer are consumed in the exact sequential order they were written.

  2. “How do you handle ‘stuck’ transfers if the Orchestrator dies mid-process?” Answer: We avoid building custom “Sweeper Workers” as they scale poorly and introduce database load. Instead, we use a Durable Execution engine like Temporal.io. Temporal natively persists the state machine and handles timers. If a worker dies, Temporal automatically assigns the workflow to another worker to resume exactly where it left off, handling exponential backoffs and timeouts seamlessly.

  3. “Why avoid Two-Phase Commits (2PC) between the Ledger DB and the Rail Adapters?” Answer: 2PC requires locking resources across network boundaries. If a Rail Adapter (or external bank API) becomes unresponsive during the commit phase, it holds database locks open on our Ledger, causing a system-wide outage. Saga patterns with compensating transactions are preferred because they keep locks strictly local and short-lived.

  4. “If Bob’s bank rejects the transfer after we’ve already converted the funds, who takes the FX risk on the refund?” Answer: From a system design perspective, we issue a Compensating Transaction to move the money from OUTGOING_FAILED back to CREATED. The business logic decides if we refund Alice the exact EUR she sent (Wise takes the FX hit) or refund the INR equivalent converted back to EUR at the current rate (Alice takes the hit). The Ledger supports both by allowing flexible Reversal Legs.

  5. “How do you guarantee strict ordering and partition the Kafka topics orchestrating the transfer state machine?” Answer: For the transfer orchestration, we partition the Kafka topics by transfer_id. This is critical because a transfer must strictly follow its state transitions (CREATED -> FUNDED -> CONVERTED -> OUTGOING). If we didn’t partition by transfer_id, the events might arrive out of order (e.g., processing CONVERTED before FUNDED), breaking the state machine. Partitioning by transfer_id ensures all events for a single transfer are processed sequentially by the same consumer.