Deep Dive: Trading Order System (RFQ)

[!NOTE] This module explores the core principles of Deep Dive: Trading Order System (RFQ), deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: The High-Stakes World of RFQ

In the world of cryptocurrency exchanges, there are two primary models for trading: CLOB (Central Limit Order Book) and RFQ (Request for Quote). While CLOBs (like Nasdaq or Binance) match buyers and sellers continuously, RFQ systems are designed for institutional clients or “Convert” features where a user asks: “How much for 10 BTC right now?” and the system replies: “$50,000.00. Valid for 7 seconds.”

This creates a unique distributed systems challenge: Ephemeral State Management. You must guarantee a price for a short window, managing the risk that the market might crash during those seconds. If your system is too slow, you lose money. If your system is inconsistent, you lose user funds.

This guide assumes no prior knowledge of financial systems. We will build a production-grade, globally scalable RFQ system from the ground up, dissecting every component from the Load Balancer to the Database Kernel.


2. Requirements & Constraints

2.1 Functional Requirements

  1. Authentication & Authorization: Users must log in via a secure Identity Provider (OIDC). Access is strictly controlled via Role-Based Access Control (RBAC) and KYC Tiering.
  2. Market Data Feed (Ticker): Users need a real-time stream of indicative prices for all supported pairs (e.g., GET /prices), distinct from firm executable quotes.
  3. Get Quote (RFQ): Users request a firm buy/sell price. The system returns a cryptographically signed quote with an absolute expires_at timestamp.
  4. Execute Order: Users can accept the quote only if:
    • Expiry Check: Server time ≤ expires_at (plus grace period).
    • Funds Reservation: Usage of a “Hold” mechanism to lock funds before execution.
    • Atomic Claim: Introduction of a “Claim Key” to prevent double-filling (replay protection).
  5. Wallet Management: Users must be able to manage their balances (deposits, withdrawals). The system must use a Ledger/Journal model for auditability.
  6. Transaction History: Users can view their past trades and current portfolio value.
  7. Settlement: Funds must be exchanged atomically. No partial states.

2.2 Non-Functional Requirements

  1. Reliability: 99.999% (Five Nines).
    • Strategy: Active-Active for stateless services (Quotes), Active-Passive with <30s failover for stateful services (Orders).
    • SLO Split: Execution (99.99%) vs Quoting (99.9% with graceful degradation).
  2. Latency:
    • Quote Generation: < 50ms (99th percentile). Stale quotes are risky.
    • Order Execution: < 100ms (99th percentile). Fast feedback is crucial.
  3. Consistency: Strong Consistency (ACID) for all financial transactions. Eventual consistency is not allowed for balances.
  4. Throughput: Support 10,000 Quotes/sec and 500 Trades/sec initially, scalable to 10x via sharding.
  5. Availability: 24/7. Crypto markets never sleep.
  6. Security: Strict Authentication (mTLS/JWT), Rate Limiting, and Audit Logs.
  7. Global Reach: Multi-region deployment with “Home Region” pinning for users.

2.3 Compliance & Regulations

[!IMPORTANT] Problem Statement: “System should be compliant with country regulations”

  • KYC (Know Your Customer): Mandatory verification before trading.
  • Tier 1: Email (Low limits).
  • Tier 2: ID + Selfie (Medium limits).
  • Tier 3: Source of Funds (High limits).
  • AML (Anti-Money Laundering):
  • Sanctions Screening: Check users against OFAC/UN lists securely.
  • Transaction Monitoring: Real-time flagging of suspicious patterns (e.g., structuring).
  • Travel Rule: For crypto withdrawals >$1,000, originator and beneficiary details must be transmitted to the receiving VASP (Virtual Asset Service Provider).
  • Market Abuse: Prevention of wash trading and spoofing via Rate Limiting and Anomaly Detection.
  • Data Residency:
  • GDPR (EU): European user data stays in eu-west-1.
  • CCPA (US): US user data stays in us-east-1.

3. Capacity Planning & Estimation

Before writing code, we must understand the hardware requirements.

3.1 Traffic Analysis

  • DAU (Daily Active Users): 1,000,000.
  • Quote-to-Trade Ratio: 20:1. Users check prices frequently but trade less often.
  • Quotes: 20 requests/user/day → 20M quotes/day.
  • Average QPS: 20,000,000 / 86,400 ≈ 230 QPS.
  • Peak QPS: Financial markets are uniquely volatile. A single tweet (e.g., from Elon Musk) can trigger massive spikes. We must design for 10x to 50x bursts → 2,300 - 11,500 QPS.
  • Why design for peak? If your system fails during a rally, users can’t trade, and you lose the most profitable moments.
  • Trades: 1M trades/day.
  • Average TPS: 1,000,000 / 86,400 ≈ 12 TPS.
  • Peak TPS: 120 TPS.

3.2 Bandwidth & Network

  • Quote Payload: ~1KB (JSON with prices, metadata, signature).
  • Bandwidth: 2,300 QPS × 1 KB ≈ 2.3 MB/s. Trivial for modern networks (1Gbps links).
  • Order Payload: ~500 Bytes.
  • Bandwidth: Negligible.

3.3 Storage (The Ledger)

  • Trades Table: The primary source of truth.
  • Row Size: ~500 Bytes (IDs, timestamps, prices, fees).
  • Daily Growth: 1,000,000 trades × 500 B = 500 MB/day.
  • Yearly Growth: 500 MB × 365 ≈ 180 GB.
  • 5-Year Retention: 180 GB × 5 ≈ 900 GB.
  • Conclusion: A single master database could hold this volume, but for IOPS (Input/Output Operations Per Second) and concurrency, we will need Sharding (Consistent Hashing).

3.4 Memory (Redis Cache)

  • Active Quotes: Only valid for 7 seconds.
  • At peak 2,300 QPS × 7s = ~16,100 active quotes in memory.
  • Size: 16,100 × 1 KB ≈ 16 MB.
  • Conclusion: Redis memory usage is tiny. We are CPU/Network bound, not Memory bound.

4. High-Level Architecture: Global Regional Cells

We move beyond a standard monolith to a Federated Regional Cell architecture. This design balances two conflicting requirements:

  1. Low Latency: Users want fast quotes (Speed of Light matters).
  2. Data Residency: Regulators (GDPR, FinCEN) require user data to stay effectively “domiciled” in their home jurisdiction.

The diagram below illustrates our Hybrid Routing Pattern:

  • Stateless Reads (Blue Path): Served from the nearest Regional Cell for maximum speed.
  • Stateful Writes (Pink Path): Proxied to the user’s Home Cell for ACID compliance and data sovereignty.
System Architecture: Global Regional Cells
Geo-Partitioned Compliance & Home Cell Routing
Quote/Read Path (Fast)
Execution Path (ACID)
Compliance Boundary
Global Edge Network
Traveler (in Japan)
Global DNS/GSLB
Edge WAF
API Gateway (JP)
Home-Cell Resolver
Checks JWT Claim: home_cell=EU
Internal LB
Service Discovery
APAC Cell (Nearest)
LPs (Binance/CME)
Quote Service
Local Execution TTL 7s
Redis (Local)
Ticker Cache Quote:7s
(Stateless Read Path)
EU Cell (Home / Regulated)
ILB (EU)
Order Service
• Atomic Claim • Funds Hold • ACID Tx
Redis (Cluster)
Atomic CLAIM Key
Postgres Shards
• Shard 1 (User A-M) • Shard 2 (User N-Z)
Tables: - Wallets (Ledger) - Orders - Holds
Compliance & Risk
• KYC Check • Sanctions • Travel Rule
Kafka Cluster
topic: orders topic: ledger topic: audit
S3 Glacier (Audit)
Traveler Behavior: 1. User connects to nearest endpoint (APAC). 2. Market Data served locally (Fast). 3. Trades forwarded to Home Cell (EU) for ACID.
2. Forward to Home (EU)

5. Detailed Component Design & Trade-offs

5.1 API Gateway (The Doorman)

The Gateway (e.g., Kong, Nginx, or AWS API Gateway) is the single entry point.

  • Authentication:
  • Uses JWT (JSON Web Tokens) for lightweight, stateless auth.
  • The Gateway verifies the signature (RSA-256) and claims (Expiration, Scopes) before passing the request downstream.
  • Rate Limiting:
  • Algorithm: Token Bucket.
  • Why Token Bucket over Leaky Bucket? Token Bucket allows for short bursts of traffic (e.g., a bot reacting to a sudden market move), which is desirable for trading. Leaky Bucket enforces a rigid rate, which might punish legitimate high-frequency users.
  • Scope: Per User ID and Per IP.
  • Config: 100 req/min for Quotes, 20 req/min for Orders.
  • Optimization:
  • SSL Termination: Decrypt HTTPS at the edge to offload CPU from microservices.
  • Keep-Alive: Maintain persistent connections to upstream services to avoid TCP Handshake overhead.

5.2 The Quote Service (The Sprinter)

This service must be incredibly fast. It is Stateless and Read-Heavy.

  • Role:
    1. Fetch real-time prices from Liquidity Providers (LPs) via WebSocket.
    2. Apply a “Spread” (e.g., +0.5% markup for profit).
    3. Generate a cryptographically signed quote.
    4. Cache the quote in Redis with TTL=7s.
  • Optimizations:
  • Fan-out: Query 3 LPs in parallel, take the median price (to avoid outliers/manipulation).
  • Zero-Allocation: In Go/Rust, reuse memory buffers to avoid Garbage Collection pauses during high load.

5.3 The Order Service (The Vault)

This service must be incredibly safe. It is Stateful and Transactional.

  • Role:
    1. Atomically claim the quote (prevent replay attacks).
    2. Check user balance (DB lock).
    3. Execute the trade atomically (Update DB).
  • Idempotency:
  • Crucial for network failures. If a client sends an order but doesn’t get a response (timeout), they will retry.
  • Mechanism: The client sends a unique idempotency_key (UUID). The server checks a unique index in the DB. INSERT INTO orders (id, ...) VALUES ... ON CONFLICT DO NOTHING.

5.3.1 Quote Replay Protection (CRITICAL)

[!CAUTION] The #1 Interview Gotcha: Without atomic claim, your system is vulnerable to double-fill attacks.

The Attack Scenario:

  1. User receives quote: quote_id=abc123, valid for 7 seconds
  2. User opens two browser tabs
  3. Tab A: Submit order with quote_id=abc123
  4. Tab B: Submit order with quote_id=abc123 (1ms later)
  5. Both requests race:
    • Request A: Check quote:abc123 exists? → ✅ YES
    • Request B: Check quote:abc123 exists? → ✅ YES (race!)
    • Request A: Debit $50,000, credit 1 BTC
    • Request B: Debit $50,000, credit 1 BTC
  6. Result: User gets 2 BTC for the price of 1 → You lose $50,000

The Analogy: Ticket Reservation Imagine you are buying a concert ticket.

  1. Quote: You see a seat for $100.
  2. Race: You and your friend click “Buy” at the exact same millisecond.
  3. Correct System: Only ONE person gets the ticket. The other gets an error.
  4. Bad System: Both get the ticket. The venue is overbooked. You lose money.

The Solution: Atomic Check-and-Claim

We use a Redis Lua script to atomically check if a quote exists AND mark it as claimed in a single operation:

Click to view Lua Script
-- quote_claim.lua
-- This script ensures a quote can only be claimed ONCE
local quote_key = KEYS[1]       -- "quote:abc123"
local claim_key = KEYS[2]       -- "quote:abc123:claimed"
local order_id = ARGV[1]        -- "order_xyz789"
local remaining_ttl = ARGV[2]   -- Remaining seconds for quote validity

-- Atomic check: Does quote exist AND is it not claimed?
if redis.call('EXISTS', quote_key) == 1 and redis.call('EXISTS', claim_key) == 0 then
  -- Claim it for this order_id
  redis.call('SET', claim_key, order_id, 'EX', remaining_ttl)
  -- Return the quote payload
  return redis.call('GET', quote_key)
else
  -- Quote expired or already claimed
  return nil
end

Order Service Execution Flow (Python):

Click to view Python Implementation
import redis
import time
import hmac
import hashlib
import json

redis_client = redis.Redis()

# Load Lua script once on startup (cached on Redis server)
QUOTE_CLAIM_SCRIPT = redis_client.script_load(open('quote_claim.lua').read())

def execute_order(user_id, quote_payload_signed, idempotency_key):
  # Step 1: Verify HMAC signature (prevent tampering)
  quote_id = quote_payload_signed['id']
  received_hmac = quote_payload_signed['signature']
  payload = json.dumps(quote_payload_signed['data'], sort_keys=True)
  expected_hmac = hmac.new(SERVER_SECRET, payload.encode(), hashlib.sha256).hexdigest()

  if not hmac.compare_digest(received_hmac, expected_hmac):
    return {"error": "INVALID_SIGNATURE"}

  # Step 2: Check quote freshness (server-side clock with grace period)
  quote_data = quote_payload_signed['data']
  expires_at = quote_data['expires_at']  # Absolute timestamp from server
  current_time = time.time()

  # Allow 200ms grace period for network latency
  if current_time > expires_at + 0.2:
    return {"error": "QUOTE_EXPIRED"}

  # Step 3: Atomic claim using Lua script
  remaining_ttl = int((expires_at + 0.2) - current_time)
  quote_key = f"quote:{quote_id}"
  claim_key = f"quote:{quote_id}:claimed"

  quote_payload = redis_client.evalsha(
    QUOTE_CLAIM_SCRIPT,
    2,  # Number of KEYS
    quote_key, claim_key,
    idempotency_key, remaining_ttl
  )

  if quote_payload is None:
    return {"error": "QUOTE_EXPIRED_OR_ALREADY_USED"}

  # Step 4: Database transaction (Reserve + Execute)
  with db.begin():
    # Lock user's wallet row (Pessimistic Locking / Reserve)
    wallet = db.execute(
      "SELECT balance FROM wallets WHERE user_id = %s AND currency = %s FOR UPDATE",
      (user_id, quote_data['base_currency'])
    ).fetchone()

    if wallet['balance'] < quote_data['total_cost']:
      return {"error": "INSUFFICIENT_BALANCE"}

    # Debit and credit atomically
    db.execute(
      "UPDATE wallets SET balance = balance - %s WHERE user_id = %s AND currency = %s",
      (quote_data['total_cost'], user_id, quote_data['base_currency'])
    )
    db.execute(
      "UPDATE wallets SET balance = balance + %s WHERE user_id = %s AND currency = %s",
      (quote_data['amount'], user_id, quote_data['quote_currency'])
    )

    # Insert order record (idempotency_key ensures uniqueness)
    db.execute(
      "INSERT INTO orders (order_id, user_id, quote_id, pair, amount, price, status) "
      "VALUES (%s, %s, %s, %s, %s, %s, 'FILLED') "
      "ON CONFLICT (idempotency_key) DO NOTHING",
      (idempotency_key, user_id, quote_id, ...)
    )

  return {"status": "SUCCESS", "order_id": idempotency_key}

Why This Works:

  1. Lua Atomicity: Redis executes the entire script as a single atomic operation. No race conditions.
  2. Claim Key: Once set, the claim key prevents the quote from being used again (even if the original quote key still exists).
  3. TTL Inheritance: The claim key expires at the same time as the quote, preventing leaked state.
  4. Grace Period: The 200ms buffer accounts for network delays, ensuring valid requests at t=6.9s are not rejected.

5.4 Redis Architecture

We use Redis not just for caching, but for temporary state (Quotes).

  • Cluster Mode:
  • Data is sharded across multiple nodes (16384 slots).
  • This allows horizontal scaling of memory and throughput.
  • Eviction Policy: volatile-ttl.
  • Why? We only want to evict keys with an expiry set (quotes). If we used allkeys-lru, we might accidentally evict persistent configuration keys or session data. See Module 06: Caching.
  • Persistence:
  • RDB (Snapshot): Every 15 minutes.
  • AOF (Append Only File): Disabled or set to everysec.
  • Why? If Redis crashes, losing the last 1 second of quotes is acceptable. Users will just request a new quote. Performance > Durability for quotes.

5.5 Database Architecture (The Ledger)

We choose PostgreSQL for its robust ACID compliance.

Why PostgreSQL over NewSQL (CockroachDB/TiDB)?

  • Maturity: PostgreSQL has decades of battle-testing in financial systems. NewSQL databases are powerful but introduce complexity in deployment and debugging.
  • Complexity: For our scale (1M users), sharded Postgres is sufficient and well-understood. Distributed SQL adds network overhead (Raft consensus) to every write, increasing latency, which we want to avoid for order execution.

Schema Design

[!CAUTION] Never use FLOAT or DOUBLE for money. Floating-point arithmetic causes rounding errors due to IEEE 754 representation. Example: 0.1 + 0.2 &ne; 0.3 in binary floating point. Always use fixed-point types (NUMERIC/DECIMAL) or store integer minor units (BIGINT for cents/wei).

Click to view SQL Schema
CREATE TABLE wallets (
  user_id UUID,
  currency VARCHAR(10),
  -- NUMERIC(36, 18) supports crypto tokens with 18 decimals (e.g., ETH)
  -- Many ERC-20 tokens use 18 decimal places (1 ETH = 10^18 wei)
  balance NUMERIC(36, 18) NOT NULL DEFAULT 0,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  PRIMARY KEY (user_id, currency)
);

CREATE TABLE orders (
  order_id UUID PRIMARY KEY,
  idempotency_key UUID UNIQUE NOT NULL,  -- Prevent duplicate submissions
  user_id UUID NOT NULL,
  quote_id VARCHAR(64) NOT NULL,
  pair VARCHAR(20) NOT NULL,  -- 'BTC-USD', 'ETH-USDT'
  side VARCHAR(4) NOT NULL,   -- 'BUY', 'SELL'
  amount NUMERIC(36, 18) NOT NULL,       -- Quantity of crypto
  price NUMERIC(36, 18) NOT NULL,        -- Price per unit
  total_cost NUMERIC(36, 18) NOT NULL,   -- amount * price
  fee NUMERIC(36, 18) NOT NULL DEFAULT 0,
  status VARCHAR(20) NOT NULL,           -- 'FILLED', 'FAILED', 'CANCELLED'
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  settled_at TIMESTAMPTZ,

  INDEX idx_user_created (user_id, created_at DESC)  -- For transaction history
);

Sharding Strategy (2-Level Hierarchy)

As the user base grows, we need to scale writes. We use a 2-Level Partitioning Strategy:

  1. Level 1: Jurisdiction (Regional Cell): Users are first partitioned by their regulatory home (e.g., EU Users → Frankfurt Cell). This is a hard physical boundary for compliance.
  2. Level 2: User ID Sharding (Consistent Hashing): Within the cell, we shard by user_id.
    • Sharding Key: user_id.
    • Mechanism: Hash(User_ID) % 1000Bucket_IDPhysical_Node.
    • Why?: All data for a specific user lives on the same shard, enabling Local ACID Transactions.

5.6 Asynchronous History & Auditing

While the order execution is synchronous (ACID), reporting is asynchronous.

  1. Change Data Capture (CDC): We use the Outbox Pattern or simple application-level publishing.
  2. Kafka Topic: After the DB commit succeeds, the Order Service publishes an event to the orders_events topic.
      { "event": "ORDER_FILLED", "order_id": "...", "price": 50250, "ts": 169876... }
    
  3. Audit & Compliance Services: Consume these events for Post-Trade Surveillance:
    • Market Abuse Detection: Flagging “Wash Trading” or “Layering”.
    • AML Velocity Checks: Identifying “Structuring” (e.g., user deposits 9,999 twice to avoid the 10k reporting threshold).
    • Chainalysis Integration: If the user withdraws to a wallet linked to a Darknet Market, flag the account immediately.
    • Regulatory Archival: Write all events to WORM (Write Once Read Many) S3 Glacier storage for 7+ years (SEC 17a-4 compliance).

5.7 Authentication & Authorization

[!IMPORTANT] Problem Statement Requirement: “Users should be able to log in to the system and access only the features and data they are authorized to see.”

5.7.1 Authentication Flow

User Registration

Email VerificationKYC (Know Your Customer)Account Activated

Click to view Auth Logic
# Simplified Auth Flow
def login(email, password, totp_code):
  # Step 1: Verify password (bcrypt with cost factor 12)
  user = db.query("SELECT * FROM users WHERE email = ?", email)
  if not bcrypt.verify(password, user.password_hash):
    return {"error": "INVALID_CREDENTIALS"}

  # Step 2: Verify 2FA (Time-based One-Time Password)
  secret = user.totp_secret
  if not pyotp.TOTP(secret).verify(totp_code):
    return {"error": "INVALID_2FA"}

  # Step 3: Generate JWT with user role/tier embedded
  jwt_payload = {
    "user_id": user.id,
    "home_cell": user.jurisdiction,  # 'EU', 'US', 'SG' - Crucial for routing!
    "role": user.role,          # 'retail', 'vip', 'institutional'
    "tier": user.kyc_tier,       # 'bronze', 'silver', 'gold'
    "exp": time.time() + 900     # 15 minutes
  }
  access_token = jwt.encode(jwt_payload, JWT_SECRET, algorithm="HS256")
  refresh_token = generate_refresh_token(user.id)  # 7-day expiry

  refresh_rotation: New refresh token issued on each use

  return {"access_token": access_token, "refresh_token": refresh_token}

5.7.2 Role-Based Access Control (RBAC)

The API Gateway enforces trading limits based on user tier:

Tier Max Trade Size Daily Limit Rate Limit (Quotes/sec)
Bronze (Unverified) 1,000 5,000 5
Silver (KYC Verified) 10,000 50,000 10
Gold (Enhanced DD) 100,000 500,000 20
Institutional Custom Custom Custom

API Gateway Middleware (Kong/Envoy):

Click to view Kong Middleware
-- kong_auth.lua
function check_trade_limit(jwt_payload, trade_amount)
  local tier = jwt_payload.tier
  local limits = {
    bronze = 1000,
    silver = 10000,
    gold = 100000
  }

  if trade_amount > limits[tier] then
    return 403, "TRADE_LIMIT_EXCEEDED"
  end

  return 200, "OK"
end

5.7.3 Security Best Practices

  1. Password Storage: bcrypt with cost factor 12 (2^12 iterations)
  2. 2FA: TOTP (Google Authenticator) mandatory for withdrawals
  3. Session Management:
    • Access Token: 15 minutes (short-lived)
    • Refresh Token: 7 days (stored in HttpOnly cookie)
    • Refresh rotation: New refresh token issued on each use
  4. Rate Limiting: Per-user sliding window in Redis
Click to view Rate Limiter
   key = f"rate_limit:{user_id}:quote"
   redis.incr(key)
   redis.expire(key, 1)  # 1-second window
   if redis.get(key) > 10:  # Max 10 quotes/sec
     return "RATE_LIMIT_EXCEEDED"

5.8 Multi-Region Deployment: The “Regional Cell” Model

[!IMPORTANT] Key Principle: Compliance and Latency dictate a 2-Level Partitioning Strategy.

  1. Jurisdiction (Cell): Data Residency (GDPR/FinCEN).
  2. Shard (User ID): Horizontal Scalability.

The Analogy: Embassies Think of Regional Cells like Embassies.

  • EU Cell (Embassy): A piece of European soil in the cloud. All EU citizen data lives here.
  • Traveling User: When an EU user goes to Japan, they can visit the Japanese “Consulate” (Edge Node) for help (Read Data), but for official business like renewing a passport (Trading/Writing), their request is securely mailed back to the HQ in Europe.

5.8.1 Regional Cells (The Outer Layer)

Instead of a monolithic global database, we deploy isolated Regional Cells. Each cell is a self-contained stamp of the architecture (QuoteSvc, OrderSvc, Postgres, Redis, Kafka) located in a specific jurisdiction. Requirements:

  1. Low Latency: Users in Asia shouldn’t hit US servers (adds 200ms+ RTT)
  2. Data Residency: EU users’ data must stay in EU (GDPR compliance)
  3. Disaster Recovery: If US-East fails, failover to US-West < 1 minute
  • EU Cell (eu-central-1): Serves European residents. Data never leaves.
  • US Cell (us-east-1): Serves US residents. Complies with FinCEN/SEC.
  • APAC Cell (ap-southeast-1): Serves Asia-Pacific.

Data Residency Rules:

  • Global Data: Public Market Data, Coin Metadata (stored everywhere).
  • Resident Data: PII (KYC), Wallet Balances, Trade History, Audit Logs (pinned to Home Cell).

5.8.2 Home Cell Routing (The Traveler Problem)

A user’s “Home Cell” is determined at signup based on their country of residence. What happens when an EU user travels to Japan?

Pattern: Nearest Ingress → Home Cell Forwarding

  1. DNS/GSLB: Routes user to the nearest edge (Japan). Handshake is fast.
  2. Ticker/Market Data: Served locally from Japan (Stateless/Public). Fast.
  3. Auth & Trade:
    • API Gateway in Japan decodes the JWT.
    • JWT Claim: {"user_id": "u123", "home_cell": "EU"}.
    • Gateway proxies the request to the EU Ingress.
    • EU Cell processes the Order (ACID execution).

Result: The user gets fast market data, but trade execution pays the latency penalty (Japan → EU) to guarantee Compliance (Data never rests in Japan) and Consistency (No distributed transactions).

5.8.3 Cross-Region Failure Scenarios

What happens if the Private Backbone between Japan and EU is severed?

  1. Reads (Market Data): ✅ Unaffected. The Japan Edge continues to serve cached prices from local LPs.
  2. Writes (Trading): ❌ Failed. The Gateway cannot forward the POST /order to the Home Cell.
    • Behavior: API returns 503 Service Unavailable.
    • Why not buffer? We specifically avoid buffering orders in the Edge because it creates “hidden risk”. If the link comes back 10 minutes later, the market price will have moved, and executing old orders would be disastrous. Fast failure is a feature here.

5.9 Internal Load Balancer (ILB)

While the API Gateway handles external traffic (Authentication, Rate Limiting), we use an Internal Load Balancer (ILB) between the Gateway and our microservices (Quote, Order, Auth).

  • Role: Traffic Distribution & Service Discovery.
  • Why?:
  • Decoupling: The Gateway addresses the Service (e.g., http://quote-svc), not individual IPs. The ILB resolves this to healthy pod IPs.
  • Health Checks: The ILB automatically removes unhealthy instances from rotation.
  • Protocol Support: Handles gRPC (HTTP/2) load balancing for internal communication if needed.
  • Placement:
  • Region-Specific: Each Regional Cell has its own ILB to keep traffic local.
  • Layer 4 vs Layer 7: Typically Layer 7 (HTTP) for smart routing, or Layer 4 (TCP) for raw speed.

6. Deep Dive: Reliability & Consistency

6.1 The “Double Spend” Attack

Scenario: Malicious User “Eve” has $100. She sends two requests simultaneously:

  1. Buy $100 of BTC.
  2. Buy $100 of ETH. If processed in parallel without locking, both might check balance > 100 (True) and succeed. Eve spends 200 but only had 100.

The Analogy: The Shared Bank Account Imagine a husband and wife share a bank account with $100.

  1. Husband goes to ATM A: “Withdraw $100”.
  2. Wife goes to ATM B: “Withdraw $100”.
  3. If the bank doesn’t “lock” the account, both ATMs check balance (100), dispense cash (200 total), and the bank loses $100.

Solution: Pessimistic Locking We use SELECT ... FOR UPDATE to lock the wallet row.

Click to view Transaction Logic
BEGIN;
  -- This line blocks other transactions trying to read Alice's USD wallet
  SELECT balance FROM wallets WHERE user_id = 'Alice' AND currency = 'USD' FOR UPDATE;

  IF balance >= 100 THEN
  UPDATE wallets SET balance = balance - 100 ...;
  INSERT INTO orders ...;
  COMMIT;
  ELSE
  ROLLBACK;
  END IF;
END;
  • Trade-off: Locking reduces concurrency for a single user, but prevents fraud. Since we shard by user, this doesn’t block other users.
  • Why not Optimistic Locking? Optimistic locking (using version columns) works well for low-contention. In high-frequency trading, retrying failed transactions due to version conflicts adds latency and complexity to the client. Pessimistic locking guarantees the order executes (or fails) immediately.

6.2 The 7-Second Race Condition

Scenario:

  • T=0.0s: Quote Generated (Valid until T=7.0s).
  • T=6.9s: User clicks “Buy”.
  • T=7.1s: Request reaches Server.
  • Strict Logic: Reject (Expired).
  • User Experience: “I clicked in time! Your system sucks.”

Solution: The Grace Period We add a server-side buffer (e.g., 500ms). The validator checks: if (CurrentTime < QuoteExpiry + GracePeriod)

  • Risk: The market crashes in that 500ms.
  • Mitigation: The “Spread” (profit margin) we added to the quote covers these small slippage risks.

Interactive Simulator

Test your latency luck. Can you click before the quote expires?

RFQ Simulator

Can you beat the network lag?

BTC / USD
$54,320.50
7.00s
> System Ready.

7. 3 System Walkthrough: The Life of a Trade

To solidify our understanding, let’s trace a single transaction through the entire stack, examining the exact API payloads, Redis keys, and Database queries.

Scenario A: The Happy Path (Quote → Buy → Success)

Step 1: User requests a quote for 1 BTC.

Click to view Quote Request/Response
  • Request: POST /quotes
      {
    "pair": "BTC-USD",
    "side": "BUY",
    "amount": 1.00000000
      }
    
  • Quote Service Action:
    1. Fetches price from LP: $50,000.
    2. Adds spread (+0.5%): $50,250.
    3. Generates quote_id: q_123.
    4. Signs the payload: HMAC_SHA256(price + expiry, secret).
  • Redis State:
      SET quote:q_123 "{\"price\": 50250, \"expiry\": 1698765432}" EX 7
    
  • Response: 200 OK
      {
    "quote_id": "q_123",
    "price": 50250.00,
    "expiry": 1698765432,
    "signature": "a1b2c3d4..."
      }
    

Step 2: User accepts the quote (within 7 seconds).

Click to view Order Execution Flow
  • Request: POST /orders
      {
    "quote_id": "q_123",
    "order_id": "o_999",
    "user_id": "u_alice"
      }
    

    (Note: order_id is a client-generated UUID for Idempotency)

  • Order Service Action:
    1. Validate Quote: GET quote:q_123. (If exists, proceed).
    2. Validate Signature: Recompute HMAC. (Prevents tampering with price).
    3. DB Transaction: ```sql BEGIN; – 1. Lock Wallet SELECT balance FROM wallets WHERE user_id = ‘u_alice’ AND currency = ‘USD’ FOR UPDATE; – (Result: 100,000.00)

    – 2. Update Balance UPDATE wallets SET balance = balance - 50250.00 WHERE user_id = ‘u_alice’;

    – 3. Insert Order INSERT INTO orders (order_id, user_id, quote_id, status) VALUES (‘o_999’, ‘u_alice’, ‘q_123’, ‘FILLED’);

    COMMIT; ```

  • Response: 200 OK
      {
    "status": "FILLED",
    "tx_id": "tx_555"
      }
    

Scenario B: The “Double Spend” Attempt

Imagine “Eve” has $100. She sends two requests simultaneously:

  1. Req A: Buy $100 BTC.
  2. Req B: Buy $100 ETH.

Timeline:

  • T=0.00s: Req A reaches DB. Starts Transaction.
Click to view Race Condition Logic
```sql   [Tx A] SELECT balance ... FOR UPDATE; -- Locks Row
``` *   **T=0.01s**: **Req B** reaches DB. Starts Transaction.
```sql   [Tx B] SELECT balance ... FOR UPDATE; -- BLOCKED! Waits for Tx A.
```
  • T=0.05s: Req A updates balance (100 → 0) and COMMIT.
  • Row Lock released.
  • T=0.06s: Req B unblocks and reads the new balance.
  • balance is now $0.
  • Logic: IF balance < 100 THEN ROLLBACK.
  • Result: Req A succeeds (200 OK). Req B fails (402 Payment Required).

Scenario C: The “Expired Quote” Race Condition

Step 1: User attempts to trade at T=7.1s (0.1s too late).

  • Request: POST /orders { quote_id: "q_123" }
  • Order Service Action:
    1. Check Redis: GET quote:q_123
    2. Result: (nil) (Key evicted by Redis).
    3. Fallback Check: Even if the key existed (e.g., due to lag), check payload.expiry < Now().
  • Response: 400 Bad Request
Click to view Error Response
```json   {
"error": "QUOTE_EXPIRED",
"message": "Quote q_123 is no longer valid. Please request a new price."   }
```

8. Alternative Solutions (Trade-offs)

7.1 RFQ vs. CLOB (Central Limit Order Book)

  • CLOB (e.g., Nasdaq, Binance):
  • Mechanism: Continuous matching of limit orders.
  • Pros: Transparent pricing, high liquidity discovery.
  • Cons: Extremely complex to engineer (matching engine), high computational cost.
  • RFQ (Our System):
  • Mechanism: Guaranteed price on demand.
  • Pros: Simpler architecture, better UX for large trades (no slippage).
  • Cons: Platform takes market risk.

7.2 SQL vs. NoSQL

  • NoSQL (DynamoDB):
  • Pros: Infinite scaling.
  • Cons: Lack of multi-row ACID transactions. Implementing a ledger in NoSQL requires complex application-level locking (e.g., optimistic locking with version numbers), which is error-prone for financial data.
  • SQL (PostgreSQL):
  • Pros: Native ACID, referential integrity.
  • Cons: Harder to scale writes.
  • Decision: SQL wins because financial correctness > raw write speed. Sharding solves the scale issue.

7.3 Event Sourcing

  • Concept: Store every transaction as an immutable event (Deposited, Bought, Sold). Calculate balance by replaying events.
  • Pros: Perfect audit trail, easy debugging.
  • Cons: Replaying millions of events to get a balance is slow. Requires “Snapshots”.
  • Our Choice: Hybrid. We use a standard SQL table for current balance (fast) but log every change to an audit_logs table (immutable).

9. Low-Level Optimizations (The “Boom” Factor)

To squeeze every millisecond out of the system:

  1. Kernel Tuning:
    • Increase TCP Buffer sizes (net.ipv4.tcp_rmem, net.ipv4.tcp_wmem) to handle high-throughput bursts.
    • Enable TCP Fast Open (TFO) to reduce handshake latency by 1 RTT.
  2. Connection Pooling:
    • Database connections are expensive. Use PgBouncer to maintain a pool of warm connections, reducing overhead.
  3. Garbage Collection (GC):
    • For the Quote Service (Golang), tune GOGC to trade memory for CPU.
    • For the Order Service (Java), use ZGC or Shenandoah for sub-millisecond pause times.
  4. Network:
    • Place Quote Services in the same Availability Zone (AZ) as the Liquidity Providers if possible (AWS us-east-1).
    • Use Kernel Bypass (DPDK) is likely overkill for 230 QPS, but worth mentioning for HFT requiring microsecond latency.

10. Requirements Traceability Matrix

Requirement Architectural Solution
Get Quote (7s) Quote Service + Redis (TTL 7s) + WebSocket to LPs.
Place Order Order Service with HMAC validation + Idempotency keys.
Balance Check PostgreSQL with SELECT FOR UPDATE (Pessimistic Locking).
Reliability (99.999%) Active-Passive DB Failover + Stateless Services + Kubernetes Auto-healing.
Latency (<50ms) In-memory processing (Redis) + Connection Pooling + Geolocation.
Consistency Database Sharding by user_id allows local ACID transactions.
Scalability Horizontal scaling of services + DB Sharding + Redis Cluster.
Security API Gateway (JWT, Rate Limit) + Private Subnets + mTLS.
Compliance Async KYC pipeline + Audit Logs (Event Sourcing lite).

11. Observability & Tracing

You cannot fix what you cannot see. For a system moving millions of dollars, we need total visibility.

Click to view Logs & Metrics

10.1 The RED Method (Metrics)

We instrument every service to emit these three golden signals:

  1. Rate: Request counts per second.
    • Metric: http_requests_total{service="quote_svc", status="200"}
    • Use: Detect traffic spikes or DDoS.
  2. Errors: Failed requests.
    • Metric: order_failed_total{reason="insufficient_funds"}
    • Use: Alert if order failures exceed 1% of total traffic.
  3. Duration: Latency distributions.
    • Metric: quote_generation_seconds_bucket (Histogram)
    • Use: Alert if P99 latency > 100ms.

10.2 Distributed Tracing

A single order touches 4 systems: Gateway &rarr; Order Service &rarr; Redis &rarr; DB. If an order is slow, Distributed Tracing tells us exactly where.

  • Trace ID: Generated at the Gateway (e.g., x-trace-id: 12345). Passed via HTTP headers to every downstream service.
  • Spans: Each service logs a “Span” with start/end timestamps.
  • Visualization (e.g., Jaeger/Zipkin):
[Gateway]   |-------------------------------------------|  205ms
      [Order SVC] |-----------------------------|    180ms
            [Redis] |----|                     10ms
            [DB Lock]      |-----------|       150ms (Bottleneck!)

In this example, the DB Lock took 150ms, indicating database contention.

10.3 Structured Logging

Forget plain text logs. Use JSON for machine-readability (ELK Stack).

{
  "level": "INFO",
  "timestamp": "2023-10-27T10:00:00Z",
  "service": "order-service",
  "trace_id": "a1b2c3d4",
  "user_id": "u_999",
  "event": "order_placed",
  "amount": 100.00,
  "currency": "USD"
}
  • Audit Logs: Separate, immutable logs for compliance. Every balance change must be recorded here and archived to WORM (Write Once Read Many) storage (e.g., S3 Object Lock).

10.4 Alerting Strategy

  • P1 (Critical - Wake up on-call):
  • Order Success Rate < 99.5%.
  • Database connection pool saturation > 90%.
  • Redis Cluster state “FAIL”.
  • P2 (Warning - Ticket for tomorrow):
  • Latency P99 > 150ms (SLA breach warning).
  • Disk usage > 80%.

12. Deployment & Operations

11.1 Deployment Strategy: Blue/Green

For a financial system, we cannot risk a “bad deploy” corrupting the database.

  1. Blue (Active): Serving 100% traffic.
  2. Green (Staging): Deploy new version. Run integration tests.
  3. Switch: Update the Load Balancer to route 1% traffic to Green (Canary).
  4. Monitor: Watch for HTTP 500 or latency spikes.
  5. Rollout: If safe, route 100% to Green.
  6. Rollback: If 1% fails, instantly revert LB to Blue. Users see errors for only 5 seconds.

11.2 Database Schema Evolution

  • Problem: Adding a column locks the table.
  • Solution: Expand-Contract Pattern.
    1. Expand: Add nullable column new_col (Zero downtime).
    2. Code: Update app to write to both old_col and new_col.
    3. Backfill: Run a background job to copy data oldnew.
    4. Contract: Update code to read only from new_col. Drop old_col.

13. Follow-Up Questions: The Interview Gauntlet

This section covers 50 rapid-fire questions to test the depth of your design.

I. Database & Data Consistency (The Core)

  • Why PostgreSQL over NewSQL? Sharded Postgres is more mature and sufficient for 1M users. Distributed SQL (CockroachDB) adds consensus latency to writes.
  • Handling Hot Shards: If a “Whale” hits 10k TPS, we use Virtual Buckets to migrate that user to a dedicated physical node.
  • Isolation Levels: We use READ COMMITTED for performance. SERIALIZABLE prevents race conditions but causes too many transaction aborts/retries in high-concurrency environments.
  • Lock Contention: If SELECT ... FOR UPDATE hangs, connection pools exhaust. We set NOWAIT or short timeouts (e.g., 2s) to fail fast.
  • Replication Lag: Users reading from replicas might see old balances. We implement “Sticky Sessions” or force reads from Primary for critical wallet views.
  • Schema Migrations: Use pg_repack or similar tools to add columns without locking tables.
  • Archival Strategy: Move data > 1 year old to S3 (Parquet format) and delete from Postgres to keep indices small.
  • Double Booking: Without row locking, two parallel transactions read the same balance, subtract funds, and overwrite each other.
  • Database Failover: Postgres Automatic Failover (PAF) takes ~30s. Writes fail during this window; users see errors.
  • Data Corruption: External reconciliation (Nightly Jobs) sums all wallet balances vs. total deposits to detect drift.

II. Scalability & Performance

  • Redis Eviction: volatile-ttl ensures we only drop expired quotes, never persistent configs.
  • Connection Pooling: Use PgBouncer sidecar. If scaling exceeds DB limits, we must shard further.
  • Load Balancing: Least Outstanding Requests handles varying service times better than Round Robin.
  • Traveling Users: We pin users to a Home Cell based on residency. A user traveling to Japan connects to the Japan Edge for speed, but the Gateway forwards trade requests to their EU Home Cell for compliance.
  • Cross-Region Latency: We accept the latency penalty (e.g., 200ms) for traveling users to guarantee Data Residency. Market data remains fast (local).
  • CDN Caching: We generally cannot cache prices as they change every second. WebSocket is preferred.
  • Write-Heavy Spikes: During crashes, we implement Queue-based Load Leveling (Kafka) to smooth out DB writes.
  • Serialization: JSON is fine for this scale. Protobuf saves bandwidth but adds debugging complexity.
  • Autoscaling Triggers: Scale on CPU Usage (>70%) and Request Queue Depth.
  • Cache Penetration: Use Bloom Filters to block requests for non-existent symbols (“FAKE-COIN”).

III. Reliability & Fault Tolerance

  • Redis Persistence: If Redis dies, quotes are lost. This is acceptable; users just request a new quote.
  • Circuit Breaker: Threshold based on Error Rate (e.g., >50% failures in 10s).
  • Bulkhead Pattern: Isolate thread pools for “Notifications” vs “Orders” so one slow dependency doesn’t crash the app.
  • Retry Storms: Add Exponential Backoff and Jitter to client retries.
  • Idempotency Storage: If Redis evicts keys, we fallback to a check in the persistent DB (slower but safer).
  • Graceful Degradation: If History Service fails, the “Trade” button still works.
  • Clock Skew: We use NTP on servers. Tolerance is built into the 7s expiry window.
  • Zonal Failures: Deployment across 3 AZs ensures only ~33% capacity loss, which autoscaling covers.

IV. Architecture & Microservices

  • Saga Pattern: We don’t use Sagas for the core trade (too slow). We use local ACID via sharding.
  • Service Discovery: Kubernetes (CoreDNS) handles service IP resolution.
  • Gateway vs Mesh: Gateway handles Edge concerns (Auth, Rate Limit); Mesh handles inter-service concerns (mTLS, Retries).
  • Configuration: Use a dynamic config server (e.g., Consul/Etcd) with watchers to update expiry_seconds hot.
  • Data Ownership: Order Service cannot access Wallet Table directly. Must call Wallet Service API to decouple schemas.
  • Event Ordering: Kafka Partition Key = user_id ensures events for one user are sequential.

V. Security & Compliance

  • Insider Trading: Admin actions require Multi-Party Approval and are logged to immutable audit trails.
  • API Key Security: Scoped keys (Read-Only vs Trade). Automated rotation.
  • DDoS Protection: Rate limiting at the Edge (Cloudflare) + Gateway (Token Bucket).
  • Audit Immutability: Write logs to S3 with Object Lock (Governance Mode).
  • PII Data: “Crypto-shredding”: Delete the encryption key for a user’s data to effectively “erase” it without modifying immutable logs.
  • Internal Auth: mTLS (Mutual TLS) ensures only authorized services can talk to the Wallet Service.

VI. Operations & Observability

  • Metric Cardinality: Do not tag metrics with user_id. Use logs for high-cardinality debugging.
  • Distributed Tracing: Inject x-trace-id at the Gateway and propagate it everywhere.
  • Deployment: Canary Deployment. Roll out v2 to 1% of users, monitor error rates, then expand.
  • Chaos Engineering: Randomly kill pods (Chaos Monkey) during staging to test recovery.
  • Alert Fatigue: Group related alerts. Use “Symptoms” (User can’t trade) rather than “Causes” (CPU high) for paging.
  • Capacity Planning: Linear regression on past 6 months of data to forecast storage/compute needs.

VII. Business Logic & Edge Cases

  • Negative Balance: Should be impossible with ACID. If it happens, freeze account and trigger manual investigation.
  • Partial Fills: Requires DB schema change (filled_amount vs requested_amount).
  • Market Halted: A global “Kill Switch” in Redis that the Order Service checks before every trade.
  • Rounding Errors: Always use Integers (Micros/Satoshis) or BigDecimal. Never float or double.
  • Settlement Failure: Platform takes the risk. If LP fails, we still owe the user the crypto.

VIII. Advanced Architecture (The 99.999% Club)

  • Event Sourcing vs. CRUD: Event sourcing is better for auditability but harder to query. We use a hybrid approach (CRUD for current state, Events for history).
  • CQRS (Command Query Responsibility Segregation): Use separate models for Writes (Order Service) and Reads (History Service). This allows scaling reads independently via Read Replicas.
  • LMAX Disruptor: A high-performance inter-thread messaging library. Used in HFT to avoid lock contention. Overkill for 500 TPS but good for 500k TPS.
  • Kernel Bypass (DPDK/Solarflare): Bypassing the Linux kernel networking stack to write directly to the NIC. Reduces latency from 10us to 1us.
  • Clock Synchronization: NTP is not enough for sub-millisecond precision. Use PTP (Precision Time Protocol) with hardware timestamping.
  • Garbage Collection Tuning: For Java, use ZGC to keep pauses < 1ms. For Go, use GOGC=off and manual memory management if needed (extreme case).
  • False Sharing: CPU cache line contention. Pad data structures to 64 bytes to prevent cores from invalidating each other’s caches.

IX. Failure Modes & Disaster Recovery

  • Partial Partition: What if the Order Service can reach Redis but not the DB? Answer: Fail the request safely.
  • Zombie Processes: A service that thinks it’s the leader but isn’t. Use Fencing Tokens (epoch numbers) to reject writes from zombies.
  • Thundering Herd: If Redis clears, 10k users hit the DB. Use Request Coalescing (Singleflight) to merge identical requests.
  • Split Brain: If the cluster partitions, do we accept writes on both sides? Answer: No. Pause writes (CP system) to preserve consistency.
  • Corrupted WAL: If Postgres WAL is corrupted, replay from the last snapshot and accept data loss (RPO > 0).
  • Region Failure: Failover to DR region. RTO (Recovery Time Objective) ~15 mins. DNS switch.

X. Market Microstructure

  • Slippage: The difference between the quoted price and executed price. In RFQ, the platform absorbs slippage (the “Spread”).
  • Order Types:
  • FOK (Fill or Kill): Execute fully or not at all.
  • IOC (Immediate or Cancel): Execute what you can, cancel the rest.
  • GTC (Good Till Cancelled): Standard limit orders (not used in RFQ).
  • Spread Capture: The primary revenue model. We buy at 50,000 and sell to users at 50,250.
  • Hedging: When a user buys 1 BTC, we immediately buy 1 BTC from an LP to neutralize our inventory risk.

14. Summary: The Whiteboard Strategy

If you are asked to design this in 45 minutes, draw this 4-Quadrant Layout:

1. Requirements & Core Math

  • Func: Quote (7s), Order, Wallet.
  • Non-Func: 99.999%, <50ms Latency, ACID.
  • Scale: 230 QPS (Quotes), 12 TPS (Orders).
  • Traffic: Read-heavy (20:1).

2. Architecture

[Client] ↓ [Gateway (Auth/Rate Limit)] ↙ ↘ [Quote SVC] -- [Redis Cluster] [Order SVC] -- [Sharded DB]

* Separation of Concerns: Fast (Quotes) vs Safe (Orders). * Sharding: By User ID for local ACID.

3. Data & API

POST /quotes → { price, expiry, sig } POST /orders → { order_id, status } Wallets: (user_id, currency, balance) Orders: (order_id, status, quote_id)

4. Trade-offs & Deep Dives

  • Concurrency: Pessimistic Locking (`FOR UPDATE`) prevents double-spend.
  • Latency: Redis `volatile-ttl` + Connection Pooling + Geo-routing.
  • Reliability: Grace Period for network jitter.
  • Observability: Distributed Tracing + Audit Logs.

Return to Specialized Systems