Deep Dive: Trading Order System (RFQ)
[!NOTE] This module explores the core principles of Deep Dive: Trading Order System (RFQ), deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Introduction: The High-Stakes World of RFQ
In the world of cryptocurrency exchanges, there are two primary models for trading: CLOB (Central Limit Order Book) and RFQ (Request for Quote). While CLOBs (like Nasdaq or Binance) match buyers and sellers continuously, RFQ systems are designed for institutional clients or “Convert” features where a user asks: “How much for 10 BTC right now?” and the system replies: “$50,000.00. Valid for 7 seconds.”
This creates a unique distributed systems challenge: Ephemeral State Management. You must guarantee a price for a short window, managing the risk that the market might crash during those seconds. If your system is too slow, you lose money. If your system is inconsistent, you lose user funds.
This guide assumes no prior knowledge of financial systems. We will build a production-grade, globally scalable RFQ system from the ground up, dissecting every component from the Load Balancer to the Database Kernel.
2. Requirements & Constraints
2.1 Functional Requirements
- Authentication & Authorization: Users must log in via a secure Identity Provider (OIDC). Access is strictly controlled via Role-Based Access Control (RBAC) and KYC Tiering.
- Market Data Feed (Ticker): Users need a real-time stream of indicative prices for all supported pairs (e.g.,
GET /prices), distinct from firm executable quotes. - Get Quote (RFQ): Users request a firm buy/sell price. The system returns a cryptographically signed quote with an absolute
expires_attimestamp. - Execute Order: Users can accept the quote only if:
- Expiry Check: Server time ≤
expires_at(plus grace period). - Funds Reservation: Usage of a “Hold” mechanism to lock funds before execution.
- Atomic Claim: Introduction of a “Claim Key” to prevent double-filling (replay protection).
- Expiry Check: Server time ≤
- Wallet Management: Users must be able to manage their balances (deposits, withdrawals). The system must use a Ledger/Journal model for auditability.
- Transaction History: Users can view their past trades and current portfolio value.
- Settlement: Funds must be exchanged atomically. No partial states.
2.2 Non-Functional Requirements
- Reliability: 99.999% (Five Nines).
- Strategy: Active-Active for stateless services (Quotes), Active-Passive with <30s failover for stateful services (Orders).
- SLO Split: Execution (99.99%) vs Quoting (99.9% with graceful degradation).
- Latency:
- Quote Generation: < 50ms (99th percentile). Stale quotes are risky.
- Order Execution: < 100ms (99th percentile). Fast feedback is crucial.
- Consistency: Strong Consistency (ACID) for all financial transactions. Eventual consistency is not allowed for balances.
- Throughput: Support 10,000 Quotes/sec and 500 Trades/sec initially, scalable to 10x via sharding.
- Availability: 24/7. Crypto markets never sleep.
- Security: Strict Authentication (mTLS/JWT), Rate Limiting, and Audit Logs.
- Global Reach: Multi-region deployment with “Home Region” pinning for users.
2.3 Compliance & Regulations
[!IMPORTANT] Problem Statement: “System should be compliant with country regulations”
- KYC (Know Your Customer): Mandatory verification before trading.
- Tier 1: Email (Low limits).
- Tier 2: ID + Selfie (Medium limits).
- Tier 3: Source of Funds (High limits).
- AML (Anti-Money Laundering):
- Sanctions Screening: Check users against OFAC/UN lists securely.
- Transaction Monitoring: Real-time flagging of suspicious patterns (e.g., structuring).
- Travel Rule: For crypto withdrawals >$1,000, originator and beneficiary details must be transmitted to the receiving VASP (Virtual Asset Service Provider).
- Market Abuse: Prevention of wash trading and spoofing via Rate Limiting and Anomaly Detection.
- Data Residency:
- GDPR (EU): European user data stays in
eu-west-1. - CCPA (US): US user data stays in
us-east-1.
3. Capacity Planning & Estimation
Before writing code, we must understand the hardware requirements.
3.1 Traffic Analysis
- DAU (Daily Active Users): 1,000,000.
- Quote-to-Trade Ratio: 20:1. Users check prices frequently but trade less often.
- Quotes: 20 requests/user/day → 20M quotes/day.
- Average QPS: 20,000,000 / 86,400 ≈ 230 QPS.
- Peak QPS: Financial markets are uniquely volatile. A single tweet (e.g., from Elon Musk) can trigger massive spikes. We must design for 10x to 50x bursts → 2,300 - 11,500 QPS.
- Why design for peak? If your system fails during a rally, users can’t trade, and you lose the most profitable moments.
- Trades: 1M trades/day.
- Average TPS: 1,000,000 / 86,400 ≈ 12 TPS.
- Peak TPS: 120 TPS.
3.2 Bandwidth & Network
- Quote Payload: ~1KB (JSON with prices, metadata, signature).
- Bandwidth: 2,300 QPS × 1 KB ≈ 2.3 MB/s. Trivial for modern networks (1Gbps links).
- Order Payload: ~500 Bytes.
- Bandwidth: Negligible.
3.3 Storage (The Ledger)
- Trades Table: The primary source of truth.
- Row Size: ~500 Bytes (IDs, timestamps, prices, fees).
- Daily Growth: 1,000,000 trades × 500 B = 500 MB/day.
- Yearly Growth: 500 MB × 365 ≈ 180 GB.
- 5-Year Retention: 180 GB × 5 ≈ 900 GB.
- Conclusion: A single master database could hold this volume, but for IOPS (Input/Output Operations Per Second) and concurrency, we will need Sharding (Consistent Hashing).
3.4 Memory (Redis Cache)
- Active Quotes: Only valid for 7 seconds.
- At peak 2,300 QPS × 7s = ~16,100 active quotes in memory.
- Size: 16,100 × 1 KB ≈ 16 MB.
- Conclusion: Redis memory usage is tiny. We are CPU/Network bound, not Memory bound.
4. High-Level Architecture: Global Regional Cells
We move beyond a standard monolith to a Federated Regional Cell architecture. This design balances two conflicting requirements:
- Low Latency: Users want fast quotes (Speed of Light matters).
- Data Residency: Regulators (GDPR, FinCEN) require user data to stay effectively “domiciled” in their home jurisdiction.
The diagram below illustrates our Hybrid Routing Pattern:
- Stateless Reads (Blue Path): Served from the nearest Regional Cell for maximum speed.
- Stateful Writes (Pink Path): Proxied to the user’s Home Cell for ACID compliance and data sovereignty.
home_cell=EUCLAIM KeyTables: - Wallets (Ledger) - Orders - Holds
5. Detailed Component Design & Trade-offs
5.1 API Gateway (The Doorman)
The Gateway (e.g., Kong, Nginx, or AWS API Gateway) is the single entry point.
- Authentication:
- Uses JWT (JSON Web Tokens) for lightweight, stateless auth.
- The Gateway verifies the signature (RSA-256) and claims (Expiration, Scopes) before passing the request downstream.
- Rate Limiting:
- Algorithm: Token Bucket.
- Why Token Bucket over Leaky Bucket? Token Bucket allows for short bursts of traffic (e.g., a bot reacting to a sudden market move), which is desirable for trading. Leaky Bucket enforces a rigid rate, which might punish legitimate high-frequency users.
- Scope: Per User ID and Per IP.
- Config:
100 req/minfor Quotes,20 req/minfor Orders. - Optimization:
- SSL Termination: Decrypt HTTPS at the edge to offload CPU from microservices.
- Keep-Alive: Maintain persistent connections to upstream services to avoid TCP Handshake overhead.
5.2 The Quote Service (The Sprinter)
This service must be incredibly fast. It is Stateless and Read-Heavy.
- Role:
- Fetch real-time prices from Liquidity Providers (LPs) via WebSocket.
- Apply a “Spread” (e.g., +0.5% markup for profit).
- Generate a cryptographically signed quote.
- Cache the quote in Redis with
TTL=7s.
- Optimizations:
- Fan-out: Query 3 LPs in parallel, take the median price (to avoid outliers/manipulation).
- Zero-Allocation: In Go/Rust, reuse memory buffers to avoid Garbage Collection pauses during high load.
5.3 The Order Service (The Vault)
This service must be incredibly safe. It is Stateful and Transactional.
- Role:
- Atomically claim the quote (prevent replay attacks).
- Check user balance (DB lock).
- Execute the trade atomically (Update DB).
- Idempotency:
- Crucial for network failures. If a client sends an order but doesn’t get a response (timeout), they will retry.
- Mechanism: The client sends a unique
idempotency_key(UUID). The server checks a unique index in the DB.INSERT INTO orders (id, ...) VALUES ... ON CONFLICT DO NOTHING.
5.3.1 Quote Replay Protection (CRITICAL)
[!CAUTION] The #1 Interview Gotcha: Without atomic claim, your system is vulnerable to double-fill attacks.
The Attack Scenario:
- User receives quote:
quote_id=abc123, valid for 7 seconds - User opens two browser tabs
- Tab A: Submit order with
quote_id=abc123 - Tab B: Submit order with
quote_id=abc123(1ms later) - Both requests race:
- Request A: Check
quote:abc123exists? → ✅ YES - Request B: Check
quote:abc123exists? → ✅ YES (race!) - Request A: Debit $50,000, credit 1 BTC
- Request B: Debit $50,000, credit 1 BTC
- Request A: Check
- Result: User gets 2 BTC for the price of 1 → You lose $50,000
The Analogy: Ticket Reservation Imagine you are buying a concert ticket.
- Quote: You see a seat for $100.
- Race: You and your friend click “Buy” at the exact same millisecond.
- Correct System: Only ONE person gets the ticket. The other gets an error.
- Bad System: Both get the ticket. The venue is overbooked. You lose money.
The Solution: Atomic Check-and-Claim
We use a Redis Lua script to atomically check if a quote exists AND mark it as claimed in a single operation:
Click to view Lua Script
-- quote_claim.lua
-- This script ensures a quote can only be claimed ONCE
local quote_key = KEYS[1] -- "quote:abc123"
local claim_key = KEYS[2] -- "quote:abc123:claimed"
local order_id = ARGV[1] -- "order_xyz789"
local remaining_ttl = ARGV[2] -- Remaining seconds for quote validity
-- Atomic check: Does quote exist AND is it not claimed?
if redis.call('EXISTS', quote_key) == 1 and redis.call('EXISTS', claim_key) == 0 then
-- Claim it for this order_id
redis.call('SET', claim_key, order_id, 'EX', remaining_ttl)
-- Return the quote payload
return redis.call('GET', quote_key)
else
-- Quote expired or already claimed
return nil
end
Order Service Execution Flow (Python):
Click to view Python Implementation
import redis
import time
import hmac
import hashlib
import json
redis_client = redis.Redis()
# Load Lua script once on startup (cached on Redis server)
QUOTE_CLAIM_SCRIPT = redis_client.script_load(open('quote_claim.lua').read())
def execute_order(user_id, quote_payload_signed, idempotency_key):
# Step 1: Verify HMAC signature (prevent tampering)
quote_id = quote_payload_signed['id']
received_hmac = quote_payload_signed['signature']
payload = json.dumps(quote_payload_signed['data'], sort_keys=True)
expected_hmac = hmac.new(SERVER_SECRET, payload.encode(), hashlib.sha256).hexdigest()
if not hmac.compare_digest(received_hmac, expected_hmac):
return {"error": "INVALID_SIGNATURE"}
# Step 2: Check quote freshness (server-side clock with grace period)
quote_data = quote_payload_signed['data']
expires_at = quote_data['expires_at'] # Absolute timestamp from server
current_time = time.time()
# Allow 200ms grace period for network latency
if current_time > expires_at + 0.2:
return {"error": "QUOTE_EXPIRED"}
# Step 3: Atomic claim using Lua script
remaining_ttl = int((expires_at + 0.2) - current_time)
quote_key = f"quote:{quote_id}"
claim_key = f"quote:{quote_id}:claimed"
quote_payload = redis_client.evalsha(
QUOTE_CLAIM_SCRIPT,
2, # Number of KEYS
quote_key, claim_key,
idempotency_key, remaining_ttl
)
if quote_payload is None:
return {"error": "QUOTE_EXPIRED_OR_ALREADY_USED"}
# Step 4: Database transaction (Reserve + Execute)
with db.begin():
# Lock user's wallet row (Pessimistic Locking / Reserve)
wallet = db.execute(
"SELECT balance FROM wallets WHERE user_id = %s AND currency = %s FOR UPDATE",
(user_id, quote_data['base_currency'])
).fetchone()
if wallet['balance'] < quote_data['total_cost']:
return {"error": "INSUFFICIENT_BALANCE"}
# Debit and credit atomically
db.execute(
"UPDATE wallets SET balance = balance - %s WHERE user_id = %s AND currency = %s",
(quote_data['total_cost'], user_id, quote_data['base_currency'])
)
db.execute(
"UPDATE wallets SET balance = balance + %s WHERE user_id = %s AND currency = %s",
(quote_data['amount'], user_id, quote_data['quote_currency'])
)
# Insert order record (idempotency_key ensures uniqueness)
db.execute(
"INSERT INTO orders (order_id, user_id, quote_id, pair, amount, price, status) "
"VALUES (%s, %s, %s, %s, %s, %s, 'FILLED') "
"ON CONFLICT (idempotency_key) DO NOTHING",
(idempotency_key, user_id, quote_id, ...)
)
return {"status": "SUCCESS", "order_id": idempotency_key}
Why This Works:
- Lua Atomicity: Redis executes the entire script as a single atomic operation. No race conditions.
- Claim Key: Once set, the claim key prevents the quote from being used again (even if the original quote key still exists).
- TTL Inheritance: The claim key expires at the same time as the quote, preventing leaked state.
- Grace Period: The 200ms buffer accounts for network delays, ensuring valid requests at
t=6.9sare not rejected.
5.4 Redis Architecture
We use Redis not just for caching, but for temporary state (Quotes).
- Cluster Mode:
- Data is sharded across multiple nodes (16384 slots).
- This allows horizontal scaling of memory and throughput.
- Eviction Policy:
volatile-ttl. - Why? We only want to evict keys with an expiry set (quotes). If we used
allkeys-lru, we might accidentally evict persistent configuration keys or session data. See Module 06: Caching. - Persistence:
- RDB (Snapshot): Every 15 minutes.
- AOF (Append Only File): Disabled or set to
everysec. - Why? If Redis crashes, losing the last 1 second of quotes is acceptable. Users will just request a new quote. Performance > Durability for quotes.
5.5 Database Architecture (The Ledger)
We choose PostgreSQL for its robust ACID compliance.
Why PostgreSQL over NewSQL (CockroachDB/TiDB)?
- Maturity: PostgreSQL has decades of battle-testing in financial systems. NewSQL databases are powerful but introduce complexity in deployment and debugging.
- Complexity: For our scale (1M users), sharded Postgres is sufficient and well-understood. Distributed SQL adds network overhead (Raft consensus) to every write, increasing latency, which we want to avoid for order execution.
Schema Design
[!CAUTION] Never use
FLOATorDOUBLEfor money. Floating-point arithmetic causes rounding errors due to IEEE 754 representation. Example:0.1 + 0.2 ≠ 0.3in binary floating point. Always use fixed-point types (NUMERIC/DECIMAL) or store integer minor units (BIGINTfor cents/wei).
Click to view SQL Schema
CREATE TABLE wallets (
user_id UUID,
currency VARCHAR(10),
-- NUMERIC(36, 18) supports crypto tokens with 18 decimals (e.g., ETH)
-- Many ERC-20 tokens use 18 decimal places (1 ETH = 10^18 wei)
balance NUMERIC(36, 18) NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (user_id, currency)
);
CREATE TABLE orders (
order_id UUID PRIMARY KEY,
idempotency_key UUID UNIQUE NOT NULL, -- Prevent duplicate submissions
user_id UUID NOT NULL,
quote_id VARCHAR(64) NOT NULL,
pair VARCHAR(20) NOT NULL, -- 'BTC-USD', 'ETH-USDT'
side VARCHAR(4) NOT NULL, -- 'BUY', 'SELL'
amount NUMERIC(36, 18) NOT NULL, -- Quantity of crypto
price NUMERIC(36, 18) NOT NULL, -- Price per unit
total_cost NUMERIC(36, 18) NOT NULL, -- amount * price
fee NUMERIC(36, 18) NOT NULL DEFAULT 0,
status VARCHAR(20) NOT NULL, -- 'FILLED', 'FAILED', 'CANCELLED'
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
settled_at TIMESTAMPTZ,
INDEX idx_user_created (user_id, created_at DESC) -- For transaction history
);
Sharding Strategy (2-Level Hierarchy)
As the user base grows, we need to scale writes. We use a 2-Level Partitioning Strategy:
- Level 1: Jurisdiction (Regional Cell): Users are first partitioned by their regulatory home (e.g., EU Users → Frankfurt Cell). This is a hard physical boundary for compliance.
- Level 2: User ID Sharding (Consistent Hashing): Within the cell, we shard by
user_id.- Sharding Key:
user_id. - Mechanism:
Hash(User_ID) % 1000→Bucket_ID→Physical_Node. - Why?: All data for a specific user lives on the same shard, enabling Local ACID Transactions.
- Sharding Key:
5.6 Asynchronous History & Auditing
While the order execution is synchronous (ACID), reporting is asynchronous.
- Change Data Capture (CDC): We use the Outbox Pattern or simple application-level publishing.
- Kafka Topic: After the DB commit succeeds, the Order Service publishes an event to the
orders_eventstopic.{ "event": "ORDER_FILLED", "order_id": "...", "price": 50250, "ts": 169876... } - Audit & Compliance Services: Consume these events for Post-Trade Surveillance:
- Market Abuse Detection: Flagging “Wash Trading” or “Layering”.
- AML Velocity Checks: Identifying “Structuring” (e.g., user deposits 9,999 twice to avoid the 10k reporting threshold).
- Chainalysis Integration: If the user withdraws to a wallet linked to a Darknet Market, flag the account immediately.
- Regulatory Archival: Write all events to WORM (Write Once Read Many) S3 Glacier storage for 7+ years (SEC 17a-4 compliance).
5.7 Authentication & Authorization
[!IMPORTANT] Problem Statement Requirement: “Users should be able to log in to the system and access only the features and data they are authorized to see.”
5.7.1 Authentication Flow
User Registration →
Email Verification → KYC (Know Your Customer) → Account Activated
Click to view Auth Logic
# Simplified Auth Flow
def login(email, password, totp_code):
# Step 1: Verify password (bcrypt with cost factor 12)
user = db.query("SELECT * FROM users WHERE email = ?", email)
if not bcrypt.verify(password, user.password_hash):
return {"error": "INVALID_CREDENTIALS"}
# Step 2: Verify 2FA (Time-based One-Time Password)
secret = user.totp_secret
if not pyotp.TOTP(secret).verify(totp_code):
return {"error": "INVALID_2FA"}
# Step 3: Generate JWT with user role/tier embedded
jwt_payload = {
"user_id": user.id,
"home_cell": user.jurisdiction, # 'EU', 'US', 'SG' - Crucial for routing!
"role": user.role, # 'retail', 'vip', 'institutional'
"tier": user.kyc_tier, # 'bronze', 'silver', 'gold'
"exp": time.time() + 900 # 15 minutes
}
access_token = jwt.encode(jwt_payload, JWT_SECRET, algorithm="HS256")
refresh_token = generate_refresh_token(user.id) # 7-day expiry
refresh_rotation: New refresh token issued on each use
return {"access_token": access_token, "refresh_token": refresh_token}
5.7.2 Role-Based Access Control (RBAC)
The API Gateway enforces trading limits based on user tier:
| Tier | Max Trade Size | Daily Limit | Rate Limit (Quotes/sec) |
|---|---|---|---|
| Bronze (Unverified) | 1,000 | 5,000 | 5 |
| Silver (KYC Verified) | 10,000 | 50,000 | 10 |
| Gold (Enhanced DD) | 100,000 | 500,000 | 20 |
| Institutional | Custom | Custom | Custom |
API Gateway Middleware (Kong/Envoy):
Click to view Kong Middleware
-- kong_auth.lua
function check_trade_limit(jwt_payload, trade_amount)
local tier = jwt_payload.tier
local limits = {
bronze = 1000,
silver = 10000,
gold = 100000
}
if trade_amount > limits[tier] then
return 403, "TRADE_LIMIT_EXCEEDED"
end
return 200, "OK"
end
5.7.3 Security Best Practices
- Password Storage:
bcryptwith cost factor 12 (2^12 iterations) - 2FA: TOTP (Google Authenticator) mandatory for withdrawals
- Session Management:
- Access Token: 15 minutes (short-lived)
- Refresh Token: 7 days (stored in HttpOnly cookie)
- Refresh rotation: New refresh token issued on each use
- Rate Limiting: Per-user sliding window in Redis
Click to view Rate Limiter
key = f"rate_limit:{user_id}:quote"
redis.incr(key)
redis.expire(key, 1) # 1-second window
if redis.get(key) > 10: # Max 10 quotes/sec
return "RATE_LIMIT_EXCEEDED"
5.8 Multi-Region Deployment: The “Regional Cell” Model
[!IMPORTANT] Key Principle: Compliance and Latency dictate a 2-Level Partitioning Strategy.
- Jurisdiction (Cell): Data Residency (GDPR/FinCEN).
- Shard (User ID): Horizontal Scalability.
The Analogy: Embassies Think of Regional Cells like Embassies.
- EU Cell (Embassy): A piece of European soil in the cloud. All EU citizen data lives here.
- Traveling User: When an EU user goes to Japan, they can visit the Japanese “Consulate” (Edge Node) for help (Read Data), but for official business like renewing a passport (Trading/Writing), their request is securely mailed back to the HQ in Europe.
5.8.1 Regional Cells (The Outer Layer)
Instead of a monolithic global database, we deploy isolated Regional Cells. Each cell is a self-contained stamp of the architecture (QuoteSvc, OrderSvc, Postgres, Redis, Kafka) located in a specific jurisdiction. Requirements:
- Low Latency: Users in Asia shouldn’t hit US servers (adds 200ms+ RTT)
- Data Residency: EU users’ data must stay in EU (GDPR compliance)
- Disaster Recovery: If US-East fails, failover to US-West < 1 minute
- EU Cell (
eu-central-1): Serves European residents. Data never leaves. - US Cell (
us-east-1): Serves US residents. Complies with FinCEN/SEC. - APAC Cell (
ap-southeast-1): Serves Asia-Pacific.
Data Residency Rules:
- ✅ Global Data: Public Market Data, Coin Metadata (stored everywhere).
- ❌ Resident Data: PII (KYC), Wallet Balances, Trade History, Audit Logs (pinned to Home Cell).
5.8.2 Home Cell Routing (The Traveler Problem)
A user’s “Home Cell” is determined at signup based on their country of residence. What happens when an EU user travels to Japan?
Pattern: Nearest Ingress → Home Cell Forwarding
- DNS/GSLB: Routes user to the nearest edge (Japan). Handshake is fast.
- Ticker/Market Data: Served locally from Japan (Stateless/Public). Fast.
- Auth & Trade:
- API Gateway in Japan decodes the JWT.
- JWT Claim:
{"user_id": "u123", "home_cell": "EU"}. - Gateway proxies the request to the EU Ingress.
- EU Cell processes the Order (ACID execution).
Result: The user gets fast market data, but trade execution pays the latency penalty (Japan → EU) to guarantee Compliance (Data never rests in Japan) and Consistency (No distributed transactions).
5.8.3 Cross-Region Failure Scenarios
What happens if the Private Backbone between Japan and EU is severed?
- Reads (Market Data): ✅ Unaffected. The Japan Edge continues to serve cached prices from local LPs.
- Writes (Trading): ❌ Failed. The Gateway cannot forward the
POST /orderto the Home Cell.- Behavior: API returns
503 Service Unavailable. - Why not buffer? We specifically avoid buffering orders in the Edge because it creates “hidden risk”. If the link comes back 10 minutes later, the market price will have moved, and executing old orders would be disastrous. Fast failure is a feature here.
- Behavior: API returns
5.9 Internal Load Balancer (ILB)
While the API Gateway handles external traffic (Authentication, Rate Limiting), we use an Internal Load Balancer (ILB) between the Gateway and our microservices (Quote, Order, Auth).
- Role: Traffic Distribution & Service Discovery.
- Why?:
- Decoupling: The Gateway addresses the Service (e.g.,
http://quote-svc), not individual IPs. The ILB resolves this to healthy pod IPs. - Health Checks: The ILB automatically removes unhealthy instances from rotation.
- Protocol Support: Handles gRPC (HTTP/2) load balancing for internal communication if needed.
- Placement:
- Region-Specific: Each Regional Cell has its own ILB to keep traffic local.
- Layer 4 vs Layer 7: Typically Layer 7 (HTTP) for smart routing, or Layer 4 (TCP) for raw speed.
6. Deep Dive: Reliability & Consistency
6.1 The “Double Spend” Attack
Scenario: Malicious User “Eve” has $100. She sends two requests simultaneously:
- Buy $100 of BTC.
- Buy $100 of ETH.
If processed in parallel without locking, both might check
balance > 100(True) and succeed. Eve spends 200 but only had 100.
The Analogy: The Shared Bank Account Imagine a husband and wife share a bank account with $100.
- Husband goes to ATM A: “Withdraw $100”.
- Wife goes to ATM B: “Withdraw $100”.
- If the bank doesn’t “lock” the account, both ATMs check balance (100), dispense cash (200 total), and the bank loses $100.
Solution: Pessimistic Locking
We use SELECT ... FOR UPDATE to lock the wallet row.
Click to view Transaction Logic
BEGIN;
-- This line blocks other transactions trying to read Alice's USD wallet
SELECT balance FROM wallets WHERE user_id = 'Alice' AND currency = 'USD' FOR UPDATE;
IF balance >= 100 THEN
UPDATE wallets SET balance = balance - 100 ...;
INSERT INTO orders ...;
COMMIT;
ELSE
ROLLBACK;
END IF;
END;
- Trade-off: Locking reduces concurrency for a single user, but prevents fraud. Since we shard by user, this doesn’t block other users.
- Why not Optimistic Locking? Optimistic locking (using
versioncolumns) works well for low-contention. In high-frequency trading, retrying failed transactions due to version conflicts adds latency and complexity to the client. Pessimistic locking guarantees the order executes (or fails) immediately.
6.2 The 7-Second Race Condition
Scenario:
- T=0.0s: Quote Generated (Valid until T=7.0s).
- T=6.9s: User clicks “Buy”.
- T=7.1s: Request reaches Server.
- Strict Logic: Reject (Expired).
- User Experience: “I clicked in time! Your system sucks.”
Solution: The Grace Period
We add a server-side buffer (e.g., 500ms). The validator checks:
if (CurrentTime < QuoteExpiry + GracePeriod)
- Risk: The market crashes in that 500ms.
- Mitigation: The “Spread” (profit margin) we added to the quote covers these small slippage risks.
Interactive Simulator
Test your latency luck. Can you click before the quote expires?
RFQ Simulator
Can you beat the network lag?
7. 3 System Walkthrough: The Life of a Trade
To solidify our understanding, let’s trace a single transaction through the entire stack, examining the exact API payloads, Redis keys, and Database queries.
Scenario A: The Happy Path (Quote → Buy → Success)
Step 1: User requests a quote for 1 BTC.
Click to view Quote Request/Response
- Request:
POST /quotes{ "pair": "BTC-USD", "side": "BUY", "amount": 1.00000000 } - Quote Service Action:
- Fetches price from LP:
$50,000. - Adds spread (+0.5%):
$50,250. - Generates
quote_id:q_123. - Signs the payload:
HMAC_SHA256(price + expiry, secret).
- Fetches price from LP:
- Redis State:
SET quote:q_123 "{\"price\": 50250, \"expiry\": 1698765432}" EX 7 - Response:
200 OK{ "quote_id": "q_123", "price": 50250.00, "expiry": 1698765432, "signature": "a1b2c3d4..." }
Step 2: User accepts the quote (within 7 seconds).
Click to view Order Execution Flow
- Request:
POST /orders{ "quote_id": "q_123", "order_id": "o_999", "user_id": "u_alice" }(Note:
order_idis a client-generated UUID for Idempotency) - Order Service Action:
- Validate Quote:
GET quote:q_123. (If exists, proceed). - Validate Signature: Recompute HMAC. (Prevents tampering with price).
- DB Transaction: ```sql BEGIN; – 1. Lock Wallet SELECT balance FROM wallets WHERE user_id = ‘u_alice’ AND currency = ‘USD’ FOR UPDATE; – (Result: 100,000.00)
– 2. Update Balance UPDATE wallets SET balance = balance - 50250.00 WHERE user_id = ‘u_alice’;
– 3. Insert Order INSERT INTO orders (order_id, user_id, quote_id, status) VALUES (‘o_999’, ‘u_alice’, ‘q_123’, ‘FILLED’);
COMMIT; ```
- Validate Quote:
- Response:
200 OK{ "status": "FILLED", "tx_id": "tx_555" }
Scenario B: The “Double Spend” Attempt
Imagine “Eve” has $100. She sends two requests simultaneously:
- Req A: Buy $100 BTC.
- Req B: Buy $100 ETH.
Timeline:
- T=0.00s: Req A reaches DB. Starts Transaction.
Click to view Race Condition Logic
```sql [Tx A] SELECT balance ... FOR UPDATE; -- Locks Row
``` * **T=0.01s**: **Req B** reaches DB. Starts Transaction.
```sql [Tx B] SELECT balance ... FOR UPDATE; -- BLOCKED! Waits for Tx A.
```
- T=0.05s: Req A updates balance (100 → 0) and
COMMIT. - Row Lock released.
- T=0.06s: Req B unblocks and reads the new balance.
balanceis now $0.- Logic:
IF balance < 100 THEN ROLLBACK. - Result: Req A succeeds (200 OK). Req B fails (402 Payment Required).
Scenario C: The “Expired Quote” Race Condition
Step 1: User attempts to trade at T=7.1s (0.1s too late).
- Request:
POST /orders { quote_id: "q_123" } - Order Service Action:
- Check Redis:
GET quote:q_123 - Result:
(nil)(Key evicted by Redis). - Fallback Check: Even if the key existed (e.g., due to lag), check
payload.expiry < Now().
- Check Redis:
- Response:
400 Bad Request
Click to view Error Response
```json {
"error": "QUOTE_EXPIRED",
"message": "Quote q_123 is no longer valid. Please request a new price." }
```
8. Alternative Solutions (Trade-offs)
7.1 RFQ vs. CLOB (Central Limit Order Book)
- CLOB (e.g., Nasdaq, Binance):
- Mechanism: Continuous matching of limit orders.
- Pros: Transparent pricing, high liquidity discovery.
- Cons: Extremely complex to engineer (matching engine), high computational cost.
- RFQ (Our System):
- Mechanism: Guaranteed price on demand.
- Pros: Simpler architecture, better UX for large trades (no slippage).
- Cons: Platform takes market risk.
7.2 SQL vs. NoSQL
- NoSQL (DynamoDB):
- Pros: Infinite scaling.
- Cons: Lack of multi-row ACID transactions. Implementing a ledger in NoSQL requires complex application-level locking (e.g., optimistic locking with version numbers), which is error-prone for financial data.
- SQL (PostgreSQL):
- Pros: Native ACID, referential integrity.
- Cons: Harder to scale writes.
- Decision: SQL wins because financial correctness > raw write speed. Sharding solves the scale issue.
7.3 Event Sourcing
- Concept: Store every transaction as an immutable event (
Deposited,Bought,Sold). Calculate balance by replaying events. - Pros: Perfect audit trail, easy debugging.
- Cons: Replaying millions of events to get a balance is slow. Requires “Snapshots”.
- Our Choice: Hybrid. We use a standard SQL table for current balance (fast) but log every change to an
audit_logstable (immutable).
9. Low-Level Optimizations (The “Boom” Factor)
To squeeze every millisecond out of the system:
- Kernel Tuning:
- Increase TCP Buffer sizes (
net.ipv4.tcp_rmem,net.ipv4.tcp_wmem) to handle high-throughput bursts. - Enable TCP Fast Open (TFO) to reduce handshake latency by 1 RTT.
- Increase TCP Buffer sizes (
- Connection Pooling:
- Database connections are expensive. Use PgBouncer to maintain a pool of warm connections, reducing overhead.
- Garbage Collection (GC):
- For the Quote Service (Golang), tune
GOGCto trade memory for CPU. - For the Order Service (Java), use ZGC or Shenandoah for sub-millisecond pause times.
- For the Quote Service (Golang), tune
- Network:
- Place Quote Services in the same Availability Zone (AZ) as the Liquidity Providers if possible (AWS
us-east-1). - Use Kernel Bypass (DPDK) is likely overkill for 230 QPS, but worth mentioning for HFT requiring microsecond latency.
- Place Quote Services in the same Availability Zone (AZ) as the Liquidity Providers if possible (AWS
10. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Get Quote (7s) | Quote Service + Redis (TTL 7s) + WebSocket to LPs. |
| Place Order | Order Service with HMAC validation + Idempotency keys. |
| Balance Check | PostgreSQL with SELECT FOR UPDATE (Pessimistic Locking). |
| Reliability (99.999%) | Active-Passive DB Failover + Stateless Services + Kubernetes Auto-healing. |
| Latency (<50ms) | In-memory processing (Redis) + Connection Pooling + Geolocation. |
| Consistency | Database Sharding by user_id allows local ACID transactions. |
| Scalability | Horizontal scaling of services + DB Sharding + Redis Cluster. |
| Security | API Gateway (JWT, Rate Limit) + Private Subnets + mTLS. |
| Compliance | Async KYC pipeline + Audit Logs (Event Sourcing lite). |
11. Observability & Tracing
You cannot fix what you cannot see. For a system moving millions of dollars, we need total visibility.
Click to view Logs & Metrics
10.1 The RED Method (Metrics)
We instrument every service to emit these three golden signals:
- Rate: Request counts per second.
- Metric:
http_requests_total{service="quote_svc", status="200"} - Use: Detect traffic spikes or DDoS.
- Metric:
- Errors: Failed requests.
- Metric:
order_failed_total{reason="insufficient_funds"} - Use: Alert if order failures exceed 1% of total traffic.
- Metric:
- Duration: Latency distributions.
- Metric:
quote_generation_seconds_bucket(Histogram) - Use: Alert if P99 latency > 100ms.
- Metric:
10.2 Distributed Tracing
A single order touches 4 systems: Gateway → Order Service → Redis → DB. If an order is slow, Distributed Tracing tells us exactly where.
- Trace ID: Generated at the Gateway (e.g.,
x-trace-id: 12345). Passed via HTTP headers to every downstream service. - Spans: Each service logs a “Span” with start/end timestamps.
- Visualization (e.g., Jaeger/Zipkin):
[Gateway] |-------------------------------------------| 205ms
[Order SVC] |-----------------------------| 180ms
[Redis] |----| 10ms
[DB Lock] |-----------| 150ms (Bottleneck!)
In this example, the DB Lock took 150ms, indicating database contention.
10.3 Structured Logging
Forget plain text logs. Use JSON for machine-readability (ELK Stack).
{
"level": "INFO",
"timestamp": "2023-10-27T10:00:00Z",
"service": "order-service",
"trace_id": "a1b2c3d4",
"user_id": "u_999",
"event": "order_placed",
"amount": 100.00,
"currency": "USD"
}
- Audit Logs: Separate, immutable logs for compliance. Every balance change must be recorded here and archived to WORM (Write Once Read Many) storage (e.g., S3 Object Lock).
10.4 Alerting Strategy
- P1 (Critical - Wake up on-call):
- Order Success Rate < 99.5%.
- Database connection pool saturation > 90%.
- Redis Cluster state “FAIL”.
- P2 (Warning - Ticket for tomorrow):
- Latency P99 > 150ms (SLA breach warning).
- Disk usage > 80%.
12. Deployment & Operations
11.1 Deployment Strategy: Blue/Green
For a financial system, we cannot risk a “bad deploy” corrupting the database.
- Blue (Active): Serving 100% traffic.
- Green (Staging): Deploy new version. Run integration tests.
- Switch: Update the Load Balancer to route 1% traffic to Green (Canary).
- Monitor: Watch for
HTTP 500or latency spikes. - Rollout: If safe, route 100% to Green.
- Rollback: If 1% fails, instantly revert LB to Blue. Users see errors for only 5 seconds.
11.2 Database Schema Evolution
- Problem: Adding a column locks the table.
- Solution: Expand-Contract Pattern.
- Expand: Add nullable column
new_col(Zero downtime). - Code: Update app to write to both
old_colandnew_col. - Backfill: Run a background job to copy data
old→new. - Contract: Update code to read only from
new_col. Dropold_col.
- Expand: Add nullable column
13. Follow-Up Questions: The Interview Gauntlet
This section covers 50 rapid-fire questions to test the depth of your design.
I. Database & Data Consistency (The Core)
- Why PostgreSQL over NewSQL? Sharded Postgres is more mature and sufficient for 1M users. Distributed SQL (CockroachDB) adds consensus latency to writes.
- Handling Hot Shards: If a “Whale” hits 10k TPS, we use Virtual Buckets to migrate that user to a dedicated physical node.
- Isolation Levels: We use READ COMMITTED for performance. SERIALIZABLE prevents race conditions but causes too many transaction aborts/retries in high-concurrency environments.
- Lock Contention: If
SELECT ... FOR UPDATEhangs, connection pools exhaust. We setNOWAITor short timeouts (e.g., 2s) to fail fast. - Replication Lag: Users reading from replicas might see old balances. We implement “Sticky Sessions” or force reads from Primary for critical wallet views.
- Schema Migrations: Use
pg_repackor similar tools to add columns without locking tables. - Archival Strategy: Move data > 1 year old to S3 (Parquet format) and delete from Postgres to keep indices small.
- Double Booking: Without row locking, two parallel transactions read the same balance, subtract funds, and overwrite each other.
- Database Failover: Postgres Automatic Failover (PAF) takes ~30s. Writes fail during this window; users see errors.
- Data Corruption: External reconciliation (Nightly Jobs) sums all wallet balances vs. total deposits to detect drift.
II. Scalability & Performance
- Redis Eviction:
volatile-ttlensures we only drop expired quotes, never persistent configs. - Connection Pooling: Use PgBouncer sidecar. If scaling exceeds DB limits, we must shard further.
- Load Balancing: Least Outstanding Requests handles varying service times better than Round Robin.
- Traveling Users: We pin users to a Home Cell based on residency. A user traveling to Japan connects to the Japan Edge for speed, but the Gateway forwards trade requests to their EU Home Cell for compliance.
- Cross-Region Latency: We accept the latency penalty (e.g., 200ms) for traveling users to guarantee Data Residency. Market data remains fast (local).
- CDN Caching: We generally cannot cache prices as they change every second. WebSocket is preferred.
- Write-Heavy Spikes: During crashes, we implement Queue-based Load Leveling (Kafka) to smooth out DB writes.
- Serialization: JSON is fine for this scale. Protobuf saves bandwidth but adds debugging complexity.
- Autoscaling Triggers: Scale on CPU Usage (>70%) and Request Queue Depth.
- Cache Penetration: Use Bloom Filters to block requests for non-existent symbols (“FAKE-COIN”).
III. Reliability & Fault Tolerance
- Redis Persistence: If Redis dies, quotes are lost. This is acceptable; users just request a new quote.
- Circuit Breaker: Threshold based on Error Rate (e.g., >50% failures in 10s).
- Bulkhead Pattern: Isolate thread pools for “Notifications” vs “Orders” so one slow dependency doesn’t crash the app.
- Retry Storms: Add Exponential Backoff and Jitter to client retries.
- Idempotency Storage: If Redis evicts keys, we fallback to a check in the persistent DB (slower but safer).
- Graceful Degradation: If History Service fails, the “Trade” button still works.
- Clock Skew: We use NTP on servers. Tolerance is built into the 7s expiry window.
- Zonal Failures: Deployment across 3 AZs ensures only ~33% capacity loss, which autoscaling covers.
IV. Architecture & Microservices
- Saga Pattern: We don’t use Sagas for the core trade (too slow). We use local ACID via sharding.
- Service Discovery: Kubernetes (CoreDNS) handles service IP resolution.
- Gateway vs Mesh: Gateway handles Edge concerns (Auth, Rate Limit); Mesh handles inter-service concerns (mTLS, Retries).
- Configuration: Use a dynamic config server (e.g., Consul/Etcd) with watchers to update
expiry_secondshot. - Data Ownership: Order Service cannot access Wallet Table directly. Must call Wallet Service API to decouple schemas.
- Event Ordering: Kafka Partition Key =
user_idensures events for one user are sequential.
V. Security & Compliance
- Insider Trading: Admin actions require Multi-Party Approval and are logged to immutable audit trails.
- API Key Security: Scoped keys (Read-Only vs Trade). Automated rotation.
- DDoS Protection: Rate limiting at the Edge (Cloudflare) + Gateway (Token Bucket).
- Audit Immutability: Write logs to S3 with Object Lock (Governance Mode).
- PII Data: “Crypto-shredding”: Delete the encryption key for a user’s data to effectively “erase” it without modifying immutable logs.
- Internal Auth: mTLS (Mutual TLS) ensures only authorized services can talk to the Wallet Service.
VI. Operations & Observability
- Metric Cardinality: Do not tag metrics with
user_id. Use logs for high-cardinality debugging. - Distributed Tracing: Inject
x-trace-idat the Gateway and propagate it everywhere. - Deployment: Canary Deployment. Roll out v2 to 1% of users, monitor error rates, then expand.
- Chaos Engineering: Randomly kill pods (Chaos Monkey) during staging to test recovery.
- Alert Fatigue: Group related alerts. Use “Symptoms” (User can’t trade) rather than “Causes” (CPU high) for paging.
- Capacity Planning: Linear regression on past 6 months of data to forecast storage/compute needs.
VII. Business Logic & Edge Cases
- Negative Balance: Should be impossible with ACID. If it happens, freeze account and trigger manual investigation.
- Partial Fills: Requires DB schema change (
filled_amountvsrequested_amount). - Market Halted: A global “Kill Switch” in Redis that the Order Service checks before every trade.
- Rounding Errors: Always use Integers (Micros/Satoshis) or BigDecimal. Never
floatordouble. - Settlement Failure: Platform takes the risk. If LP fails, we still owe the user the crypto.
VIII. Advanced Architecture (The 99.999% Club)
- Event Sourcing vs. CRUD: Event sourcing is better for auditability but harder to query. We use a hybrid approach (CRUD for current state, Events for history).
- CQRS (Command Query Responsibility Segregation): Use separate models for Writes (Order Service) and Reads (History Service). This allows scaling reads independently via Read Replicas.
- LMAX Disruptor: A high-performance inter-thread messaging library. Used in HFT to avoid lock contention. Overkill for 500 TPS but good for 500k TPS.
- Kernel Bypass (DPDK/Solarflare): Bypassing the Linux kernel networking stack to write directly to the NIC. Reduces latency from 10us to 1us.
- Clock Synchronization: NTP is not enough for sub-millisecond precision. Use PTP (Precision Time Protocol) with hardware timestamping.
- Garbage Collection Tuning: For Java, use ZGC to keep pauses < 1ms. For Go, use
GOGC=offand manual memory management if needed (extreme case). - False Sharing: CPU cache line contention. Pad data structures to 64 bytes to prevent cores from invalidating each other’s caches.
IX. Failure Modes & Disaster Recovery
- Partial Partition: What if the Order Service can reach Redis but not the DB? Answer: Fail the request safely.
- Zombie Processes: A service that thinks it’s the leader but isn’t. Use Fencing Tokens (epoch numbers) to reject writes from zombies.
- Thundering Herd: If Redis clears, 10k users hit the DB. Use Request Coalescing (Singleflight) to merge identical requests.
- Split Brain: If the cluster partitions, do we accept writes on both sides? Answer: No. Pause writes (CP system) to preserve consistency.
- Corrupted WAL: If Postgres WAL is corrupted, replay from the last snapshot and accept data loss (RPO > 0).
- Region Failure: Failover to DR region. RTO (Recovery Time Objective) ~15 mins. DNS switch.
X. Market Microstructure
- Slippage: The difference between the quoted price and executed price. In RFQ, the platform absorbs slippage (the “Spread”).
- Order Types:
- FOK (Fill or Kill): Execute fully or not at all.
- IOC (Immediate or Cancel): Execute what you can, cancel the rest.
- GTC (Good Till Cancelled): Standard limit orders (not used in RFQ).
- Spread Capture: The primary revenue model. We buy at 50,000 and sell to users at 50,250.
- Hedging: When a user buys 1 BTC, we immediately buy 1 BTC from an LP to neutralize our inventory risk.
14. Summary: The Whiteboard Strategy
If you are asked to design this in 45 minutes, draw this 4-Quadrant Layout:
1. Requirements & Core Math
- Func: Quote (7s), Order, Wallet.
- Non-Func: 99.999%, <50ms Latency, ACID.
- Scale: 230 QPS (Quotes), 12 TPS (Orders).
- Traffic: Read-heavy (20:1).
2. Architecture
* Separation of Concerns: Fast (Quotes) vs Safe (Orders). * Sharding: By User ID for local ACID.
3. Data & API
4. Trade-offs & Deep Dives
- Concurrency: Pessimistic Locking (`FOR UPDATE`) prevents double-spend.
- Latency: Redis `volatile-ttl` + Connection Pooling + Geo-routing.
- Reliability: Grace Period for network jitter.
- Observability: Distributed Tracing + Audit Logs.