Design a Flash Sale System (Ticketmaster)

1. What is a Flash Sale System?

In 2022, Ticketmaster crashed when millions of fans tried to buy Taylor Swift tickets simultaneously. The problem wasn’t just volume; it was Contention.

A Flash Sale is a Distributed Systems nightmare:

  1. Extreme Contention: 1 Million users want 100 items.
  2. Zero Error Margin: You cannot sell 101 items (Overselling).
  3. Short Duration: The event lasts seconds. Auto-scaling groups don’t have time to boot up.

[!TIP] Real-World Examples:

  • Ticketmaster: Concert tickets.
  • Amazon Prime Day: Limited deals.
  • Sneaker Drops: Nike SNKRS app.

2. Requirements & Goals

2.1 Functional Requirements

  1. Purchase: Users can buy an item.
  2. Inventory: System must track available stock strictly.
  3. Limits: 1 item per user.

2.2 Non-Functional Requirements

  1. High Consistency: Absolutely NO overselling. (CP over AP).
  2. High Availability: The system must not crash under surge.
  3. Low Latency: Responses must be fast (< 200ms).
  4. Fairness: First come, first served.

2.3 Extended Requirements

  1. Bot Protection: Real humans must get priority over scalper bots.

3. Capacity Estimation

3.1 Traffic Analysis

  • Traffic: 1 Million Active Users at 10:00 AM.
  • QPS: Peak traffic 500,000 requests per second.
  • Inventory: 100 iPhones.

3.2 The Bottleneck

A standard PostgreSQL database handles ~1,000 TPS on a single row lock. We need 500,000 TPS.

  • Conclusion: We cannot touch the database for the initial “Buy” action. We need an In-Memory solution (Redis).

4. System APIs

Minimal API surface area to reduce attack vectors.

Method Endpoint Description
POST /v1/flash-sale/buy Attempts to reserve an item. Payload: { itemId: "iphone15", userId: "u1" }
GET /v1/flash-sale/status Checks if sale is active/sold out.

5. Database Design

5.1 Redis (Hot State)

  • Counter: item:{id}:stock -> Integer
  • User Set: item:{id}:users -> Set<UserId> (To prevent double booking)

5.2 SQL (Cold State)

  • Table: orders
  • Columns: order_id, user_id, item_id, status.

6. High-Level Architecture

We use a Funnel Architecture to filter traffic at every layer.

System Architecture: Flash Sale Traffic Funnel
Multi-Layer Bot Filtering | Redis Atomic Inventory | Async Queueing
High-Volume Traffic
Filtered Requests
Valid Orders
Zone 1: The Surge (1M+ Users)
CDN / WAF / BOT PROTECTION
Zone 2: Rate Limiting (100k Req/s)
TOKEN BUCKET / AUTH VALIDATION
Zone 3: Atomic Lock (10k Req/s)
Redis Cache
Stock: 100
Lua Script:
DECR stock
Zone 4: Persistence
Kafka -> SQL
100 Orders Created
90% Rejected Sold Items Only Bots Rejected

7. Component Design (Deep Dive)

7.1 Redis Atomic Counter (Lua)

Why Lua? Redis commands (GET, DECR) are atomic individually, but a sequence of them is not. A race condition can occur between checking the stock and decrementing it. Lua Script executes atomically on the Redis server. No other command can run while the script is executing.

-- KEYS[1] = inventory_key ("item:ps5:stock")
-- KEYS[2] = user_history_key ("item:ps5:users")
-- ARGV[1] = user_id

local stock = tonumber(redis.call("GET", KEYS[1]))
if stock <= 0 then
    return -1 -- Sold Out
end

if redis.call("SISMEMBER", KEYS[2], ARGV[1]) == 1 then
    return -2 -- Already Bought
end

redis.call("DECR", KEYS[1])
redis.call("SADD", KEYS[2], ARGV[1])
return 1 -- Success

7.2 Queue-Based Load Leveling

Once Redis returns 1 (Success), we don’t write to SQL immediately. We push a message to Kafka: { "user_id": "u1", "item_id": "ps5", "timestamp": 1620000000 } A separate worker process reads from Kafka and inserts into the SQL DB at a safe pace (e.g., 500 inserts/sec).


8. Reliability, Caching, & Load Balancing

8.1 Preventing Overselling

  • Pessimistic Locking (SQL): SELECT ... FOR UPDATE. Safe but slow.
  • Optimistic Locking (SQL): UPDATE ... WHERE version = 1. Fast but causes retry storms.
  • Redis Atomic: Best for high contention.

8.2 Redis Persistence

If Redis crashes, we lose the stock count.

  • AOF (Append Only File): We must set appendfsync always or everysec to minimize data loss.
  • Trade-off: always is slow (disk I/O). everysec risks losing 1 second of sales (overselling 1 sec worth of inventory). For Ticketmaster, we usually accept everysec and have a small “buffer” inventory or handle refunds.

9. Interactive Decision Visualizer

Compare Pessimistic Locking (SQL) vs. Redis Atomic Decrement.

  • SQL Lock: Traffic jams. High Latency.
  • Redis: High-speed highway. Low Latency.

Flash Sale Simulator

Architecture Strategy Selection

Inventory Left
100
Avg Latency
0ms
Items Sold
0
Select strategy and start.

10. System Walkthrough

Let’s trace a successful purchase request during the peak load.

Scenario: User A Buys an iPhone 15

  1. User Action: User A clicks “Buy Now” at 10:00:01 AM.
  2. Edge Layer (CDN/WAF): Checks IP reputation. Passes request.
  3. API Gateway: Token Bucket Rate Limiter allows the request.
  4. Application Server: Executes Redis Lua Script.
    • Input: KEYS=["item:iphone15:stock"], ARGV=["user_A"].
    • Logic: Stock is 50. User not in set. DECR to 49. SADD user.
    • Output: 1 (Success).
  5. Event Publishing: App Server pushes message to Kafka Topic orders-stream.
    { "order_id": "uuid-123", "user_id": "user_A", "item": "iphone15", "status": "PENDING_PAYMENT" }
    
  6. Response: App Server returns 200 OK to User A. “Reservation Confirmed! Redirecting to Payment…”.
  7. Async Worker: Pulls from Kafka, inserts row into SQL orders table.

11. Low-Level Optimizations

11.1 Redis Lazy Free

Deleting a large key (e.g., the Set of 1 Million users after the sale ends) is a blocking operation.

  • Problem: DEL item:iphone:users can block Redis for 2 seconds.
  • Solution: Use UNLINK (non-blocking delete) or configure lazyfree-lazy-user-del yes. This moves the memory reclamation to a background thread.

11.2 False Sharing & Padding

In highly concurrent in-memory systems (like if you wrote a custom C++ engine instead of Redis), False Sharing occurs when two atomic counters sit on the same CPU Cache Line (64 bytes).

  • Scenario: Core A updates Stock_Item_1. Core B updates Stock_Item_2.
  • Issue: If both variables share a cache line, Cores invalidate each other’s L1 cache, killing performance.
  • Solution: Cache Line Padding. Add dummy bytes between variables to ensure they sit on different cache lines.

12. Interview Gauntlet

Q1: How do you prevent bots from buying everything?

  • Answer: Multi-layered approach. 1. CAPTCHA at checkout. 2. WAF (IP rate limits). 3. Post-analysis (cancel orders delivering to the same address). 4. “Waiting Room” queue.

Q2: Why not use a Database with Row Locking?

  • Answer: Row locking serializes requests. Postgres can handle ~1k TPS on a single row. We need 100k TPS. Serializing 100k requests would cause massive timeouts.

Q3: What if the user reserves an item but doesn’t pay?

  • Answer: The reservation has a TTL (e.g., 5 mins). We use a Delayed Queue (or Redis Keyspace Notification). If payment is not received in 5 mins, we increment the stock back (INCR).

Q4: How do you test this system?

  • Answer: Distributed Load Testing (k6 / JMeter) running on multiple cloud instances to simulate 500k concurrent users. We verify “Sold Count” == “Inventory” exactly.

Q5: Can we use a Distributed Lock (Redlock)?

  • Answer: Redlock is too slow for high-frequency counting. It requires multiple network round trips to acquire/release. Atomic Lua scripts are much faster (single RTT).

13. Summary: The Whiteboard Strategy

1. Requirements

  • Func: Buy, Inventory.
  • Non-Func: No Overselling, High Load.

2. Architecture

[Client] -> [Rate Limit] -> [Redis Lua]
|
(Async) -> [Kafka] -> [SQL]

* Redis: Hot State (Stock).
* SQL: Cold State (Orders).

3. Data & API

POST /buy -> { itemId, userId }
Redis: DECR item:stock
SQL: Orders Table

4. Deep Dives

  • Lua: Ensures atomicity.
  • Funnel: Filter traffic early.