Flash Sale System (Ticketmaster)
Design a Flash Sale System (Ticketmaster)
1. What is a Flash Sale System?
In 2022, Ticketmaster crashed when millions of fans tried to buy Taylor Swift tickets simultaneously. The problem wasn’t just volume; it was Contention.
A Flash Sale is a Distributed Systems nightmare:
- Extreme Contention: 1 Million users want 100 items.
- Zero Error Margin: You cannot sell 101 items (Overselling).
- Short Duration: The event lasts seconds. Auto-scaling groups don’t have time to boot up.
[!TIP] Real-World Examples:
- Ticketmaster: Concert tickets.
- Amazon Prime Day: Limited deals.
- Sneaker Drops: Nike SNKRS app.
2. Requirements & Goals
2.1 Functional Requirements
- Purchase: Users can buy an item.
- Inventory: System must track available stock strictly.
- Limits: 1 item per user.
2.2 Non-Functional Requirements
- High Consistency: Absolutely NO overselling. (CP over AP).
- High Availability: The system must not crash under surge.
- Low Latency: Responses must be fast (< 200ms).
- Fairness: First come, first served.
2.3 Extended Requirements
- Bot Protection: Real humans must get priority over scalper bots.
3. Capacity Estimation
3.1 Traffic Analysis
- Traffic: 1 Million Active Users at 10:00 AM.
- QPS: Peak traffic 500,000 requests per second.
- Inventory: 100 iPhones.
3.2 The Bottleneck
A standard PostgreSQL database handles ~1,000 TPS on a single row lock. We need 500,000 TPS.
- Conclusion: We cannot touch the database for the initial “Buy” action. We need an In-Memory solution (Redis).
4. System APIs
Minimal API surface area to reduce attack vectors.
| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/flash-sale/buy |
Attempts to reserve an item. Payload: { itemId: "iphone15", userId: "u1" } |
GET |
/v1/flash-sale/status |
Checks if sale is active/sold out. |
5. Database Design
5.1 Redis (Hot State)
- Counter:
item:{id}:stock->Integer - User Set:
item:{id}:users->Set<UserId>(To prevent double booking)
5.2 SQL (Cold State)
- Table:
orders - Columns:
order_id,user_id,item_id,status.
6. High-Level Architecture
We use a Funnel Architecture to filter traffic at every layer.
Stock: 100
DECR stock
100 Orders Created
7. Component Design (Deep Dive)
7.1 Redis Atomic Counter (Lua)
Why Lua? Redis commands (GET, DECR) are atomic individually, but a sequence of them is not. A race condition can occur between checking the stock and decrementing it.
Lua Script executes atomically on the Redis server. No other command can run while the script is executing.
-- KEYS[1] = inventory_key ("item:ps5:stock")
-- KEYS[2] = user_history_key ("item:ps5:users")
-- ARGV[1] = user_id
local stock = tonumber(redis.call("GET", KEYS[1]))
if stock <= 0 then
return -1 -- Sold Out
end
if redis.call("SISMEMBER", KEYS[2], ARGV[1]) == 1 then
return -2 -- Already Bought
end
redis.call("DECR", KEYS[1])
redis.call("SADD", KEYS[2], ARGV[1])
return 1 -- Success
7.2 Queue-Based Load Leveling
Once Redis returns 1 (Success), we don’t write to SQL immediately. We push a message to Kafka:
{ "user_id": "u1", "item_id": "ps5", "timestamp": 1620000000 }
A separate worker process reads from Kafka and inserts into the SQL DB at a safe pace (e.g., 500 inserts/sec).
8. Reliability, Caching, & Load Balancing
8.1 Preventing Overselling
- Pessimistic Locking (SQL):
SELECT ... FOR UPDATE. Safe but slow. - Optimistic Locking (SQL):
UPDATE ... WHERE version = 1. Fast but causes retry storms. - Redis Atomic: Best for high contention.
8.2 Redis Persistence
If Redis crashes, we lose the stock count.
- AOF (Append Only File): We must set
appendfsync alwaysoreverysecto minimize data loss. - Trade-off:
alwaysis slow (disk I/O).everysecrisks losing 1 second of sales (overselling 1 sec worth of inventory). For Ticketmaster, we usually accepteverysecand have a small “buffer” inventory or handle refunds.
9. Interactive Decision Visualizer
Compare Pessimistic Locking (SQL) vs. Redis Atomic Decrement.
- SQL Lock: Traffic jams. High Latency.
- Redis: High-speed highway. Low Latency.
Flash Sale Simulator
Architecture Strategy Selection
10. System Walkthrough
Let’s trace a successful purchase request during the peak load.
Scenario: User A Buys an iPhone 15
- User Action: User A clicks “Buy Now” at 10:00:01 AM.
- Edge Layer (CDN/WAF): Checks IP reputation. Passes request.
- API Gateway: Token Bucket Rate Limiter allows the request.
- Application Server: Executes Redis Lua Script.
- Input:
KEYS=["item:iphone15:stock"], ARGV=["user_A"]. - Logic: Stock is 50. User not in set.
DECRto 49.SADDuser. - Output:
1(Success).
- Input:
- Event Publishing: App Server pushes message to Kafka Topic
orders-stream.{ "order_id": "uuid-123", "user_id": "user_A", "item": "iphone15", "status": "PENDING_PAYMENT" } - Response: App Server returns
200 OKto User A. “Reservation Confirmed! Redirecting to Payment…”. - Async Worker: Pulls from Kafka, inserts row into SQL
orderstable.
11. Low-Level Optimizations
11.1 Redis Lazy Free
Deleting a large key (e.g., the Set of 1 Million users after the sale ends) is a blocking operation.
- Problem:
DEL item:iphone:userscan block Redis for 2 seconds. - Solution: Use
UNLINK(non-blocking delete) or configurelazyfree-lazy-user-del yes. This moves the memory reclamation to a background thread.
11.2 False Sharing & Padding
In highly concurrent in-memory systems (like if you wrote a custom C++ engine instead of Redis), False Sharing occurs when two atomic counters sit on the same CPU Cache Line (64 bytes).
- Scenario: Core A updates
Stock_Item_1. Core B updatesStock_Item_2. - Issue: If both variables share a cache line, Cores invalidate each other’s L1 cache, killing performance.
- Solution: Cache Line Padding. Add dummy bytes between variables to ensure they sit on different cache lines.
12. Interview Gauntlet
Q1: How do you prevent bots from buying everything?
- Answer: Multi-layered approach. 1. CAPTCHA at checkout. 2. WAF (IP rate limits). 3. Post-analysis (cancel orders delivering to the same address). 4. “Waiting Room” queue.
Q2: Why not use a Database with Row Locking?
- Answer: Row locking serializes requests. Postgres can handle ~1k TPS on a single row. We need 100k TPS. Serializing 100k requests would cause massive timeouts.
Q3: What if the user reserves an item but doesn’t pay?
- Answer: The reservation has a TTL (e.g., 5 mins). We use a Delayed Queue (or Redis Keyspace Notification). If payment is not received in 5 mins, we increment the stock back (
INCR).
Q4: How do you test this system?
- Answer: Distributed Load Testing (k6 / JMeter) running on multiple cloud instances to simulate 500k concurrent users. We verify “Sold Count” == “Inventory” exactly.
Q5: Can we use a Distributed Lock (Redlock)?
- Answer: Redlock is too slow for high-frequency counting. It requires multiple network round trips to acquire/release. Atomic Lua scripts are much faster (single RTT).
13. Summary: The Whiteboard Strategy
1. Requirements
- Func: Buy, Inventory.
- Non-Func: No Overselling, High Load.
2. Architecture
|
(Async) -> [Kafka] -> [SQL]
* Redis: Hot State (Stock).
* SQL: Cold State (Orders).
3. Data & API
Redis: DECR item:stock
SQL: Orders Table
4. Deep Dives
- Lua: Ensures atomicity.
- Funnel: Filter traffic early.