Design a Flash Sale System (Ticketmaster)

1. What is a Flash Sale System?

In 2022, Ticketmaster crashed when millions of fans tried to buy Taylor Swift tickets simultaneously. The problem wasn’t just volume; it was Contention.

A Flash Sale is a Distributed Systems nightmare:

Extreme Contention: 1 Million users want 100 items.
Zero Error Margin: You cannot sell 101 items (Overselling).
Short Duration: The event lasts seconds. Auto-scaling groups don’t have time to boot up.

[!TIP] Real-World Examples:

Ticketmaster: Concert tickets.

Amazon Prime Day: Limited deals.

Sneaker Drops: Nike SNKRS app.

2. Requirements & Goals

2.1 Functional Requirements

Purchase: Users can buy an item.
Inventory: System must track available stock strictly.
Limits: 1 item per user.

2.2 Non-Functional Requirements

High Consistency: Absolutely NO overselling. (CP over AP).
High Availability: The system must not crash under surge.
Low Latency: Responses must be fast (< 200ms).
Fairness: First come, first served.

2.3 Extended Requirements

Bot Protection: Real humans must get priority over scalper bots.

3. Capacity Estimation

3.1 Traffic Analysis

Traffic: 1 Million Active Users at 10:00 AM.
QPS: Peak traffic 500,000 requests per second.
Inventory: 100 iPhones.

3.2 The Bottleneck

A standard PostgreSQL database handles ~1,000 TPS on a single row lock. We need 500,000 TPS.

Conclusion: We cannot touch the database for the initial “Buy” action. We need an In-Memory solution (Redis).

4. System APIs

Minimal API surface area to reduce attack vectors.

Method	Endpoint	Description
`POST`	`/v1/flash-sale/buy`	Attempts to reserve an item. Payload: `{ itemId: "iphone15", userId: "u1" }`
`GET`	`/v1/flash-sale/status`	Checks if sale is active/sold out.

5. Database Design

5.1 Redis (Hot State)

Counter: item:{id}:stock -> Integer
User Set: item:{id}:users -> Set<UserId> (To prevent double booking)

5.2 SQL (Cold State)

Table: orders
Columns: order_id, user_id, item_id, status.

6. High-Level Architecture

We use a Funnel Architecture to filter traffic at every layer.

System Architecture: Flash Sale Traffic Funnel

Multi-Layer Bot Filtering | Redis Atomic Inventory | Async Queueing

High-Volume Traffic

Filtered Requests

Valid Orders

Zone 1: The Surge (1M+ Users)

CDN / WAF / BOT PROTECTION

Zone 2: Rate Limiting (100k Req/s)

TOKEN BUCKET / AUTH VALIDATION

Zone 3: Atomic Lock (10k Req/s)

Redis Cache
Stock: 100

                    Lua Script:
DECR stock
                

Zone 4: Persistence

Kafka -> SQL
100 Orders Created

7. Component Design (Deep Dive)

7.1 Redis Atomic Counter (Lua)

Why Lua? Redis commands (GET, DECR) are atomic individually, but a sequence of them is not. A race condition can occur between checking the stock and decrementing it. Lua Script executes atomically on the Redis server. No other command can run while the script is executing.

-- KEYS[1] = inventory_key ("item:ps5:stock")
-- KEYS[2] = user_history_key ("item:ps5:users")
-- ARGV[1] = user_id

local stock = tonumber(redis.call("GET", KEYS[1]))
if stock <= 0 then
    return -1 -- Sold Out
end

if redis.call("SISMEMBER", KEYS[2], ARGV[1]) == 1 then
    return -2 -- Already Bought
end

redis.call("DECR", KEYS[1])
redis.call("SADD", KEYS[2], ARGV[1])
return 1 -- Success

7.2 Queue-Based Load Leveling

Once Redis returns 1 (Success), we don’t write to SQL immediately. We push a message to Kafka: { "user_id": "u1", "item_id": "ps5", "timestamp": 1620000000 } A separate worker process reads from Kafka and inserts into the SQL DB at a safe pace (e.g., 500 inserts/sec).

8. Reliability, Caching, & Load Balancing

8.1 Preventing Overselling

Pessimistic Locking (SQL): SELECT ... FOR UPDATE. Safe but slow.
Optimistic Locking (SQL): UPDATE ... WHERE version = 1. Fast but causes retry storms.
Redis Atomic: Best for high contention.

8.2 Redis Persistence

If Redis crashes, we lose the stock count.

AOF (Append Only File): We must set appendfsync always or everysec to minimize data loss.
Trade-off: always is slow (disk I/O). everysec risks losing 1 second of sales (overselling 1 sec worth of inventory). For Ticketmaster, we usually accept everysec and have a small “buffer” inventory or handle refunds.

9. Interactive Decision Visualizer

Compare Pessimistic Locking (SQL) vs. Redis Atomic Decrement.

SQL Lock: Traffic jams. High Latency.
Redis: High-speed highway. Low Latency.

Flash Sale Simulator

Architecture Strategy Selection

Inventory Left

100

Avg Latency

0ms

Items Sold

Select strategy and start.

10. System Walkthrough

Let’s trace a successful purchase request during the peak load.

Scenario: User A Buys an iPhone 15

User Action: User A clicks “Buy Now” at 10:00:01 AM.
Edge Layer (CDN/WAF): Checks IP reputation. Passes request.
API Gateway: Token Bucket Rate Limiter allows the request.
Application Server: Executes Redis Lua Script.
- Input: KEYS=["item:iphone15:stock"], ARGV=["user_A"].
- Logic: Stock is 50. User not in set. DECR to 49. SADD user.
- Output: 1 (Success).

Event Publishing: App Server pushes message to Kafka Topic orders-stream.

{ "order_id": "uuid-123", "user_id": "user_A", "item": "iphone15", "status": "PENDING_PAYMENT" }

Response: App Server returns 200 OK to User A. “Reservation Confirmed! Redirecting to Payment…”.
Async Worker: Pulls from Kafka, inserts row into SQL orders table.

11. Low-Level Optimizations

11.1 Redis Lazy Free

Deleting a large key (e.g., the Set of 1 Million users after the sale ends) is a blocking operation.

Problem: DEL item:iphone:users can block Redis for 2 seconds.
Solution: Use UNLINK (non-blocking delete) or configure lazyfree-lazy-user-del yes. This moves the memory reclamation to a background thread.

In highly concurrent in-memory systems (like if you wrote a custom C++ engine instead of Redis), False Sharing occurs when two atomic counters sit on the same CPU Cache Line (64 bytes).

Scenario: Core A updates Stock_Item_1. Core B updates Stock_Item_2.
Issue: If both variables share a cache line, Cores invalidate each other’s L1 cache, killing performance.
Solution: Cache Line Padding. Add dummy bytes between variables to ensure they sit on different cache lines.

12. Interview Gauntlet

Q1: How do you prevent bots from buying everything?

Answer: Multi-layered approach. 1. CAPTCHA at checkout. 2. WAF (IP rate limits). 3. Post-analysis (cancel orders delivering to the same address). 4. “Waiting Room” queue.

Q2: Why not use a Database with Row Locking?

Answer: Row locking serializes requests. Postgres can handle ~1k TPS on a single row. We need 100k TPS. Serializing 100k requests would cause massive timeouts.

Q3: What if the user reserves an item but doesn’t pay?

Answer: The reservation has a TTL (e.g., 5 mins). We use a Delayed Queue (or Redis Keyspace Notification). If payment is not received in 5 mins, we increment the stock back (INCR).

Q4: How do you test this system?

Answer: Distributed Load Testing (k6 / JMeter) running on multiple cloud instances to simulate 500k concurrent users. We verify “Sold Count” == “Inventory” exactly.

Q5: Can we use a Distributed Lock (Redlock)?

Answer: Redlock is too slow for high-frequency counting. It requires multiple network round trips to acquire/release. Atomic Lua scripts are much faster (single RTT).

13. Summary: The Whiteboard Strategy

1. Requirements

Func: Buy, Inventory.
Non-Func: No Overselling, High Load.

2. Architecture

[Client] -> [Rate Limit] -> [Redis Lua]
|
(Async) -> [Kafka] -> [SQL]

* Redis: Hot State (Stock).
* SQL: Cold State (Orders).

3. Data & API

            POST /buy -> { itemId, userId }

            Redis: DECR item:stock

            SQL: Orders Table

4. Deep Dives

Lua: Ensures atomicity.
Funnel: Filter traffic early.