Flash Sale System (Ticketmaster)

[!NOTE] This module explores the core principles of Flash Sale System (Ticketmaster), deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. What is a Flash Sale System?

In 2022, Ticketmaster crashed when millions of fans tried to buy Taylor Swift tickets simultaneously. The problem wasn’t just volume; it was Contention.

A Flash Sale is a Distributed Systems nightmare:

  1. Extreme Contention: 1 Million users want 100 items.
  2. Zero Error Margin: You cannot sell 101 items (Overselling).
  3. Short Duration: The event lasts seconds. Auto-scaling groups don’t have time to boot up.

[!TIP] Real-World Examples:

  • Ticketmaster: Concert tickets.
  • Amazon Prime Day: Limited deals.
  • Sneaker Drops: Nike SNKRS app.

2. Requirements & Goals

2.1 Functional Requirements

  1. Purchase: Users can buy an item.
  2. Inventory: System must track available stock strictly.
  3. Limits: 1 item per user.

2.2 Non-Functional Requirements

  1. High Consistency: Absolutely NO overselling. (CP over AP).
  2. High Availability: The system must not crash under surge.
  3. Low Latency: Responses must be fast (< 200ms).
  4. Fairness: First come, first served.

2.3 Extended Requirements

  1. Bot Protection: Real humans must get priority over scalper bots.

3. Capacity Estimation

3.1 Traffic Analysis

  • Traffic: 1 Million Active Users at 10:00 AM.
  • QPS: Peak traffic 500,000 requests per second.
  • Inventory: 100 iPhones.

3.2 The Bottleneck

A standard PostgreSQL database handles ~1,000 TPS on a single row lock. We need 500,000 TPS.

  • Conclusion: We cannot touch the database for the initial “Buy” action. We need an In-Memory solution (Redis).

4. System APIs

Minimal API surface area to reduce attack vectors.

Method Endpoint Description
POST /v1/flash-sale/buy Attempts to reserve an item. Payload: { itemId: "iphone15", userId: "u1" }
GET /v1/flash-sale/status Checks if sale is active/sold out.

5. Database Design

5.1 Redis (Hot State)

  • Counter: item:{id}:stockInteger
  • User Set: item:{id}:usersSet<UserId> (To prevent double booking)

5.2 SQL (Cold State)

  • Table: orders
  • Columns: order_id, user_id, item_id, status.

6. High-Level Architecture

We use a Funnel Architecture to filter traffic at every layer.

System Architecture: Flash Sale Traffic Funnel
Multi-Layer Bot Filtering | Redis Atomic Inventory | Async Queueing
High-Volume Traffic
Filtered Requests
Valid Orders
Zone 1: The Surge (1M+ Users)
CDN / WAF / BOT PROTECTION
Zone 2: Rate Limiting (100k Req/s)
TOKEN BUCKET / AUTH VALIDATION
Zone 3: Atomic Lock (10k Req/s)
Redis Cache Stock: 100
Lua Script: DECR stock
Zone 4: Persistence
Kafka → SQL 100 Orders Created
90% Rejected Sold Items Only Bots Rejected

7. Component Design (Deep Dive)

7.1 Redis Atomic Counter (Lua)

Why Lua? Redis commands (GET, DECR) are atomic individually, but a sequence of them is not. A race condition can occur between checking the stock and decrementing it. Lua Script executes atomically on the Redis server. No other command can run while the script is executing. (See ACID Transactions for DB atomicity details).

-- KEYS[1] = inventory_key ("item:ps5:stock")
-- KEYS[2] = user_history_key ("item:ps5:users")
-- ARGV[1] = user_id

local stock = tonumber(redis.call("GET", KEYS[1]))
if stock <= 0 then
  return -1 -- Sold Out
end

if redis.call("SISMEMBER", KEYS[2], ARGV[1]) == 1 then
  return -2 -- Already Bought
end

redis.call("DECR", KEYS[1])
redis.call("SADD", KEYS[2], ARGV[1])
return 1 -- Success

7.2 Queue-Based Load Leveling

Once Redis returns 1 (Success), we don’t write to SQL immediately. We push a message to Kafka: { "user_id": "u1", "item_id": "ps5", "timestamp": 1620000000 } A separate worker process reads from Kafka and inserts into the SQL DB at a safe pace (e.g., 500 inserts/sec).


8. Data Partitioning & Sharding

For a single item Flash Sale (e.g., 100 iPhones), we cannot shard the Item Stock key easily without complex coordination. However, we partition the Traffic using the Load Balancer and Rate Limiters.

  • Request Sharding: Users are hashed by ID to different API Gateways.
  • Inventory Sharding: If we have 10,000 items, we can split them into 10 Redis partitions (1000 items each). The Load Balancer routes users to a specific partition based on hash(UserID) % 10.
  • Trade-off: One partition might sell out while others still have stock. We might need a “Work Stealing” algorithm to rebalance, but that adds complexity. For small inventory (100 items), a single Redis node is fine.

9. Reliability, Caching, & Load Balancing

9.1 Preventing Overselling

  • Pessimistic Locking (SQL): SELECT ... FOR UPDATE. Safe but slow.
  • Optimistic Locking (SQL): UPDATE ... WHERE version = 1. Fast but causes retry storms.
  • Redis Atomic: Best for high contention.

9.2 Redis Persistence

If Redis crashes, we lose the stock count.

  • AOF (Append Only File): We must set appendfsync always or everysec to minimize data loss.
  • Trade-off: always is slow (disk I/O). everysec risks losing 1 second of sales (overselling 1 sec worth of inventory). For Ticketmaster, we usually accept everysec and have a small “buffer” inventory or handle refunds.

10. Interactive Decision Visualizer

Compare Pessimistic Locking (SQL) vs. Redis Atomic Decrement.

  • SQL Lock: Traffic jams. High Latency.
  • Redis: High-speed highway. Low Latency.

Flash Sale Simulator

Architecture Strategy Selection

Inventory Left
100
Avg Latency
0ms
Items Sold
0
Select strategy and start.

11. System Walkthrough

Let’s trace a successful purchase request during the peak load.

Scenario: User A Buys an iPhone 15

  1. User Action: User A clicks “Buy Now” at 10:00:01 AM.
  2. Edge Layer (CDN/WAF): Checks IP reputation. Passes request.
  3. API Gateway: Token Bucket Rate Limiter allows the request.
  4. Application Server: Executes Redis Lua Script.
    • Input: KEYS=["item:iphone15:stock"], ARGV=["user_A"].
    • Logic: Stock is 50. User not in set. DECR to 49. SADD user.
    • Output: 1 (Success).
  5. Event Publishing: App Server pushes message to Kafka Topic orders-stream.
      { "order_id": "uuid-123", "user_id": "user_A", "item": "iphone15", "status": "PENDING_PAYMENT" }
    
  6. Response: App Server returns 200 OK to User A. “Reservation Confirmed! Redirecting to Payment…”.
  7. Async Worker: Pulls from Kafka, inserts row into SQL orders table.

12. Low-Level Optimizations

12.1 Redis Lazy Free

Deleting a large key (e.g., the Set of 1 Million users after the sale ends) is a blocking operation.

  • Problem: DEL item:iphone:users can block Redis for 2 seconds.
  • Solution: Use UNLINK (non-blocking delete) or configure lazyfree-lazy-user-del yes. This moves the memory reclamation to a background thread.

12.2 False Sharing & Padding

In highly concurrent in-memory systems (like if you wrote a custom C++ engine instead of Redis), False Sharing occurs when two atomic counters sit on the same CPU Cache Line (64 bytes).

  • Scenario: Core A updates Stock_Item_1. Core B updates Stock_Item_2.
  • Issue: If both variables share a cache line, Cores invalidate each other’s L1 cache, killing performance.
  • Solution: Cache Line Padding. Add dummy bytes between variables to ensure they sit on different cache lines.

13. Interview Gauntlet

Q1: How do you prevent bots from buying everything?

  • Answer: Multi-layered approach. 1. CAPTCHA at checkout. 2. WAF (IP rate limits). 3. Post-analysis (cancel orders delivering to the same address). 4. “Waiting Room” queue.

Q2: Why not use a Database with Row Locking?

  • Answer: Row locking serializes requests. Postgres can handle ~1k TPS on a single row. We need 100k TPS. Serializing 100k requests would cause massive timeouts.

Q3: What if the user reserves an item but doesn’t pay?

  • Answer: The reservation has a TTL (e.g., 5 mins). We use a Delayed Queue (or Redis Keyspace Notification). If payment is not received in 5 mins, we increment the stock back (INCR).

Q4: How do you test this system?

  • Answer: Distributed Load Testing (k6 / JMeter) running on multiple cloud instances to simulate 500k concurrent users. We verify “Sold Count” == “Inventory” exactly.

Q5: Can we use a Distributed Lock (Redlock)?

  • Answer: Redlock is too slow for high-frequency counting. It requires multiple network round trips to acquire/release. Atomic Lua scripts are much faster (single RTT).

14. Summary: The Whiteboard Strategy

1. Requirements

  • Func: Buy, Inventory.
  • Non-Func: No Overselling, High Load.

2. Architecture

[Client] → [Rate Limit] → [Redis Lua] | (Async) → [Kafka] → [SQL]

* Redis: Hot State (Stock). * SQL: Cold State (Orders).

3. Data & API

POST /buy → { itemId, userId } Redis: DECR item:stock SQL: Orders Table

4. Deep Dives

  • Lua: Ensures atomicity.
  • Funnel: Filter traffic early.