Distributed Locking (Redlock)
In 2011, Ticketmaster’s ticket allocation system had a race condition that allowed double-booking premium seats during high-demand sales (e.g., Taylor Swift Eras Tour). Two users in different data centers would simultaneously check seat availability, both see “Available”, and both complete their booking. Their monolithic SQL database prevented this with SELECT FOR UPDATE row locks — but when they migrated to microservices with distributed databases, the row lock no longer worked across services. The solution: Distributed Locking via Redis. A single atomic Redis SET key NX PX 30000 command — “set if not exists, expire in 30s” — became the industry-standard way to implement mutual exclusion across microservice instances. The Redlock algorithm (5 Redis masters, majority quorum) was invented by Redis creator Salvatore Sanfilippo in 2016. It started a legendary debate with researcher Martin Kleppmann about whether it was safe for financial systems. The answer still shapes how every major ticketing platform works today.
[!IMPORTANT] In this lesson, you will master:
- Lease Mechanics: Why “True Locks” don’t exist in distributed systems, and why everything is actually a time-bound lease.
- The Ghost Writer Anomaly: How Garbage Collection (GC) hardware pauses can invalidate your safety guarantees and crash your database.
- Fencing Tokens: Implementing the industry-standard shield that enables the storage hardware to reject “Zombie” writes.
1. The Anomaly: Double Booking
Imagine a Ticketmaster clone.
- User A checks: “Seat 1A Available?” → YES.
- User B checks: “Seat 1A Available?” → YES.
- User A books Seat 1A.
- User B books Seat 1A. Result: Collision. Data Corruption. Angry Users.
The Solution: Mutual Exclusion.
- User A Acquires Lock for
seat_1A. - User B tries to acquire Lock → FAILS (Wait).
- User A books seat → Releases Lock.
- User B acquires Lock → Checks Seat → “Sold Out”.
2. Efficiency vs Correctness
Before you implement a lock, you must ask: “What happens if the lock fails?”
| Goal | Description | Consequence of Failure | Solution |
|---|---|---|---|
| Efficiency | Prevent doing the same work twice (e.g., sending email). | Minor annoyance (User gets 2 emails). | Redis (Redlock) |
| Correctness | Prevent data corruption (e.g., money transfer). | Catastrophic (Money lost). | Fencing Tokens (ZooKeeper/Etcd) |
[!WARNING] Redis is for Efficiency. If you need absolute safety (Correctness), do not rely solely on Redis. Use a consensus system like ZooKeeper or Etcd because Redis (even Redlock) makes assumptions about system clocks.
3. The Tool: Redis SETNX
The simplest distributed lock is a single atomic command in Redis.
- Command:
SET resource_name my_random_value NX PX 30000 NX: Not Exists (Only set if key doesn’t exist).PX 30000: Pexpire (Auto-delete after 30s).
Why the TTL (Time To Live)?
If the client holding the lock crashes before releasing it, a lock without a TTL stays forever (Deadlock). The TTL ensures the lock auto-releases, acting as a Lease.
[!NOTE] Hardware-First Intuition: The Clock Jump. Redis TTLs (
PX) rely on the physical hardware clock of the Redis server. If the NTP (Network Time Protocol) daemon detects a clock drift and “steps” the clock forward (e.g., to fix a 2-second lag), your 5-second lock effectively becomes a 3-second lock. This physical clock instability is why Staff Engineers distinguish between Clock Slewing (gradually adjusting frequency) and Clock Stepping (instant jump). For high-value financial transactions, they rely on Logical Clocks (sequence numbers) which are hardware-independent.
4. The Trap: The Ghost Writer (GC Pauses)
Here is how a simple Redis lock fails during a Garbage Collection (GC) Pause.
- Client A acquires Lock (TTL 5s).
- Client A freezes for 8s (GC Pause). Lock Expired.
- Client B acquires Lock. Writes to DB.
- Client A wakes up. Thinks it still holds the lock. Writes to DB. Result: Last Write Wins. Client A overwrites Client B’s valid data.
Sequence Diagram: The Ghost Writer
The Fix: Fencing Tokens
To solve this, we need the Storage Layer to help.
- Lock Service returns a monotonic Token (1, 2, 3…).
- Client A gets Token 33.
- Client B gets Token 34 (after A expires).
- Client A wakes up, tries to write with 33.
- Database checks: “I’ve already seen 34. Reject 33.”
5. Interactive Demo: Redlock & Time Travel
Cyberpunk Mode: Simulate the Race Condition.
- Mission: Acquire the lock and write to the Database.
- Weapon: “Freeze Ray” (Simulates GC Pause).
- Defense: Fencing Tokens (Visualized).
[!TIP] Try it yourself:
- Acquire Lock as Client A.
- Immediately hit “❄️ Freeze (GC)”. This pauses Client A for 6 seconds (longer than the 5s Lock TTL).
- Wait for the lock to expire (watch the red bar).
- Acquire Lock as Client B. Client B will write to the DB (Token 34).
- Watch Client A wake up and try to write with Token 33.
- Result: The Database triggers a “BLOCKED” shield because 33 < 34.
6. Redlock Algorithm (Multi-Master)
Single Redis is a Single Point of Failure. Redlock uses 5 independent Redis masters to solve this.
- Client gets current timestamp.
- Tries to acquire lock in all 5 instances sequentially.
- If acquired in Majority (3/5) and time elapsed < TTL:
- Lock Acquired.
- Else:
- Unlock All.
The Controversy: Kleppmann vs Antirez
Distributed Systems researcher Martin Kleppmann famously critiqued Redlock.
- The Issue: Redlock relies on Wall-Clock Time. If a server’s clock jumps forward (e.g., NTP sync), it might expire a lock prematurely.
- The Verdict:
- Use Redlock for Efficiency (preventing double-processing).
- Use ZooKeeper/Etcd for Correctness (preventing data corruption). ZooKeeper uses logical clocks (Zxid), not wall clocks.
7. The GC Death Spiral: “Pause > TTL”
One of the most dangerous anomalies in distributed systems is the NUMA-aware memory stall or a massive Stop-the-World GC pause.
- Lease Acquired: Client A gets a 5s lease.
- Physical Stall: The JVM or OS freezes Client A for 10s (e.g., swapping to disk).
- Zombie State: Client A wakes up. It has no internal way of knowing it was frozen. It checks its local clock, which says “Time is T+10”, but the client logic might still proceed assuming it has the lock.
- Collision: The Storage hardware must be the final line of defense using Fencing Tokens.
8. Summary
- Distributed Locks are essential for Mutual Exclusion.
- TTL (Lease) prevents deadlocks but introduces race conditions.
- Fencing Tokens are the shield against Zombie Leaders (GC Pauses).
- Redlock is great for efficiency, but not for financial safety.
Staff Engineer Tip: When using fencing tokens, treat them as Hardware-Level Guardrails. The database (e.g., Postgres or DynamoDB) should have a version > last_seen_version constraint physically embedded in the UPDATE query. This ensures that even if your application code is “Zombified” by a massive 30-second NUMA-related memory stall, the hardware storage controller will physically block the write from touching the disk.
Hardware Nuance: NTP Stepping vs Slewing. Most servers use ntpd to keep clocks in sync. If the clock is off by a small amount, NTP slews the clock—speeding it up or slowing it down slightly so time is continuous. However, if the drift is large (>125ms), NTP may step the clock—jumping it instantly. This “jump” can invalidate a Redis PX timeout instantly, causing a lock to expire “early” in real-world time.
Mnemonic — “Lock = Lease (not forever)”: Redis SETNX = check-and-set atomic. TTL = lease, not true lock. GC Pause > TTL → Zombie Writer → Old Token rejected by DB (Fencing). Redlock: 3/5 masters quorum = stronger. But: NTP clock jump → lock expiry miscalculation → Kleppmann’s critique. Rule: Redis = Efficiency (prevent double emails). ZooKeeper/Etcd = Correctness (prevent data corruption). Always include Fencing Token in DB write query.