Module Review: Caching
[!IMPORTANT] As a Senior Staff Engineer, I expect you to master the following to pass this module:
- Quantify Latency: Discern the difference between an L1 hit (1ns) and a WAN trip (150ms).
- Defend the DB: Choose the right eviction and write strategy for high-throughput systems.
- Hardware Intuition: Explain how Pointer Overhead, Write Amplification, and Light-Speed constraints affect software design.
1. Cheat Sheet: The Numbers You Must Know
| Operation | Latency (Approx) | Analogy | |
|---|---|---|---|
| L1/L2 Cache | CPU Internal | < 10ns | Heartbeat |
| RAM (Redis) | Memory Access | ~100ns | Brushing Teeth |
| SSD Read | Fast Disk | ~150,000ns (150µs) | Weekend Trip |
| Network (LAN) | Data Center | ~500,000ns (0.5ms) | 1 Week Vacation |
| Network (WAN) | Cross-Region | ~150,000,000ns (150ms) | 4 Years (University) |
2. Flashcards: Test Your Knowledge
What is the "Thundering Herd" problem?
Tap to reveal
Concurrency Spike
When a popular cache key expires, thousands of requests hit the database simultaneously to regenerate it. Solved by Mutex Locks or Refresh-Ahead.
LRU vs LFU?
Tap to reveal
Recency vs Frequency
LRU: Evicts least recently used (Good for Recency bias).
LFU: Evicts least frequently used (Good for long-term popularity, resistant to scans).
Write-Through vs Write-Back?
Tap to reveal
Safety vs Speed
Write-Through: Writes to DB + Cache synchronously (Safe).
Write-Back: Writes to Cache, async to DB (Fast, risk of data loss).
What is Consistent Hashing?
Tap to reveal
Distributed Scaling
A technique to map keys to nodes on a ring. When a node is added/removed, only 1/N keys are moved, minimizing cache misses.
Redis Persistence: RDB vs AOF?
Tap to reveal
Snapshot vs Log
RDB: Periodic snapshots (Compact, fast restart, data loss window).
AOF: Append-only log (Durable, large file, slower restart).
What is Cache Penetration?
Tap to reveal
Querying Non-Existent Data
Users query keys that don't exist in DB (e.g. id=-1). Bypasses cache and hits DB. Solved by Bloom Filters or caching Nulls.
2.5 War Story: The Thundering Herd at Ticketmaster
Imagine Taylor Swift concert tickets go on sale. Millions of users hit the homepage. The cache key concert:taylor_swift:details is extremely hot.
Suddenly, the cache TTL expires.
- The Thundering Herd: In the exact millisecond the cache misses, 50,000 requests bypass the cache and hit the database simultaneously to fetch the exact same concert details.
- The Meltdown: The database, unable to handle 50,000 complex joins simultaneously, spikes in CPU, runs out of connections, and crashes.
- The Fix (Mutex Lock):
- Request 1 arrives, gets a cache miss, and acquires a Mutex Lock (e.g., Redis
SETNX). - Request 1 goes to the database.
- Requests 2-50,000 arrive, see the lock is held, and sleep for 50ms, then retry the cache.
- Request 1 updates the cache and releases the lock.
- Requests 2-50,000 wake up, hit the now-populated cache, and the database survives.
- Request 1 arrives, gets a cache miss, and acquires a Mutex Lock (e.g., Redis
3. Decision Matrix: Redis vs Memcached
| Feature | Redis | Memcached |
|---|---|---|
| Data Types | Strings, Lists, Sets, Hashes, Bitmaps | Strings Only (Binary Blobs) |
| Architecture | Single-Threaded (Event Loop) | Multi-Threaded |
| Persistence | Yes (RDB + AOF) | No (Pure In-Memory) |
| Clustering | Native (Hash Slots) | Client-Side Only (Consistent Hashing) |
| Best For | Complex Apps, Queues, Leaderboards | Simple KV Caching, Scaling Vertical CPUs |
Key Takeaway: Protect the DB by using Bloom Filters for Cache Penetration and Mutex Locks for Thundering Herds.
4. Anatomy of a Cache Entry (LRU Example)
To understand Pointer Overhead, let’s dissect what a single Redis string key actually looks like in memory (simplified):
| Component | Size | Purpose |
|---|---|---|
| Key String | Variable (e.g., 10 bytes) | The actual key (user:123). |
| Value String | Variable (e.g., 20 bytes) | The cached payload ({"name":"Alice"}). |
| Metadata (TTL) | 8 bytes | Expiration timestamp. |
| LRU Pointers (Prev/Next) | 16 bytes (8 bytes each) | Pointers to maintain the Double-Linked List for the LRU eviction queue. |
| Hash Table Pointer | 8 bytes | Pointer from the main hash slot to this entry. |
| Total Overhead | ~32+ bytes | Extra memory consumed per entry! |
Key Insight: Caching 1 million tiny 5-byte values doesn’t take 5MB of RAM; due to Pointer Overhead and metadata, it might take 40MB+. This is why hardware intuition matters!
5. Hardware-First System Design Checklist
| Layer | Item | Why? |
|---|---|---|
| CPU | Are we using arrays for spatial locality? | Hits L1/L2 cache lines (64 bytes). |
| RAM | Have we accounted for Pointer Overhead? | LRU metadata can increase RAM usage by 40%+. |
| Network | Is BGP Anycast routing to the nearest PoP? | Bypasses the “Speed of Light” light-speed penalty. |
| Disk | Do we have Battery-Backed (BCC) RAID? | Safely use Write-Back for massive throughput. |
| Kernel | Is vm.dirty_ratio tuned correctly? |
Optimizes Write Coalescing and reduces SSD wear. |
6. Staff Engineer Challenge
[!IMPORTANT] The Micro-Caching Paradox: Imagine you have a highly dynamic “Real-time Stock Price” service. You can’t cache the prices for 60 seconds because data staleness would be illegal. However, your database is catching fire under 100,000 requests per second.
Challenge: How would you use Micro-Caching (TTL < 1s) and Request Collapsing to save the hardware without violating business requirements?
Hint: Think about the “Thundering Herd” and how many requests arrive within the same 10ms window.
7. Glossary
Need to review terms? Check out the System Design Glossary.