Memory & Cache Optimization
[!NOTE] This module explores the core principles of Memory & Cache Optimization, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Hook: The CPU is Fast, RAM is Slow
Your CPU operates at 4 GHz (0.25 ns per cycle). Reading from RAM takes ~100 ns. That is 400 CPU cycles waiting for data. This is why Cache Layout matters more than instruction count for high-performance systems.
2. The Concepts
A. Spatial Locality & Cache Lines
CPUs fetch memory in chunks of 64 bytes (a Cache Line).
- Good: Iterating an
int[]array. (Next int is in the same cache line). - Bad: Iterating a
LinkedList. (Next node is at random address → Cache Miss).
B. Struct Padding & Alignment
CPUs read words aligned to 4 or 8 bytes.
struct Bad {
char c; // 1 byte
// 7 bytes wasted padding!
long x; // 8 bytes
};
Optimization: Reorder fields from largest to smallest to minimize padding wastage.
C. False Sharing
If two threads modify different variables that sit on the same cache line, the cores force each other to invalidate the cache line repeatedly. Solution: Pad variables to ensure they live on different 64-byte lines.
3. Data-Oriented Design (DoD)
OOP (Object Oriented Programming): class Ball { float x, y; Color c; }. Array of Balls.
- Scatter memory. Pointer chasing.
DoD: class BallManager { float[] xs; float[] ys; Color[] cs; }. Structure of Arrays (SoA).
- Benefit: If you only need to update positions, you iterate
xsandys. You never loadcsinto cache. This maximizes cache utilization.
4. Deep Dive Strategy Lab: Memory
Intuition Through Analogy
RAM is a Warehouse. L1 Cache is your Backpack.
- Linked List: You run to the warehouse, grab one item, come back. Run back for the next.
- Array: You take a forklift and bring a whole pallet (Cache Line) to your desk. You work on 16 items instantly before needing to go back.
Mathematical Anchor: Latency Numbers
Jeff Dean’s famous numbers (approximate):
- L1 Cache Ref: 0.5 ns
- L2 Cache Ref: 7 ns
- Main Memory Ref: 100 ns
- SSD Random Read: 150,000 ns
- Network Round Trip: 150,000,000 ns
Optimization is shifting work up this list.