Memory & Cache Optimization

[!NOTE] This module explores the core principles of Memory & Cache Optimization, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Hook: The CPU is Fast, RAM is Slow

Your CPU operates at 4 GHz (0.25 ns per cycle). Reading from RAM takes ~100 ns. That is 400 CPU cycles waiting for data. This is why Cache Layout matters more than instruction count for high-performance systems.


2. The Concepts

A. Spatial Locality & Cache Lines

CPUs fetch memory in chunks of 64 bytes (a Cache Line).

  • Good: Iterating an int[] array. (Next int is in the same cache line).
  • Bad: Iterating a LinkedList. (Next node is at random address → Cache Miss).

B. Struct Padding & Alignment

CPUs read words aligned to 4 or 8 bytes.

struct Bad {
  char c;   // 1 byte
  // 7 bytes wasted padding!
  long x;   // 8 bytes
};

Optimization: Reorder fields from largest to smallest to minimize padding wastage.

C. False Sharing

If two threads modify different variables that sit on the same cache line, the cores force each other to invalidate the cache line repeatedly. Solution: Pad variables to ensure they live on different 64-byte lines.


3. Data-Oriented Design (DoD)

OOP (Object Oriented Programming): class Ball { float x, y; Color c; }. Array of Balls.

  • Scatter memory. Pointer chasing.

DoD: class BallManager { float[] xs; float[] ys; Color[] cs; }. Structure of Arrays (SoA).

  • Benefit: If you only need to update positions, you iterate xs and ys. You never load cs into cache. This maximizes cache utilization.

4. Deep Dive Strategy Lab: Memory

Intuition Through Analogy

RAM is a Warehouse. L1 Cache is your Backpack.

  • Linked List: You run to the warehouse, grab one item, come back. Run back for the next.
  • Array: You take a forklift and bring a whole pallet (Cache Line) to your desk. You work on 16 items instantly before needing to go back.

Mathematical Anchor: Latency Numbers

Jeff Dean’s famous numbers (approximate):

  • L1 Cache Ref: 0.5 ns
  • L2 Cache Ref: 7 ns
  • Main Memory Ref: 100 ns
  • SSD Random Read: 150,000 ns
  • Network Round Trip: 150,000,000 ns

Optimization is shifting work up this list.