Accelerating Descent: Momentum & Adam

1. Introduction: SGD is Slow

Standard Gradient Descent (SGD) has problems:

  1. Zig-Zagging: In ravines, it bounces back and forth between walls instead of moving down the valley.
  2. Saddle Points: It gets stuck on flat plateaus where the gradient is near zero.

We need algorithms that have “velocity” and “intelligence”.


2. Momentum

Imagine a heavy ball rolling down a hill.

  • Physics: It gains speed. If it hits a small bump, its momentum carries it over.
  • Math: \(v_{t+1} = \beta v_t + (1-\beta) \nabla L\) \(w_{t+1} = w_t - \alpha v_{t+1}\) Accumulates past gradients to smooth out the path.

3. RMSProp & Adam (Adaptive Learning Rates)

Different parameters need different learning rates.

  • Sparse features: Rare words in NLP need larger updates.
  • Adam (Adaptive Moment Estimation): Combines Momentum (Velocity) and RMSProp (Scaling learning rate by variance).

It is the default optimizer for nearly all Deep Learning today.


4. Interactive Visualizer: The Great Race

Watch three balls race to the center (Minimum).

  • Red (SGD): Slow, gets confused by the noise.
  • Blue (Momentum): Builds speed, overshoots slightly but corrects.
  • Green (Adam): Fast and precise.
● SGD ● Momentum ● Adam

5. Summary

  • SGD: Baseline, can be slow.
  • Momentum: Adds velocity to plow through noise and valleys.
  • Adam: Adapts to the terrain geometry, speeding up on flat surfaces and slowing down on steep cliffs.

Next: Constrained Optimization →