Accelerating Descent: Momentum & Adam
[!NOTE] This module explores the core principles of Accelerating Descent: Momentum & Adam, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Introduction: The Need for Speed
Standard Gradient Descent (SGD) is like walking down a mountain in a thick fog. You look at your feet, take a step in the steepest direction, and repeat.
The Problems:
- Zig-Zagging: In narrow ravines (common in Deep Learning), the gradient points across the ravine, not down it. SGD oscillates wildly.
- Saddle Points: On flat plateaus, the gradient ∇ L ≈ 0. SGD stops moving.
- Local Minima: It has no energy to escape shallow traps.
We need algorithms with Velocity (Momentum) and Intelligence (Adaptive Learning Rates).
2. Momentum: Adding Physics
Imagine a heavy ball rolling down the hill. It accumulates Momentum.
- If the gradient keeps pointing in the same direction, the ball speeds up.
- If the gradient changes direction (zig-zag), the accumulated velocity carries it forward, smoothing out the path.
The Algorithm
We introduce a velocity vector v. \beta (usually 0.9) is the friction or decay rate.
- Result: The optimizer “remembers” its previous direction. It plows through saddle points and dampens oscillations in ravines.
3. Adam: Adaptive Moment Estimation
Adam combines the best of two worlds:
- Momentum: Keeps track of the “Average Gradient” (First Moment, m).
- RMSProp: Keeps track of the “Variance of the Gradient” (Second Moment, v).
It scales the learning rate individually for each parameter. Parameters with large gradients get smaller steps; parameters with small gradients get larger steps.
The Algorithm
- Update Moments:
- Bias Correction (Crucial at start!):
- Update Weights:
4. Python Implementation
Let’s implement these from scratch to see the difference.
import numpy as np
class Optimizer:
def __init__(self, lr=0.01):
self.lr = lr
def step(self, w, grad):
raise NotImplementedError
class SGD(Optimizer):
def step(self, w, grad):
return w - self.lr * grad
class Momentum(Optimizer):
def __init__(self, lr=0.01, beta=0.9):
super().__init__(lr)
self.beta = beta
self.v = 0
def step(self, w, grad):
# Update velocity
self.v = self.beta * self.v + (1 - self.beta) * grad
# Update weights
return w - self.lr * self.v
class Adam(Optimizer):
def __init__(self, lr=0.01, beta1=0.9, beta2=0.999):
super().__init__(lr)
self.beta1 = beta1
self.beta2 = beta2
self.m = 0
self.v = 0
self.t = 0
self.eps = 1e-8
def step(self, w, grad):
self.t += 1
# Update moments
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
self.v = self.beta2 * self.v + (1 - self.beta2) * (grad**2)
# Bias correction
m_hat = self.m / (1 - self.beta1**self.t)
v_hat = self.v / (1 - self.beta2**self.t)
# Update weights
return w - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
5. Interactive Visualizer: The Great Race
Watch three balls race to the minimum of a “Ravine” function (steep in one direction, flat in the other). This geometry is notorious for trapping SGD.
- SGD (Red): Struggles, zig-zags violently, or crawls slowly.
- Momentum (Blue): Builds speed, might overshoot slightly, but corrects and reaches the goal fast.
- Adam (Green): The surgical instrument. Adapts to the scale difference and moves directly to the target.
6. Summary
- SGD: Good baseline, but struggles with ravines (oscillation) and saddle points (stagnation).
- Momentum: Adds physical mass (v) to the optimizer, allowing it to plow through flat regions and smooth out zig-zags.
- Adam: The gold standard. Combines Momentum with Adaptive Learning Rates (RMSProp), scaling gradients perfectly for each parameter.