Accelerating Descent: Momentum & Adam

[!NOTE] This module explores the core principles of Accelerating Descent: Momentum & Adam, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: The Need for Speed

Standard Gradient Descent (SGD) is like walking down a mountain in a thick fog. You look at your feet, take a step in the steepest direction, and repeat.

wt+1 = wt - η ∇ L(wt)

The Problems:

  1. Zig-Zagging: In narrow ravines (common in Deep Learning), the gradient points across the ravine, not down it. SGD oscillates wildly.
  2. Saddle Points: On flat plateaus, the gradient ∇ L ≈ 0. SGD stops moving.
  3. Local Minima: It has no energy to escape shallow traps.

We need algorithms with Velocity (Momentum) and Intelligence (Adaptive Learning Rates).


2. Momentum: Adding Physics

Imagine a heavy ball rolling down the hill. It accumulates Momentum.

  • If the gradient keeps pointing in the same direction, the ball speeds up.
  • If the gradient changes direction (zig-zag), the accumulated velocity carries it forward, smoothing out the path.

The Algorithm

We introduce a velocity vector v. \beta (usually 0.9) is the friction or decay rate.

vt+1 = β vt + (1 - β) ∇L(wt) wt+1 = wt - η vt+1
  • Result: The optimizer “remembers” its previous direction. It plows through saddle points and dampens oscillations in ravines.

3. Adam: Adaptive Moment Estimation

Adam combines the best of two worlds:

  1. Momentum: Keeps track of the “Average Gradient” (First Moment, m).
  2. RMSProp: Keeps track of the “Variance of the Gradient” (Second Moment, v).

It scales the learning rate individually for each parameter. Parameters with large gradients get smaller steps; parameters with small gradients get larger steps.

The Algorithm

  1. Update Moments:
mt+1 = β1 mt + (1-β1) ∇ L vt+1 = β2 vt + (1-β2) (∇ L)2
  1. Bias Correction (Crucial at start!):
m̂ = mt+1 / (1-β1t),   v̂ = vt+1 / (1-β2t)
  1. Update Weights:
wt+1 = wt - η · m̂ / (√v̂ + ε)

4. Python Implementation

Let’s implement these from scratch to see the difference.

import numpy as np

class Optimizer:
    def __init__(self, lr=0.01):
        self.lr = lr
    def step(self, w, grad):
        raise NotImplementedError

class SGD(Optimizer):
    def step(self, w, grad):
        return w - self.lr * grad

class Momentum(Optimizer):
    def __init__(self, lr=0.01, beta=0.9):
        super().__init__(lr)
        self.beta = beta
        self.v = 0

    def step(self, w, grad):
        # Update velocity
        self.v = self.beta * self.v + (1 - self.beta) * grad
        # Update weights
        return w - self.lr * self.v

class Adam(Optimizer):
    def __init__(self, lr=0.01, beta1=0.9, beta2=0.999):
        super().__init__(lr)
        self.beta1 = beta1
        self.beta2 = beta2
        self.m = 0
        self.v = 0
        self.t = 0
        self.eps = 1e-8

    def step(self, w, grad):
        self.t += 1
        # Update moments
        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * (grad**2)

        # Bias correction
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)

        # Update weights
        return w - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

5. Interactive Visualizer: The Great Race

Watch three balls race to the minimum of a “Ravine” function (steep in one direction, flat in the other). This geometry is notorious for trapping SGD.

  • SGD (Red): Struggles, zig-zags violently, or crawls slowly.
  • Momentum (Blue): Builds speed, might overshoot slightly, but corrects and reaches the goal fast.
  • Adam (Green): The surgical instrument. Adapts to the scale difference and moves directly to the target.
SGD Momentum Adam

6. Summary

  • SGD: Good baseline, but struggles with ravines (oscillation) and saddle points (stagnation).
  • Momentum: Adds physical mass (v) to the optimizer, allowing it to plow through flat regions and smooth out zig-zags.
  • Adam: The gold standard. Combines Momentum with Adaptive Learning Rates (RMSProp), scaling gradients perfectly for each parameter.