Accelerating Descent: Momentum & Adam

[!NOTE] This module explores the core principles of Accelerating Descent: Momentum & Adam, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: The Need for Speed

Standard Gradient Descent (SGD) is like walking down a mountain in a thick fog. You look at your feet, take a step in the steepest direction, and repeat.

w_t+1 = w_t - η ∇ L(w_t)

The Problems:

Zig-Zagging: In narrow ravines (common in Deep Learning), the gradient points across the ravine, not down it. SGD oscillates wildly.
Saddle Points: On flat plateaus, the gradient ∇ L ≈ 0. SGD stops moving.
Local Minima: It has no energy to escape shallow traps.

We need algorithms with Velocity (Momentum) and Intelligence (Adaptive Learning Rates).

2. Momentum: Adding Physics

Imagine a heavy ball rolling down the hill. It accumulates Momentum.

If the gradient keeps pointing in the same direction, the ball speeds up.
If the gradient changes direction (zig-zag), the accumulated velocity carries it forward, smoothing out the path.

The Algorithm

We introduce a velocity vector v. \beta (usually 0.9) is the friction or decay rate.

v_t+1 = β v_t + (1 - β) ∇L(w_t) w_t+1 = w_t - η v_t+1

Result: The optimizer “remembers” its previous direction. It plows through saddle points and dampens oscillations in ravines.

3. Adam: Adaptive Moment Estimation

Adam combines the best of two worlds:

Momentum: Keeps track of the “Average Gradient” (First Moment, m).
RMSProp: Keeps track of the “Variance of the Gradient” (Second Moment, v).

It scales the learning rate individually for each parameter. Parameters with large gradients get smaller steps; parameters with small gradients get larger steps.

The Algorithm

Update Moments:

m_t+1 = β₁ m_t + (1-β₁) ∇ L v_t+1 = β₂ v_t + (1-β₂) (∇ L)²

Bias Correction (Crucial at start!):

m̂ = m_t+1 / (1-β₁^t), v̂ = v_t+1 / (1-β₂^t)

Update Weights:

w_t+1 = w_t - η · m̂ / (√v̂ + ε)

4. Python Implementation

Let’s implement these from scratch to see the difference.

import numpy as np

class Optimizer:
    def __init__(self, lr=0.01):
        self.lr = lr
    def step(self, w, grad):
        raise NotImplementedError

class SGD(Optimizer):
    def step(self, w, grad):
        return w - self.lr * grad

class Momentum(Optimizer):
    def __init__(self, lr=0.01, beta=0.9):
        super().__init__(lr)
        self.beta = beta
        self.v = 0

    def step(self, w, grad):
        # Update velocity
        self.v = self.beta * self.v + (1 - self.beta) * grad
        # Update weights
        return w - self.lr * self.v

class Adam(Optimizer):
    def __init__(self, lr=0.01, beta1=0.9, beta2=0.999):
        super().__init__(lr)
        self.beta1 = beta1
        self.beta2 = beta2
        self.m = 0
        self.v = 0
        self.t = 0
        self.eps = 1e-8

    def step(self, w, grad):
        self.t += 1
        # Update moments
        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * (grad**2)

        # Bias correction
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)

        # Update weights
        return w - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

5. Interactive Visualizer: The Great Race

Watch three balls race to the minimum of a “Ravine” function (steep in one direction, flat in the other). This geometry is notorious for trapping SGD.

SGD (Red): Struggles, zig-zags violently, or crawls slowly.
Momentum (Blue): Builds speed, might overshoot slightly, but corrects and reaches the goal fast.
Adam (Green): The surgical instrument. Adapts to the scale difference and moves directly to the target.

Learning Rate:

      SGD
      Momentum
      Adam
  

6. Summary

SGD: Good baseline, but struggles with ravines (oscillation) and saddle points (stagnation).
Momentum: Adds physical mass (v) to the optimizer, allowing it to plow through flat regions and smooth out zig-zags.
Adam: The gold standard. Combines Momentum with Adaptive Learning Rates (RMSProp), scaling gradients perfectly for each parameter.