Optimizers: Navigating the Loss Landscape

Training a neural network is like hiking down a mountain in thick fog. You can only feel the slope under your feet (the gradient). Your goal is to reach the lowest valley (global minimum). Optimizers are the algorithms that decide which direction to take and how big a step to make.

1. Gradient Descent: The Basic Hiker

Stochastic Gradient Descent (SGD) is the simplest approach. It takes a step proportional to the negative of the gradient.

  • Rule: wt+1 = wt - η ∇ L(wt)
  • Pros: Simple, low memory.
  • Cons: Can get stuck in local minima, slow convergence in “ravines”, oscillates if learning rate is high.

2. Momentum: Adding Velocity

Imagine rolling a heavy ball down the hill. It gathers speed. Momentum helps accelerate SGD in the relevant direction and dampens oscillations.

  • Analogy: A heavy ball rolling down a hill.
  • Mechanism: Accumulates a “velocity” vector from past gradients.
  • Rule:
    1. vt+1 = β vt + (1 - β) ∇ L(wt)
    2. wt+1 = wt - η vt+1

3. Adam: The Adaptive Expert

Adam (Adaptive Moment Estimation) combines the best of Momentum and RMSprop. It adapts the learning rate for each parameter.

  • Momentum: Keeps track of the average gradient (first moment).
  • RMSprop: Keeps track of the variance of the gradient (second moment).
  • Result: It moves fast on flat surfaces and slows down on steep slopes, making it very robust.

4. Interactive: The Optimizer Race

Watch how different optimizers navigate a 2D loss surface towards the minimum (center).

● SGD ● Momentum ● Adam
SGD Steps: 0
Momentum Steps: 0
Adam Steps: 0

5. PyTorch Implementation

In PyTorch, you don’t implement optimizers from scratch often, but knowing how to use them is crucial.

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple model
model = nn.Sequential(
  nn.Linear(10, 5),
  nn.ReLU(),
  nn.Linear(5, 1)
)

# 1. Stochastic Gradient Descent (SGD)
# lr: learning rate, momentum: 0.9 (adds velocity)
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# 2. Adam (Standard for most tasks)
# lr: learning rate (usually 1e-3 or 3e-4)
optimizer_adam = optim.Adam(model.parameters(), lr=0.001)

# Training Loop Snippet
for inputs, targets in dataloader:
  optimizer_adam.zero_grad()   # Reset gradients
  outputs = model(inputs)      # Forward pass
  loss = criterion(outputs, targets)
  loss.backward()              # Backward pass
  optimizer_adam.step()        # Update weights

6. Custom Optimizer: Understanding the Guts

Here is how you might implement a simple SGD class in Python to understand the mechanics.

class SimpleSGD:
  def __init__(self, params, lr=0.01):
    self.params = list(params)
    self.lr = lr

  def step(self):
    # Iterate over all parameters
    for p in self.params:
      if p.grad is None:
        continue

      # Update rule: w = w - lr * grad
      # We use p.data to avoid tracking history for autograd
      p.data -= self.lr * p.grad.data

  def zero_grad(self):
    for p in self.params:
      if p.grad is not None:
        p.grad.zero_()

[!NOTE] Learning Rate Schedulers: Often, we want to decrease the learning rate as training progresses to fine-tune the weights. PyTorch provides torch.optim.lr_scheduler for this.

7. Summary

Optimizer Best Use Case Analogy
SGD Simple problems, or when precise control is needed. Walking downhill carefully.
Momentum Noisy gradients, ravines. Rolling a heavy ball downhill.
Adam Default choice for most Deep Learning tasks. A smart rover with adaptive wheels for different terrain.