Optimizers: Navigating the Loss Landscape
Training a neural network is like hiking down a mountain in thick fog. You can only feel the slope under your feet (the gradient). Your goal is to reach the lowest valley (global minimum). Optimizers are the algorithms that decide which direction to take and how big a step to make.
1. Gradient Descent: The Basic Hiker
Stochastic Gradient Descent (SGD) is the simplest approach. It takes a step proportional to the negative of the gradient.
- Rule: wt+1 = wt - η ∇ L(wt)
- Pros: Simple, low memory.
- Cons: Can get stuck in local minima, slow convergence in “ravines”, oscillates if learning rate is high.
2. Momentum: Adding Velocity
Imagine rolling a heavy ball down the hill. It gathers speed. Momentum helps accelerate SGD in the relevant direction and dampens oscillations.
- Analogy: A heavy ball rolling down a hill.
- Mechanism: Accumulates a “velocity” vector from past gradients.
- Rule:
- vt+1 = β vt + (1 - β) ∇ L(wt)
- wt+1 = wt - η vt+1
3. Adam: The Adaptive Expert
Adam (Adaptive Moment Estimation) combines the best of Momentum and RMSprop. It adapts the learning rate for each parameter.
- Momentum: Keeps track of the average gradient (first moment).
- RMSprop: Keeps track of the variance of the gradient (second moment).
- Result: It moves fast on flat surfaces and slows down on steep slopes, making it very robust.
4. Interactive: The Optimizer Race
Watch how different optimizers navigate a 2D loss surface towards the minimum (center).
5. PyTorch Implementation
In PyTorch, you don’t implement optimizers from scratch often, but knowing how to use them is crucial.
import torch
import torch.nn as nn
import torch.optim as optim
# Define a simple model
model = nn.Sequential(
nn.Linear(10, 5),
nn.ReLU(),
nn.Linear(5, 1)
)
# 1. Stochastic Gradient Descent (SGD)
# lr: learning rate, momentum: 0.9 (adds velocity)
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# 2. Adam (Standard for most tasks)
# lr: learning rate (usually 1e-3 or 3e-4)
optimizer_adam = optim.Adam(model.parameters(), lr=0.001)
# Training Loop Snippet
for inputs, targets in dataloader:
optimizer_adam.zero_grad() # Reset gradients
outputs = model(inputs) # Forward pass
loss = criterion(outputs, targets)
loss.backward() # Backward pass
optimizer_adam.step() # Update weights
6. Custom Optimizer: Understanding the Guts
Here is how you might implement a simple SGD class in Python to understand the mechanics.
class SimpleSGD:
def __init__(self, params, lr=0.01):
self.params = list(params)
self.lr = lr
def step(self):
# Iterate over all parameters
for p in self.params:
if p.grad is None:
continue
# Update rule: w = w - lr * grad
# We use p.data to avoid tracking history for autograd
p.data -= self.lr * p.grad.data
def zero_grad(self):
for p in self.params:
if p.grad is not None:
p.grad.zero_()
[!NOTE] Learning Rate Schedulers: Often, we want to decrease the learning rate as training progresses to fine-tune the weights. PyTorch provides
torch.optim.lr_schedulerfor this.
7. Summary
| Optimizer | Best Use Case | Analogy |
|---|---|---|
| SGD | Simple problems, or when precise control is needed. | Walking downhill carefully. |
| Momentum | Noisy gradients, ravines. | Rolling a heavy ball downhill. |
| Adam | Default choice for most Deep Learning tasks. | A smart rover with adaptive wheels for different terrain. |