Case Study: Gradient Descent (The Learning Algorithm)
1. Introduction: Learning = Optimization
How does a Neural Network “learn”? It’s not magic. It’s Calculus.
- We have a Loss Function L(w) (Error).
- We want to find weights w where L(w) is minimal.
- We calculate the Gradient ∇L(w) (Slope).
- We take a step downhill.
This simple algorithm, repeated millions of times, is how ChatGPT was trained.
2. The Algorithm
2.1 Vanilla Gradient Descent
wnew = wold - α ċ ∇L(wold)
- w: The weight (parameter).
- ∇L: The Gradient (Slope).
- α: The Learning Rate (Step Size).
2.2 The Problem of Local Minima and Saddle Points
Real loss landscapes are not simple bowls.
- Local Minima: Valleys that are not the lowest point. We get stuck.
- Saddle Points: Areas where the gradient is zero (flat), but it’s not a minimum. (Like the center of a Pringle chip). In high dimensions, these are very common.
2.3 Adding Momentum (The Physics Solution)
Standard Gradient Descent has no “mass”. It stops instantly if the gradient is zero. Momentum gives the optimizer inertia. It’s like rolling a heavy ball down a hill instead of a feather.
Momentum Update Rule:
- Update Velocity: vnew = βvold - α∇L(w)
- Update Weight: wnew = wold + vnew
- v: Velocity vector (accumulated gradients).
- β: Friction coefficient (usually 0.9).
- Effect: It plows through small bumps (local minima) and speeds up across flat regions (saddle points).
2.4 Batch vs. Stochastic Gradient Descent (SGD)
| Method | Data used per step | Gradient Quality | Speed | Use Case |
|---|---|---|---|---|
| Batch GD | Entire Dataset (N) | Perfect Direction | Very Slow | Small datasets only. |
| SGD | Single Sample (1) | Noisy / Erratic | Very Fast | Massive datasets (Online learning). |
| Mini-Batch SGD | Batch of 32-512 | Good approximation | Fast | Standard for Deep Learning. |
[!TIP] Why Noise is Good: The noise in SGD helps the model “jump” out of sharp local minima, finding flatter, more robust solutions (better generalization).
3. Interactive Visualizer: The Descent Arena
A ball on a hill represents our model’s state. The goal is to reach the Global Minimum (Lowest Point).
Experiments:
- Double Well: Set Momentum to 0. Click “Reset Ball”. It gets stuck in the left (local) minimum.
- Escape: Increase Momentum to 0.9. Click “Reset Ball”. Watch it gain enough speed to jump over the hill to the global minimum on the right!
- Plateau: Select “Plateau”. Set Momentum to 0. Watch it crawl. Set Momentum to 0.9. Watch it speed up.
4. Summary
- Gradient Descent is an iterative optimization algorithm.
- Momentum adds “inertia” to help escape Local Minima and speed up crossing Saddle Points (Plateaus).
- SGD: Using one sample at a time adds “noise” which can also help escape local minima.