Case Study: Gradient Descent (The Learning Algorithm)

1. Introduction: Learning = Optimization

How does a Neural Network “learn”? It’s not magic. It’s Calculus.

We have a Loss Function L(w) (Error).
We want to find weights w where L(w) is minimal.
We calculate the Gradient ∇L(w) (Slope).
We take a step downhill.

This simple algorithm, repeated millions of times, is how ChatGPT was trained.

2. The Algorithm

2.1 Vanilla Gradient Descent

w_new = w_old - α &cdot; ∇L(w_old)

w: The weight (parameter).
∇L: The Gradient (Slope).
α: The Learning Rate (Step Size).

2.2 The Problem of Local Minima and Saddle Points

Real loss landscapes are not simple bowls.

Local Minima: Valleys that are not the lowest point. We get stuck.
Saddle Points: Areas where the gradient is zero (flat), but it’s not a minimum. (Like the center of a Pringle chip). In high dimensions, these are very common.

2.3 Adding Momentum (The Physics Solution)

Standard Gradient Descent has no “mass”. It stops instantly if the gradient is zero. Momentum gives the optimizer inertia. It’s like rolling a heavy ball down a hill instead of a feather.

Momentum Update Rule:

Update Velocity: v_new = βv_old - α∇L(w)
Update Weight: w_new = w_old + v_new

v: Velocity vector (accumulated gradients).
β: Friction coefficient (usually 0.9).
- Effect: It plows through small bumps (local minima) and speeds up across flat regions (saddle points).

2.4 Batch vs. Stochastic Gradient Descent (SGD)

Method	Data used per step	Gradient Quality	Speed	Use Case
Batch GD	Entire Dataset (N)	Perfect Direction	Very Slow	Small datasets only.
SGD	Single Sample (1)	Noisy / Erratic	Very Fast	Massive datasets (Online learning).
Mini-Batch SGD	Batch of 32-512	Good approximation	Fast	Standard for Deep Learning.

[!TIP] Why Noise is Good: The noise in SGD helps the model “jump” out of sharp local minima, finding flatter, more robust solutions (better generalization).

3. Interactive Visualizer: The Descent Arena

A ball on a hill represents our model’s state. The goal is to reach the Global Minimum (Lowest Point).

Experiments:

Double Well: Set Momentum to 0. Click “Reset Ball”. It gets stuck in the left (local) minimum.
Escape: Increase Momentum to 0.9. Click “Reset Ball”. Watch it gain enough speed to jump over the hill to the global minimum on the right!
Plateau: Select “Plateau”. Set Momentum to 0. Watch it crawl. Set Momentum to 0.9. Watch it speed up.

Landscape:

Learning Rate (α): 0.01

Momentum (β): 0.00

            Loss: 0.00
            
            Velocity: 0.00

Click graph to move ball

4. Summary

Gradient Descent is an iterative optimization algorithm.
Momentum adds “inertia” to help escape Local Minima and speed up crossing Saddle Points (Plateaus).
SGD: Using one sample at a time adds “noise” which can also help escape local minima.

Next: Module Review