Module Review: Advanced Optimization

1. Key Takeaways

  • Loss Landscape: The geometry of the loss function determines training difficulty. Convex functions are easy (bowl-shaped); Neural Networks are Non-Convex (rugged), plagued by Saddle Points rather than local minima.
  • Optimizers:
  • SGD: The baseline. Struggles in ravines and gets stuck on plateaus.
  • Momentum: Adds “velocity” to the optimizer, allowing it to plow through flat regions and dampen oscillations.
  • Adam: The gold standard. Combines Momentum (First Moment) and RMSProp (Second Moment) to adapt learning rates for each parameter.
  • Constrained Optimization: To optimize under constraints (g(x)=0), we use Lagrange Multipliers (\nabla f = \lambda \nabla g), finding points where the objective and constraint gradients align.
  • AutoDiff: Modern frameworks use Reverse Mode AutoDiff (Backpropagation), which efficiently computes gradients for millions of inputs (parameters) in a single backward pass.
  • Backpropagation: The Chain Rule applied to the computational graph. Deep networks with Sigmoid activations suffer from Vanishing Gradients because derivatives (<0.25) multiply to zero.

Module Review: Advanced Optimization

[!NOTE] This module explores the core principles of Module Review: Advanced Optimization, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Interactive Flashcards

What is a Saddle Point?

Tap to flip

A point where the gradient is zero (\nabla L = 0), but it is a minimum in one direction and a maximum in another. It is the main obstacle in high-dimensional optimization.

Why use Reverse Mode AutoDiff for ML?

Tap to flip

Because ML models have millions of inputs (parameters) but only one output (Loss). Reverse mode computes all gradients in a single backward pass, whereas Forward mode would require millions of passes.

What does Adam do?

Tap to flip

It combines Momentum (First Moment) and RMSProp (Second Moment) to adapt learning rates individually for each parameter.

What causes Vanishing Gradients?

Tap to flip

Multiplying many small derivatives (e.g., Sigmoid max derivative is 0.25) during backpropagation, causing gradients at early layers to shrink to zero.

What is Jensen's Inequality?

Tap to flip

For a convex function f, the function of the average is less than or equal to the average of the function values: f(E[x]) \le E[f(x)].

What is the Tangency Condition?

Tap to flip

In constrained optimization, the optimal point occurs where the constraint boundary runs parallel to the objective's contour lines (\nabla f = \lambda \nabla g).


2. Cheat Sheet: Optimizers

Optimizer Formula (Simplified) Pros Cons
SGD w = w - \eta \nabla L Simple, Low memory Slow, stuck in saddle points, oscillates
Momentum v = \beta v + (1-\beta)\nabla L
w = w - \eta v
Fast in ravines, dampens oscillation Introduces new hyperparameter \beta
RMSProp v = \beta v + (1-\beta)(\nabla L)2
w = w - \eta \frac{\nabla L}{\sqrt{v}}
Adaptive Learning Rate per parameter No momentum, can get stuck in local minima
Adam Momentum + RMSProp Fast, Robust, Defacto standard Can generalize slightly worse than SGD on simple problems

3. Next Steps

Now that you understand how to train networks, let’s explore the advanced linear algebra that powers them.

Math ML Glossary