Module Review: Advanced Optimization
1. Key Takeaways
- Loss Landscape: The geometry of the loss function determines training difficulty. Convex functions are easy (bowl-shaped); Neural Networks are Non-Convex (rugged), plagued by Saddle Points rather than local minima.
- Optimizers:
- SGD: The baseline. Struggles in ravines and gets stuck on plateaus.
- Momentum: Adds “velocity” to the optimizer, allowing it to plow through flat regions and dampen oscillations.
- Adam: The gold standard. Combines Momentum (First Moment) and RMSProp (Second Moment) to adapt learning rates for each parameter.
- Constrained Optimization: To optimize under constraints (g(x)=0), we use Lagrange Multipliers (\nabla f = \lambda \nabla g), finding points where the objective and constraint gradients align.
- AutoDiff: Modern frameworks use Reverse Mode AutoDiff (Backpropagation), which efficiently computes gradients for millions of inputs (parameters) in a single backward pass.
-
Backpropagation: The Chain Rule applied to the computational graph. Deep networks with Sigmoid activations suffer from Vanishing Gradients because derivatives (<0.25) multiply to zero.
Module Review: Advanced Optimization
[!NOTE] This module explores the core principles of Module Review: Advanced Optimization, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Interactive Flashcards
What is a Saddle Point?
Tap to flipA point where the gradient is zero (\nabla L = 0), but it is a minimum in one direction and a maximum in another. It is the main obstacle in high-dimensional optimization.
Why use Reverse Mode AutoDiff for ML?
Tap to flipBecause ML models have millions of inputs (parameters) but only one output (Loss). Reverse mode computes all gradients in a single backward pass, whereas Forward mode would require millions of passes.
What does Adam do?
Tap to flipIt combines Momentum (First Moment) and RMSProp (Second Moment) to adapt learning rates individually for each parameter.
What causes Vanishing Gradients?
Tap to flipMultiplying many small derivatives (e.g., Sigmoid max derivative is 0.25) during backpropagation, causing gradients at early layers to shrink to zero.
What is Jensen's Inequality?
Tap to flipFor a convex function f, the function of the average is less than or equal to the average of the function values: f(E[x]) \le E[f(x)].
What is the Tangency Condition?
Tap to flipIn constrained optimization, the optimal point occurs where the constraint boundary runs parallel to the objective's contour lines (\nabla f = \lambda \nabla g).
2. Cheat Sheet: Optimizers
| Optimizer | Formula (Simplified) | Pros | Cons |
|---|---|---|---|
| SGD | w = w - \eta \nabla L | Simple, Low memory | Slow, stuck in saddle points, oscillates |
| Momentum | v = \beta v + (1-\beta)\nabla L w = w - \eta v |
Fast in ravines, dampens oscillation | Introduces new hyperparameter \beta |
| RMSProp | v = \beta v + (1-\beta)(\nabla L)2 w = w - \eta \frac{\nabla L}{\sqrt{v}} |
Adaptive Learning Rate per parameter | No momentum, can get stuck in local minima |
| Adam | Momentum + RMSProp | Fast, Robust, Defacto standard | Can generalize slightly worse than SGD on simple problems |
3. Next Steps
Now that you understand how to train networks, let’s explore the advanced linear algebra that powers them.