Module Review: Advanced Optimization

1. Key Takeaways

Loss Landscape: The geometry of the loss function determines training difficulty. Convex functions are easy (bowl-shaped); Neural Networks are Non-Convex (rugged), plagued by Saddle Points rather than local minima.
Optimizers:
SGD: The baseline. Struggles in ravines and gets stuck on plateaus.
Momentum: Adds “velocity” to the optimizer, allowing it to plow through flat regions and dampen oscillations.
Adam: The gold standard. Combines Momentum (First Moment) and RMSProp (Second Moment) to adapt learning rates for each parameter.
Constrained Optimization: To optimize under constraints (g(x)=0), we use Lagrange Multipliers (\nabla f = \lambda \nabla g), finding points where the objective and constraint gradients align.
AutoDiff: Modern frameworks use Reverse Mode AutoDiff (Backpropagation), which efficiently computes gradients for millions of inputs (parameters) in a single backward pass.
Backpropagation: The Chain Rule applied to the computational graph. Deep networks with Sigmoid activations suffer from Vanishing Gradients because derivatives (<0.25) multiply to zero.

Module Review: Advanced Optimization

[!NOTE] This module explores the core principles of Module Review: Advanced Optimization, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Interactive Flashcards

What is a Saddle Point?

Tap to flip

A point where the gradient is zero (\nabla L = 0), but it is a minimum in one direction and a maximum in another. It is the main obstacle in high-dimensional optimization.

Why use Reverse Mode AutoDiff for ML?

Tap to flip

Because ML models have millions of inputs (parameters) but only one output (Loss). Reverse mode computes all gradients in a single backward pass, whereas Forward mode would require millions of passes.

What does Adam do?

Tap to flip

It combines Momentum (First Moment) and RMSProp (Second Moment) to adapt learning rates individually for each parameter.

What causes Vanishing Gradients?

Tap to flip

Multiplying many small derivatives (e.g., Sigmoid max derivative is 0.25) during backpropagation, causing gradients at early layers to shrink to zero.

What is Jensen's Inequality?

Tap to flip

For a convex function f, the function of the average is less than or equal to the average of the function values: f(E[x]) \le E[f(x)].

What is the Tangency Condition?

Tap to flip

In constrained optimization, the optimal point occurs where the constraint boundary runs parallel to the objective's contour lines (\nabla f = \lambda \nabla g).

2. Cheat Sheet: Optimizers

Optimizer	Formula (Simplified)	Pros	Cons
SGD	w = w - \eta \nabla L	Simple, Low memory	Slow, stuck in saddle points, oscillates
Momentum	v = \beta v + (1-\beta)\nabla L w = w - \eta v	Fast in ravines, dampens oscillation	Introduces new hyperparameter \beta
RMSProp	v = \beta v + (1-\beta)(\nabla L)² w = w - \eta \frac{\nabla L}{\sqrt{v}}	Adaptive Learning Rate per parameter	No momentum, can get stuck in local minima
Adam	Momentum + RMSProp	Fast, Robust, Defacto standard	Can generalize slightly worse than SGD on simple problems

3. Next Steps

Now that you understand how to train networks, let’s explore the advanced linear algebra that powers them.

Math ML Glossary

Module Review: Advanced Optimization

Module Review: Advanced Optimization

1. Key Takeaways

Backpropagation: The Chain Rule applied to the computational graph. Deep networks with Sigmoid activations suffer from Vanishing Gradients because derivatives (<0.25) multiply to zero.

Module Review: Advanced Optimization

1. Interactive Flashcards

What is a Saddle Point?

Why use Reverse Mode AutoDiff for ML?

What does Adam do?

What causes Vanishing Gradients?

What is Jensen's Inequality?

What is the Tangency Condition?

2. Cheat Sheet: Optimizers

3. Next Steps