Module 2 Review: Calculus Fundamentals
1. Key Takeaways
- Derivatives measure Sensitivity: “How much does the output change if I nudge the input?”
- The Chain Rule is the engine of Backpropagation. It allows us to compute gradients for deep networks layer by layer.
- Gradient Descent uses the Gradient Vector to find the minimum of a Loss Function.
- Taylor Series approximate complex functions with simple polynomials, forming the basis of optimization algorithms.
2. Cheat Sheet
| Concept | Formula / Definition | Intuition |
|---|---|---|
| Derivative | f’(x) = limh→0 (f(x+h)-f(x))/h | Slope of the tangent line. Sensitivity. |
| Power Rule | d/dx xn = nxn-1 | Simple polynomial derivatives. |
| Chain Rule | dy/dx = dy/du ċ du/dx | Multiply local derivatives to get total derivative. |
| Partial Derivative | ∂f / ∂x | Slope along one axis, holding others constant. |
| Gradient | ∇f = [∂f/∂x, ∂f/∂y, …] | Vector pointing in the direction of steepest ascent. |
| Jacobian | Jij = ∂yi / ∂xj | Matrix of first derivatives (Slope Map) for vector functions. |
| Hessian | Hij = ∂2f / ∂xi∂xj | Matrix of second derivatives (Curvature Map) for scalar functions. |
| Taylor Series | f(x) ≈ f(a) + f’(a)(x-a) + … | Approximating functions with polynomials. |
| Gradient Descent | w = w - α∇L(w) | Stepping downhill to minimize error. |
| Momentum | v = βv - α∇L; w = w+v | Adding inertia (velocity) to escape local minima. |
3. Interactive Flashcards
Click a card to reveal the definition.
AutoDiff
Tap to flip
Automatic Differentiation
A technique to compute derivatives exactly by tracking the sequence of operations (Computation Graph) and applying the Chain Rule. It avoids the slowness of numerical differentiation and the complexity of symbolic differentiation.
Saddle Point
Tap to flip
Saddle Point
A point where the gradient is zero (flat), but it is not a local minimum or maximum. It curves up in one direction and down in another. High-dimensional optimization often gets stuck here.
Learning Rate (α)
Tap to flip
Learning Rate
A hyperparameter that controls the step size in Gradient Descent. Too small = slow convergence. Too large = divergence (overshooting).
SGD
Tap to flip
Stochastic Gradient Descent
An optimization variant where the gradient is estimated using a single random training example (or a mini-batch) rather than the entire dataset. It is faster and the noise helps escape local minima.
Jacobian vs Hessian
Tap to flip
Comparison
Jacobian: First-order derivatives (Slope Matrix) for vector output.
Hessian: Second-order derivatives (Curvature Matrix) for scalar output.
Dual Numbers
Tap to flip
Dual Numbers
A number system a + bε (where ε2=0) used in Forward-Mode AutoDiff. The real part tracks the value, and the dual part tracks the derivative.