Module 2 Review: Calculus Fundamentals

1. Key Takeaways

Derivatives measure Sensitivity: “How much does the output change if I nudge the input?”
The Chain Rule is the engine of Backpropagation. It allows us to compute gradients for deep networks layer by layer.
Gradient Descent uses the Gradient Vector to find the minimum of a Loss Function.
Taylor Series approximate complex functions with simple polynomials, forming the basis of optimization algorithms.

2. Cheat Sheet

Concept	Formula / Definition	Intuition
Derivative	f’(x) = lim_h→0 (f(x+h)-f(x))/h	Slope of the tangent line. Sensitivity.
Power Rule	d/dx xⁿ = nx^n-1	Simple polynomial derivatives.
Chain Rule	dy/dx = dy/du &cdot; du/dx	Multiply local derivatives to get total derivative.
Partial Derivative	∂f / ∂x	Slope along one axis, holding others constant.
Gradient	∇f = [∂f/∂x, ∂f/∂y, …]	Vector pointing in the direction of steepest ascent.
Jacobian	J_ij = ∂y_i / ∂x_j	Matrix of first derivatives (Slope Map) for vector functions.
Hessian	H_ij = ∂²f / ∂x_i∂x_j	Matrix of second derivatives (Curvature Map) for scalar functions.
Taylor Series	f(x) ≈ f(a) + f’(a)(x-a) + …	Approximating functions with polynomials.
Gradient Descent	w = w - α∇L(w)	Stepping downhill to minimize error.
Momentum	v = βv - α∇L; w = w+v	Adding inertia (velocity) to escape local minima.

3. Interactive Flashcards

Click a card to reveal the definition.

AutoDiff

Tap to flip

Automatic Differentiation

A technique to compute derivatives exactly by tracking the sequence of operations (Computation Graph) and applying the Chain Rule. It avoids the slowness of numerical differentiation and the complexity of symbolic differentiation.

Saddle Point

Tap to flip

Saddle Point

A point where the gradient is zero (flat), but it is not a local minimum or maximum. It curves up in one direction and down in another. High-dimensional optimization often gets stuck here.

Learning Rate (α)

Tap to flip

Learning Rate

A hyperparameter that controls the step size in Gradient Descent. Too small = slow convergence. Too large = divergence (overshooting).

SGD

Tap to flip

Stochastic Gradient Descent

An optimization variant where the gradient is estimated using a single random training example (or a mini-batch) rather than the entire dataset. It is faster and the noise helps escape local minima.

Jacobian vs Hessian

Tap to flip

Comparison

Jacobian: First-order derivatives (Slope Matrix) for vector output.
Hessian: Second-order derivatives (Curvature Matrix) for scalar output.

Dual Numbers

Tap to flip

Dual Numbers

A number system a + bε (where ε²=0) used in Forward-Mode AutoDiff. The real part tracks the value, and the dual part tracks the derivative.