Approximation: The Taylor Series
1. Introduction: Complexity from Simplicity
How does a calculator compute sin(37°)? It doesn’t draw a triangle. It uses Polynomials.
Calculus tells us that any smooth function (like ex, sin(x), or a Neural Network Loss Function) can be approximated locally by a sum of simpler terms:
- A Constant (Start point).
- A Line (Slope).
- A Parabola (Curve).
- …and so on.
This is the basis of Gradient Descent (First-Order Approximation) and Newton’s Method (Second-Order Approximation).
2. The Formula
The Taylor Series of f(x) centered at a is:
f(x) ≈ f(a) + f’(a)(x-a) + f’‘(a)/2! (x-a)2 + f’’‘(a)/3! (x-a)3 + …
Why “Factorials”?
Imagine we want to approximate f(x) with a polynomial P(x) = c0 + c1x + c2x2 + c3x3. We want the derivatives of P(x) to match f(x) at x=0.
- P’(x) = c1 + 2c2x + 3c3x2. At x=0, P’(0) = c1. So c1 = f’(0).
- P’‘(x) = 2c2 + 3×2c3x. At x=0, P’‘(0) = 2c2. So c2 = f’‘(0)/2.
- P’’‘(x) = 3×2c3. At x=0, P’’‘(0) = 3×2c3 = 6c3. So c3 = f’’‘(0)/6 = f’’‘(0)/3!.
The factorials (n!) appear because differentiating xn repeatedly brings down n, n-1, n-2… as multipliers. We divide by them to normalize.
3. Example: Approximating sin(x) at a=0
At x=0, sin(0)=0, cos(0)=1.
- f(x) = sin(x) → f(0) = 0
- f’(x) = cos(x) → f’(0) = 1
- f’‘(x) = -sin(x) → f’‘(0) = 0
- f’’‘(x) = -cos(x) → f’’‘(0) = -1
Putting it together (Maclaurin Series):
sin(x) ≈ x - x3/3! + x5/5! - x7/7! …
4. Connection to Optimization
In Machine Learning, we want to find the minimum of a Loss Function L(w).
4.1 First-Order: Gradient Descent
We use the first two terms (Line Approximation):
L(w+h) ≈ L(w) + L’(w)h
We step downhill along this line.
4.2 Second-Order: Newton’s Method
We use the first three terms (Parabola Approximation):
L(w+h) ≈ L(w) + L’(w)h + 0.5 L’‘(w)h2
We jump directly to the bottom of this parabola.
[!TIP] Trade-off: Newton’s Method is faster (fewer steps) but computationally expensive because calculating the second derivative (Hessian Matrix) for millions of parameters is impossible. That’s why we stick to Gradient Descent.
5. Interactive Visualizer: Polynomial Builder
See how adding more terms (N) makes the polynomial PN(x) hug the target curve tightly near the center (x=0), but diverge far away.
- N=1: Line (Linear Model). Good only near center.
- N=3: Cubic curve.
- Show Error: Visualize the difference (Residuals) between truth and approximation.
6. Summary
- Taylor Series: Converting hard functions to easy polynomials.
- Gradient Descent: A Linear approximation (Taylor N=1). We step down the slope.
- Newton’s Method: A Quadratic approximation (Taylor N=2). We jump to the minimum of the parabola.
- Trust Region: We must restrict our steps to the region where the Taylor approximation is valid.