Multivariable Calculus: The Gradient Vector

1. Introduction: Beyond y=f(x)

Real-world problems rarely depend on a single variable.

  • House Price: Depends on (Size, Rooms, Location, Age).
  • Neural Network Loss: Depends on millions of weights (w1, w2, …, wn).

We need calculus for functions with vector inputs: f(x).


2. Partial Derivatives

If z = f(x, y) = x2 + y2, how does z change? It depends on which direction you move!

  • Partial with respect to x (∂f / ∂x): Treat y as a constant (like slicing the mountain along the East-West axis). Differentiate x.
  • Partial with respect to y (∂f / ∂y): Treat x as a constant (slicing North-South). Differentiate y.

Example: f(x, y) = 3x2y

  • ∂f / ∂x = 6xy (Treat y as constant like 5).
  • ∂f / ∂y = 3x2 (Treat x as constant like 5).

3. The Gradient Vector (∇f)

If we collect all partial derivatives into a vector, we get the Gradient:

∇f(x, y) = [ ∂f / ∂x, ∂f / ∂y ]

Properties of the Gradient

  1. It is a Vector (has direction and magnitude).
  2. It points in the Direction of Steepest Ascent (Uphill).
  3. Its magnitude   ∇f   tells you how steep the slope is.

[!TIP] Gradient Descent: To find the minimum (bottom of the valley), we go in the opposite direction of the gradient: -∇f.


4. The Matrix Derivatives

In Deep Learning, we deal with layers of neurons, so we need matrices.

4.1 The Jacobian Matrix (J) - The Slope Map

If we have a function mapping a vector to a vector (f: Rn → Rm), the first derivative is an m × n matrix called the Jacobian.

  • Shape: (Output Dim) × (Input Dim).
  • Usage: Used for Backpropagating errors through a layer. It tells us how every output changes with respect to every input.

4.2 The Hessian Matrix (H) - The Curvature Map

If we have a scalar function (L: Rn → R), the second derivative is an n × n symmetric matrix called the Hessian.

  • Hij = ∂2L / ∂wi∂wj
  • Shape: (Input Dim) × (Input Dim).
  • Usage: Determines Curvature (Bowl vs Saddle).
    • Positive Definite H: Valley (Minimum). We want to be here.
    • Indefinite H: Saddle Point. We want to escape this.
    • Newton’s Method uses the inverse of the Hessian to jump to the minimum, but it’s too expensive ($O(N^3)$) for Deep Learning.

5. Interactive Visualizer: The Gradient Compass

The background shows a “Hill” function: z = 4 - (x2 + y2).

  • Bright Center: Peak (High Z).
  • Dark Edges: Valley (Low Z).

Interaction:

  1. Move Mouse: The Red Arrow represents the Gradient ∇f. It always points Uphill (towards the center peak).
  2. Click: Spawn a “Ball” that rolls Downhill (opposite to the gradient).
    • Unlike simple Gradient Descent, these balls have Momentum (Mass). They accelerate down the slope (v += a), oscillate, and eventually settle due to friction.
Pos: (0.0, 0.0)
Gradient: [0.0, 0.0]
Click to drop a ball (Physics Enabled)

6. Summary

  • Partial Derivative: Slope along one axis (Sensitivity to one input).
  • Gradient Vector: Combined direction of steepest ascent. We move against it to learn.
  • Jacobian Matrix: The derivatives of a vector output (The Slope Map).
  • Hessian Matrix: The derivatives of derivatives (The Curvature Map).

Next: Taylor Series