Jacobian & Hessian Matrices

[!NOTE] This module explores the core principles of Jacobian & Hessian Matrices, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: The Landscape of Loss

Training a Neural Network is like hiking down a mountain in the dark. You want to reach the lowest point (Global Minimum Loss). To do this, you need to know:

  1. Which way is down? (Gradient).
  2. Is the ground curving? (Hessian).

2. The Gradient & Jacobian (First Derivative)

The Gradient (∇f) tells you the direction of steepest ascent. You go the opposite way to minimize loss.

If you have a function with multiple outputs (like a layer in a neural net), the derivatives form a matrix called the Jacobian (J).

Jij = ∂yi / ∂xj
  • Meaning: How much does Output i change when I wiggle Input j?
  • Deep Learning: Used in Backpropagation to pass errors backward.

3. The Hessian (Second Derivative)

The Hessian (H) is a matrix of second derivatives. It describes the curvature of the landscape.

Hij = ∂2f / ∂xi∂xj

The Eigenvalues of the Hessian tell us the shape of the terrain:

  • All Positive: Bowl (Convex). Local Minimum.
  • All Negative: Hill (Concave). Local Maximum.
  • Mixed Signs: Saddle Point. (Up in one direction, down in another).

Newton’s Method (The Smart Jump)

Gradient Descent takes tiny steps. Newton’s Method uses the curvature (Hessian) to take a massive leap straight to the bottom of the bowl.

xnew = xold - H-1 ∇ f

Why don’t we always use it? Calculating the Inverse Hessian (H-1) for a billion-parameter network is impossibly expensive (O(N3)).


4. Interactive Visualizer: The Landscape Explorer v3.0

Explore different optimization landscapes.

  • Gradient Step: Takes a small step downhill. Safe but slow.
  • Newton Step: Uses curvature to jump. Fast, but can fail if the Hessian is not positive definite (e.g., Saddle Points).

[!TIP] Try it yourself: Click on the canvas to place the probe. Switch between Bowl, Saddle, and Hill. Compare how Gradient Step vs Newton Step behaves near the Saddle Point (center).

Blue=Low, Red=High
Position (x, y): [ 0.0, 0.0 ]
Hessian Eigenvalues: λ1 = 2.0 λ2 = 2.0
CONVEX (Bowl)

5. Coding: Automatic Differentiation

In Deep Learning, we don’t calculate derivatives by hand. We use Autograd. Here is how to calculate the Gradient and Hessian in PyTorch.

import torch

# 1. Define a function (e.g., Saddle Point: x^2 - y^2)
def f(x, y):
    return x**2 - y**2

# 2. Define inputs requiring gradient
x = torch.tensor([1.0], requires_grad=True)
y = torch.tensor([1.0], requires_grad=True)

# 3. Compute function value
z = f(x, y)

# 4. Compute Gradient (First Derivative)
z.backward()
print(f"Gradient at (1,1): dx={x.grad.item()}, dy={y.grad.item()}")
# Output: dx=2.0 (2x), dy=-2.0 (-2y)

# 5. Compute Hessian (Second Derivative) using functional API
inputs = (torch.tensor([1.0]), torch.tensor([1.0]))
hessian = torch.autograd.functional.hessian(f, inputs)

print("\nHessian Matrix:")
# Note: hessian returns a tuple of tuples structure for multiple inputs
# For clarity, let's treat input as a single vector [x, y]
def f_vec(v):
    return v[0]**2 - v[1]**2

v = torch.tensor([1.0, 1.0])
H = torch.autograd.functional.hessian(f_vec, v)
print(H)
# Output:
# tensor([[ 2.,  0.],
#         [ 0., -2.]])

6. Summary

  • Gradient: Direction of steepest climb. (Use negative gradient to descend).
  • Jacobian: Matrix of all first-order derivatives. Measures sensitivity.
  • Hessian: Matrix of second-order derivatives. Measures curvature.
  • Optimization: We want to find points where Gradient is zero and Hessian is Positive Definite (Bowl).