Automatic Differentiation: The Magic of PyTorch
[!NOTE] This module explores the core principles of Automatic Differentiation: The Magic of PyTorch, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Introduction: Who computes the gradients?
In calculus class, you calculated derivatives by hand. In early AI (80s), researchers derived gradients on paper, simplified them, and coded them in C++. In Modern AI (PyTorch/TF), you write the Forward Pass, and the framework calculates the Backward Pass (gradients) automatically. This is Automatic Differentiation (AutoDiff).
It is NOT numerical differentiation (finite differences, which is slow and imprecise). It is NOT symbolic differentiation (like Mathematica, which can lead to expression explosion). It is the exact application of the Chain Rule on a graph structure.
2. The Computational Graph
Every calculation in your code builds a Directed Acyclic Graph (DAG).
- Nodes: Operations (Addition, Multiplication, Sin, Exp).
- Edges: Tensors (Data flowing between operations).
Example: y = (x + w) × b
- Input Nodes: x, w, b.
- Intermediate Node: a = x + w.
- Output Node: y = a \times b.
To find \frac{\partial y}{\partial x}, we just traverse the graph backwards!
3. Forward vs Reverse Mode
Why do we always talk about “Backpropagation”? Why not “Forwardpropagation” of gradients?
Forward Mode
Computes \frac{\partial v}{\partial x} for every node v as we go forward.
- Mechanism: We track the value and the derivative w.r.t. one input simultaneously.
- Best for: Functions with few inputs and many outputs.
- Cost: Proportional to the number of inputs.
Reverse Mode (Backpropagation)
Computes y first, then goes backward to find \frac{\partial y}{\partial x}, \frac{\partial y}{\partial w}, \dots
- Mechanism: We compute the output, then propagate the error signal backwards.
- Best for: Functions with many inputs (billion weights) and few outputs (1 loss value).
- Cost: Proportional to the number of outputs.
- Winner: Deep Learning! We typically have 1 Loss value and 100B parameters. Reverse mode gives us all 100B gradients in one pass.
4. Python: Autograd from Scratch
Let’s build a tiny AutoDiff engine (inspired by Andrej Karpathy’s Micrograd) to understand what PyTorch does under the hood.
import math
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
self.label = label
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
# Local gradients for addition are 1.0
self.grad += 1.0 * out.grad
other.grad += 1.0 * out.grad
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
# Local gradients for multiplication are the other value
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
def backward(self):
# Topological sort to ensure we process parents before children
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
# Go backwards
self.grad = 1.0
for node in reversed(topo):
node._backward()
# Usage
x = Value(2.0, label='x')
w = Value(-3.0, label='w')
b = Value(10.0, label='b')
a = x * w; a.label = 'a'
y = a + b; y.label = 'y'
y.backward()
print(f"y = {y.data}")
print(f"dy/dx = {x.grad}")
5. Interactive Visualizer: Graph Builder
Visualize the Computational Graph for y = (x + w) × b.
- Inputs: x=2, w=1, b=3.
- Forward Pass (Blue): Values flow up. 2+1 = 3, then 3 \times 3 = 9.
- Backward Pass (Red): Gradients flow down. \frac{\partial y}{\partial y} = 1, then splits to a and b.
6. Summary
- Computational Graph: Represents math as a tree of operations.
- AutoDiff: Applies the Chain Rule recursively on this graph.
- Reverse Mode: The secret sauce of Deep Learning that allows us to compute gradients for millions of parameters efficiently in one backward pass.