Case Study: Backpropagation from Scratch
This module explores the core principles of Case Study: Backpropagation from Scratch, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Introduction: The Algorithm That Runs the World
GPT-4, Stable Diffusion, AlphaGo—they all train using one algorithm: Backpropagation. It is simply the Chain Rule applied to the Computational Graph of a Neural Network.
But implementing it by hand reveals hidden dangers:
- Vanishing Gradients: Why deep networks couldn’t be trained before 2010.
- Dead Neurons: Why ReLU is better than Sigmoid.
- Initialization: Why random weights matter.
2. Intuition: The Corporate “Blame Game”
Before diving into the calculus, let’s build a mental model. Think of a deep neural network as a multi-tiered corporation trying to produce a perfect product.
- Forward Pass (Production): Raw data (Inputs) enters the factory floor. Workers (Hidden Layer 1) process it and pass it to Managers (Hidden Layer 2), who hand the final product to the CEO (Output Layer).
- Loss Function (Quality Control): The CEO evaluates the final product against the market target. “This is terrible! The error (Loss) is huge!”
- Backpropagation (Assigning Blame): The CEO doesn’t just yell into the void. They yell at the Managers, calculating exactly: “How much of this error is your fault?” (The derivative of Loss with respect to the Output).
- The Managers then turn to the Workers and say: “We only assembled what you gave us! How much of this is YOUR fault?” This is the Chain Rule: multiplying the manager’s inherited blame by the worker’s direct contribution to the manager’s mistake.
- Gradient Descent (Course Correction): Based on the exact proportion of blame received, everyone adjusts their internal processes (Weights) by a small margin (Learning Rate) so the next iteration is slightly better.
This recursive passing of “blame” proportional to contribution is the mathematical essence of backpropagation.
3. The Network Architecture
Consider a tiny network with 1 input, 1 hidden neuron, and 1 output.
- Input: x
- Hidden Layer: h = \sigma(w1 x + b1)
- Output Layer: \hat{y} = w2 h + b2
- Loss: L = \frac{1}{2}(\hat{y} - y)2
We want to find \frac{\partial L}{\partial w1}.
4. The Derivation (Chain Rule)
To find how the Loss L changes with respect to w1, we chain derivatives backwards from the output:
- Loss Gradient: \frac{\partial L}{\partial \hat{y}} = (\hat{y} - y)
- Output Weight: \frac{\partial \hat{y}}{\partial h} = w2
- Activation Gradient: \frac{\partial h}{\partial z1} = \sigma’(z1) = \sigma(z1)(1-\sigma(z1))
- Input: \frac{\partial z1}{\partial w1} = x
Multiplying them together:
5. The Vanishing Gradient Problem
Notice the term \sigma’(z1). The maximum derivative of the Sigmoid function is 0.25. If you have a network with 10 layers, the gradient at the first layer is multiplied by 0.2510 \approx 0.0000009.
The gradient vanishes, meaning the “blame signal” becomes infinitesimally small before it reaches the early layers. The first layers effectively stop learning entirely.
War Story: The AI Winter
In the 1990s and early 2000s, researchers hit a massive wall. They knew adding more layers (Deep Learning) should allow networks to learn more complex features, but training anything beyond 2 or 3 layers failed miserably. This contributed heavily to a period of reduced funding and interest in neural networks. It wasn’t until the widespread adoption of the ReLU activation function (which has a gradient of 1, preventing the signal from decaying) and better initialization techniques (like He Initialization) around 2010 that deep networks finally became trainable, triggering the modern AI boom.
Solution: Use ReLU (gradient is 1) and proper weight initialization to preserve the variance of gradients across layers.
6. Python: Backprop from Scratch
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
# Data
x = np.array([[0,0], [0,1], [1,0], [1,1]]) # XOR input? No, simple regression
y = np.array([[0], [1], [1], [0]]) # XOR target
# Initialization
input_size = 2
hidden_size = 4
output_size = 1
# Weights
np.random.seed(42)
W1 = np.random.uniform(-1, 1, (input_size, hidden_size))
b1 = np.zeros((1, hidden_size))
W2 = np.random.uniform(-1, 1, (hidden_size, output_size))
b2 = np.zeros((1, output_size))
lr = 0.5
for epoch in range(5000):
# Forward
z1 = np.dot(x, W1) + b1
h = sigmoid(z1)
z2 = np.dot(h, W2) + b2
y_pred = sigmoid(z2) # Use sigmoid for output too for 0-1 range
# Loss (MSE)
loss = 0.5 * np.sum((y_pred - y)**2)
# Backward
d_loss_y = (y_pred - y) * sigmoid_derivative(z2)
# Gradient for W2
d_W2 = np.dot(h.T, d_loss_y)
# Gradient for Hidden Layer
d_loss_h = np.dot(d_loss_y, W2.T)
d_loss_z1 = d_loss_h * sigmoid_derivative(z1)
# Gradient for W1
d_W1 = np.dot(x.T, d_loss_z1)
# Update
W1 -= lr * d_W1
W2 -= lr * d_W2
if epoch % 1000 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
7. Interactive Visualizer: Neural Flow
Visualize the flow of data (Forward) and Gradients (Backward).
- Blue Pulses: Forward values propagating to the output.
- Red Pulses: Backward gradients updating the weights.
- Weights: Represented by line thickness. Watch them change as you train!
8. Summary
- Forward: Compute prediction by passing data through layers.
- Backward: Compute gradients by passing error backwards using the Chain Rule.
- Vanishing Gradients: Gradients get multiplied by small numbers (derivatives < 1) at each layer, potentially becoming zero at early layers.
- Solution: Use ReLU (derivative is 1) and proper initialization (He/Xavier) to keep gradients healthy.