Case Study: Backpropagation from Scratch
[!NOTE] This module explores the core principles of Case Study: Backpropagation from Scratch, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Introduction: The Algorithm That Runs the World
GPT-4, Stable Diffusion, AlphaGo—they all train using one algorithm: Backpropagation. It is simply the Chain Rule applied to the Computational Graph of a Neural Network.
But implementing it by hand reveals hidden dangers:
- Vanishing Gradients: Why deep networks couldn’t be trained before 2010.
- Dead Neurons: Why ReLU is better than Sigmoid.
- Initialization: Why random weights matter.
2. The Network Architecture
Consider a tiny network with 1 input, 1 hidden neuron, and 1 output.
- Input: x
- Hidden Layer: h = \sigma(w1 x + b1)
- Output Layer: \hat{y} = w2 h + b2
- Loss: L = \frac{1}{2}(\hat{y} - y)2
We want to find \frac{\partial L}{\partial w1}.
3. The Derivation (Chain Rule)
To find how the Loss L changes with respect to w1, we chain derivatives backwards from the output:
- Loss Gradient: \frac{\partial L}{\partial \hat{y}} = (\hat{y} - y)
- Output Weight: \frac{\partial \hat{y}}{\partial h} = w2
- Activation Gradient: \frac{\partial h}{\partial z1} = \sigma’(z1) = \sigma(z1)(1-\sigma(z1))
- Input: \frac{\partial z1}{\partial w1} = x
Multiplying them together:
4. The Vanishing Gradient Problem
Notice the term \sigma’(z1). The maximum derivative of the Sigmoid function is 0.25. If you have a network with 10 layers, the gradient at the first layer is multiplied by 0.2510 \approx 0.0000009.
The gradient vanishes. The first layers stop learning. Solution: Use ReLU (gradient is 1) and He Initialization.
5. Python: Backprop from Scratch
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
# Data
x = np.array([[0,0], [0,1], [1,0], [1,1]]) # XOR input? No, simple regression
y = np.array([[0], [1], [1], [0]]) # XOR target
# Initialization
input_size = 2
hidden_size = 4
output_size = 1
# Weights
np.random.seed(42)
W1 = np.random.uniform(-1, 1, (input_size, hidden_size))
b1 = np.zeros((1, hidden_size))
W2 = np.random.uniform(-1, 1, (hidden_size, output_size))
b2 = np.zeros((1, output_size))
lr = 0.5
for epoch in range(5000):
# Forward
z1 = np.dot(x, W1) + b1
h = sigmoid(z1)
z2 = np.dot(h, W2) + b2
y_pred = sigmoid(z2) # Use sigmoid for output too for 0-1 range
# Loss (MSE)
loss = 0.5 * np.sum((y_pred - y)**2)
# Backward
d_loss_y = (y_pred - y) * sigmoid_derivative(z2)
# Gradient for W2
d_W2 = np.dot(h.T, d_loss_y)
# Gradient for Hidden Layer
d_loss_h = np.dot(d_loss_y, W2.T)
d_loss_z1 = d_loss_h * sigmoid_derivative(z1)
# Gradient for W1
d_W1 = np.dot(x.T, d_loss_z1)
# Update
W1 -= lr * d_W1
W2 -= lr * d_W2
if epoch % 1000 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
6. Interactive Visualizer: Neural Flow
Visualize the flow of data (Forward) and Gradients (Backward).
- Blue Pulses: Forward values propagating to the output.
- Red Pulses: Backward gradients updating the weights.
- Weights: Represented by line thickness. Watch them change as you train!
7. Summary
- Forward: Compute prediction by passing data through layers.
- Backward: Compute gradients by passing error backwards using the Chain Rule.
- Vanishing Gradients: Gradients get multiplied by small numbers (derivatives < 1) at each layer, potentially becoming zero at early layers.
- Solution: Use ReLU (derivative is 1) and proper initialization (He/Xavier) to keep gradients healthy.