Case Study: Backpropagation from Scratch

[!NOTE] This module explores the core principles of Case Study: Backpropagation from Scratch, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: The Algorithm That Runs the World

GPT-4, Stable Diffusion, AlphaGo—they all train using one algorithm: Backpropagation. It is simply the Chain Rule applied to the Computational Graph of a Neural Network.

But implementing it by hand reveals hidden dangers:

Vanishing Gradients: Why deep networks couldn’t be trained before 2010.
Dead Neurons: Why ReLU is better than Sigmoid.
Initialization: Why random weights matter.

2. The Network Architecture

Consider a tiny network with 1 input, 1 hidden neuron, and 1 output.

Input: x
Hidden Layer: h = \sigma(w₁ x + b₁)
Output Layer: \hat{y} = w₂ h + b₂
Loss: L = \frac{1}{2}(\hat{y} - y)²

We want to find \frac{\partial L}{\partial w₁}.

3. The Derivation (Chain Rule)

To find how the Loss L changes with respect to w₁, we chain derivatives backwards from the output:

∇w₁ = ∂L/∂ŷ · ∂ŷ/∂h · ∂h/∂z₁ · ∂z₁/∂w₁

Loss Gradient: \frac{\partial L}{\partial \hat{y}} = (\hat{y} - y)
Output Weight: \frac{\partial \hat{y}}{\partial h} = w₂
Activation Gradient: \frac{\partial h}{\partial z₁} = \sigma’(z₁) = \sigma(z₁)(1-\sigma(z₁))
Input: \frac{\partial z₁}{\partial w₁} = x

Multiplying them together:

∇w₁ = (ŷ - y) · w₂ · σ'(z₁) · x

4. The Vanishing Gradient Problem

Notice the term \sigma’(z₁). The maximum derivative of the Sigmoid function is 0.25. If you have a network with 10 layers, the gradient at the first layer is multiplied by 0.25¹⁰ \approx 0.0000009.

The gradient vanishes. The first layers stop learning. Solution: Use ReLU (gradient is 1) and He Initialization.

5. Python: Backprop from Scratch

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

# Data
x = np.array([[0,0], [0,1], [1,0], [1,1]]) # XOR input? No, simple regression
y = np.array([[0], [1], [1], [0]]) # XOR target

# Initialization
input_size = 2
hidden_size = 4
output_size = 1

# Weights
np.random.seed(42)
W1 = np.random.uniform(-1, 1, (input_size, hidden_size))
b1 = np.zeros((1, hidden_size))
W2 = np.random.uniform(-1, 1, (hidden_size, output_size))
b2 = np.zeros((1, output_size))

lr = 0.5

for epoch in range(5000):
    # Forward
    z1 = np.dot(x, W1) + b1
    h = sigmoid(z1)
    z2 = np.dot(h, W2) + b2
    y_pred = sigmoid(z2) # Use sigmoid for output too for 0-1 range

    # Loss (MSE)
    loss = 0.5 * np.sum((y_pred - y)**2)

    # Backward
    d_loss_y = (y_pred - y) * sigmoid_derivative(z2)

    # Gradient for W2
    d_W2 = np.dot(h.T, d_loss_y)

    # Gradient for Hidden Layer
    d_loss_h = np.dot(d_loss_y, W2.T)
    d_loss_z1 = d_loss_h * sigmoid_derivative(z1)

    # Gradient for W1
    d_W1 = np.dot(x.T, d_loss_z1)

    # Update
    W1 -= lr * d_W1
    W2 -= lr * d_W2

    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

6. Interactive Visualizer: Neural Flow

Visualize the flow of data (Forward) and Gradients (Backward).

Blue Pulses: Forward values propagating to the output.
Red Pulses: Backward gradients updating the weights.
Weights: Represented by line thickness. Watch them change as you train!

Target: 1.0
Output: 0.50
Loss: 0.12

7. Summary

Forward: Compute prediction by passing data through layers.
Backward: Compute gradients by passing error backwards using the Chain Rule.
Vanishing Gradients: Gradients get multiplied by small numbers (derivatives < 1) at each layer, potentially becoming zero at early layers.
Solution: Use ReLU (derivative is 1) and proper initialization (He/Xavier) to keep gradients healthy.