Case Study: Backpropagation from Scratch

[!NOTE] This module explores the core principles of Case Study: Backpropagation from Scratch, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: The Algorithm That Runs the World

GPT-4, Stable Diffusion, AlphaGo—they all train using one algorithm: Backpropagation. It is simply the Chain Rule applied to the Computational Graph of a Neural Network.

But implementing it by hand reveals hidden dangers:

  1. Vanishing Gradients: Why deep networks couldn’t be trained before 2010.
  2. Dead Neurons: Why ReLU is better than Sigmoid.
  3. Initialization: Why random weights matter.

2. The Network Architecture

Consider a tiny network with 1 input, 1 hidden neuron, and 1 output.

  • Input: x
  • Hidden Layer: h = \sigma(w1 x + b1)
  • Output Layer: \hat{y} = w2 h + b2
  • Loss: L = \frac{1}{2}(\hat{y} - y)2

We want to find \frac{\partial L}{\partial w1}.


3. The Derivation (Chain Rule)

To find how the Loss L changes with respect to w1, we chain derivatives backwards from the output:

∇w1 = ∂L/∂ŷ · ∂ŷ/∂h · ∂h/∂z1 · ∂z1/∂w1
  1. Loss Gradient: \frac{\partial L}{\partial \hat{y}} = (\hat{y} - y)
  2. Output Weight: \frac{\partial \hat{y}}{\partial h} = w2
  3. Activation Gradient: \frac{\partial h}{\partial z1} = \sigma’(z1) = \sigma(z1)(1-\sigma(z1))
  4. Input: \frac{\partial z1}{\partial w1} = x

Multiplying them together:

∇w1 = (ŷ - y) · w2 · σ'(z1) · x

4. The Vanishing Gradient Problem

Notice the term \sigma’(z1). The maximum derivative of the Sigmoid function is 0.25. If you have a network with 10 layers, the gradient at the first layer is multiplied by 0.2510 \approx 0.0000009.

The gradient vanishes. The first layers stop learning. Solution: Use ReLU (gradient is 1) and He Initialization.


5. Python: Backprop from Scratch

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

# Data
x = np.array([[0,0], [0,1], [1,0], [1,1]]) # XOR input? No, simple regression
y = np.array([[0], [1], [1], [0]]) # XOR target

# Initialization
input_size = 2
hidden_size = 4
output_size = 1

# Weights
np.random.seed(42)
W1 = np.random.uniform(-1, 1, (input_size, hidden_size))
b1 = np.zeros((1, hidden_size))
W2 = np.random.uniform(-1, 1, (hidden_size, output_size))
b2 = np.zeros((1, output_size))

lr = 0.5

for epoch in range(5000):
    # Forward
    z1 = np.dot(x, W1) + b1
    h = sigmoid(z1)
    z2 = np.dot(h, W2) + b2
    y_pred = sigmoid(z2) # Use sigmoid for output too for 0-1 range

    # Loss (MSE)
    loss = 0.5 * np.sum((y_pred - y)**2)

    # Backward
    d_loss_y = (y_pred - y) * sigmoid_derivative(z2)

    # Gradient for W2
    d_W2 = np.dot(h.T, d_loss_y)

    # Gradient for Hidden Layer
    d_loss_h = np.dot(d_loss_y, W2.T)
    d_loss_z1 = d_loss_h * sigmoid_derivative(z1)

    # Gradient for W1
    d_W1 = np.dot(x.T, d_loss_z1)

    # Update
    W1 -= lr * d_W1
    W2 -= lr * d_W2

    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

6. Interactive Visualizer: Neural Flow

Visualize the flow of data (Forward) and Gradients (Backward).

  • Blue Pulses: Forward values propagating to the output.
  • Red Pulses: Backward gradients updating the weights.
  • Weights: Represented by line thickness. Watch them change as you train!
Target: 1.0
Output: 0.50
Loss: 0.12

7. Summary

  • Forward: Compute prediction by passing data through layers.
  • Backward: Compute gradients by passing error backwards using the Chain Rule.
  • Vanishing Gradients: Gradients get multiplied by small numbers (derivatives < 1) at each layer, potentially becoming zero at early layers.
  • Solution: Use ReLU (derivative is 1) and proper initialization (He/Xavier) to keep gradients healthy.