Case Study: Backpropagation from Scratch

This module explores the core principles of Case Study: Backpropagation from Scratch, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: The Algorithm That Runs the World

GPT-4, Stable Diffusion, AlphaGo—they all train using one algorithm: Backpropagation. It is simply the Chain Rule applied to the Computational Graph of a Neural Network.

But implementing it by hand reveals hidden dangers:

Vanishing Gradients: Why deep networks couldn’t be trained before 2010.
Dead Neurons: Why ReLU is better than Sigmoid.
Initialization: Why random weights matter.

2. Intuition: The Corporate “Blame Game”

Before diving into the calculus, let’s build a mental model. Think of a deep neural network as a multi-tiered corporation trying to produce a perfect product.

Forward Pass (Production): Raw data (Inputs) enters the factory floor. Workers (Hidden Layer 1) process it and pass it to Managers (Hidden Layer 2), who hand the final product to the CEO (Output Layer).
Loss Function (Quality Control): The CEO evaluates the final product against the market target. “This is terrible! The error (Loss) is huge!”
Backpropagation (Assigning Blame): The CEO doesn’t just yell into the void. They yell at the Managers, calculating exactly: “How much of this error is your fault?” (The derivative of Loss with respect to the Output).
- The Managers then turn to the Workers and say: “We only assembled what you gave us! How much of this is YOUR fault?” This is the Chain Rule: multiplying the manager’s inherited blame by the worker’s direct contribution to the manager’s mistake.
Gradient Descent (Course Correction): Based on the exact proportion of blame received, everyone adjusts their internal processes (Weights) by a small margin (Learning Rate) so the next iteration is slightly better.

This recursive passing of “blame” proportional to contribution is the mathematical essence of backpropagation.

3. The Network Architecture

Consider a tiny network with 1 input, 1 hidden neuron, and 1 output.

Input: x
Hidden Layer: h = \sigma(w₁ x + b₁)
Output Layer: \hat{y} = w₂ h + b₂
Loss: L = \frac{1}{2}(\hat{y} - y)²

We want to find \frac{\partial L}{\partial w₁}.

4. The Derivation (Chain Rule)

To find how the Loss L changes with respect to w₁, we chain derivatives backwards from the output:

∇w₁ = ∂L/∂ŷ · ∂ŷ/∂h · ∂h/∂z₁ · ∂z₁/∂w₁

Loss Gradient: \frac{\partial L}{\partial \hat{y}} = (\hat{y} - y)
Output Weight: \frac{\partial \hat{y}}{\partial h} = w₂
Activation Gradient: \frac{\partial h}{\partial z₁} = \sigma’(z₁) = \sigma(z₁)(1-\sigma(z₁))
Input: \frac{\partial z₁}{\partial w₁} = x

Multiplying them together:

∇w₁ = (ŷ - y) · w₂ · σ'(z₁) · x

5. The Vanishing Gradient Problem

Notice the term \sigma’(z₁). The maximum derivative of the Sigmoid function is 0.25. If you have a network with 10 layers, the gradient at the first layer is multiplied by 0.25¹⁰ \approx 0.0000009.

The gradient vanishes, meaning the “blame signal” becomes infinitesimally small before it reaches the early layers. The first layers effectively stop learning entirely.

War Story: The AI Winter
In the 1990s and early 2000s, researchers hit a massive wall. They knew adding more layers (Deep Learning) should allow networks to learn more complex features, but training anything beyond 2 or 3 layers failed miserably. This contributed heavily to a period of reduced funding and interest in neural networks. It wasn’t until the widespread adoption of the ReLU activation function (which has a gradient of 1, preventing the signal from decaying) and better initialization techniques (like He Initialization) around 2010 that deep networks finally became trainable, triggering the modern AI boom.

Solution: Use ReLU (gradient is 1) and proper weight initialization to preserve the variance of gradients across layers.

6. Python: Backprop from Scratch

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

# Data
x = np.array([[0,0], [0,1], [1,0], [1,1]]) # XOR input? No, simple regression
y = np.array([[0], [1], [1], [0]]) # XOR target

# Initialization
input_size = 2
hidden_size = 4
output_size = 1

# Weights
np.random.seed(42)
W1 = np.random.uniform(-1, 1, (input_size, hidden_size))
b1 = np.zeros((1, hidden_size))
W2 = np.random.uniform(-1, 1, (hidden_size, output_size))
b2 = np.zeros((1, output_size))

lr = 0.5

for epoch in range(5000):
    # Forward
    z1 = np.dot(x, W1) + b1
    h = sigmoid(z1)
    z2 = np.dot(h, W2) + b2
    y_pred = sigmoid(z2) # Use sigmoid for output too for 0-1 range

    # Loss (MSE)
    loss = 0.5 * np.sum((y_pred - y)**2)

    # Backward
    d_loss_y = (y_pred - y) * sigmoid_derivative(z2)

    # Gradient for W2
    d_W2 = np.dot(h.T, d_loss_y)

    # Gradient for Hidden Layer
    d_loss_h = np.dot(d_loss_y, W2.T)
    d_loss_z1 = d_loss_h * sigmoid_derivative(z1)

    # Gradient for W1
    d_W1 = np.dot(x.T, d_loss_z1)

    # Update
    W1 -= lr * d_W1
    W2 -= lr * d_W2

    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

7. Interactive Visualizer: Neural Flow

Visualize the flow of data (Forward) and Gradients (Backward).

Blue Pulses: Forward values propagating to the output.
Red Pulses: Backward gradients updating the weights.
Weights: Represented by line thickness. Watch them change as you train!

Target: 1.0
Output: 0.50
Loss: 0.12

8. Summary

Forward: Compute prediction by passing data through layers.
Backward: Compute gradients by passing error backwards using the Chain Rule.
Vanishing Gradients: Gradients get multiplied by small numbers (derivatives < 1) at each layer, potentially becoming zero at early layers.
Solution: Use ReLU (derivative is 1) and proper initialization (He/Xavier) to keep gradients healthy.

Case Study: Backpropagation from Scratch

Case Study: Backpropagation from Scratch

1. Introduction: The Algorithm That Runs the World

2. Intuition: The Corporate “Blame Game”

3. The Network Architecture

4. The Derivation (Chain Rule)

5. The Vanishing Gradient Problem

6. Python: Backprop from Scratch

7. Interactive Visualizer: Neural Flow

8. Summary

Found this lesson helpful?