Module Review: Training Deep Networks

Key Takeaways

  • Backpropagation is the core algorithm for computing gradients. It uses the Chain Rule to assign “blame” (gradients) to each weight in the network.
  • Optimizers use these gradients to update the weights.
  • SGD: Simple updates, but can be slow or get stuck.
  • Momentum: Adds velocity to overcome noise and ravines.
  • Adam: Adapts learning rates for each parameter, generally the best default choice.
  • Batch Normalization stabilizes training by normalizing layer inputs, reducing Internal Covariate Shift, and allowing higher learning rates.
  • Hyperparameters like Learning Rate and Batch Size significantly impact training speed and stability.

Flashcards

Test your understanding of the core concepts. Click a card to flip it.

What is the "Chain Rule" in the context of Backpropagation?
A calculus rule used to compute derivatives of composite functions. In Backprop, it allows us to calculate the gradient of the Loss with respect to any weight by multiplying the local gradients along the path.
Why is SGD often better than full-batch Gradient Descent?
SGD updates weights more frequently (after each batch), leading to faster convergence. The noise introduced by mini-batches can also help the model escape shallow local minima.
What problem does Batch Normalization solve?
Internal Covariate Shift. It stabilizes the distribution of inputs to each layer, preventing them from shifting wildly as previous layers change during training.
What are the two main components of the Adam optimizer?
Momentum (First moment: average of gradients) and RMSprop (Second moment: variance of gradients). It combines them to adapt the learning rate.
What happens if the learning rate is too high?
The loss may oscillate or diverge (explode), as the optimizer overshoots the minimum.

Cheat Sheet: Update Rules

Algorithm Update Rule (Simplified) Key Feature
SGD w = w - η ∇L Basic step towards negative gradient.
Momentum v = β v + ∇L
w = w - η v
Adds inertia to speed up convergence.
RMSprop s = β s + (1-β) (∇L)2
w = w - η/√s ∇L
Divides by variance to normalize gradients.
Adam m = β1 m + (1-β1) ∇L
v = β2 v + (1-β2) (∇L)2
w = w - η m_hat/√v_hat
Combines Momentum and RMSprop.

Quick Revision

  • Vanishing Gradient: Gradients become too small in deep networks (fix: ReLU, BatchNorm, Residual connections).
  • Exploding Gradient: Gradients become too large (fix: Gradient Clipping).
  • Saddle Point: A point where gradient is zero but not a minimum (Momentum helps escape).
  • Epoch vs Batch: Epoch is one pass over all data; Batch is one step of gradient descent.

Next Steps

Now that you know how to train a network, it’s time to learn about specific architectures for different tasks.

Deep Learning Glossary