Module Review: Neural Networks

[!NOTE] This module explores the core principles of Module Review: Neural Networks, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Key Takeaways

  • Perceptron: The basic unit of a neural network. It performs a linear classification but fails on non-linear problems like XOR.
  • Activation Functions: Essential for introducing non-linearity.
  • ReLU: Standard for hidden layers.
  • Sigmoid: Good for probability output, bad for hidden layers (vanishing gradient).
  • Softmax: Used for multi-class classification output.
  • Universal Approximation: An MLP with at least one hidden layer can approximate any continuous function.
  • Forward Propagation: The flow of data from input to output through layers of neurons.

2. Flashcards

What is the "Vanishing Gradient" problem?
When gradients become extremely small during backpropagation (common with Sigmoid/Tanh), preventing early layers from learning effectively.
Why can't a Perceptron solve XOR?
Because XOR is not linearly separable. A single Perceptron can only draw a straight line decision boundary.
What is the purpose of an Activation Function?
To introduce non-linearity into the network, allowing it to learn complex patterns.
What is the output range of Tanh?
(-1, 1). It is zero-centered, unlike Sigmoid which is (0, 1).
What is "Dead ReLU"?
A state where a ReLU neuron only outputs 0 because its weights have updated such that the input is always negative. It stops learning.

3. Cheat Sheet

Concept Formula / Definition Key Usage
Perceptron Output y = 1 if w·x + b > 0, else 0 Simple binary classification
Sigmoid σ(x) = 1 / (1 + e⁻ˣ) Binary probability output
Tanh tanh(x) Zero-centered hidden layers (Legacy)
ReLU max(0, x) Default for hidden layers
Leaky ReLU max(0.01x, x) Fixes “Dead ReLU”
Softmax e^z / Σ e^z Multi-class probability output
Update Rule w ← w + α(y - ŷ)x Perceptron learning

4. Next Steps

Now that you understand the architecture, it’s time to learn how to train these deep networks using Gradient Descent and Backpropagation.