Module Review: Neural Networks

[!NOTE] This module explores the core principles of Module Review: Neural Networks, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Key Takeaways

Perceptron: The basic unit of a neural network. It performs a linear classification but fails on non-linear problems like XOR.
Activation Functions: Essential for introducing non-linearity.
ReLU: Standard for hidden layers.
Sigmoid: Good for probability output, bad for hidden layers (vanishing gradient).
Softmax: Used for multi-class classification output.
Universal Approximation: An MLP with at least one hidden layer can approximate any continuous function.
Forward Propagation: The flow of data from input to output through layers of neurons.

2. Flashcards

What is the "Vanishing Gradient" problem?

When gradients become extremely small during backpropagation (common with Sigmoid/Tanh), preventing early layers from learning effectively.

Why can't a Perceptron solve XOR?

Because XOR is not linearly separable. A single Perceptron can only draw a straight line decision boundary.

What is the purpose of an Activation Function?

To introduce non-linearity into the network, allowing it to learn complex patterns.

What is the output range of Tanh?

(-1, 1). It is zero-centered, unlike Sigmoid which is (0, 1).

What is "Dead ReLU"?

A state where a ReLU neuron only outputs 0 because its weights have updated such that the input is always negative. It stops learning.

3. Cheat Sheet

Concept	Formula / Definition	Key Usage
Perceptron Output	`y = 1` if `w·x + b > 0`, else `0`	Simple binary classification
Sigmoid	`σ(x) = 1 / (1 + e⁻ˣ)`	Binary probability output
Tanh	`tanh(x)`	Zero-centered hidden layers (Legacy)
ReLU	`max(0, x)`	Default for hidden layers
Leaky ReLU	`max(0.01x, x)`	Fixes “Dead ReLU”
Softmax	`e^z / Σ e^z`	Multi-class probability output
Update Rule	`w ← w + α(y - ŷ)x`	Perceptron learning

4. Next Steps

Now that you understand the architecture, it’s time to learn how to train these deep networks using Gradient Descent and Backpropagation.