Module Review: Information Theory
Key Takeaways
- Shannon Entropy (H): Measures the average uncertainty or “surprise” in a probability distribution. Maximized when all outcomes are equally likely.
- KL Divergence (DKL): Measures the information loss when approximating a true distribution P with a model Q. It is non-symmetric and non-negative.
-
Mutual Information (I): Measures how much knowing one variable reduces uncertainty about another. I(X; Y) = H(X) − H(X Y). - Cross-Entropy (H(P, Q)): The standard loss function for classification. Minimizing Cross-Entropy is equivalent to minimizing KL Divergence between truth and prediction.
Flashcards
Test your understanding of the core concepts. Click to flip.
What is Shannon Entropy?
The expected value of surprisal. It quantifies the uncertainty in a probability distribution.
H(X) = − Σ P(x) log P(x)
Is KL Divergence symmetric?
No.
DKL(P || Q) ≠ DKL(Q || P)
It is not a true distance metric.
What is Mutual Information if X and Y are independent?
Zero.
If independent, knowing Y gives no information about X, so the reduction in uncertainty is 0.
Why do we minimize Cross-Entropy in ML?
Minimizing Cross-Entropy is mathematically equivalent to minimizing the KL Divergence between the true distribution (labels) and the predicted distribution.
Cheat Sheet
| Concept | Formula | Notes | ||
|---|---|---|---|---|
| Shannon Entropy | H(X) = − Σ P(x) log P(x) | Average surprisal (bits). | ||
| KL Divergence | DKL(P | Q) = Σ P(x) log (P(x) / Q(x)) | Info lost when using Q to approx P. | |
| Joint Entropy | H(X, Y) = − Σ Σ P(x, y) log P(x, y) | Uncertainty of pair (X, Y). | ||
| Conditional Entropy | H(X | Y) = − Σ Σ P(x, y) log P(x | y) | Uncertainty of X given Y. |
| Mutual Information | I(X; Y) = H(X) − H(X | Y) | Reduction in uncertainty. | |
| Cross-Entropy | H(P, Q) = − Σ P(x) log Q(x) | Loss function (Labels P vs Preds Q). |
Next Steps
You have mastered the foundations of Information Theory! These concepts are the bedrock of modern Machine Learning and Statistics.
- Practice: Try deriving the gradients for Cross-Entropy Loss yourself.
- Review: Check the Probability Glossary for definitions.
- Move On: Continue to Module 01: Descriptive Statistics (or whichever module is next in your curriculum).