Module Review: Information Theory

Key Takeaways

  1. Shannon Entropy (H): Measures the average uncertainty or “surprise” in a probability distribution. Maximized when all outcomes are equally likely.
  2. KL Divergence (DKL): Measures the information loss when approximating a true distribution P with a model Q. It is non-symmetric and non-negative.
  3. Mutual Information (I): Measures how much knowing one variable reduces uncertainty about another. I(X; Y) = H(X) − H(X Y).
  4. Cross-Entropy (H(P, Q)): The standard loss function for classification. Minimizing Cross-Entropy is equivalent to minimizing KL Divergence between truth and prediction.

Flashcards

Test your understanding of the core concepts. Click to flip.

What is Shannon Entropy?

The expected value of surprisal. It quantifies the uncertainty in a probability distribution.

H(X) = − Σ P(x) log P(x)

Is KL Divergence symmetric?

No.

DKL(P || Q) ≠ DKL(Q || P)

It is not a true distance metric.

What is Mutual Information if X and Y are independent?

Zero.

If independent, knowing Y gives no information about X, so the reduction in uncertainty is 0.

Why do we minimize Cross-Entropy in ML?

Minimizing Cross-Entropy is mathematically equivalent to minimizing the KL Divergence between the true distribution (labels) and the predicted distribution.

Cheat Sheet

Concept Formula Notes    
Shannon Entropy H(X) = − Σ P(x) log P(x) Average surprisal (bits).    
KL Divergence DKL(P   Q) = Σ P(x) log (P(x) / Q(x)) Info lost when using Q to approx P.
Joint Entropy H(X, Y) = − Σ Σ P(x, y) log P(x, y) Uncertainty of pair (X, Y).    
Conditional Entropy H(X Y) = − Σ Σ P(x, y) log P(x y) Uncertainty of X given Y.
Mutual Information I(X; Y) = H(X) − H(X Y) Reduction in uncertainty.  
Cross-Entropy H(P, Q) = − Σ P(x) log Q(x) Loss function (Labels P vs Preds Q).    

Next Steps

You have mastered the foundations of Information Theory! These concepts are the bedrock of modern Machine Learning and Statistics.

Back to Module Index