Entropy & KL Divergence
Information Theory provides the mathematical framework for quantifying information, uncertainty, and the “distance” between probability distributions. These concepts are foundational to modern Machine Learning, particularly in loss functions and generative models.
1. The Concept of Information
What is “information”? In this context, information is related to surprise.
- If I tell you “the sun rose today,” there is very little information because the event was highly probable (low surprise).
- If I tell you “it snowed in the Sahara today,” there is a lot of information because the event was highly improbable (high surprise).
Mathematically, the surprisal (or self-information) of an event x with probability P(x) is defined as:
[!NOTE] We typically use base-2 logarithms (log2), measuring information in bits. In Machine Learning, we often use the natural logarithm (ln), measuring information in nats.
2. Shannon Entropy
Shannon Entropy is simply the expected value of the surprisal. It measures the average uncertainty in a random variable.
For a discrete random variable X with possible outcomes x1, …, xn occurring with probabilities P(x1), …, P(xn), the entropy H(X) is:
- High Entropy: The distribution is nearly uniform (maximum uncertainty). All outcomes are equally likely.
- Low Entropy: The distribution is concentrated (low uncertainty). One outcome is very likely.
Interactive: Entropy of a Coin Flip
Consider a coin with probability of heads p. The probability of tails is 1-p. The entropy is:
Use the slider below to change p (probability of Heads) and observe how Entropy changes.
Notice that entropy is maximized (1 bit) when p=0.5 (maximum uncertainty). It drops to 0 when the outcome is certain (p=0 or p=1).
3. Hardware Reality: Bits, Nats, and Floating Point
Why do we care about different bases for logarithms?
Bits vs. Nats
- Base 2 (Bits): Used in computer science and information theory. A bit represents the information gained from a fair coin flip (p=0.5).
- Base e (Nats): Used in calculus and physics. The derivative of ex is ex, making “nats” more convenient for gradients in Machine Learning (e.g., in PyTorch or TensorFlow).
Floating Point Underflow
In practice, probabilities P(x) can become extremely small (e.g., 10-50).
- Multiplying many small probabilities results in underflow (values becoming 0).
- Log-Space: To avoid this, we work in “log-space”. Instead of P(x), we store log P(x).
- Multiplication becomes addition: log(A × B) = log A + log B.
- This transforms tiny numbers into manageable negative numbers (e.g., log10(10-50) = -50).
[!IMPORTANT] Numerical Stability: When implementing entropy, always handle the P(x)=0 case separately. Mathematically, limx→0 x log x = 0. In code,
0 * log(0)returnsNaNunless explicitly handled.
4. Kullback-Leibler (KL) Divergence
KL Divergence (or relative entropy) measures how one probability distribution P diverges from a second, expected probability distribution Q.
- P(x): The true distribution (e.g., observed data).
- Q(x): The approximating distribution (e.g., our model’s predictions).
[!IMPORTANT] KL Divergence is not a true “distance” metric because it is not symmetric. DKL(P || Q) ≠ DKL(Q || P) It satisfies DKL(P || Q) ≥ 0, and is zero if and only if P = Q.
Interactive: Visualizing Divergence
The charts below represent two distributions over 5 categories.
- Blue (P): The True Distribution (Fixed).
- Green (Q): The Approximated Distribution. Adjust the sliders to change the weights of Q. The system will normalize them automatically to ensure they sum to 1. Try to match P to minimize DKL.
5. Code Implementation
Calculate Entropy and KL Divergence in Java, Go, and Python.
Java
import java.util.Arrays;
public class InformationTheory {
// Calculate Shannon Entropy (in bits)
public static double entropy(double[] probabilities) {
double entropy = 0.0;
for (double p : probabilities) {
if (p > 0) {
entropy -= p * (Math.log(p) / Math.log(2)); // log2(p)
}
}
return entropy;
}
// Calculate KL Divergence D(P || Q) (in bits)
public static double klDivergence(double[] p, double[] q) {
double divergence = 0.0;
for (int i = 0; i < p.length; i++) {
if (p[i] > 0) {
// Handle Q[i] == 0 case (should technically be infinity)
double qVal = Math.max(q[i], 1e-10);
divergence += p[i] * (Math.log(p[i] / qVal) / Math.log(2));
}
}
return Math.max(0, divergence);
}
public static void main(String[] args) {
double[] P = {0.1, 0.5, 0.2, 0.15, 0.05};
double[] Q = {0.2, 0.2, 0.2, 0.2, 0.2};
System.out.printf("Entropy of P: %.4f bits%n", entropy(P));
System.out.printf("KL Divergence D(P || Q): %.4f bits%n", klDivergence(P, Q));
}
}
Go
package main
import (
"fmt"
"math"
)
// Entropy calculates Shannon Entropy in bits
func Entropy(probabilities []float64) float64 {
entropy := 0.0
for _, p := range probabilities {
if p > 0 {
entropy -= p * math.Log2(p)
}
}
return entropy
}
// KLDivergence calculates D(P || Q) in bits
func KLDivergence(p, q []float64) float64 {
divergence := 0.0
for i, val := range p {
if val > 0 {
qVal := q[i]
if qVal == 0 {
qVal = 1e-10 // Avoid division by zero
}
divergence += val * math.Log2(val/qVal)
}
}
if divergence < 0 {
return 0
}
return divergence
}
func main() {
P := []float64{0.1, 0.5, 0.2, 0.15, 0.05}
Q := []float64{0.2, 0.2, 0.2, 0.2, 0.2}
fmt.Printf("Entropy of P: %.4f bits\n", Entropy(P))
fmt.Printf("KL Divergence D(P || Q): %.4f bits\n", KLDivergence(P, Q))
}
Python
import numpy as np
from scipy.stats import entropy
# Define two probability distributions
# P: The true distribution
# Q: The model's approximation
P = np.array([0.1, 0.5, 0.2, 0.15, 0.05])
Q = np.array([0.2, 0.2, 0.2, 0.2, 0.2])
# 1. Shannon Entropy of P
# H(P) = - sum(P * log(P))
# scipy.stats.entropy uses base e (nats) by default.
# Pass base=2 for bits.
H_P = entropy(P, base=2)
print(f"Entropy of P: {H_P:.4f} bits")
# 2. KL Divergence D(P || Q)
# scipy.stats.entropy(pk, qk) calculates KL divergence
D_KL = entropy(P, Q, base=2)
print(f"KL Divergence D(P || Q): {D_KL:.4f} bits")
# Verify asymmetry
D_KL_reverse = entropy(Q, P, base=2)
print(f"KL Divergence D(Q || P): {D_KL_reverse:.4f} bits")
Key Takeaways
- Entropy measures the uncertainty or “information content” of a distribution.
- KL Divergence measures the inefficiency of assuming distribution Q when the true distribution is P.
- Minimizing KL Divergence is equivalent to maximizing the likelihood of the data under the model (more on this in Cross-Entropy).