Entropy & KL Divergence

Information Theory provides the mathematical framework for quantifying information, uncertainty, and the “distance” between probability distributions. These concepts are foundational to modern Machine Learning, particularly in loss functions and generative models.

1. The Concept of Information

What is “information”? In this context, information is related to surprise.

If I tell you “the sun rose today,” there is very little information because the event was highly probable (low surprise).
If I tell you “it snowed in the Sahara today,” there is a lot of information because the event was highly improbable (high surprise).

Mathematically, the surprisal (or self-information) of an event x with probability P(x) is defined as:

I(x) = - log P(x) = log (1 / P(x))

[!NOTE] We typically use base-2 logarithms (log₂), measuring information in bits. In Machine Learning, we often use the natural logarithm (ln), measuring information in nats.

2. Shannon Entropy

Shannon Entropy is simply the expected value of the surprisal. It measures the average uncertainty in a random variable.

For a discrete random variable X with possible outcomes x₁, …, x_n occurring with probabilities P(x₁), …, P(x_n), the entropy H(X) is:

H(X) = - Σ i=1 n P(x i) log P(x i)

High Entropy: The distribution is nearly uniform (maximum uncertainty). All outcomes are equally likely.
Low Entropy: The distribution is concentrated (low uncertainty). One outcome is very likely.

Interactive: Entropy of a Coin Flip

Consider a coin with probability of heads p. The probability of tails is 1-p. The entropy is:

H(X) = - [p log 2 p + (1-p) log 2 (1-p)]

Use the slider below to change p (probability of Heads) and observe how Entropy changes.

Probability of Heads (p): 0.50

Entropy (bits) 1.00

Notice that entropy is maximized (1 bit) when p=0.5 (maximum uncertainty). It drops to 0 when the outcome is certain (p=0 or p=1).

3. Hardware Reality: Bits, Nats, and Floating Point

Why do we care about different bases for logarithms?

Bits vs. Nats

Base 2 (Bits): Used in computer science and information theory. A bit represents the information gained from a fair coin flip (p=0.5).
Base e (Nats): Used in calculus and physics. The derivative of e^x is e^x, making “nats” more convenient for gradients in Machine Learning (e.g., in PyTorch or TensorFlow).

Floating Point Underflow

In practice, probabilities P(x) can become extremely small (e.g., 10^-50).

Multiplying many small probabilities results in underflow (values becoming 0).
Log-Space: To avoid this, we work in “log-space”. Instead of P(x), we store log P(x).
Multiplication becomes addition: log(A × B) = log A + log B.
This transforms tiny numbers into manageable negative numbers (e.g., log₁₀(10^-50) = -50).

[!IMPORTANT] Numerical Stability: When implementing entropy, always handle the P(x)=0 case separately. Mathematically, lim_x→0 x log x = 0. In code, 0 * log(0) returns NaN unless explicitly handled.

4. Kullback-Leibler (KL) Divergence

KL Divergence (or relative entropy) measures how one probability distribution P diverges from a second, expected probability distribution Q.

D_{KL}(P || Q) = Σx P(x) log (P(x) / Q(x))

P(x): The true distribution (e.g., observed data).
Q(x): The approximating distribution (e.g., our model’s predictions).

[!IMPORTANT] KL Divergence is not a true “distance” metric because it is not symmetric. D_KL(P || Q) ≠ D_KL(Q || P) It satisfies D_KL(P || Q) ≥ 0, and is zero if and only if P = Q.

Interactive: Visualizing Divergence

The charts below represent two distributions over 5 categories.

Blue (P): The True Distribution (Fixed).
Green (Q): The Approximated Distribution. Adjust the sliders to change the weights of Q. The system will normalize them automatically to ensure they sum to 1. Try to match P to minimize D_KL.

True Distribution (P)

Approximation (Q)

KL Divergence D_KL(P || Q) 0.00

5. Code Implementation

Calculate Entropy and KL Divergence in Java, Go, and Python.

Java

import java.util.Arrays;

public class InformationTheory {

    // Calculate Shannon Entropy (in bits)
    public static double entropy(double[] probabilities) {
        double entropy = 0.0;
        for (double p : probabilities) {
            if (p > 0) {
                entropy -= p * (Math.log(p) / Math.log(2)); // log2(p)
            }
        }
        return entropy;
    }

    // Calculate KL Divergence D(P || Q) (in bits)
    public static double klDivergence(double[] p, double[] q) {
        double divergence = 0.0;
        for (int i = 0; i < p.length; i++) {
            if (p[i] > 0) {
                // Handle Q[i] == 0 case (should technically be infinity)
                double qVal = Math.max(q[i], 1e-10);
                divergence += p[i] * (Math.log(p[i] / qVal) / Math.log(2));
            }
        }
        return Math.max(0, divergence);
    }

    public static void main(String[] args) {
        double[] P = {0.1, 0.5, 0.2, 0.15, 0.05};
        double[] Q = {0.2, 0.2, 0.2, 0.2, 0.2};

        System.out.printf("Entropy of P: %.4f bits%n", entropy(P));
        System.out.printf("KL Divergence D(P || Q): %.4f bits%n", klDivergence(P, Q));
    }
}

Go

package main

import (
	"fmt"
	"math"
)

// Entropy calculates Shannon Entropy in bits
func Entropy(probabilities []float64) float64 {
	entropy := 0.0
	for _, p := range probabilities {
		if p > 0 {
			entropy -= p * math.Log2(p)
		}
	}
	return entropy
}

// KLDivergence calculates D(P || Q) in bits
func KLDivergence(p, q []float64) float64 {
	divergence := 0.0
	for i, val := range p {
		if val > 0 {
			qVal := q[i]
			if qVal == 0 {
				qVal = 1e-10 // Avoid division by zero
			}
			divergence += val * math.Log2(val/qVal)
		}
	}
	if divergence < 0 {
		return 0
	}
	return divergence
}

func main() {
	P := []float64{0.1, 0.5, 0.2, 0.15, 0.05}
	Q := []float64{0.2, 0.2, 0.2, 0.2, 0.2}

	fmt.Printf("Entropy of P: %.4f bits\n", Entropy(P))
	fmt.Printf("KL Divergence D(P || Q): %.4f bits\n", KLDivergence(P, Q))
}

Python

import numpy as np
from scipy.stats import entropy

# Define two probability distributions
# P: The true distribution
# Q: The model's approximation
P = np.array([0.1, 0.5, 0.2, 0.15, 0.05])
Q = np.array([0.2, 0.2, 0.2, 0.2, 0.2])

# 1. Shannon Entropy of P
# H(P) = - sum(P * log(P))
# scipy.stats.entropy uses base e (nats) by default.
# Pass base=2 for bits.
H_P = entropy(P, base=2)
print(f"Entropy of P: {H_P:.4f} bits")

# 2. KL Divergence D(P || Q)
# scipy.stats.entropy(pk, qk) calculates KL divergence
D_KL = entropy(P, Q, base=2)
print(f"KL Divergence D(P || Q): {D_KL:.4f} bits")

# Verify asymmetry
D_KL_reverse = entropy(Q, P, base=2)
print(f"KL Divergence D(Q || P): {D_KL_reverse:.4f} bits")

Key Takeaways

Entropy measures the uncertainty or “information content” of a distribution.
KL Divergence measures the inefficiency of assuming distribution Q when the true distribution is P.
Minimizing KL Divergence is equivalent to maximizing the likelihood of the data under the model (more on this in Cross-Entropy).