Mutual Information

While Entropy measures the uncertainty of a single variable, Mutual Information (MI) measures the relationship between two random variables. It tells us how much knowing one variable reduces uncertainty about the other.

1. Joint and Conditional Entropy

Before defining Mutual Information, we need two building blocks:

Joint Entropy H(X, Y)

The uncertainty associated with a pair of random variables (X, Y).

H(X, Y) = - Σ x Σ y P(x, y) log P(x, y)

Conditional Entropy H(X|Y)

The average uncertainty remaining in X after we know Y.

H(X|Y) = Σ y P(y) H(X|Y=y)

H(X|Y) = - Σ x Σ y P(x, y) log P(x|y)

[!NOTE] The Chain Rule for Entropy: H(X, Y) = H(X) + H(Y|X) The total uncertainty of X and Y is the uncertainty of X plus the remaining uncertainty of Y once X is known.

2. Mutual Information (MI)

Mutual Information I(X; Y) is the reduction in uncertainty of X due to the knowledge of Y.

I(X; Y) = H(X) - H(X|Y)

It is symmetric: I(X; Y) = I(Y; X). It can also be written as:

I(X; Y) = H(X) + H(Y) - H(X, Y)

If X and Y are independent, knowing Y gives no information about X, so H(X

Y) = H(X) and I(X; Y) = 0.

Interactive: The Information Venn Diagram

The relationship between these quantities is best visualized as a Venn Diagram.

Circle 1 (Left): Entropy of X, H(X).
Circle 2 (Right): Entropy of Y, H(Y).
Intersection: Mutual Information I(X; Y).
Union: Joint Entropy H(X, Y).

Use the sliders to adjust the uncertainty of X and Y, and their overlap (Mutual Information).

Entropy H(X): 4.0

Entropy H(Y): 4.0

Mutual Info I(X;Y): 2.0

H(X|Y)

H(Y|X)

I(X;Y)

H(X) 4.00

H(Y) 4.00

          I(X;Y)
          2.00
      

H(X|Y) 2.00

H(X,Y) 6.00

3. Intuition: Information Gain

Think of MI as Information Gain.

Scenario A: Correlation.
X = “It is raining”.
Y = “Ground is wet”.
Knowing Y significantly reduces the uncertainty about X. MI is High.
Scenario B: Independence.
X = “It is raining”.
Y = “Stock market is up”.
Knowing Y tells you nothing about X. MI is Zero.

In Decision Trees (e.g., ID3, C4.5 algorithms), we choose the split feature that maximizes Information Gain (Mutual Information) with the target variable.

4. Code Implementation

Calculate Mutual Information from discrete data in Java, Go, and Python.

Java

import java.util.HashMap;
import java.util.Map;

public class MutualInformation {

    public static double calculateMI(int[] x, int[] y) {
        int n = x.length;
        Map<Integer, Double> pX = new HashMap<>();
        Map<Integer, Double> pY = new HashMap<>();
        Map<String, Double> pXY = new HashMap<>();

        // 1. Calculate Probabilities
        for (int i = 0; i < n; i++) {
            pX.put(x[i], pX.getOrDefault(x[i], 0.0) + 1.0);
            pY.put(y[i], pY.getOrDefault(y[i], 0.0) + 1.0);
            String key = x[i] + "," + y[i];
            pXY.put(key, pXY.getOrDefault(key, 0.0) + 1.0);
        }

        // Normalize
        pX.replaceAll((k, v) -> v / n);
        pY.replaceAll((k, v) -> v / n);
        pXY.replaceAll((k, v) -> v / n);

        // 2. Calculate MI
        // I(X;Y) = Sum P(x,y) * log2(P(x,y) / (P(x)*P(y)))
        double mi = 0.0;
        for (Map.Entry<String, Double> entry : pXY.entrySet()) {
            String[] parts = entry.getKey().split(",");
            int valX = Integer.parseInt(parts[0]);
            int valY = Integer.parseInt(parts[1]);
            double jointP = entry.getValue();
            double marginalP = pX.get(valX) * pY.get(valY);

            if (jointP > 0) {
                mi += jointP * (Math.log(jointP / marginalP) / Math.log(2));
            }
        }
        return mi;
    }

    public static void main(String[] args) {
        int[] X = {0, 0, 1, 1, 0, 1, 0, 1, 1, 1};
        int[] Y = {0, 0, 1, 1, 0, 1, 1, 1, 1, 0};
        System.out.printf("Mutual Information: %.4f bits%n", calculateMI(X, Y));
    }
}

Go

package main

import (
	"fmt"
	"math"
)

func CalculateMI(x, y []int) float64 {
	n := float64(len(x))
	pX := make(map[int]float64)
	pY := make(map[int]float64)
	pXY := make(map[string]float64)

	// 1. Calculate Counts
	for i := 0; i < len(x); i++ {
		pX[x[i]]++
		pY[y[i]]++
		key := fmt.Sprintf("%d,%d", x[i], y[i])
		pXY[key]++
	}

	// 2. Calculate MI
	mi := 0.0
	for key, count := range pXY {
		jointP := count / n

		var valX, valY int
		fmt.Sscanf(key, "%d,%d", &valX, &valY)

		px := pX[valX] / n
		py := pY[valY] / n

		if jointP > 0 {
			mi += jointP * math.Log2(jointP/(px*py))
		}
	}
	return mi
}

func main() {
	X := []int{0, 0, 1, 1, 0, 1, 0, 1, 1, 1}
	Y := []int{0, 0, 1, 1, 0, 1, 1, 1, 1, 0}
	fmt.Printf("Mutual Information: %.4f bits\n", CalculateMI(X, Y))
}

Python

import numpy as np
from sklearn.metrics import mutual_info_score
from scipy.stats import entropy

# Example: X and Y are discrete variables
# 0 = Rain, 1 = No Rain
# 0 = Umbrella, 1 = No Umbrella
X = [0, 0, 1, 1, 0, 1, 0, 1, 1, 1]
Y = [0, 0, 1, 1, 0, 1, 1, 1, 1, 0] # Y is correlated with X

# 1. Using sklearn (works on raw labels)
# Note: mutual_info_score uses natural log (nats)
mi_nats = mutual_info_score(X, Y)
mi_bits = mi_nats / np.log(2)
print(f"Mutual Information: {mi_bits:.4f} bits")

# 2. Manual Calculation from Joint Distribution
def calc_mutual_information(x, y):
    # Compute Joint Probability Matrix P(x,y)
    coords = np.stack((x, y), axis=-1)
    unique, counts = np.unique(coords, axis=0, return_counts=True)
    P_xy = counts / len(x)

    # Compute Marginals P(x) and P(y)
    P_x = np.array([np.sum(P_xy[unique[:,0] == val]) for val in np.unique(x)])
    P_y = np.array([np.sum(P_xy[unique[:,1] == val]) for val in np.unique(y)])

    # H(X) and H(Y)
    H_x = entropy(P_x, base=2)
    H_y = entropy(P_y, base=2)

    # H(X,Y)
    H_xy = entropy(P_xy, base=2)

    # I(X;Y) = H(X) + H(Y) - H(X,Y)
    return H_x + H_y - H_xy

mi_manual = calc_mutual_information(X, Y)
print(f"Manual MI Calculation: {mi_manual:.4f} bits")

Key Takeaways

MI measures dependency: It quantifies how much information X and Y share.
Non-linear: Unlike correlation (which measures linear relationships), MI captures non-linear dependencies.
Feature Selection: In Machine Learning, MI is often used to select features that are most informative about the target variable.