Conditional Independence

Two variables X and Y might appear correlated, but that doesn’t mean one causes the other. Often, a third variable Z influences both, creating a spurious correlation.

This chapter explores Conditional Independence, a core concept for causal inference and the secret sauce behind efficient Machine Learning algorithms like Naive Bayes.

1. The Definition

Two random variables X and Y are conditionally independent given a third variable Z if, once we know the value of Z, learning Y gives us no extra information about X.

Mathematically:

P(X, Y | Z) = P(X | Z)P(Y | Z)

Equivalently:

P(X | Y, Z) = P(X | Z)
This is denoted as: **(X ⊥ Y Z)**.

[!NOTE] Independence (X ⊥ Y) and Conditional Independence (X ⊥ Y | Z) are different. Variables can be dependent but conditionally independent, or vice versa.


2. The Classic Example: Shoe Size and Reading

Imagine we collect data on elementary school students. We measure their Shoe Size (X) and their Reading Ability (Y).

  • Observation: We find a strong positive correlation. Kids with bigger feet read better!
  • Conclusion?: Should we buy bigger shoes to help kids read? Obviously not.

There is a confounding variable: Age (Z).

  • Older kids have bigger feet.
  • Older kids have better reading ability.

If we condition on Age (e.g., look only at 7-year-olds), the correlation vanishes. Among 7-year-olds, shoe size does not predict reading ability.

Thus: **Shoe Size ⊥ Reading Ability Age**.

3. Interactive: The Confounder Switch

Explore how a hidden variable Z creates a false correlation between X and Y.

  • Scenario: X and Y are independent within groups (Group A and Group B).
  • The Trap: Group B has both higher X and higher Y on average.
  • Toggle: Switch “Condition on Z” to reveal the groups and see the true relationship. Notice how the trend lines flip or flatten!
Condition on Z (Hidden)
Correlation (Overall): --
Group A: -- | Group B: --

4. Hardware Reality: Naive Bayes and Storage

The assumption of Conditional Independence is the engine behind many high-speed algorithms.

Consider a spam filter that looks at 50,000 words (X1, …, X50000). To calculate P(Spam | Words), we need the joint likelihood P(Words | Spam).

  • Without Independence: We need to store the probability for every combination of 50,000 words. That’s 250000 entries. Impossible.
  • With Independence (Naive Bayes): We assume words are conditionally independent given the class (Spam/Not Spam).
P(Words | Spam) ≈ P(Word_1 | Spam) × P(Word_2 | Spam) × ...

We only need to store P(Wordi | Spam) for each word. That’s just 50,000 entries.

Hardware Impact: This reduces the space complexity from Exponential O(KN) to Linear O(NK). This fits in CPU cache, making Naive Bayes blazingly fast despite being “naive”.


5. Coding Example: Removing Confounders

Let’s simulate the Shoe Size / Reading Ability example in Java, Go, and Python.

Java Example

import java.util.Random;

public class ConfounderDemo {
    public static void main(String[] args) {
        Random rand = new Random();
        int n = 1000;
        double[] x = new double[n]; // Shoe Size
        double[] y = new double[n]; // Reading Ability
        double[] z = new double[n]; // Age (Confounder)

        for (int i = 0; i < n; i++) {
            z[i] = 5 + rand.nextDouble() * 5; // Age 5-10
            x[i] = 2 * z[i] + rand.nextGaussian(); // Bigger feet
            y[i] = 5 * z[i] + rand.nextGaussian() * 2; // Better reading
        }

        // 1. Naive Correlation
        System.out.printf("Naive Correlation: %.2f\n", correlation(x, y));

        // 2. Condition on Age ~ 7 (Stratification)
        int count = 0;
        double[] xSub = new double[n];
        double[] ySub = new double[n];
        for (int i = 0; i < n; i++) {
            if (z[i] > 6.9 && z[i] < 7.1) {
                xSub[count] = x[i];
                ySub[count] = y[i];
                count++;
            }
        }

        // Resize arrays to exact count for correlation
        double[] xFinal = new double[count];
        double[] yFinal = new double[count];
        System.arraycopy(xSub, 0, xFinal, 0, count);
        System.arraycopy(ySub, 0, yFinal, 0, count);

        System.out.printf("Conditional Correlation (Age ~ 7): %.2f\n", correlation(xFinal, yFinal));
    }

    public static double correlation(double[] xs, double[] ys) {
        double sx = 0.0, sy = 0.0, sxy = 0.0, sxx = 0.0, syy = 0.0;
        int n = xs.length;
        for (int i = 0; i < n; ++i) {
            sx += xs[i];
            sy += ys[i];
            sxy += xs[i] * ys[i];
            sxx += xs[i] * xs[i];
            syy += ys[i] * ys[i];
        }
        return (n * sxy - sx * sy) / Math.sqrt((n * sxx - sx * sx) * (n * syy - sy * sy));
    }
}

Go Example

package main

import (
	"fmt"
	"math"
	"math/rand"
)

func main() {
	n := 1000
	x := make([]float64, n)
	y := make([]float64, n)
	z := make([]float64, n)

	for i := 0; i < n; i++ {
		z[i] = 5 + rand.Float64()*5
		x[i] = 2*z[i] + rand.NormFloat64()
		y[i] = 5*z[i] + rand.NormFloat64()*2
	}

	// 1. Naive Correlation
	fmt.Printf("Naive Correlation: %.2f\n", correlation(x, y))

	// 2. Condition on Age ~ 7
	var xSub, ySub []float64
	for i := 0; i < n; i++ {
		if z[i] > 6.9 && z[i] < 7.1 {
			xSub = append(xSub, x[i])
			ySub = append(ySub, y[i])
		}
	}

	fmt.Printf("Conditional Correlation (Age ~ 7): %.2f\n", correlation(xSub, ySub))
}

func correlation(xs, ys []float64) float64 {
	n := float64(len(xs))
	sumX, sumY, sumXY, sumXX, sumYY := 0.0, 0.0, 0.0, 0.0, 0.0

	for i := 0; i < len(xs); i++ {
		sumX += xs[i]
		sumY += ys[i]
		sumXY += xs[i] * ys[i]
		sumXX += xs[i] * xs[i]
		sumYY += ys[i] * ys[i]
	}

	return (n*sumXY - sumX*sumY) / math.Sqrt((n*sumXX-sumX*sumX)*(n*sumYY-sumY*sumY))
}

Python Example

import numpy as np

# 1. Generate Confounder Z (Age in years, 5 to 10)
n_samples = 1000
Z = np.random.uniform(5, 10, n_samples)

# 2. Generate X (Shoe Size) depends on Z
X = 2 * Z + np.random.normal(0, 1, n_samples)

# 3. Generate Y (Reading Ability) depends on Z
Y = 5 * Z + np.random.normal(0, 2, n_samples)

# 4. Naive Correlation (X, Y)
corr_naive = np.corrcoef(X, Y)[0, 1]
print(f"Naive Correlation: {corr_naive:.2f}")

# 5. Condition on Z (Stratification)
mask = (Z > 6.9) & (Z < 7.1)
X_sub = X[mask]
Y_sub = Y[mask]

corr_cond = np.corrcoef(X_sub, Y_sub)[0, 1]
print(f"Conditional Correlation (Age ~ 7): {corr_cond:.2f}")

6. Summary

  • **Conditional Independence (X ⊥ Y Z)**: Knowing Z separates X and Y.
  • Confounding: A common cause creates a spurious correlation.
  • Naive Bayes: Exploits conditional independence to reduce storage from exponential to linear.
  • Controlling: We “control” for confounders by holding them constant (conditioning) to find true causal links.

Next: Covariance Matrices