Conditional Independence
Two variables X and Y might appear correlated, but that doesn’t mean one causes the other. Often, a third variable Z influences both, creating a spurious correlation.
This chapter explores Conditional Independence, a core concept for causal inference and the secret sauce behind efficient Machine Learning algorithms like Naive Bayes.
1. The Definition
Two random variables X and Y are conditionally independent given a third variable Z if, once we know the value of Z, learning Y gives us no extra information about X.
Mathematically:
Equivalently:
| This is denoted as: **(X ⊥ Y | Z)**. |
[!NOTE] Independence (X ⊥ Y) and Conditional Independence (X ⊥ Y | Z) are different. Variables can be dependent but conditionally independent, or vice versa.
2. The Classic Example: Shoe Size and Reading
Imagine we collect data on elementary school students. We measure their Shoe Size (X) and their Reading Ability (Y).
- Observation: We find a strong positive correlation. Kids with bigger feet read better!
- Conclusion?: Should we buy bigger shoes to help kids read? Obviously not.
There is a confounding variable: Age (Z).
- Older kids have bigger feet.
- Older kids have better reading ability.
If we condition on Age (e.g., look only at 7-year-olds), the correlation vanishes. Among 7-year-olds, shoe size does not predict reading ability.
| Thus: **Shoe Size ⊥ Reading Ability | Age**. |
3. Interactive: The Confounder Switch
Explore how a hidden variable Z creates a false correlation between X and Y.
- Scenario: X and Y are independent within groups (Group A and Group B).
- The Trap: Group B has both higher X and higher Y on average.
- Toggle: Switch “Condition on Z” to reveal the groups and see the true relationship. Notice how the trend lines flip or flatten!
4. Hardware Reality: Naive Bayes and Storage
The assumption of Conditional Independence is the engine behind many high-speed algorithms.
Consider a spam filter that looks at 50,000 words (X1, …, X50000). To calculate P(Spam | Words), we need the joint likelihood P(Words | Spam).
- Without Independence: We need to store the probability for every combination of 50,000 words. That’s 250000 entries. Impossible.
- With Independence (Naive Bayes): We assume words are conditionally independent given the class (Spam/Not Spam).
We only need to store P(Wordi | Spam) for each word. That’s just 50,000 entries.
Hardware Impact: This reduces the space complexity from Exponential O(KN) to Linear O(NK). This fits in CPU cache, making Naive Bayes blazingly fast despite being “naive”.
5. Coding Example: Removing Confounders
Let’s simulate the Shoe Size / Reading Ability example in Java, Go, and Python.
Java Example
import java.util.Random;
public class ConfounderDemo {
public static void main(String[] args) {
Random rand = new Random();
int n = 1000;
double[] x = new double[n]; // Shoe Size
double[] y = new double[n]; // Reading Ability
double[] z = new double[n]; // Age (Confounder)
for (int i = 0; i < n; i++) {
z[i] = 5 + rand.nextDouble() * 5; // Age 5-10
x[i] = 2 * z[i] + rand.nextGaussian(); // Bigger feet
y[i] = 5 * z[i] + rand.nextGaussian() * 2; // Better reading
}
// 1. Naive Correlation
System.out.printf("Naive Correlation: %.2f\n", correlation(x, y));
// 2. Condition on Age ~ 7 (Stratification)
int count = 0;
double[] xSub = new double[n];
double[] ySub = new double[n];
for (int i = 0; i < n; i++) {
if (z[i] > 6.9 && z[i] < 7.1) {
xSub[count] = x[i];
ySub[count] = y[i];
count++;
}
}
// Resize arrays to exact count for correlation
double[] xFinal = new double[count];
double[] yFinal = new double[count];
System.arraycopy(xSub, 0, xFinal, 0, count);
System.arraycopy(ySub, 0, yFinal, 0, count);
System.out.printf("Conditional Correlation (Age ~ 7): %.2f\n", correlation(xFinal, yFinal));
}
public static double correlation(double[] xs, double[] ys) {
double sx = 0.0, sy = 0.0, sxy = 0.0, sxx = 0.0, syy = 0.0;
int n = xs.length;
for (int i = 0; i < n; ++i) {
sx += xs[i];
sy += ys[i];
sxy += xs[i] * ys[i];
sxx += xs[i] * xs[i];
syy += ys[i] * ys[i];
}
return (n * sxy - sx * sy) / Math.sqrt((n * sxx - sx * sx) * (n * syy - sy * sy));
}
}
Go Example
package main
import (
"fmt"
"math"
"math/rand"
)
func main() {
n := 1000
x := make([]float64, n)
y := make([]float64, n)
z := make([]float64, n)
for i := 0; i < n; i++ {
z[i] = 5 + rand.Float64()*5
x[i] = 2*z[i] + rand.NormFloat64()
y[i] = 5*z[i] + rand.NormFloat64()*2
}
// 1. Naive Correlation
fmt.Printf("Naive Correlation: %.2f\n", correlation(x, y))
// 2. Condition on Age ~ 7
var xSub, ySub []float64
for i := 0; i < n; i++ {
if z[i] > 6.9 && z[i] < 7.1 {
xSub = append(xSub, x[i])
ySub = append(ySub, y[i])
}
}
fmt.Printf("Conditional Correlation (Age ~ 7): %.2f\n", correlation(xSub, ySub))
}
func correlation(xs, ys []float64) float64 {
n := float64(len(xs))
sumX, sumY, sumXY, sumXX, sumYY := 0.0, 0.0, 0.0, 0.0, 0.0
for i := 0; i < len(xs); i++ {
sumX += xs[i]
sumY += ys[i]
sumXY += xs[i] * ys[i]
sumXX += xs[i] * xs[i]
sumYY += ys[i] * ys[i]
}
return (n*sumXY - sumX*sumY) / math.Sqrt((n*sumXX-sumX*sumX)*(n*sumYY-sumY*sumY))
}
Python Example
import numpy as np
# 1. Generate Confounder Z (Age in years, 5 to 10)
n_samples = 1000
Z = np.random.uniform(5, 10, n_samples)
# 2. Generate X (Shoe Size) depends on Z
X = 2 * Z + np.random.normal(0, 1, n_samples)
# 3. Generate Y (Reading Ability) depends on Z
Y = 5 * Z + np.random.normal(0, 2, n_samples)
# 4. Naive Correlation (X, Y)
corr_naive = np.corrcoef(X, Y)[0, 1]
print(f"Naive Correlation: {corr_naive:.2f}")
# 5. Condition on Z (Stratification)
mask = (Z > 6.9) & (Z < 7.1)
X_sub = X[mask]
Y_sub = Y[mask]
corr_cond = np.corrcoef(X_sub, Y_sub)[0, 1]
print(f"Conditional Correlation (Age ~ 7): {corr_cond:.2f}")
6. Summary
-
**Conditional Independence (X ⊥ Y Z)**: Knowing Z separates X and Y. - Confounding: A common cause creates a spurious correlation.
- Naive Bayes: Exploits conditional independence to reduce storage from exponential to linear.
- Controlling: We “control” for confounders by holding them constant (conditioning) to find true causal links.