The Bias-Variance Tradeoff
In supervised learning and estimation, we want our models to generalize well to unseen data. However, there are two competing sources of error that prevent us from achieving perfect generalization: Bias and Variance.
The Bias-Variance Tradeoff explains why simple models underfit and complex models overfit. It mathematically decomposes the total error (Mean Squared Error) into three distinct components.
1. Decomposing the Error (MSE)
When we train a model f̂(x) to approximate a true function f(x), the expected prediction error (Mean Squared Error) on a new data point x is:
MSE = E[(f(x) - f̂(x))<sup>2</sup>]
Through algebraic expansion, this decomposes into:
MSE = Bias<sup>2</sup> + Variance + Irreducible Error
1. Bias (Systematic Error)
Bias measures how far the average prediction of our model is from the true value. High bias means the model is too simple to capture the underlying pattern (Underfitting).
- Definition:
Bias[f̂(x)] = E[f̂(x)] - f(x) - High Bias: Linear Regression on quadratic data.
- Low Bias: High-degree polynomial regression.
2. Variance (Sensitivity to Training Data)
Variance measures how much the prediction for a given point changes if we retrain the model on a different dataset. High variance means the model is capturing noise in the training data (Overfitting).
- Definition:
Var[f̂(x)] = E[(f̂(x) - E[f̂(x)])<sup>2</sup>] - High Variance: Decision Trees (without pruning), k-NN with k=1.
- Low Variance: Linear Regression.
3. Irreducible Error (σ2)
This is the noise inherent in the data itself (e.g., measurement error, thermal noise). No matter how good our model is, we cannot reduce this error. It acts as a lower bound on performance.
[!IMPORTANT] Hardware Reality: Why Squared Error? Why do we care about Mean Squared Error? Why not Absolute Error?
- Differentiability: The square function
x<sup>2</sup>is smooth and differentiable everywhere, making Gradient Descent easy (unlike|x|which has a kink at 0).- Gauss-Markov Theorem: Under assumptions of normality, minimizing squared error yields the Best Linear Unbiased Estimator (BLUE).
- Sensitivity to Outliers: Squared error penalizes large errors heavily (
10<sup>2</sup> = 100, while2<sup>2</sup> = 4). This is physically consistent with energy equations (Kinetic Energy ∝ v2).
2. The Tradeoff Visualized
Imagine a bullseye target. The center is the true value f(x).
Low Bias, Low Variance (Ideal)
High Bias, Low Variance (Underfitting)
Low Bias, High Variance (Overfitting)
High Bias, High Variance (Worst Case)
3. Interactive Demo: Polynomial Fitting
Fit a polynomial to noisy data generated from a sine wave.
- Degree 1 (Linear): High Bias (Underfitting). Cannot capture the curve.
- Degree 3-4: Balanced. Captures the curve well.
- Degree 15: High Variance (Overfitting). Wiggles wildly to hit every noise point.
Model Complexity
4. Code Examples: Bias in Variance Estimation
A classic example of Bias is estimating the variance of a population.
- MLE Estimator:
Σ(x - mean)<sup>2</sup> / n. This is Biased. It systematically underestimates the true variance. - Sample Variance:
Σ(x - mean)<sup>2</sup> / (n - 1). This is Unbiased.
The following code simulates this property.
Python
import numpy as np
# Simulation settings
n_simulations = 10000
sample_size = 5
true_variance = 4.0
# Store estimates
mle_estimates = []
unbiased_estimates = []
np.random.seed(42)
for _ in range(n_simulations):
# Draw sample from Normal(0, sqrt(4))
sample = np.random.normal(0, np.sqrt(true_variance), sample_size)
# MLE (divide by n)
mle_var = np.var(sample, ddof=0)
mle_estimates.append(mle_var)
# Unbiased (divide by n-1)
unbiased_var = np.var(sample, ddof=1)
unbiased_estimates.append(unbiased_var)
# Calculate Bias
# E[Est] - True
bias_mle = np.mean(mle_estimates) - true_variance
bias_unbiased = np.mean(unbiased_estimates) - true_variance
print(f"True Variance: {true_variance}")
print(f"MLE Bias: {bias_mle:.4f} (Expected: -0.8000)")
print(f"Unbiased Bias: {bias_unbiased:.4f} (Expected: 0.0000)")
Java
import java.util.Random;
public class VarianceBias {
public static void main(String[] args) {
int nSimulations = 10000;
int sampleSize = 5;
double trueVariance = 4.0;
double sumMle = 0;
double sumUnbiased = 0;
Random rand = new Random(42);
for (int i = 0; i < nSimulations; i++) {
double[] sample = new double[sampleSize];
double sum = 0;
for (int j = 0; j < sampleSize; j++) {
sample[j] = rand.nextGaussian() * Math.sqrt(trueVariance);
sum += sample[j];
}
double mean = sum / sampleSize;
double sumSqDiff = 0;
for (double x : sample) {
sumSqDiff += Math.pow(x - mean, 2);
}
// MLE (divide by n)
sumMle += sumSqDiff / sampleSize;
// Unbiased (divide by n-1)
sumUnbiased += sumSqDiff / (sampleSize - 1);
}
double expectedMle = sumMle / nSimulations;
double expectedUnbiased = sumUnbiased / nSimulations;
System.out.printf("True Variance: %.2f%n", trueVariance);
System.out.printf("MLE Bias: %.4f%n", expectedMle - trueVariance);
System.out.printf("Unbiased Bias: %.4f%n", expectedUnbiased - trueVariance);
}
}
Go
package main
import (
"fmt"
"math"
"math/rand"
)
func main() {
nSimulations := 10000
sampleSize := 5
trueVariance := 4.0
sumMle := 0.0
sumUnbiased := 0.0
// Seed
rand.Seed(42)
for i := 0; i < nSimulations; i++ {
sample := make([]float64, sampleSize)
sum := 0.0
for j := 0; j < sampleSize; j++ {
// NormFloat64 returns N(0, 1)
val := rand.NormFloat64() * math.Sqrt(trueVariance)
sample[j] = val
sum += val
}
mean := sum / float64(sampleSize)
sumSqDiff := 0.0
for _, x := range sample {
sumSqDiff += math.Pow(x-mean, 2)
}
// MLE (divide by n)
sumMle += sumSqDiff / float64(sampleSize)
// Unbiased (divide by n-1)
sumUnbiased += sumSqDiff / float64(sampleSize-1)
}
expectedMle := sumMle / float64(nSimulations)
expectedUnbiased := sumUnbiased / float64(nSimulations)
fmt.Printf("True Variance: %.2f\n", trueVariance)
fmt.Printf("MLE Bias: %.4f\n", expectedMle-trueVariance)
fmt.Printf("Unbiased Bias: %.4f\n", expectedUnbiased-trueVariance)
}
5. Summary
- Bias is the error from erroneous assumptions (e.g., assuming data is linear when it’s quadratic).
- Variance is the error from sensitivity to small fluctuations in the training set.
- Complexity: Increasing model complexity decreases bias but increases variance.
- Regularization: Techniques like L2 Regularization (Ridge) work by deliberately adding bias to reduce variance, often lowering the total MSE.
Next, we will look at a simpler estimation technique: the Method of Moments.