Bias-Variance Tradeoff
The Bias-Variance Tradeoff is the central problem in supervised learning. Ideally, we want to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously.
- High Bias causes an algorithm to miss the relevant relations between features and target outputs (underfitting).
- High Variance causes an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
1. The Error Decomposition
The expected error of a learning algorithm can be decomposed into three components:
Error = Bias2 + Variance + Irreducible Error
1.1 Bias (Underfitting)
Bias is the error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
- Concept: The model is “too simple” to capture the underlying structure of the data.
- Symptoms: High training error AND high validation error.
- Example: Linear regression on a dataset with a quadratic relationship.
1.2 Variance (Overfitting)
Variance is the error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
- Concept: The model is “too complex” and memorizes the noise in the training data.
- Symptoms: Low training error BUT high validation error.
- Example: High-degree polynomial regression.
1.3 Irreducible Error (Noise)
The noise term (\epsilon) represents the fundamental limitation of the problem itself (e.g., measurement error, missing features). This error cannot be reduced by any model.
2. Interactive: Polynomial Fitter
Use the visualizer below to explore how model complexity (polynomial degree) affects Bias and Variance.
- Degree 1 (Underfitting): The line is too simple to capture the curve. High Bias.
- Degree 3 (Balanced): Captures the underlying sine wave pattern well. Low Bias, Low Variance.
- Degree 15 (Overfitting): Wiggles wildly to hit every single noisy point. Low Bias, High Variance.
3. Detecting Bias and Variance with Code
We can diagnose these issues by plotting the training and validation errors as a function of the training set size (Learning Curves) or model complexity.
Python Implementation
Here is how you can visualize the Bias-Variance tradeoff using Python and Scikit-Learn.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# 1. Generate Synthetic Data
def true_fun(X):
return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30
degrees = [1, 4, 15] # Linear, Balanced, Overfit
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1
# 2. Fit Models and Plot
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([
("polynomial_features", polynomial_features),
("linear_regression", linear_regression)
])
pipeline.fit(X[:, np.newaxis], y)
# Evaluate
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2e}".format(
degrees[i], mean_squared_error(y, pipeline.predict(X[:, np.newaxis]))))
plt.show()
4. Key Takeaways
| Metric | High Bias (Underfitting) | High Variance (Overfitting) |
|---|---|---|
| Training Error | High | Low (approx 0) |
| Validation Error | High | High |
| Gap | Small gap between Train/Val | Large gap between Train/Val |
| Solution | Add features, increase complexity | Add data, regularization, decrease complexity |
[!TIP] Regularization (L1/L2) is a technique to explicitly control variance by adding a penalty term to the loss function that discourages complex models (large coefficients).