Residual Analysis & Diagnostics

[!IMPORTANT] Fitting a model is easy; validating it is hard. If your residuals show a pattern, your model is missing something.

The Ordinary Least Squares (OLS) method is powerful, but it relies on strict assumptions. If these are violated, your p-values, confidence intervals, and predictions may be garbage. We check these assumptions by analyzing the residuals (e = y - \hat{y}).

1. Intuition: The Signal and the Noise

Why do we want our residuals to look like random static (white noise)?

Think about Information Theory.

  • Data = Model + Residuals
  • Information = Pattern + Noise

If your residuals show a pattern (e.g., a curve, a wave), it means there is still information left in the data that your model failed to capture.

  • Ideal Model: Extracts all the pattern. The leftovers (residuals) are pure, uninformative noise (Maximum Entropy).
  • Bad Model: Leaves patterns behind. You left “money on the table.”

2. The Four Assumptions (L.I.H.N.)

We remember the assumptions using the acronym LIHN:

  1. Linearity: The relationship is actually linear.
    • Check: Residuals vs. Fitted plot should show no clear pattern (just a random cloud).
  2. Independence: Errors are independent (no serial correlation).
    • Check: Durbin-Watson test (values near 2 are good). Critical for time-series data.
  3. Homoscedasticity: Constant variance of errors.
    • Check: Residuals vs. Fitted plot should show a constant “band” of width. No “funnel” shapes.
  4. Normality: Errors are normally distributed.
    • Check: Q-Q Plot should follow the 45-degree line.

3. Hardware Reality: The Autocorrelation Trap

In system design and finance, the Independence assumption is the most dangerous one to violate.

  • Scenario: You measure CPU usage every second.
  • The Trap: High CPU usage at t likely means high CPU usage at t+1. The errors are correlated (Autocorrelation).
  • The Consequence: OLS assumes every data point adds new information. If points are correlated, they are “echoes” of each other. OLS thinks you have more sample size than you actually do.
  • Result: Your Standard Errors (SE) become tiny. Your T-statistics explode. You get P < 0.00001 and think you found a breakthrough, but you actually found nothing.

Residual Analysis & Diagnostics

[!NOTE] This module explores the core principles of Residual Analysis & Diagnostics, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Interactive: The Pattern Matcher

Learn to spot the violations. Click the buttons below to simulate different datasets and see what their Residuals vs. Fitted plots look like.


2. Implementation: Calculating Residuals

Here is how to calculate residuals and check their standard deviation (Root Mean Square Error - RMSE) in Go, Java, and Python.

Go

package main

import (
	"fmt"
	"math"
)

func main() {
	yTrue := []float64{2, 4, 5, 4, 5}
	yPred := []float64{1.8, 3.9, 5.2, 4.1, 4.8}

	var sumSquaredResiduals float64
	for i := 0; i < len(yTrue); i++ {
		residual := yTrue[i] - yPred[i]
		sumSquaredResiduals += residual * residual
		fmt.Printf("Obs %d: Residual = %.2f\n", i, residual)
	}

	rmse := math.Sqrt(sumSquaredResiduals / float64(len(yTrue)))
	fmt.Printf("RMSE: %.4f\n", rmse)
}

Java

public class ResidualsExample {
    public static void main(String[] args) {
        double[] yTrue = {2, 4, 5, 4, 5};
        double[] yPred = {1.8, 3.9, 5.2, 4.1, 4.8};

        double sumSquaredResiduals = 0;
        for (int i = 0; i < yTrue.length; i++) {
            double residual = yTrue[i] - yPred[i];
            sumSquaredResiduals += residual * residual;
            System.out.printf("Obs %d: Residual = %.2f\n", i, residual);
        }

        double rmse = Math.sqrt(sumSquaredResiduals / yTrue.length);
        System.out.printf("RMSE: %.4f\n", rmse);
    }
}

Python (Statsmodels Diagnostics)

We can generate these plots easily in Python. The most critical plot is Residuals vs Fitted.

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# 1. Fit the model
# (Assuming X and y are already defined)
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()

# 2. Get Residuals and Fitted Values
residuals = results.resid
fitted_vals = results.fittedvalues

# 3. Create a 2x2 Diagnostic Plot
fig, ax = plt.subplots(2, 2, figsize=(12, 10))

# A. Residuals vs Fitted (Check Linearity & Homoscedasticity)
ax[0, 0].scatter(fitted_vals, residuals, alpha=0.5)
ax[0, 0].axhline(0, color='red', linestyle='--')
ax[0, 0].set_xlabel('Fitted Values')
ax[0, 0].set_ylabel('Residuals')
ax[0, 0].set_title('Residuals vs Fitted')

# B. Q-Q Plot (Check Normality)
sm.qqplot(residuals, line='45', ax=ax[0, 1])
ax[0, 1].set_title('Normal Q-Q')

# C. Scale-Location (Check Homoscedasticity)
# Square root of standardized residuals vs fitted
standardized_resid = np.sqrt(np.abs(results.get_influence().resid_studentized_internal))
ax[1, 0].scatter(fitted_vals, standardized_resid, alpha=0.5)
ax[1, 0].set_xlabel('Fitted Values')
ax[1, 0].set_ylabel('Sqrt(|Standardized Residuals|)')
ax[1, 0].set_title('Scale-Location')

# D. Histogram of Residuals (Check Normality)
ax[1, 1].hist(residuals, bins=15, edgecolor='black', alpha=0.7)
ax[1, 1].set_title('Histogram of Residuals')

plt.tight_layout()
plt.show()

Interpretation Guide

Plot What to look for Violation Sign Solution
Residuals vs Fitted Random scatter around 0 Curved “U” shape Add polynomial terms (x<sup>2</sup>)
Residuals vs Fitted Constant width band Funnel shape Log-transform y or Weighted Least Squares (WLS)
Normal Q-Q Points on the red line Points deviating at tails Robust regression or check for outliers

[!TIP] Heteroscedasticity often occurs when analyzing financial data (e.g., income vs. spending). High-income earners have more variance in spending than low-income earners.

3. Summary

  • Always plot your residuals. R-squared alone is deceptive.
  • Ideal Residuals: Random noise, constant variance, normally distributed.
  • Violations:
  • Curve: Model is underfitting (needs non-linear terms).
  • Funnel: Heteroscedasticity (needs transformation).
  • Outliers: Data quality issues or special cases.

Next: Regularization (Ridge & Lasso)