Simple Linear Regression
[!NOTE] Linear regression is the “Hello World” of machine learning, but don’t underestimate it. It forms the foundation for advanced techniques like Generalized Linear Models (GLMs) and Neural Networks.
Simple Linear Regression allows us to model the relationship between two continuous variables: a predictor (x) and a response (y). The goal is to find the “best fit” line that minimizes the error between the predicted and actual values.
1. The Genesis: Why Squared Errors?
Why do we minimize the Sum of Squared Errors (SSE) instead of the sum of absolute errors or cubic errors?
- The Geometric Intuition: In Euclidean space, the shortest distance between a point and a line is the perpendicular distance. Squared Euclidean distance corresponds to the Pythagorean theorem.
- The Probability Link (Gauss-Markov): Carl Friedrich Gauss proved that if the errors (
ε) are Normally Distributed, minimizing the squared error is mathematically equivalent to Maximum Likelihood Estimation (MLE). It gives us the most probable coefficients given the data. - The Calculus Convenience: The square function
x<sup>2</sup>is smooth and convex. Its derivative is linear (2x), making it incredibly easy to find the minimum using calculus. Absolute value|x|has a sharp corner at 0, making differentiation messier.
2. The Mathematical Model
The population model for simple linear regression is defined as:
yi = β0 + β1xi + εi
Where:
yi: The dependent variable (response) for the i-th observation.xi: The independent variable (predictor).β0: The intercept (the value ofywhenx = 0).β1: The slope (the change inyfor a one-unit increase inx).εi: The error term (residuals), assumed to be normally distributedN(0, σ<sup>2</sup>).
Ordinary Least Squares (OLS)
How do we find β0 and β1? We use the method of Ordinary Least Squares (OLS). This minimizes the Sum of Squared Errors (SSE):
SSE = Σi=1n (yi - \hat{y}i)2
Where \hat{y}<sub>i</sub> is the predicted value.
3. Hardware Reality: The Cost of Inversion
In textbooks, you might see the “Normal Equation” to solve for coefficients directly:
β = (X<sup>T</sup>X)<sup>-1</sup>X<sup>T</sup>y
This looks elegant, but in Production Systems, we rarely use it directly. Why?
- Computational Complexity: Inverting a matrix is roughly O(n3) operations. For a dataset with 100,000 features, this is impossible.
- Numerical Instability: Computers typically use IEEE 754 Floating Point math. If your features are highly correlated (multicollinearity), the matrix
X<sup>T</sup>Xbecomes “ill-conditioned.” The determinant is close to zero. - The Floating Point Trap: When dividing by a tiny number (inversion), tiny rounding errors in the 16th decimal place explode into massive errors in your result.
The Solution?
- QR Decomposition: A more numerically stable way to solve the equation without explicit inversion.
-
Gradient Descent: For massive datasets (Deep Learning), we don’t solve it analytically at all. We iterate towards the solution step-by-step.
Simple Linear Regression
[!NOTE] This module explores the core principles of Simple Linear Regression, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Interactive Visualizer: Drag the Points
See OLS in action. Drag the points below to change the dataset. Notice how the regression line (the red line) updates instantly to minimize the squared vertical distances (residuals) between the points and the line.
Residuals are shown in gray.
2. Implementation: The Trinity
While understanding the math is crucial, in practice, we use libraries. Here is how to implement Simple Linear Regression in Go, Java, and Python.
Go
Go doesn’t have a built-in Data Science stack like Python, but gonum is the standard for numerical computing.
package main
import (
"fmt"
"gonum.org/v1/gonum/stat"
)
func main() {
// 1. Data
x := []float64{1, 2, 3, 4, 5}
y := []float64{2, 4, 5, 4, 5}
// 2. Fit OLS
// alpha = intercept, beta = slope
alpha, beta := stat.LinearRegression(x, y, nil, false)
// 3. Output
fmt.Printf("Intercept: %.2f\n", alpha)
fmt.Printf("Slope: %.2f\n", beta)
// 4. Prediction
newX := 6.0
predY := alpha + beta*newX
fmt.Printf("Prediction for x=%.1f: %.2f\n", newX, predY)
}
Java
In Java, we can use Apache Commons Math for robust statistical analysis.
import org.apache.commons.math3.stat.regression.SimpleRegression;
public class LinearRegressionExample {
public static void main(String[] args) {
// 1. Initialize
SimpleRegression regression = new SimpleRegression();
// 2. Add Data (x, y)
regression.addData(1d, 2d);
regression.addData(2d, 4d);
regression.addData(3d, 5d);
regression.addData(4d, 4d);
regression.addData(5d, 5d);
// 3. Output
System.out.println("Intercept: " + regression.getIntercept());
System.out.println("Slope: " + regression.getSlope());
System.out.println("R-Square: " + regression.getRSquare());
// 4. Prediction
System.out.println("Prediction for x=6: " + regression.predict(6.0));
}
}
Python
statsmodels is generally preferred for statistical analysis because it provides detailed summary tables.
import numpy as np
import statsmodels.api as sm
# 1. Generate Data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# 2. Add constant for intercept (Critical Step!)
X_with_const = sm.add_constant(X)
# 3. Fit OLS model
model = sm.OLS(y, X_with_const)
results = model.fit()
# 4. Print Summary
print(results.summary())
print(f"Prediction for x=6: {results.predict([1, 6])[0]}")
Interpreting the Output
- Coef (Coefficient):
const: The estimate forβ0 (Intercept).x1: The estimate forβ1 (Slope).
- P>|t| (P-value):
- Tests the null hypothesis that the coefficient is equal to zero.
- If p < 0.05, we reject the null hypothesis.
- R-squared (R2):
- Proportion of variance in
yexplained byx.
- Proportion of variance in
[!TIP] Correlation ≠ Causation A high R2 only indicates a linear association. It does not prove that changes in
xcause changes iny.
3. Key Assumptions (L.I.H.N)
For the p-values and confidence intervals to be valid, OLS relies on four key assumptions. We verify these using Residual Analysis in the next chapter.
- Linearity: The relationship between
xand the mean ofyis linear. - Independence: Observations are independent of each other (no autocorrelation).
- Homoscedasticity: The variance of residuals is constant across all levels of
x. - Normality: The residuals are normally distributed.