Simple Linear Regression

[!NOTE] Linear regression is the “Hello World” of machine learning, but don’t underestimate it. It forms the foundation for advanced techniques like Generalized Linear Models (GLMs) and Neural Networks.

Simple Linear Regression allows us to model the relationship between two continuous variables: a predictor (x) and a response (y). The goal is to find the “best fit” line that minimizes the error between the predicted and actual values.

1. The Genesis: Why Squared Errors?

Why do we minimize the Sum of Squared Errors (SSE) instead of the sum of absolute errors or cubic errors?

The Geometric Intuition: In Euclidean space, the shortest distance between a point and a line is the perpendicular distance. Squared Euclidean distance corresponds to the Pythagorean theorem.
The Probability Link (Gauss-Markov): Carl Friedrich Gauss proved that if the errors (ε) are Normally Distributed, minimizing the squared error is mathematically equivalent to Maximum Likelihood Estimation (MLE). It gives us the most probable coefficients given the data.
The Calculus Convenience: The square function x2 is smooth and convex. Its derivative is linear (2x), making it incredibly easy to find the minimum using calculus. Absolute value |x| has a sharp corner at 0, making differentiation messier.

2. The Mathematical Model

The population model for simple linear regression is defined as:

y_i = β₀ + β₁x_i + ε_i

Where:

y_i: The dependent variable (response) for the i-th observation.
x_i: The independent variable (predictor).
β₀: The intercept (the value of y when x = 0).
β₁: The slope (the change in y for a one-unit increase in x).
ε_i: The error term (residuals), assumed to be normally distributed N(0, σ2).

Ordinary Least Squares (OLS)

How do we find β₀ and β₁? We use the method of Ordinary Least Squares (OLS). This minimizes the Sum of Squared Errors (SSE):

SSE = Σ_i=1ⁿ (y_i - \hat{y}_i)²

Where \hat{y}i is the predicted value.

3. Hardware Reality: The Cost of Inversion

In textbooks, you might see the “Normal Equation” to solve for coefficients directly:

β = (XTX)-1XTy

This looks elegant, but in Production Systems, we rarely use it directly. Why?

Computational Complexity: Inverting a matrix is roughly O(n³) operations. For a dataset with 100,000 features, this is impossible.
Numerical Instability: Computers typically use IEEE 754 Floating Point math. If your features are highly correlated (multicollinearity), the matrix XTX becomes “ill-conditioned.” The determinant is close to zero.
The Floating Point Trap: When dividing by a tiny number (inversion), tiny rounding errors in the 16th decimal place explode into massive errors in your result.

The Solution?

QR Decomposition: A more numerically stable way to solve the equation without explicit inversion.
Gradient Descent: For massive datasets (Deep Learning), we don’t solve it analytically at all. We iterate towards the solution step-by-step.

Simple Linear Regression

[!NOTE] This module explores the core principles of Simple Linear Regression, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Interactive Visualizer: Drag the Points

See OLS in action. Drag the points below to change the dataset. Notice how the regression line (the red line) updates instantly to minimize the squared vertical distances (residuals) between the points and the line.

Slope (β₁): 0.00

Intercept (β₀): 0.00

R²: 0.00

Residuals are shown in gray.

2. Implementation: The Trinity

While understanding the math is crucial, in practice, we use libraries. Here is how to implement Simple Linear Regression in Go, Java, and Python.

Go

Go doesn’t have a built-in Data Science stack like Python, but gonum is the standard for numerical computing.

package main

import (
	"fmt"
	"gonum.org/v1/gonum/stat"
)

func main() {
	// 1. Data
	x := []float64{1, 2, 3, 4, 5}
	y := []float64{2, 4, 5, 4, 5}

	// 2. Fit OLS
	// alpha = intercept, beta = slope
	alpha, beta := stat.LinearRegression(x, y, nil, false)

	// 3. Output
	fmt.Printf("Intercept: %.2f\n", alpha)
	fmt.Printf("Slope:     %.2f\n", beta)

	// 4. Prediction
	newX := 6.0
	predY := alpha + beta*newX
	fmt.Printf("Prediction for x=%.1f: %.2f\n", newX, predY)
}

Java

In Java, we can use Apache Commons Math for robust statistical analysis.

import org.apache.commons.math3.stat.regression.SimpleRegression;

public class LinearRegressionExample {
    public static void main(String[] args) {
        // 1. Initialize
        SimpleRegression regression = new SimpleRegression();

        // 2. Add Data (x, y)
        regression.addData(1d, 2d);
        regression.addData(2d, 4d);
        regression.addData(3d, 5d);
        regression.addData(4d, 4d);
        regression.addData(5d, 5d);

        // 3. Output
        System.out.println("Intercept: " + regression.getIntercept());
        System.out.println("Slope:     " + regression.getSlope());
        System.out.println("R-Square:  " + regression.getRSquare());

        // 4. Prediction
        System.out.println("Prediction for x=6: " + regression.predict(6.0));
    }
}

Python

statsmodels is generally preferred for statistical analysis because it provides detailed summary tables.

import numpy as np
import statsmodels.api as sm

# 1. Generate Data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# 2. Add constant for intercept (Critical Step!)
X_with_const = sm.add_constant(X)

# 3. Fit OLS model
model = sm.OLS(y, X_with_const)
results = model.fit()

# 4. Print Summary
print(results.summary())
print(f"Prediction for x=6: {results.predict([1, 6])[0]}")

Interpreting the Output

Coef (Coefficient):
- const: The estimate for β₀ (Intercept).
- x1: The estimate for β₁ (Slope).
P>|t| (P-value):
- Tests the null hypothesis that the coefficient is equal to zero.
- If p < 0.05, we reject the null hypothesis.
R-squared (R²):
- Proportion of variance in y explained by x.

[!TIP] Correlation ≠ Causation A high R² only indicates a linear association. It does not prove that changes in x cause changes in y.

3. Key Assumptions (L.I.H.N)

For the p-values and confidence intervals to be valid, OLS relies on four key assumptions. We verify these using Residual Analysis in the next chapter.

Linearity: The relationship between x and the mean of y is linear.
Independence: Observations are independent of each other (no autocorrelation).
Homoscedasticity: The variance of residuals is constant across all levels of x.
Normality: The residuals are normally distributed.

Simple Linear Regression

Simple Linear Regression

1. The Genesis: Why Squared Errors?

2. The Mathematical Model

yi = β0 + β1xi + εi

Ordinary Least Squares (OLS)

3. Hardware Reality: The Cost of Inversion

Gradient Descent: For massive datasets (Deep Learning), we don’t solve it analytically at all. We iterate towards the solution step-by-step.

Simple Linear Regression

1. Interactive Visualizer: Drag the Points

2. Implementation: The Trinity

Go

Java

Python

Interpreting the Output

3. Key Assumptions (L.I.H.N)

y_i = β₀ + β₁x_i + ε_i