Linear Regression

[!NOTE] This chapter covers Linear Regression, the simplest and most fundamental algorithm in Machine Learning. We will explore the intuition behind drawing a line through data, derive the mathematical loss function, and see how it works under the hood.

1. Intuition: Drawing a Line

Imagine you have a scatter plot showing house sizes versus their prices. You want to predict the price of a house you haven’t seen before based on its size. The simplest approach is to draw a straight line that “best fits” the data points.

This is Linear Regression. It assumes a linear relationship between the input variables (features) and the single output variable (target).

The Formula

The equation of a line is:

y = mx + b

In Machine Learning notation, this is often written as:

h_θ(x) = θ₀ + θ₁x₁ + θ₂x₂ + … + θ_nx_n

h_θ(x): The hypothesis (our prediction).
θ₀: The y-intercept (bias term).
θ₁…θ_n: The weights (coefficients) for each feature.
x₁…x_n: The feature values.

2. Interactive Analysis: Finding the Best Fit

Adjust the slope and intercept below to see how the line fits the data points. Watch how the Mean Squared Error (MSE) changes as you move the line. The goal is to minimize the MSE.

Model Parameters

Slope (θ₁): 1.0

Intercept (θ₀): 50.0

MSE: 0.00

3. The Rigor: Mean Squared Error (MSE)

How do we mathematically define the “best fit”? We need a metric that measures how far off our predictions are from the actual values. This is our Cost Function (or Loss Function).

For Linear Regression, the standard cost function is the Mean Squared Error (MSE).

J(θ) = (1 / 2m) × Σ_i=1^m (h_θ(x⁽ⁱ⁾) - y⁽ⁱ⁾)²

Where:

J(θ) is the cost given the parameters θ.
m is the number of training examples.
h_θ(x⁽ⁱ⁾) is the predicted value for the i-th example.
y⁽ⁱ⁾ is the actual target value for the i-th example.

We square the errors so that negative and positive errors don’t cancel each other out, and to heavily penalize large outliers. The 1 / 2m term is a mathematical convenience that makes the derivative cleaner later on.

4. Ordinary Least Squares (OLS)

In linear regression, if the dataset is small enough, we don’t even need to “search” for the best parameters iteratively. We can solve for the optimal weights algebraically using the Normal Equation.

θ = (X^TX)^-1 X^Ty

Where:

X is the matrix of feature values (with a column of 1s added for the intercept term).
y is the vector of target values.
θ is the vector of optimal weights.

Hardware Reality

While mathematically elegant, computing the inverse of (X^TX) has a time complexity of approximately O(n³), where n is the number of features. If you have hundreds of thousands of features, this becomes computationally infeasible, and you must use iterative methods like Gradient Descent instead.

5. Implementation: Scikit-Learn

Python (Scikit-Learn)

```python
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 1. Prepare Data
# X must be a 2D array: (n_samples, n_features)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2.1, 4.0, 6.2, 8.1, 9.9])

# 2. Initialize Model
model = LinearRegression()

# 3. Fit the model (This uses the OLS algebraic solution under the hood)
model.fit(X, y)

# 4. View Parameters
print(f"Intercept: {model.intercept_:.2f}")
print(f"Coefficient (Slope): {model.coef_[0]:.2f}")

# 5. Make Predictions
predictions = model.predict(X)
mse = mean_squared_error(y, predictions)
print(f"MSE: {mse:.4f}")
```

6. Complexity Analysis

Phase	Time Complexity	Space Complexity	Notes
Training (OLS Normal Equation)	O(N × M² + M³)	O(N × M)	N=samples, M=features. Matrix inversion is O(M³).
Training (Gradient Descent)	O(K × N × M)	O(N × M)	K=iterations. Better for huge number of features.
Prediction	O(M)	O(1)	Extremely fast; just a dot product.