Linear Regression
[!NOTE] This chapter covers Linear Regression, the simplest and most fundamental algorithm in Machine Learning. We will explore the intuition behind drawing a line through data, derive the mathematical loss function, and see how it works under the hood.
1. Intuition: Drawing a Line
Imagine you have a scatter plot showing house sizes versus their prices. You want to predict the price of a house you haven’t seen before based on its size. The simplest approach is to draw a straight line that “best fits” the data points.
This is Linear Regression. It assumes a linear relationship between the input variables (features) and the single output variable (target).
The Formula
The equation of a line is:
y = mx + b
In Machine Learning notation, this is often written as:
hθ(x) = θ0 + θ1x1 + θ2x2 + … + θnxn
- hθ(x): The hypothesis (our prediction).
- θ0: The y-intercept (bias term).
- θ1…θn: The weights (coefficients) for each feature.
- x1…xn: The feature values.
2. Interactive Analysis: Finding the Best Fit
Adjust the slope and intercept below to see how the line fits the data points. Watch how the Mean Squared Error (MSE) changes as you move the line. The goal is to minimize the MSE.
Model Parameters
3. The Rigor: Mean Squared Error (MSE)
How do we mathematically define the “best fit”? We need a metric that measures how far off our predictions are from the actual values. This is our Cost Function (or Loss Function).
For Linear Regression, the standard cost function is the Mean Squared Error (MSE).
J(θ) = (1 / 2m) × Σi=1m (hθ(x(i)) - y(i))2
Where:
J(θ)is the cost given the parameters θ.mis the number of training examples.- hθ(x(i)) is the predicted value for the i-th example.
- y(i) is the actual target value for the i-th example.
We square the errors so that negative and positive errors don’t cancel each other out, and to heavily penalize large outliers. The 1 / 2m term is a mathematical convenience that makes the derivative cleaner later on.
4. Ordinary Least Squares (OLS)
In linear regression, if the dataset is small enough, we don’t even need to “search” for the best parameters iteratively. We can solve for the optimal weights algebraically using the Normal Equation.
θ = (XTX)-1 XTy
Where:
Xis the matrix of feature values (with a column of 1s added for the intercept term).yis the vector of target values.θis the vector of optimal weights.
Hardware Reality
While mathematically elegant, computing the inverse of (XTX) has a time complexity of approximately O(n3), where n is the number of features. If you have hundreds of thousands of features, this becomes computationally infeasible, and you must use iterative methods like Gradient Descent instead.
5. Implementation: Scikit-Learn
6. Complexity Analysis
| Phase | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Training (OLS Normal Equation) | O(N × M2 + M3) | O(N × M) | N=samples, M=features. Matrix inversion is O(M³). |
| Training (Gradient Descent) | O(K × N × M) | O(N × M) | K=iterations. Better for huge number of features. |
| Prediction | O(M) | O(1) | Extremely fast; just a dot product. |