Regularization: Ridge & Lasso

[!NOTE] When your model is “too good” on training data but fails in production, you have an overfitting problem. Regularization is the cure.

In OLS, we minimize the Sum of Squared Errors (SSE). But if we have too many features (predictors) relative to data points, OLS tries too hard to fit the noise, leading to massive coefficients and high variance.

Regularization solves this by adding a penalty term to the loss function to discourage complex models (large coefficients).

1. The Intuition: Bias-Variance Tradeoff

Why would we intentionally make our model fit the training data worse (increase bias)?

  • Variance (Complexity): How much the model changes if trained on different data. A complex model (wiggly line) has high variance. It chases every outlier.
  • Bias (Simplicity): How far the model’s assumptions are from reality. A simple model (straight line) has high bias if the truth is a curve.
  • The Tradeoff:
  • OLS: Unbiased, but potentially High Variance (Overfitting).
  • Regularization: Adds a little Bias, but drastically reduces Variance (Underfitting vs Overfitting).
  • Goal: Minimize Total Error = Bias2 + Variance + Noise.

Regularization helps us find the “Sweet Spot” where the model generalizes best to new, unseen data.

2. Ridge Regression (L2 Penalty)

Ridge regression adds the sum of squared coefficients to the loss function:

Loss = SSE + λ Σj=1p βj2

  • λ (Lambda): The tuning parameter (hyperparameter).
  • If λ = 0, we get OLS.
  • If λ &to; ∞, all coefficients β &to; 0.
  • Effect: It shrinks coefficients towards zero but rarely to zero.
  • Use Case: When you have many correlated features (Multicollinearity).

3. Lasso Regression (L1 Penalty)

Lasso (Least Absolute Shrinkage and Selection Operator) adds the sum of absolute coefficients:

Loss = SSE + λ Σj=1p βj
  • Effect: It can shrink coefficients exactly to zero.
  • Use Case: Feature Selection. It automatically discards useless features.

4. Hardware Reality: The Magic of Sparsity

Why do engineers love Lasso? It’s not just about accuracy; it’s about Memory and Speed.

Imagine you are building a text classifier with a vocabulary of 1,000,000 words (features).

  • Ridge: All 1,000,000 coefficients will be non-zero (e.g., 0.00001). You must store a dense array of 1M floats (4MB).
  • Lasso: Most words are irrelevant. Lasso might set 999,000 coefficients to exactly 0.0. You only have 1,000 non-zero values.

This allows us to use Sparse Matrix formats (like Compressed Sparse Row - CSR).

  • Storage: Instead of storing 1M numbers, we store indices and values for just the 1k active features.
  • Compute: Dot products skip the zeros. 1000x faster inference.

Regularization: Ridge & Lasso

[!NOTE] This module explores the core principles of Regularization: Ridge & Lasso, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Interactive: The Penalty Slider

See the difference between Ridge and Lasso. We have a model with 5 features. As you increase the regularization strength (λ), watch how the coefficients shrink.

High λ
x1x2x3x4x5

[!TIP] Notice how in Lasso, the small coefficients (x4, x5) hit exactly 0.0 quickly. In Ridge, they get very small but stay non-zero.


2. Implementation: The Penalty Term

Most standard libraries handle the optimization for you. Here is how to apply Ridge and Lasso in Go, Java, and Python.

Go

In Go, we might need to implement the cost function manually if using a raw optimizer, or use a library that supports it. Here, we show how to calculate the Ridge Penalty Cost.

package main

import (
	"fmt"
	"math"
)

func RidgeCost(sse float64, coefficients []float64, lambda float64) float64 {
	penalty := 0.0
	for _, beta := range coefficients {
		penalty += beta * beta
	}
	return sse + lambda*penalty
}

func main() {
	sse := 500.0 // Assume calculated SSE
	coeffs := []float64{0.5, -1.2, 3.0}
	lambda := 0.1

	cost := RidgeCost(sse, coeffs, lambda)
	fmt.Printf("Ridge Cost: %.4f\n", cost)
}

Java

We can calculate the penalty term directly.

public class RegularizationExample {
    public static double lassoPenalty(double[] coefficients, double lambda) {
        double penalty = 0;
        for (double beta : coefficients) {
            penalty += Math.abs(beta);
        }
        return lambda * penalty;
    }

    public static void main(String[] args) {
        double[] coeffs = {0.5, -1.2, 3.0};
        double lambda = 0.1;

        System.out.println("Lasso Penalty: " + lassoPenalty(coeffs, lambda));
    }
}

Python (Scikit-Learn)

We use scikit-learn for regularization because it handles the optimization efficiently.

[!WARNING] Scale Your Data! Regularization penalizes large coefficients. If one feature is on a scale of 0-1 and another is 0-1,000,000, the second one will have a tiny coefficient naturally. To penalize them fairly, you must scale features (standardize) before regularizing.

from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# 1. Prepare Data
# X, y assumed to be loaded
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Scale Features (CRITICAL STEP)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Ridge Regression (L2)
ridge_model = Ridge(alpha=1.0) # Alpha is Lambda in sklearn
ridge_model.fit(X_train_scaled, y_train)
print(f"Ridge Coeffs: {ridge_model.coef_}")

# 4. Lasso Regression (L1)
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_scaled, y_train)
print(f"Lasso Coeffs: {lasso_model.coef_}")
# You might see zeros here!

# 5. Evaluate
y_pred = lasso_model.predict(X_test_scaled)
print(f"MSE: {mean_squared_error(y_test, y_pred)}")

3. Summary

Feature Ridge (L2) Lasso (L1)
Penalty Squared (β2) Absolute (|β|)
Feature Selection No (Keeps all features) Yes (Sets some to 0)
Multicollinearity Handles well (Distributes weight) Picks one, discards others
Hardware Benefit None specific Sparsity (Save RAM/CPU)

Next: Review & Cheat Sheet