Feature Scaling

Feature scaling is a critical preprocessing step in machine learning that normalizes the range of independent variables or features of data. Without scaling, features with larger magnitudes can dominate those with smaller magnitudes, artificially skewing the objective function of distance-based algorithms like K-Nearest Neighbors (KNN) or Support Vector Machines (SVM), and significantly slowing down the convergence of Gradient Descent.

[!NOTE] This chapter covers the genesis, mathematical proofs, and hardware implications of feature scaling, focusing on how different scaling strategies influence the loss landscape and gradient convergence.

1. The Intuition: Why Scale?

Imagine an algorithm calculating the Euclidean distance between two data points. If one feature is measured in thousands (e.g., salary) and another in single digits (e.g., years of experience), the distance metric will be entirely dominated by the salary feature. The algorithm will act as if the years of experience do not even exist.

From a hardware and optimization perspective, unscaled features create an elongated, elliptical loss landscape. Gradient descent struggles in this terrain, bouncing back and forth across the steep ridges while slowly crawling along the flat valleys. Scaling the features transforms the loss landscape into a more spherical shape, allowing gradient descent to converge much faster directly toward the minimum.

2. Interactive Feature Scaling Visualizer

Adjust the slider below to see how standardizing a distribution reshapes it.

3. Min-Max Normalization

Min-Max scaling linearly transforms features to lie between a given minimum and maximum value, typically between 0 and 1.

Mathematical Formula: x’ = (x - xmin) / (xmax - xmin)

[!WARNING] Min-Max scaling is highly sensitive to outliers. A single extreme outlier will squash the majority of your normal data points into a tiny range, destroying their variance.

Python Implementation

import numpy as np

def min_max_scale(X):
  X_min = np.min(X, axis=0)
  X_max = np.max(X, axis=0)
  # Prevent division by zero
  denominator = X_max - X_min
  denominator[denominator == 0] = 1
  return (X - X_min) / denominator

# Example
data = np.array([[10, 100], [20, 200], [30, 300]])
scaled_data = min_max_scale(data)
print(scaled_data)

4. Standardization (Z-Score Scaling)

Standardization transforms the data to have a mean (μ) of 0 and a standard deviation (σ) of 1. It assumes the data follows a Gaussian (Normal) distribution, but it can be applied to non-Gaussian data as well.

Mathematical Formula: z = (x - μ) / σ

Standardization is much less sensitive to outliers than Min-Max scaling, although severe outliers can still shift the mean and inflate the standard deviation.

Python Implementation

import numpy as np

def standardize(X):
  mu = np.mean(X, axis=0)
  sigma = np.std(X, axis=0)
  # Prevent division by zero
  sigma[sigma == 0] = 1
  return (X - mu) / sigma

# Example
data = np.array([[10, 100], [20, 200], [30, 300]])
standardized_data = standardize(data)
print(standardized_data)

5. Summary and Comparison

Here is a comparison of common scaling strategies.

Strategy When to Use Outlier Sensitivity Math Operation
Min-Max Bounded output needed (e.g., Image Pixels, Neural Networks). Extremely High (x - min) / (max - min)
Z-Score Gradient descent algorithms (Linear/Logistic Regression, SVMs). Moderate (x - μ) / σ
Robust When dataset contains significant, heavy outliers. Very Low (x - Q1) / (Q3 - Q1)