Feature Selection

Feature selection is the process of identifying and retaining the most relevant subset of variables from your dataset while discarding redundant or noisy features. High-dimensional data not only slows down training but also introduces the Curse of Dimensionality, where the volume of the feature space increases so fast that available data becomes sparse. This leads models to overfit on noise rather than generalizing on underlying signals.

[!NOTE] This chapter covers the three fundamental paradigms of feature selection: Filter Methods, Wrapper Methods, and Embedded Methods. We will explore the mathematical intuition and hardware efficiency of each approach.

1. The Dimensionality Penalty

From a hardware perspective, every additional feature implies larger matrix multiplications during training. If you increase the feature count linearly, the compute time for algorithms like Support Vector Machines (which compute pairwise distances in O(N2 × D)) grows substantially. Furthermore, caching efficiency drops as larger batches of high-dimensional vectors evict each other from the L1/L2 cache, leading to slow RAM fetches.

2. Feature Importance Visualizer (Lasso L1 Penalty)

L1 Regularization (Lasso) acts as an embedded feature selector. By applying a diamond-shaped constraint on the weights, it forces less important feature weights exactly to zero. Adjust the penalty parameter below to observe this effect.

3. Filter Methods

Filter methods evaluate the relevance of features independently of the predictive model. They apply a statistical measure to score the correlation or dependence between each feature and the target variable.

Examples:

  • Pearson Correlation: Measures linear relationship.
  • Chi-Square: Evaluates categorical vs categorical dependence.
  • Mutual Information: Measures non-linear dependency based on entropy.

Python Implementation (Variance Threshold)

A basic filter is removing features with zero or near-zero variance (constant features that provide no information).

from sklearn.feature_selection import VarianceThreshold
import numpy as np

# Dataset: Feature 1 has variance, Feature 2 is constant (all 5s)
X = np.array([[0, 5, 1],
              [2, 5, 2],
              [0, 5, 3],
              [1, 5, 4]])

# Keep features with variance > 0.1
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)

print(X_selected)
# Column index 1 is removed

4. Wrapper Methods

Wrapper methods treat the feature selection process as a search problem. They evaluate different combinations of features by actually training a model and assessing its performance on a hold-out set.

Examples:

  • Recursive Feature Elimination (RFE): Starts with all features, trains the model, drops the least important feature (e.g., smallest weight), and repeats.
  • Forward Selection: Starts with zero features and iteratively adds the one that improves the model the most.

[!WARNING] Wrapper methods are computationally expensive (O(2D) in exhaustive search) and highly prone to overfitting, as the feature subset is optimized explicitly for the specific model’s validation performance.

5. Embedded Methods

Embedded methods perform feature selection automatically during the model training process. The selection is built into the algorithm’s objective function.

Examples:

  • Lasso Regression (L1 Penalty): Adds an absolute value penalty term to the loss function, mathematically forcing irrelevant feature coefficients to exactly zero.
  • Tree-based Feature Importance: Algorithms like Random Forests measure how much a feature decreases the Gini Impurity or Entropy when used for splitting nodes.

Python Implementation (Lasso)

from sklearn.linear_model import Lasso
import numpy as np

# Synthetic data
X = np.random.randn(100, 5) # 5 features
y = 2 * X[:, 0] + 0 * X[:, 1] + X[:, 2] + np.random.randn(100) # Only f0 and f2 matter

# Train Lasso
lasso = Lasso(alpha=0.5)
lasso.fit(X, y)

print("Feature Coefficients:", lasso.coef_)
# Output will show near zero for features 1, 3, and 4

6. Summary Comparison

Method Approach Speed Risk of Overfitting
Filter Statistical metrics (Correlation, Chi-Square). Very Fast Low
Wrapper Search algorithms (RFE, Forward/Backward). Very Slow High
Embedded Built into training (L1 Penalty, Tree splits). Fast Moderate