Cross-Validation

Training a model on all available data and testing it on the same data leads to overfitting. The standard solution is to split data into training and testing sets. However, a single split can be misleading if the test set happens to be “easy” or “hard”. Cross-Validation provides a more robust estimate of model performance.

1. Validation Strategies

1.1 Train/Test Split

The simplest method. Randomly split data into two parts (e.g., 80% Train, 20% Test).

  • Pros: Fast, simple.
  • Cons: High variance in evaluation metric depending on which points end up in the test set.

1.2 K-Fold Cross-Validation

Split the data into K equal-sized folds. Train on K-1 folds and test on the remaining fold. Repeat K times so each fold is used as the test set exactly once. Average the K scores.

  • Pros: More reliable estimate, uses all data for testing.
  • Cons: Computationally expensive (K times slower).

1.3 Stratified K-Fold

Same as K-Fold, but ensures that each fold has the same proportion of class labels as the whole dataset.

  • Use Case: Mandatory for imbalanced classification tasks.

1.4 Leave-One-Out Cross-Validation (LOOCV)

K is equal to the number of samples (N). Train on N-1 samples, test on 1. Repeat N times.

  • Pros: No bias in split.
  • Cons: Extremely expensive for large datasets.

2. Interactive: K-Fold Visualizer

Click “Run K-Fold” to see how the dataset is split and iterated over.

  • Blue: Training Data
  • Orange: Validation (Test) Data
Ready

3. Python Implementation

from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import numpy as np

# Generate Data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
model = LogisticRegression()

# 1. Standard K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

print(f"K-Fold Scores: {scores}")
print(f"Mean Accuracy: {scores.mean():.4f}")

# 2. Stratified K-Fold (Better for classification)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
strat_scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

print(f"Stratified Scores: {strat_scores}")
print(f"Mean Stratified Accuracy: {strat_scores.mean():.4f}")

[!NOTE] Always set shuffle=True and a random_state in K-Fold to ensure reproducibility, especially if your data is ordered (e.g., sorted by date).

[!CAUTION] For Time Series data, do NOT use standard K-Fold because it shuffles the data and causes “data leakage” (training on future data to predict the past). Use TimeSeriesSplit instead.