Cross-Validation
Training a model on all available data and testing it on the same data leads to overfitting. The standard solution is to split data into training and testing sets. However, a single split can be misleading if the test set happens to be “easy” or “hard”. Cross-Validation provides a more robust estimate of model performance.
1. Validation Strategies
1.1 Train/Test Split
The simplest method. Randomly split data into two parts (e.g., 80% Train, 20% Test).
- Pros: Fast, simple.
- Cons: High variance in evaluation metric depending on which points end up in the test set.
1.2 K-Fold Cross-Validation
Split the data into K equal-sized folds. Train on K-1 folds and test on the remaining fold. Repeat K times so each fold is used as the test set exactly once. Average the K scores.
- Pros: More reliable estimate, uses all data for testing.
- Cons: Computationally expensive (K times slower).
1.3 Stratified K-Fold
Same as K-Fold, but ensures that each fold has the same proportion of class labels as the whole dataset.
- Use Case: Mandatory for imbalanced classification tasks.
1.4 Leave-One-Out Cross-Validation (LOOCV)
K is equal to the number of samples (N). Train on N-1 samples, test on 1. Repeat N times.
- Pros: No bias in split.
- Cons: Extremely expensive for large datasets.
2. Interactive: K-Fold Visualizer
Click “Run K-Fold” to see how the dataset is split and iterated over.
- Blue: Training Data
- Orange: Validation (Test) Data
3. Python Implementation
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import numpy as np
# Generate Data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
model = LogisticRegression()
# 1. Standard K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"K-Fold Scores: {scores}")
print(f"Mean Accuracy: {scores.mean():.4f}")
# 2. Stratified K-Fold (Better for classification)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
strat_scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
print(f"Stratified Scores: {strat_scores}")
print(f"Mean Stratified Accuracy: {strat_scores.mean():.4f}")
[!NOTE] Always set
shuffle=Trueand arandom_statein K-Fold to ensure reproducibility, especially if your data is ordered (e.g., sorted by date).
[!CAUTION] For Time Series data, do NOT use standard K-Fold because it shuffles the data and causes “data leakage” (training on future data to predict the past). Use
TimeSeriesSplitinstead.