Evaluation Metrics

Choosing the right metric is as important as choosing the right model. Accuracy is often misleading, especially with imbalanced datasets (e.g., fraud detection where 99.9% of transactions are legit). This chapter covers the essential metrics for classification and regression.

1. Classification Metrics

1.1 The Confusion Matrix

A confusion matrix is a table that describes the performance of a classification model.

True Positive (TP): Correctly predicted positive.
True Negative (TN): Correctly predicted negative.
False Positive (FP): Incorrectly predicted positive (Type I Error).
False Negative (FN): Incorrectly predicted negative (Type II Error).

1.2 Precision, Recall, and F1

Precision: Accuracy of positive predictions. High precision means low false positives.
Formula: Precision = TP / (TP + FP)
Recall (Sensitivity): Coverage of actual positives. High recall means low false negatives.
Formula: Recall = TP / (TP + FN)
F1 Score: Harmonic mean of Precision and Recall. Best for imbalanced datasets.
Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)

1.3 Interactive Confusion Matrix

Adjust the Threshold slider to see how changing the decision boundary affects the Confusion Matrix, Precision, and Recall.

Low Threshold: Classifies more items as Positive. Increases Recall, decreases Precision (more FP).
High Threshold: Classifies fewer items as Positive. Increases Precision, decreases Recall (more FN).

Classification Threshold: 0.50

0.0 (Classify All as Positive) 1.0 (Classify All as Negative)

Predicted -

Predicted +

Actual -

TN: 0

FP: 0

Actual +

FN: 0

TP: 0

Precision 0.00

Recall 0.00

Accuracy 0.00

F1 Score 0.00

1.4 ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (Recall) against the False Positive Rate (1 - Specificity) at various threshold settings.

AUC (Area Under Curve): Represents the probability that a random positive example is ranked higher than a random negative example.
AUC = 0.5: Random guessing.
AUC = 1.0: Perfect classifier.

2. Regression Metrics

For regression problems (predicting continuous values), we use different metrics.

2.1 Mean Squared Error (MSE)

Measures the average of the squares of the errors. Penalizes large errors heavily.

MSE = (1/n) Σ_i=1ⁿ (y_i - ŷ_i)²

2.2 Mean Absolute Error (MAE)

Measures the average of the absolute errors. Less sensitive to outliers than MSE.

MAE = (1/n) Σ_i=1ⁿ y_i - ŷ_i

2.3 R-squared (R²)

Represents the proportion of variance in the dependent variable that is predictable from the independent variables.

R² = 1: Perfect fit.
R² = 0: Model performs as well as just predicting the mean.

3. Python Implementation

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Classification Example
y_true = [0, 1, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 1, 1]
y_prob = [0.1, 0.9, 0.4, 0.2, 0.8, 0.9]

print("Classification Metrics:")
print(f"Accuracy: {accuracy_score(y_true, y_pred)}")
print(f"Precision: {precision_score(y_true, y_pred)}")
print(f"Recall: {recall_score(y_true, y_pred)}")
print(f"F1 Score: {f1_score(y_true, y_pred)}")
print(f"ROC AUC: {roc_auc_score(y_true, y_prob)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_true, y_pred)}")

# Regression Example
y_reg_true = [3.0, -0.5, 2.0, 7.0]
y_reg_pred = [2.5, 0.0, 2.1, 7.8]

print("\nRegression Metrics:")
print(f"MSE: {mean_squared_error(y_reg_true, y_reg_pred)}")
print(f"MAE: {mean_absolute_error(y_reg_true, y_reg_pred)}")
print(f"R2 Score: {r2_score(y_reg_true, y_reg_pred)}")

[!WARNING] Accuracy Paradox: In a dataset with 99% Negative samples, a model that simply predicts “Negative” for everything will have 99% accuracy but 0% Recall. Always check Precision/Recall for imbalanced data.

Evaluation Metrics

Evaluation Metrics

1. Classification Metrics

1.1 The Confusion Matrix

1.2 Precision, Recall, and F1

1.3 Interactive Confusion Matrix

1.4 ROC Curve and AUC

2. Regression Metrics

2.1 Mean Squared Error (MSE)

2.2 Mean Absolute Error (MAE)

2.3 R-squared (R2)

3. Python Implementation

2.3 R-squared (R²)