Review & Cheat Sheet

[!NOTE] This module explores the core principles of Review & Cheat Sheet, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Key Takeaways

Visualize First: Always plot your data before calculating summary statistics (Remember Anscombe’s Quartet).
Central Tendency:
- Mean: Best for symmetric data. Sensitive to outliers.
- Median (P50): Best for skewed data (Income, Latency). Robust to outliers.
- P99: Critical for distributed systems (Tail Latency).
Spread:
- Variance/Std Dev: Average distance from the mean. Standard Deviation (\sigma) is in original units.
- Bessel’s Correction: Use N-1 for sample variance to correct bias.
- IQR: Spread of middle 50%. Robust.
Hardware Reality: Databases use Histograms to optimize query plans (Index Scan vs Seq Scan).

2. Interactive Flashcards

Test your knowledge! Click a card to flip it.

Which measure is best for analyzing House Prices?

The Median. Because house prices are usually skewed right with high outliers (mansions), the mean would be misleadingly high.

Why do we divide by N-1 for Sample Variance?

Bessel's Correction. The sample mean is "closer" to the sample data than the true mean, causing underestimation of spread. N-1 corrects this bias.

What does P99 represent in System Design?

Tail Latency. The response time experienced by the slowest 1% of users. Critical for optimizing user experience.

How do Databases use Histograms?

To estimate Selectivity. The optimizer checks if values are skewed to decide between an Index Scan (for rare values) or Seq Scan (for common values).

3. Cheat Sheet

Formulas

Concept	Formula
Mean	μ = (1/N) Σ x_i
Sample Variance	s² = Σ(x_i - x̄)² / (N - 1)
Sample Std Dev	s = √s²
IQR	Q3 (75th percentile) - Q1 (25th percentile)
Outlier (Low)	< Q1 - 1.5 × IQR
Outlier (High)	> Q3 + 1.5 × IQR
Z-Score	z = (x - μ) / σ
Geometric Mean	μ_geo = (x₁ * … * x_n)^1/n

Python (NumPy/Pandas)

import numpy as np
from scipy import stats

data = [10, 12, 11, 13, 100]

# Central Tendency
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data)

# Spread (Note: ddof=1 for Sample Std Dev)
std_dev_sample = np.std(data, ddof=1)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Outliers (Z-Score)
z_scores = np.abs(stats.zscore(data))
outliers = np.where(z_scores > 3)

Statistics Glossary