Review & Cheat Sheet
[!NOTE] This module explores the core principles of Review & Cheat Sheet, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Key Takeaways
- Visualize First: Always plot your data before calculating summary statistics (Remember Anscombe’s Quartet).
- Central Tendency:
- Mean: Best for symmetric data. Sensitive to outliers.
- Median (P50): Best for skewed data (Income, Latency). Robust to outliers.
- P99: Critical for distributed systems (Tail Latency).
- Spread:
- Variance/Std Dev: Average distance from the mean. Standard Deviation (\sigma) is in original units.
- Bessel’s Correction: Use N-1 for sample variance to correct bias.
- IQR: Spread of middle 50%. Robust.
- Hardware Reality: Databases use Histograms to optimize query plans (Index Scan vs Seq Scan).
2. Interactive Flashcards
Test your knowledge! Click a card to flip it.
3. Cheat Sheet
Formulas
| Concept | Formula |
|---|---|
| Mean | μ = (1/N) Σ xi |
| Sample Variance | s2 = Σ(xi - x̄)2 / (N - 1) |
| Sample Std Dev | s = √s2 |
| IQR | Q3 (75th percentile) - Q1 (25th percentile) |
| Outlier (Low) | < Q1 - 1.5 × IQR |
| Outlier (High) | > Q3 + 1.5 × IQR |
| Z-Score | z = (x - μ) / σ |
| Geometric Mean | μgeo = (x1 * … * xn)1/n |
Python (NumPy/Pandas)
import numpy as np
from scipy import stats
data = [10, 12, 11, 13, 100]
# Central Tendency
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data)
# Spread (Note: ddof=1 for Sample Std Dev)
std_dev_sample = np.std(data, ddof=1)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
# Outliers (Z-Score)
z_scores = np.abs(stats.zscore(data))
outliers = np.where(z_scores > 3)