Review & Cheat Sheet

[!NOTE] This module explores the core principles of Review & Cheat Sheet, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Key Takeaways

  1. Visualize First: Always plot your data before calculating summary statistics (Remember Anscombe’s Quartet).
  2. Central Tendency:
    • Mean: Best for symmetric data. Sensitive to outliers.
    • Median (P50): Best for skewed data (Income, Latency). Robust to outliers.
    • P99: Critical for distributed systems (Tail Latency).
  3. Spread:
    • Variance/Std Dev: Average distance from the mean. Standard Deviation (\sigma) is in original units.
    • Bessel’s Correction: Use N-1 for sample variance to correct bias.
    • IQR: Spread of middle 50%. Robust.
  4. Hardware Reality: Databases use Histograms to optimize query plans (Index Scan vs Seq Scan).

2. Interactive Flashcards

Test your knowledge! Click a card to flip it.

Which measure is best for analyzing House Prices?

The Median. Because house prices are usually skewed right with high outliers (mansions), the mean would be misleadingly high.

Why do we divide by N-1 for Sample Variance?

Bessel's Correction. The sample mean is "closer" to the sample data than the true mean, causing underestimation of spread. N-1 corrects this bias.

What does P99 represent in System Design?

Tail Latency. The response time experienced by the slowest 1% of users. Critical for optimizing user experience.

How do Databases use Histograms?

To estimate Selectivity. The optimizer checks if values are skewed to decide between an Index Scan (for rare values) or Seq Scan (for common values).


3. Cheat Sheet

Formulas

Concept Formula
Mean μ = (1/N) Σ xi
Sample Variance s2 = Σ(xi - x̄)2 / (N - 1)
Sample Std Dev s = √s2
IQR Q3 (75th percentile) - Q1 (25th percentile)
Outlier (Low) < Q1 - 1.5 × IQR
Outlier (High) > Q3 + 1.5 × IQR
Z-Score z = (x - μ) / σ
Geometric Mean μgeo = (x1 * … * xn)1/n

Python (NumPy/Pandas)

import numpy as np
from scipy import stats

data = [10, 12, 11, 13, 100]

# Central Tendency
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data)

# Spread (Note: ddof=1 for Sample Std Dev)
std_dev_sample = np.std(data, ddof=1)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Outliers (Z-Score)
z_scores = np.abs(stats.zscore(data))
outliers = np.where(z_scores > 3)

Statistics Glossary