Sampling & Hypothesis Testing

[!NOTE] This module explores the core principles of Sampling & Hypothesis Testing, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: Signal vs Noise

You launch a new feature.

Old Design: 10.0% Conversion.
New Design: 10.2% Conversion.

Is this a real improvement, or just random noise? Hypothesis Testing provides the mathematical framework to answer this. It is the judge and jury of scientific experiments.

2. The Central Limit Theorem (CLT)

Before we test hypotheses, we must understand the “Magic” of statistics. The Central Limit Theorem states:

The sum (or average) of many independent random variables tends toward a Normal Distribution, regardless of their original distribution.

This is why we can use Gaussian formulas for almost everything (A/B testing, error analysis, etc.), even if the underlying data (like website clicks) is not Gaussian!

Interactive Visualizer: The Galton Board (CLT in Action)

Balls fall and bounce left/right randomly (Bernoulli trials). The final position is the sum of these bounces. Notice how the random bounces accumulate to form a perfect Bell Curve. Physics Mode: Watch the balls stack up in real-time.

[!TIP] Try it yourself: Click “Drop 300 Balls”. Watch the piles grow. The white line shows the theoretical Binomial distribution. Notice how closely reality matches theory!

3. The Framework of Hypothesis Testing

3.1 The Hypotheses

Null Hypothesis (H0): The Skeptic’s View. “There is no difference. The change did nothing.”
Alternative Hypothesis (H1): The Believer’s View. “The new design is better.”

3.2 The P-Value

The most misunderstood concept in science.

The P-Value is the probability of observing data at least this extreme, assuming the Null Hypothesis is true.

It is NOT the probability that H0 is true.
Low P-Value (< 0.05): The data is “surprising” if H0 is true. We Reject H₀.
High P-Value: The data is consistent with H0. We Fail to Reject H₀.

3.3 Effect Size & Power Analysis

A result can be Statistically Significant (low p-value) but Practically Irrelevant (tiny effect size).

Effect Size: The magnitude of the difference (e.g., Cohen’s d).
Power: The probability of finding an effect if it exists. > Power = 1 - β (Type II Error)

Rule of Thumb: Before running an A/B test, calculate the required Sample Size to achieve 80% Power. If you don’t, you might fail to detect a real improvement (False Negative). This is why companies like Netflix use Canary Deployments (see System Design Module 17).

4. The Dark Side: P-Hacking (Data Dredging)

If you set α = 0.05, you have a 5% chance of finding a “significant” result purely by luck (Type I Error).

What if you run 20 tests? The probability of getting at least one false positive is:

P(Error) = 1 - (0.95)²⁰ ≈ 64%

This is P-Hacking: Testing many hypotheses and only reporting the one that worked. This creates “Fake Science” and “Fake ML improvements”.

Interactive Visualizer: The Data Dredging Game

We simulate 20 A/B tests where there is NO actual effect (Null Hypothesis is True). Watch how many turn up “Significant” (Red) just by random chance.

Significant Results Found: 0

False Positive (p < 0.05)

Bonferroni Safe (p < 0.0025)

Implementation in Python

We can use scipy.stats to run a rigorous T-Test:

import numpy as np
from scipy import stats

# Scenario: A/B Test
# Group A (Control): 100 users, mean time = 120s, std = 20s
# Group B (Variant): 100 users, mean time = 130s, std = 25s

# Generate synthetic data
np.random.seed(42)
group_a = np.random.normal(120, 20, 100)
group_b = np.random.normal(130, 25, 100)

# Perform Independent T-Test
# Null Hypothesis: Means are equal
t_stat, p_value = stats.ttest_ind(group_a, group_b)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.6f}")

if p_value < 0.05:
    print("Result: Statistically Significant (Reject Null Hypothesis)")
else:
    print("Result: Not Significant (Fail to Reject Null Hypothesis)")

# Output example:
# T-statistic: -3.1415
# P-value: 0.0018
# Result: Statistically Significant

5. Summary

P-Value: The probability of seeing the data assuming nothing happened. It is NOT the probability that the hypothesis is true.
Sample Size: Larger samples = More Power = Lower Type II Error.
P-Hacking: Running many tests guarantees false positives. Always use Bonferroni Correction (α / k) or split your data into Validation Sets.