Sampling & Hypothesis Testing

1. Introduction: Signal vs Noise

You launch a new feature.

  • Old Design: 10.0% Conversion.
  • New Design: 10.2% Conversion.

Is this a real improvement, or just random noise? Hypothesis Testing provides the mathematical framework to answer this. It is the judge and jury of scientific experiments.


2. The Central Limit Theorem (CLT)

Before we test hypotheses, we must understand the “Magic” of statistics. The Central Limit Theorem states:

The sum (or average) of many independent random variables tends toward a Normal Distribution, regardless of their original distribution.

This is why we can use Gaussian formulas for almost everything (A/B testing, error analysis, etc.), even if the underlying data (like website clicks) is not Gaussian!

Interactive Visualizer: The Galton Board (CLT in Action)

Balls fall and bounce left/right randomly (Bernoulli trials). The final position is the sum of these bounces. Notice how the random bounces accumulate to form a perfect Bell Curve. Physics Mode: Watch the balls stack up in real-time.


3. The Framework of Hypothesis Testing

3.1 The Hypotheses

  • Null Hypothesis (H<sub>0</sub>): The Skeptic’s View. “There is no difference. The change did nothing.”
  • Alternative Hypothesis (H<sub>1</sub>): The Believer’s View. “The new design is better.”

3.2 The P-Value

The most misunderstood concept in science.

The P-Value is the probability of observing data at least this extreme, assuming the Null Hypothesis is true.

  • It is NOT the probability that H<sub>0</sub> is true.
  • Low P-Value (< 0.05): The data is “surprising” if H<sub>0</sub> is true. We Reject H0.
  • High P-Value: The data is consistent with H<sub>0</sub>. We Fail to Reject H0.

3.3 Effect Size & Power Analysis

A result can be Statistically Significant (low p-value) but Practically Irrelevant (tiny effect size).

  • Effect Size: The magnitude of the difference (e.g., Cohen’s d).
  • Power: The probability of finding an effect if it exists.

    Power = 1 - β (Type II Error)

Rule of Thumb: Before running an A/B test, calculate the required Sample Size to achieve 80% Power. If you don’t, you might fail to detect a real improvement (False Negative).


4. The Dark Side: P-Hacking (Data Dredging)

If you set &alpha; = 0.05, you have a 5% chance of finding a “significant” result purely by luck (Type I Error).

What if you run 20 tests? The probability of getting at least one false positive is:

P(Error) = 1 - (0.95)20 &approx; 64%

This is P-Hacking: Testing many hypotheses and only reporting the one that worked. This creates “Fake Science” and “Fake ML improvements”.

Interactive Visualizer: The Data Dredging Game

We simulate 20 A/B tests where there is NO actual effect (Null Hypothesis is True). Watch how many turn up “Significant” (Red) just by random chance.

Significant Results Found: 0
False Positive (p < 0.05)
Bonferroni Safe (p < 0.0025)

5. Summary

  • P-Value: The probability of seeing the data assuming nothing happened. It is NOT the probability that the hypothesis is true.
  • Sample Size: Larger samples = More Power = Lower Type II Error.
  • P-Hacking: Running many tests guarantees false positives. Always use Bonferroni Correction (&alpha; / k) or split your data into Validation Sets.

Next: Case Study: Naive Bayes Spam Classifier →