Hypothesis Testing: The Framework of Truth

How do we decide if a new drug works? Or if a website redesign actually increases sales? Or if global warming is statistically significant?

We use Hypothesis Testing, a rigorous framework for making decisions using data. It’s not about proving things “true” in an absolute sense, but about determining if there’s enough evidence to reject the status quo.


1. The Judicial Analogy

Think of a criminal trial. This is the perfect analogy for hypothesis testing.

The Defendant

The "New Idea" (Drug, Feature)

VS

The Status Quo

Innocence (No Effect)

Null Hypothesis (H0)

The Defendant is Innocent.

"The drug has no effect."

Alternative Hypothesis (H1)

The Defendant is Guilty.

"The drug works."

Evidence (Data)

The Witness Testimony.

"We saw the defendant at the scene."

Verdict

Reject or Fail to Reject H0.

We never say "Proven Innocent".

[!NOTE] “Fail to Reject” vs “Accept” In court, a “Not Guilty” verdict doesn’t mean the defendant is definitely innocent; it just means there wasn’t enough evidence to convict. Similarly, we never “accept” the Null Hypothesis; we only “fail to reject” it.


2. Interactive: P-Value Visualizer

The P-value is the probability of seeing data at least as extreme as what we observed, assuming the Null Hypothesis is true.

  • Low P-value (< 0.05): Very unlikely under H0. The evidence is overwhelming. Reject H0.
  • High P-value (> 0.05): Totally normal behavior under H0. Fail to reject H0.

Use the slider to move the Test Statistic (Z-score). Watch how the P-value (the shaded area in the tails) shrinks as the statistic moves away from the center.

P-Value 1.000
Verdict (α=0.05) Fail to Reject

3. Real World Application: Netflix A/B Testing

At companies like Netflix, every feature (thumbnail images, autoplay behavior) is an A/B test (Hypothesis Test).

The “Sample Ratio Mismatch” (SRM) Trap

Imagine Netflix tests a new “4K Streaming” button.

  • Control (50%): No button.
  • Treatment (50%): New button.

They run the experiment and get 1,000 users in Control and only 800 users in Treatment. The Treatment group spends 20% more time watching. Success?

NO. The uneven sample size (1000 vs 800) suggests a bug. Maybe the new button crashed the app for 200 users (who left immediately and weren’t tracked). If you only analyze the survivors, you have Survivor Bias.

This is why we run a Chi-Square Goodness of Fit test on the sample ratios before analyzing the metric.

[!WARNING] P-Hacking If you check your experiment results every day and stop “as soon as it’s significant”, you are cheating. You inflate your False Positive rate massively. Always define your sample size and duration before starting.


4. Java Implementation: The T-Test Logic

While we usually use libraries, understanding the math requires implementing it. Here is how you calculate a T-statistic from scratch in Java.

The formula for an independent T-statistic is:

t = (mean1 - mean2) / √((s12/n1) + (s22/n2))

public class HypothesisTest {

    // Simple class to hold summary statistics
    static class SummaryStats {
        double mean;
        double variance;
        int n;

        public SummaryStats(double[] data) {
            this.n = data.length;
            double sum = 0;
            for (double x : data) sum += x;
            this.mean = sum / n;

            double sumSqDiff = 0;
            for (double x : data) sumSqDiff += Math.pow(x - mean, 2);
            this.variance = sumSqDiff / (n - 1); // Bessel's correction
        }
    }

    public static double calculateTStatistic(double[] sample1, double[] sample2) {
        SummaryStats s1 = new SummaryStats(sample1);
        SummaryStats s2 = new SummaryStats(sample2);

        double meanDiff = s1.mean - s2.mean;

        // Standard Error calculation
        double se1 = s1.variance / s1.n;
        double se2 = s2.variance / s2.n;
        double standardError = Math.sqrt(se1 + se2);

        return meanDiff / standardError;
    }

    public static void main(String[] args) {
        double[] control = {10.2, 10.5, 10.3, 10.8, 9.9};
        double[] treatment = {12.1, 12.5, 12.3, 11.8, 12.0};

        double tStat = calculateTStatistic(control, treatment);

        System.out.printf("T-Statistic: %.4f%n", tStat);

        // In a real library, we would look up the P-value for this t-stat
        // using the degrees of freedom.
        if (Math.abs(tStat) > 2.306) { // Critical value for df=8, alpha=0.05
            System.out.println("Result is Statistically Significant (Reject H0)");
        } else {
            System.out.println("Result is NOT Significant (Fail to Reject H0)");
        }
    }
}

This code explicitly shows the components: the Signal (difference in means) divided by the Noise (standard error).


5. Summary

  • Null Hypothesis (H0) is the assumption of “innocence” (no effect).
  • P-Value is the evidence against H0. Low P-value = Strong Evidence.
  • Type I Error: Convicting the innocent (False Positive).
  • Type II Error: Acquitting the guilty (False Negative).
  • A/B Testing relies on these principles but requires vigilance against biases like SRM and Peeking.