Spread and Outliers

Knowing the center (mean/median) of your data is only half the story. Two datasets can have the exact same mean but look completely different.

To fully understand data, we need to measure its spread (variability) and identify outliers (anomalies).

1. Range and Interquartile Range (IQR)

Range

The simplest measure of spread. Formula: Range = Max - Min

It is extremely sensitive to outliers. One bad data point makes the range huge.

Interquartile Range (IQR)

The IQR measures the spread of the middle 50% of your data. It is the distance between the 75th percentile (Q3) and the 25th percentile (Q1).

Formula: IQR = Q3 - Q1

Pros: Robust to outliers.
Usage: Used in Box Plots to determine whiskers.

2. Variance and Standard Deviation

These are the most important measures of spread, especially for normally distributed data.

Variance (\sigma²)

The average of the squared differences from the Mean.

Population Formula: σ² = Σ (x_i - μ)² / N

Standard Deviation (\sigma)

The square root of the Variance. It brings the unit back to the original scale (e.g., from “squared dollars” back to “dollars”).

Formula: σ = √Variance

3. Deep Dive: The Mystery of N-1 (Bessel’s Correction)

When calculating Variance for a Sample (subset of population), we divide by N-1 instead of N. Why?

Formula (Sample): s² = Σ (x_i - x̄)² / (N - 1)

The Intuition

Bias: We usually don’t know the True Population Mean (μ), so we use the Sample Mean (x̄) as a proxy.
Underestimation: The Sample Mean is mathematically “closer” to the sample points than the True Mean is. (The mean is defined as the point that minimizes squared distance).
Correction: Because x̄ is “too close” to the data, the sum of squared differences is smaller than it would be if we used μ. Dividing by N would give us a value that is consistently too small (biased).
Fix: Dividing by a smaller number (N-1) inflates the result just enough to correct this bias on average.

[!TIP] Interview Tip: If asked “Why N-1?”, answer: “To correct the negative bias caused by using the sample mean instead of the population mean. It’s called Bessel’s Correction.”

4. Interactive Demo: The Normal Distribution (Sigma)

This visualization shows how the Standard Deviation (\sigma) affects the shape of a Normal Distribution (Bell Curve).

Low \sigma: Data is clustered tightly around the mean (tall, narrow curve).
High \sigma: Data is spread out (flat, wide curve).

Standard Deviation Visualizer

Spread (σ): 1.0

±1σ (68.2%) ±2σ (95.4%) ±3σ (99.7%)

5. Detecting Outliers

Outliers can skew your analysis (like the Mean). We use two common methods to detect them.

Method A: The IQR Rule (Box Plot Method)

A data point is an outlier if it falls outside these bounds:

Lower Bound: Q1 - (1.5 × IQR)
Upper Bound: Q3 + (1.5 × IQR)

Method B: The Z-Score Method

Calculate how many standard deviations a point is from the mean (Z-Score).

If ** Z > 3**, it is typically considered an outlier.

6. Implementation

We demonstrate how to calculate Variance, Standard Deviation, and Outliers from scratch (Java/Go) and using libraries (Python).

import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;

public class Spread {
    public static void main(String[] args) {
        double[] data = {10, 12, 11, 13, 10, 11, 12, 100};

        // 1. Mean
        double sum = 0;
        for (double x : data) sum += x;
        double mean = sum / data.length;

        // 2. Variance & Std Dev (Sample)
        double sqDiffSum = 0;
        for (double x : data) sqDiffSum += Math.pow(x - mean, 2);

        // Bessel's Correction: Divide by N-1
        double variance = sqDiffSum / (data.length - 1);
        double stdDev = Math.sqrt(variance);

        System.out.printf("Standard Deviation: %.2f%n", stdDev);

        // 3. IQR Outliers
        Arrays.sort(data); // Must sort for Percentiles
        double q1 = percentile(data, 25);
        double q3 = percentile(data, 75);
        double iqr = q3 - q1;
        double lowerBound = q1 - (1.5 * iqr);
        double upperBound = q3 + (1.5 * iqr);

        List<Double> outliers = new ArrayList<>();
        for (double x : data) {
            if (x < lowerBound || x > upperBound) outliers.add(x);
        }
        System.out.println("IQR Outliers: " + outliers);
    }

    // Basic Linear Interpolation for Percentile
    public static double percentile(double[] sortedData, double p) {
        int n = sortedData.length;
        double index = (p / 100) * (n - 1);
        int lower = (int) Math.floor(index);
        int upper = (int) Math.ceil(index);
        if (lower == upper) return sortedData[lower];
        return sortedData[lower] * (upper - index) + sortedData[upper] * (index - lower);
    }
}

package main

import (
	"fmt"
	"math"
	"sort"
)

func main() {
	data := []float64{10, 12, 11, 13, 10, 11, 12, 100}

	// 1. Mean
	sum := 0.0
	for _, x := range data {
		sum += x
	}
	mean := sum / float64(len(data))

	// 2. Variance & Std Dev (Sample)
	sqDiffSum := 0.0
	for _, x := range data {
		sqDiffSum += math.Pow(x-mean, 2)
	}

	// Bessel's Correction: Divide by N-1
	variance := sqDiffSum / float64(len(data)-1)
	stdDev := math.Sqrt(variance)

	fmt.Printf("Standard Deviation: %.2f\n", stdDev)

	// 3. IQR Outliers
	sort.Float64s(data) // Must sort for Percentiles
	q1 := percentile(data, 25)
	q3 := percentile(data, 75)
	iqr := q3 - q1
	lowerBound := q1 - (1.5 * iqr)
	upperBound := q3 + (1.5 * iqr)

	var outliers []float64
	for _, x := range data {
		if x < lowerBound || x > upperBound {
			outliers = append(outliers, x)
		}
	}
	fmt.Printf("IQR Outliers: %.0f\n", outliers)
}

// Basic Linear Interpolation for Percentile
func percentile(sortedData []float64, p float64) float64 {
	n := float64(len(sortedData))
	index := (p / 100) * (n - 1)
	lower := int(math.Floor(index))
	upper := int(math.Ceil(index))
	if lower == upper {
		return sortedData[lower]
	}
	fraction := index - float64(lower)
	return sortedData[lower]*(1-fraction) + sortedData[upper]*fraction
}

7. Summary

Range: Min to Max (Sensitive).
IQR: Middle 50% (Robust).
Variance/Std Dev: Spread around Mean (Gold standard for Normal Dist).
Outliers: Use IQR or Z-Score to find and handle them.