Exploratory Data Analysis (EDA)

Before running a single machine learning model, you must look at your data.

Summary statistics (Mean, Variance) can be deceptive. Datasets with identical stats can look completely different. This is why visualization is the most critical step in data science.

1. Why Visualization Matters: Anscombe’s Quartet

Francis Anscombe created four datasets that have nearly identical descriptive statistics:

Mean of x: 9.0
Mean of y: 7.50
Variance of x: 11.0
Variance of y: 4.12
Correlation: 0.816
Linear regression line: y = 3.00 + 0.500x

Yet, when plotted, they tell completely different stories:

Dataset I: A clean linear relationship.
Dataset II: A perfect curve (non-linear).
Dataset III: A tight linear relationship with one outlier.
Dataset IV: A vertical line with one outlier.

[!WARNING] Never rely on summary statistics alone. Always plot your data first.

2. Histograms: Understanding Distribution

A Histogram groups data into bins and counts the frequency in each bin. It reveals the shape of the distribution.

Symmetric: Bell-shaped (Normal).
Skewed Right: Long tail on the right (e.g., Income). Mean > Median.
Skewed Left: Long tail on the left. Mean < Median.
Bimodal: Two peaks (e.g., height of men and women combined).

3. Box Plots: The 5-Number Summary

A Box Plot (Box-and-Whisker Plot) visualizes the five-number summary:

Min: Lowest value (excluding outliers).
Q1: 25th percentile.
Median: 50th percentile (the line inside the box).
Q3: 75th percentile.
Max: Highest value (excluding outliers).

Outliers: Points outside 1.5 × IQR are shown as dots.

4. Scatter Plots & Correlation

Scatter Plots show the relationship between two numerical variables. Correlation (r) measures the strength and direction of a linear relationship (-1 to +1).

Interactive Demo: Correlation Explorer

Use the slider to change the correlation coefficient (r) and see how the scatter plot changes. Notice how points tighten into a line as correlation approaches &pm;1.

Correlation Visualizer

Correlation (r): 0.00

r = 1.0: Perfect positive

r = -1.0: Perfect negative

r = 0.0: Random cloud

5. Hardware Reality: Database Query Planners

Why do Systems Engineers care about histograms? Because your database (Postgres, MySQL) uses them to decide how to run a query.

The Cost-Based Optimizer (CBO) maintains statistics about every column, including a Histogram of value distributions.

Scenario: Selectivity Estimation

Query: SELECT * FROM users WHERE age > 60

Uniform Assumption: If Postgres thinks ages are uniformly distributed between 0 and 100, it estimates (100 - 60) / 100 = 40% of rows match. It might choose a Sequential Scan (read the whole table).
Histogram Reality: If the histogram shows a skewed distribution (e.g., most users are 20-30), it might know that only 5% of rows match. It will switch to an Index Scan (much faster).

[!IMPORTANT] Analyze Command: Running ANALYZE table_name in Postgres forces it to rebuild these histograms. If your query performance suddenly drops, it might be because the histograms are stale and the optimizer is making bad decisions based on old data distributions.

6. Implementation

We demonstrate plotting (Python) and the math behind correlation (Java/Go).

import java.util.stream.IntStream;

public class Correlation {
    public static void main(String[] args) {
        double[] x = {10, 20, 30, 40, 50};
        double[] y = {12, 24, 32, 45, 52};

        System.out.printf("Pearson Correlation: %.4f%n", pearsonCorrelation(x, y));
    }

    // Calculate Pearson Correlation Coefficient (r)
    // Formula: Covariance(x,y) / (StdDev(x) * StdDev(y))
    public static double pearsonCorrelation(double[] x, double[] y) {
        if (x.length != y.length) throw new IllegalArgumentException("Arrays must be same size");
        int n = x.length;

        double meanX = mean(x);
        double meanY = mean(y);

        double numerator = 0.0;
        double sumSqDiffX = 0.0;
        double sumSqDiffY = 0.0;

        for (int i = 0; i < n; i++) {
            double diffX = x[i] - meanX;
            double diffY = y[i] - meanY;

            numerator += diffX * diffY;
            sumSqDiffX += diffX * diffX;
            sumSqDiffY += diffY * diffY;
        }

        return numerator / Math.sqrt(sumSqDiffX * sumSqDiffY);
    }

    private static double mean(double[] data) {
        double sum = 0;
        for (double d : data) sum += d;
        return sum / data.length;
    }
}

package main

import (
	"fmt"
	"math"
)

func main() {
	x := []float64{10, 20, 30, 40, 50}
	y := []float64{12, 24, 32, 45, 52}

	r := pearsonCorrelation(x, y)
	fmt.Printf("Pearson Correlation: %.4f\n", r)
}

// Calculate Pearson Correlation Coefficient (r)
// Formula: Covariance(x,y) / (StdDev(x) * StdDev(y))
func pearsonCorrelation(x, y []float64) float64 {
	if len(x) != len(y) {
		panic("Arrays must be same size")
	}
	n := len(x)

	meanX := mean(x)
	meanY := mean(y)

	numerator := 0.0
	sumSqDiffX := 0.0
	sumSqDiffY := 0.0

	for i := 0; i < n; i++ {
		diffX := x[i] - meanX
		diffY := y[i] - meanY

		numerator += diffX * diffY
		sumSqDiffX += diffX * diffX
		sumSqDiffY += diffY * diffY
	}

	return numerator / math.Sqrt(sumSqDiffX*sumSqDiffY)
}

func mean(data []float64) float64 {
	sum := 0.0
	for _, v := range data {
		sum += v
	}
	return sum / float64(len(data))
}

7. Summary

Look at your data. Always.
Histograms show shape (skewness, modality).
Box Plots show summary stats and outliers.
Scatter Plots show relationships (correlation).