Kernel Density Estimation (KDE)

Histograms are the bread and butter of data visualization. But they have a fatal flaw: Binning Bias.

Depending on where you start your bins and how wide they are, the same dataset can look completely different.

Kernel Density Estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable. Think of it as a “smooth histogram” that doesn’t depend on arbitrary bin edges.

1. The Mechanic: Stacking Kernels

Imagine you have 5 data points: [2, 5, 7, 8, 12].

  1. Place a Kernel: Instead of dropping each point into a bin, place a smooth curve (a “kernel”, usually a Gaussian bell curve) centered directly on top of each data point.
  2. Sum Them Up: Add the height of all these individual curves together at every point along the x-axis.
  3. Normalize: Divide by the number of points (and the bandwidth) so the total area under the curve equals 1.

The resulting curve is your Density Estimate.


2. Interactive: The KDE Smoother

Adjust the Bandwidth slider below to see how it affects the estimated density.

  • Low Bandwidth: The curve is “spiky” and fits every data point (Overfitting).
  • High Bandwidth: The curve is “flat” and washes out the details (Underfitting).
1.0
The dashed lines are individual kernels. The blue line is their sum (the KDE).

3. The Mathematics

The standard formula for KDE is:

(x) = (1 / nh) Σi= 1n K( (x - xi) / h )

Where:

  • n: Number of data points.
  • h: Bandwidth (smoothing parameter).
  • K: Kernel function (must integrate to 1).
  • xi: The data points.

Bandwidth Selection

Choosing h is the “secret sauce” of KDE.

  • Scott’s Rule or Silverman’s Rule: Common heuristics used by default in software like SciPy and Seaborn. They try to minimize the mean integrated squared error (MISE).
  • If h is too small, the noise in the data is modeled as structure (high variance).
  • If h is too large, the structure in the data is washed out (high bias).

4. Implementation Examples

Python (SciPy)

If you need the actual probability density values:

from scipy.stats import gaussian_kde
import numpy as np

data = np.array([2, 5, 7, 8, 12])
kde = gaussian_kde(data)

# Evaluate density at specific points
# This automatically selects bandwidth using Scott's Rule
print(f"Density at x=5: {kde(5)[0]:.4f}")

Java

We can implement the Gaussian kernel summation manually.

public class KDE {
    // Gaussian Kernel Function
    public static double gaussian(double x, double mean, double bandwidth) {
        return (1.0 / (bandwidth * Math.sqrt(2 * Math.PI))) *
               Math.exp(-0.5 * Math.pow((x - mean) / bandwidth, 2));
    }

    public static void main(String[] args) {
        double[] data = {2, 5, 7, 8, 12};
        double bandwidth = 1.0; // Fixed bandwidth for simplicity
        double x = 5.0; // Point to evaluate

        double density = 0;
        for (double val : data) {
            density += gaussian(x, val, bandwidth);
        }
        density /= data.length;

        System.out.printf("Density at x=%.1f: %.4f%n", x, density);
    }
}

Go

package main

import (
	"fmt"
	"math"
)

func gaussian(x, mean, bandwidth float64) float64 {
	return (1.0 / (bandwidth * math.Sqrt(2*math.Pi))) *
		math.Exp(-0.5*math.Pow((x-mean)/bandwidth, 2))
}

func main() {
	data := []float64{2, 5, 7, 8, 12}
	bandwidth := 1.0
	x := 5.0

	density := 0.0
	for _, val := range data {
		density += gaussian(x, val, bandwidth)
	}
	density /= float64(len(data))

	fmt.Printf("Density at x=%.1f: %.4f\n", x, density)
}

5. Summary

Feature Histogram KDE
Type Discrete (Bars) Continuous (Curve)
Parameter Bin Count / Width Bandwidth
Shape Rough / Blocky Smooth
Sensitivity High (Bin Edges) Moderate (Bandwidth)

[!TIP] When to use KDE? Use it when you want to compare the shapes of multiple distributions on the same plot. Overlapping histograms are messy; overlapping KDE lines are elegant.