Kernel Density Estimation (KDE)

Histograms are the bread and butter of data visualization. But they have a fatal flaw: Binning Bias.

Depending on where you start your bins and how wide they are, the same dataset can look completely different.

Kernel Density Estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable. Think of it as a “smooth histogram” that doesn’t depend on arbitrary bin edges.

Real-World Analogy: Heatmaps vs. Pins Imagine you are a city planner mapping out coffee shops.

  • A Histogram is like dividing the city into rigid ZIP code grids and counting coffee shops per ZIP code. If a coffee shop is exactly on the border, it randomly gets thrown into one bin. The map’s “hot spots” change completely if you shift the grid slightly.
  • A KDE is like placing a glowing heat lamp over every single coffee shop. The heat lamps overlap. Where there are many coffee shops, the heat adds up to form a bright, continuous “hot zone.” There are no artificial borders.

1. The Mechanic: Stacking Kernels

Let’s trace this visually with an example. Suppose we measure the waiting times (in minutes) at a coffee shop during rush hour: [2, 5, 7, 8, 12].

  1. Place a Kernel: Instead of dropping each point into a bin, place a smooth curve (a “kernel”, usually a Gaussian bell curve) centered directly on top of each data point. For example, place a curve centered at x = 2, another at x = 5, etc.
  2. Sum Them Up: Add the height of all these individual curves together at every point along the x-axis. Since 7 and 8 are close together, their individual curves will overlap significantly, creating a high peak in the combined density estimate around x = 7.5. Conversely, the gap between 2 and 5 will result in a dip.
  3. Normalize: Divide by the number of points (and the bandwidth) so the total area under the curve equals 1.

The resulting curve is your Density Estimate.


2. Interactive: The KDE Smoother

Adjust the Bandwidth slider below to see how it affects the estimated density.

  • Low Bandwidth: The curve is “spiky” and fits every data point (Overfitting).
  • High Bandwidth: The curve is “flat” and washes out the details (Underfitting).
1.0
The dashed lines are individual kernels. The blue line is their sum (the KDE).

3. The Mathematics

The standard formula for KDE is:

(x) = (1 / nh) Σi = 1n K( (x - xi) / h )

Where:

  • n: Number of data points.
  • h: Bandwidth (smoothing parameter).
  • K: Kernel function (must integrate to 1).
  • xi: The data points.

Bandwidth Selection

Choosing h is the “secret sauce” of KDE.

  • Scott’s Rule or Silverman’s Rule: Common heuristics used by default in software like SciPy and Seaborn. They try to minimize the mean integrated squared error (MISE).
  • If h is too small, the noise in the data is modeled as structure (high variance).
  • If h is too large, the structure in the data is washed out (high bias).

4. Implementation Examples

Python (SciPy)

If you need the actual probability density values:

from scipy.stats import gaussian_kde
import numpy as np

data = np.array([2, 5, 7, 8, 12])
kde = gaussian_kde(data)

# Evaluate density at specific points
# This automatically selects bandwidth using Scott's Rule
print(f"Density at x=5: {kde(5)[0]:.4f}")

Java

We can implement the Gaussian kernel summation manually.

public class KDE {
  // Gaussian Kernel Function
  public static double gaussian(double x, double mean, double bandwidth) {
    return (1.0 / (bandwidth * Math.sqrt(2 * Math.PI))) *
       Math.exp(-0.5 * Math.pow((x - mean) / bandwidth, 2));
  }

  public static void main(String[] args) {
    double[] data = {2, 5, 7, 8, 12};
    double bandwidth = 1.0; // Fixed bandwidth for simplicity
    double x = 5.0; // Point to evaluate

    double density = 0;
    for (double val : data) {
      density += gaussian(x, val, bandwidth);
    }
    density /= data.length;

    System.out.printf("Density at x=%.1f: %.4f%n", x, density);
  }
}

Go

package main

import (
  "fmt"
  "math"
)

func gaussian(x, mean, bandwidth float64) float64 {
  return (1.0 / (bandwidth * math.Sqrt(2*math.Pi))) *
    math.Exp(-0.5*math.Pow((x-mean)/bandwidth, 2))
}

func main() {
  data := []float64{2, 5, 7, 8, 12}
  bandwidth := 1.0
  x := 5.0

  density := 0.0
  for _, val := range data {
    density += gaussian(x, val, bandwidth)
  }
  density /= float64(len(data))

  fmt.Printf("Density at x=%.1f: %.4f\n", x, density)
}

5. Summary

Feature Histogram KDE
Type Discrete (Bars) Continuous (Curve)
Parameter Bin Count / Width Bandwidth
Shape Rough / Blocky Smooth
Sensitivity High (Bin Edges) Moderate (Bandwidth)

[!TIP] When to use KDE? Use it when you want to compare the shapes of multiple distributions on the same plot. Overlapping histograms are messy; overlapping KDE lines are elegant.