Kernel Density Estimation (KDE)
Histograms are the bread and butter of data visualization. But they have a fatal flaw: Binning Bias.
Depending on where you start your bins and how wide they are, the same dataset can look completely different.
Kernel Density Estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable. Think of it as a “smooth histogram” that doesn’t depend on arbitrary bin edges.
1. The Mechanic: Stacking Kernels
Imagine you have 5 data points: [2, 5, 7, 8, 12].
- Place a Kernel: Instead of dropping each point into a bin, place a smooth curve (a “kernel”, usually a Gaussian bell curve) centered directly on top of each data point.
- Sum Them Up: Add the height of all these individual curves together at every point along the x-axis.
- Normalize: Divide by the number of points (and the bandwidth) so the total area under the curve equals 1.
The resulting curve is your Density Estimate.
2. Interactive: The KDE Smoother
Adjust the Bandwidth slider below to see how it affects the estimated density.
- Low Bandwidth: The curve is “spiky” and fits every data point (Overfitting).
- High Bandwidth: The curve is “flat” and washes out the details (Underfitting).
3. The Mathematics
The standard formula for KDE is:
Where:
- n: Number of data points.
- h: Bandwidth (smoothing parameter).
- K: Kernel function (must integrate to 1).
- xi: The data points.
Bandwidth Selection
Choosing h is the “secret sauce” of KDE.
- Scott’s Rule or Silverman’s Rule: Common heuristics used by default in software like SciPy and Seaborn. They try to minimize the mean integrated squared error (MISE).
- If h is too small, the noise in the data is modeled as structure (high variance).
- If h is too large, the structure in the data is washed out (high bias).
4. Implementation Examples
Python (SciPy)
If you need the actual probability density values:
from scipy.stats import gaussian_kde
import numpy as np
data = np.array([2, 5, 7, 8, 12])
kde = gaussian_kde(data)
# Evaluate density at specific points
# This automatically selects bandwidth using Scott's Rule
print(f"Density at x=5: {kde(5)[0]:.4f}")
Java
We can implement the Gaussian kernel summation manually.
public class KDE {
// Gaussian Kernel Function
public static double gaussian(double x, double mean, double bandwidth) {
return (1.0 / (bandwidth * Math.sqrt(2 * Math.PI))) *
Math.exp(-0.5 * Math.pow((x - mean) / bandwidth, 2));
}
public static void main(String[] args) {
double[] data = {2, 5, 7, 8, 12};
double bandwidth = 1.0; // Fixed bandwidth for simplicity
double x = 5.0; // Point to evaluate
double density = 0;
for (double val : data) {
density += gaussian(x, val, bandwidth);
}
density /= data.length;
System.out.printf("Density at x=%.1f: %.4f%n", x, density);
}
}
Go
package main
import (
"fmt"
"math"
)
func gaussian(x, mean, bandwidth float64) float64 {
return (1.0 / (bandwidth * math.Sqrt(2*math.Pi))) *
math.Exp(-0.5*math.Pow((x-mean)/bandwidth, 2))
}
func main() {
data := []float64{2, 5, 7, 8, 12}
bandwidth := 1.0
x := 5.0
density := 0.0
for _, val := range data {
density += gaussian(x, val, bandwidth)
}
density /= float64(len(data))
fmt.Printf("Density at x=%.1f: %.4f\n", x, density)
}
5. Summary
| Feature | Histogram | KDE |
|---|---|---|
| Type | Discrete (Bars) | Continuous (Curve) |
| Parameter | Bin Count / Width | Bandwidth |
| Shape | Rough / Blocky | Smooth |
| Sensitivity | High (Bin Edges) | Moderate (Bandwidth) |
[!TIP] When to use KDE? Use it when you want to compare the shapes of multiple distributions on the same plot. Overlapping histograms are messy; overlapping KDE lines are elegant.