Statistics Collector

[!NOTE] This module explores the core principles of the Statistics Collector, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise. We’ll explore the math behind histograms and correlation.

1. Introduction: The Brain of the Planner

The Query Planner relies on Statistics to estimate how many rows a query will return. If these estimates are wrong, the planner will choose a bad execution plan (e.g., using a Nested Loop when it should have used a Hash Join).

These statistics are gathered by a background process (Auto-vacuum) or manually via the ANALYZE command, and stored in the system catalog.

2. pg_statistic and pg_stats

  • pg_statistic: The raw system table where stats are stored. It’s hard to read because it uses obscure column names (like stavista1, stanumbers1) and data types.
  • pg_stats: A user-friendly view built on top of pg_statistic. Always look here first.

Inspecting Stats

You can query pg_stats to see what Postgres knows about your data:

SELECT tablename, attname, null_frac, avg_width, n_distinct
FROM pg_stats
WHERE tablename = 'users';

Key Metrics

  1. null_frac: Fraction of column entries that are NULL. If this is high, an index on this column might be smaller if it’s a partial index (WHERE col IS NOT NULL).
  2. avg_width: Average width in bytes of the column’s entries. Important for calculating memory usage for sorts and hash tables.
  3. n_distinct:
    • Positive Value: The estimated exact number of distinct values (e.g., 5 means 5 unique states).
    • Negative Value: Represents a ratio of distinct values to total rows. For example, -0.5 means “roughly 50% of the values are unique” (or total_rows * 0.5 distinct values). This scales automatically as the table grows.

3. How Data Distribution is Stored

For complex distributions, Postgres stores two key data structures:

1. Most Common Values (MCV)

A list of the most frequent values in the column and their frequencies.

  • Example: In a status column, active might appear 80% of the time, pending 15%, and banned 5%.
  • Planner Use: If you query WHERE status = 'active', the planner sees the MCV entry and estimates 80% of the table will be returned.

2. Histogram

For values not in the MCV list, Postgres divides the data into buckets of equal frequency (equi-depth).

  • Structure: A list of boundaries. If the boundaries are [10, 20, 30, 40], it means roughly the same number of rows fall between 10-20 as between 20-30.
  • Planner Use: Used for range queries (WHERE age > 25). The planner calculates which buckets overlap with the range and sums their frequencies.

4. The ANALYZE Command

The ANALYZE command forces Postgres to recalculate statistics for a table.

ANALYZE users;

[!IMPORTANT] Auto-vacuum usually runs ANALYZE automatically when a significant percentage of the table changes. However, for massive bulk loads or temporary tables, you should run ANALYZE manually immediately after loading data to prevent bad plans.

5. Interactive: Histogram Builder

Draw a distribution of data to see how Postgres would convert it into a Histogram and MCV list. This helps visualize how the database “summarizes” your data.

Histogram Builder

Draw on the canvas to create a data distribution. See how Postgres summarizes it into buckets.

Value Range (0-100)
Frequency

Postgres Histogram (Buckets)

Draw or select a preset...

Most Common Values (MCV)

Draw or select a preset...

6. Extended Statistics

Sometimes, single-column statistics aren’t enough. If columns are correlated, Postgres estimates can be wildly inaccurate.

The Problem: Correlation (Bayesian Reality)

Imagine a table of cities:

  • city: ‘Paris’
  • country: ‘France’

If you query WHERE city = 'Paris' AND country = 'France', Postgres assumes these columns are statistically independent and calculates selectivity by multiplying their probabilities:

Selectivity = P(city = 'Paris') × P(country = 'France')

However, in reality, P(country=’France’ city=’Paris’) = 1.0. The columns are highly correlated. The naive calculation P(Paris) × P(France) yields a number far smaller than reality, causing the planner to radically underestimate the row count and potentially choose a slow Nested Loop scan.

The Solution: CREATE STATISTICS

You can tell Postgres to track correlations (dependencies) between columns.

CREATE STATISTICS city_country_stats (dependencies) ON city, country FROM cities;
ANALYZE cities;

Now, Postgres knows that city determines country and will adjust its estimates accordingly.

7. Next Steps

Now you understand the core components of query performance. Review everything with our Module Review & Cheat Sheet.