Statistics Collector

[!NOTE] This module explores the core principles of Statistics Collector, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction

The Query Planner relies on Statistics to estimate how many rows a query will return. If these estimates are wrong, the planner will choose a bad execution plan (e.g., using a Nested Loop when it should have used a Hash Join).

These statistics are gathered by a background process (Auto-vacuum) or manually via the ANALYZE command, and stored in the system catalog.

2. pg_statistic and pg_stats

  • pg_statistic: The raw system table where stats are stored. It’s hard to read because it uses obscure column names (like stavista1, stanumbers1) and data types.
  • pg_stats: A user-friendly view built on top of pg_statistic. Always look here first.

Inspecting Stats

You can query pg_stats to see what Postgres knows about your data:

SELECT tablename, attname, null_frac, avg_width, n_distinct
FROM pg_stats
WHERE tablename = 'users';

Key Metrics

  1. null_frac: Fraction of column entries that are NULL. If this is high, an index on this column might be smaller if it’s a partial index (WHERE col IS NOT NULL).
  2. avg_width: Average width in bytes of the column’s entries. Important for calculating memory usage for sorts and hash tables.
  3. n_distinct:
    • Positive Value: The estimated exact number of distinct values (e.g., 5 means 5 unique states).
    • Negative Value: Represents a ratio of distinct values to total rows. For example, -0.5 means “roughly 50% of the values are unique” (or total_rows * 0.5 distinct values). This scales automatically as the table grows.

3. How Data Distribution is Stored

For complex distributions, Postgres stores two key data structures:

1. Most Common Values (MCV)

A list of the most frequent values in the column and their frequencies.

  • Example: In a status column, active might appear 80% of the time, pending 15%, and banned 5%.
  • Planner Use: If you query WHERE status = 'active', the planner sees the MCV entry and estimates 80% of the table will be returned.

2. Histogram

For values not in the MCV list, Postgres divides the data into buckets of equal frequency (equi-depth).

  • Structure: A list of boundaries. If the boundaries are [10, 20, 30, 40], it means roughly the same number of rows fall between 10-20 as between 20-30.
  • Planner Use: Used for range queries (WHERE age > 25). The planner calculates which buckets overlap with the range and sums their frequencies.

4. The ANALYZE Command

The ANALYZE command forces Postgres to recalculate statistics for a table.

ANALYZE users;

[!IMPORTANT] Auto-vacuum usually runs ANALYZE automatically when a significant percentage of the table changes. However, for massive bulk loads or temporary tables, you should run ANALYZE manually immediately after loading data to prevent bad plans.

5. Interactive: Histogram Builder

Draw a distribution of data to see how Postgres would convert it into a Histogram and MCV list. This helps visualize how the database “summarizes” your data.

Histogram Builder

Draw on the canvas to create a data distribution. See how Postgres summarizes it into buckets.

Value Range (0-100)
Frequency

Postgres Histogram (Buckets)

Draw or select a preset...

Most Common Values (MCV)

Draw or select a preset... ## 6. Extended Statistics Sometimes, single-column statistics aren't enough. If columns are correlated, Postgres estimates can be wildly inaccurate. ### The Problem: Correlation Imagine a table of `cities`: * `city`: 'Paris' * `country`: 'France' If you query `WHERE city = 'Paris' AND country = 'France'`, Postgres calculates selectivity as: `Selectivity(Paris) * Selectivity(France)` Since 'Paris' implies 'France', these events are NOT independent. The calculation `P(Paris) * P(France)` yields a much smaller number than reality, causing the planner to underestimate rows and potentially choose a bad index scan. ### The Solution: `CREATE STATISTICS` You can tell Postgres to track correlations (dependencies) between columns. ```sql CREATE STATISTICS city_country_stats (dependencies) ON city, country FROM cities; ANALYZE cities; ``` Now, Postgres knows that `city` determines `country` and will adjust its estimates accordingly. ## 7. Next Steps Now you understand the core components of query performance. Review everything with our [Module Review & Cheat Sheet](/postgres/03-query-performance/module-review/).