Statistics Collector
[!NOTE] This module explores the core principles of Statistics Collector, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Introduction
The Query Planner relies on Statistics to estimate how many rows a query will return. If these estimates are wrong, the planner will choose a bad execution plan (e.g., using a Nested Loop when it should have used a Hash Join).
These statistics are gathered by a background process (Auto-vacuum) or manually via the ANALYZE command, and stored in the system catalog.
2. pg_statistic and pg_stats
pg_statistic: The raw system table where stats are stored. It’s hard to read because it uses obscure column names (likestavista1,stanumbers1) and data types.pg_stats: A user-friendly view built on top ofpg_statistic. Always look here first.
Inspecting Stats
You can query pg_stats to see what Postgres knows about your data:
SELECT tablename, attname, null_frac, avg_width, n_distinct
FROM pg_stats
WHERE tablename = 'users';
Key Metrics
null_frac: Fraction of column entries that are NULL. If this is high, an index on this column might be smaller if it’s a partial index (WHERE col IS NOT NULL).avg_width: Average width in bytes of the column’s entries. Important for calculating memory usage for sorts and hash tables.n_distinct:- Positive Value: The estimated exact number of distinct values (e.g.,
5means 5 unique states). - Negative Value: Represents a ratio of distinct values to total rows. For example,
-0.5means “roughly 50% of the values are unique” (ortotal_rows * 0.5distinct values). This scales automatically as the table grows.
- Positive Value: The estimated exact number of distinct values (e.g.,
3. How Data Distribution is Stored
For complex distributions, Postgres stores two key data structures:
1. Most Common Values (MCV)
A list of the most frequent values in the column and their frequencies.
- Example: In a
statuscolumn,activemight appear 80% of the time,pending15%, andbanned5%. - Planner Use: If you query
WHERE status = 'active', the planner sees the MCV entry and estimates 80% of the table will be returned.
2. Histogram
For values not in the MCV list, Postgres divides the data into buckets of equal frequency (equi-depth).
- Structure: A list of boundaries. If the boundaries are
[10, 20, 30, 40], it means roughly the same number of rows fall between 10-20 as between 20-30. - Planner Use: Used for range queries (
WHERE age > 25). The planner calculates which buckets overlap with the range and sums their frequencies.
4. The ANALYZE Command
The ANALYZE command forces Postgres to recalculate statistics for a table.
ANALYZE users;
[!IMPORTANT] Auto-vacuum usually runs
ANALYZEautomatically when a significant percentage of the table changes. However, for massive bulk loads or temporary tables, you should runANALYZEmanually immediately after loading data to prevent bad plans.
5. Interactive: Histogram Builder
Draw a distribution of data to see how Postgres would convert it into a Histogram and MCV list. This helps visualize how the database “summarizes” your data.
Histogram Builder
Draw on the canvas to create a data distribution. See how Postgres summarizes it into buckets.