MVCC Internals: The Physics of Concurrency

[!NOTE] This module explores the core principles of the subject, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

Imagine you are building a high-frequency trading platform. Millions of users are simultaneously reading stock prices, while background systems are constantly updating those exact same prices. In a traditional database using locks, a reader would be forced to wait for writers (and vice versa), bringing your entire platform to a grinding halt.

PostgreSQL solves this fundamental problem using MVCC (Multi-Version Concurrency Control).

[!TIP] Analogy: Think of MVCC like a Git repository. When you UPDATE a row, Postgres doesn’t overwrite the original data; it makes a new “commit” (a new version) and updates the HEAD pointer. Readers who started their transaction earlier are still looking at an older commit, completely unaffected by your changes.

Unlike traditional locking systems where a reader must block a writer to ensure consistency, MVCC allows multiple versions of the same row to exist simultaneously in the database.

[!IMPORTANT] The Golden Rule of Postgres Concurrency: Readers never block writers, and writers never block readers.

This is achieved by treating data as immutable at the tuple (row) level. When you “update” a row, Postgres doesn’t overwrite the data in place on the disk. Instead, it marks the old row as “dead” and inserts a brand new version of the row.

1. The Tuple Header (The Metadata)

Every row in a Postgres table (a Tuple) carries a hidden header containing metadata used for visibility checks. This header is 23 bytes (usually padded to 24) and comes before your actual column data.

The most critical fields for MVCC are:

Field Size Description
xmin 4 bytes The XID that created (inserted) this tuple version.
xmax 4 bytes The XID that deleted (or updated) this tuple version. If 0, the tuple is live.
cmin/cmax 4 bytes Command identifiers within a transaction (e.g., 1st statement, 2nd statement).
ctid 6 bytes The physical location of the next version of this tuple (Page ID, Offset). Used to chain updates.
infomask 2 bytes Bitmask of boolean flags (e.g., “Hint Bits” like XMIN_COMMITTED).

Visualizing the Tuple Header

xmin 4 bytes
xmax 4 bytes
cid 4 bytes
ctid 6 bytes
infomask 2 bytes
User Data Variable
23-byte Header (Metadata)
Actual Data (Columns)

2. Interactive Visibility Simulator

How does Postgres decide if you (Transaction A) can see a specific row? It compares your Transaction ID (XID) and snapshot against the row’s xmin and xmax.

Experiment with this simulator to understand the visibility rules.

MVCC Time Machine

Current Global XID: 100

Scenario: You are an observer running SELECT * FROM users; inside Transaction 105.

Action

Observer (You)

Your XID: 105
You see rows where:
  1. xmin < 105 (Committed)
  2. xmax = 0 OR xmax ≥ 105
Table is empty.

3. Visibility Rules: The Logic

When your transaction starts, Postgres takes a snapshot. This snapshot contains the Transaction ID of your transaction and a list of all transactions that were active (not committed yet) at that moment.

For a tuple to be visible to you:

  1. Creation Check (xmin):
    • The transaction that created it (xmin) must have committed.
    • The xmin must be “in the past” relative to your snapshot.
  2. Deletion Check (xmax):
    • The tuple must not be deleted (xmax = 0).
    • OR, if it is deleted, the transaction that deleted it (xmax) must not be visible to you (it committed after your snapshot started or is still running).

The “Dirty” Secret of UPDATE

In Postgres, UPDATE is actually a syntactic sugar for DELETE + INSERT.

When you run UPDATE users SET name='Bob' WHERE id=1;:

  1. Postgres marks the old tuple (name=’Alice’) as dead by setting its xmax to the current transaction ID.
  2. Postgres inserts a new tuple (name=’Bob’) with xmin set to the current transaction ID.
  3. It updates the index to point to the new tuple.

This is why UPDATE heavy workloads can be expensive—they generate “garbage” (dead tuples) just like inserts and deletes.

[!NOTE] Optimization (HOT Updates): To mitigate the cost of updating indexes for every row update, Postgres uses HOT (Heap-Only Tuples). If an UPDATE does not modify any indexed columns, and there is free space on the same physical page, Postgres will skip step 3. It links the old tuple directly to the new tuple on the same page, saving massive amounts of I/O.

4. VACUUM and Bloat

Since deleting a row only marks it as invisible (xmax), the data stays on disk. Over time, these dead tuples accumulate, causing Table Bloat.

The Cleanup Crew: VACUUM

The VACUUM process is responsible for reclaiming this space.

[!TIP] Analogy: VACUUM is like a Garbage Collector in languages like Java or Go, but for the database disk. It sweeps through the pages, finding tuples that are no longer visible to any active transaction, and marks their physical space as reusable.

  • Standard VACUUM: Scans the table for dead tuples (where xmax is visible to everyone). It marks that space as “free” in the Free Space Map (FSM). The file size does not shrink, but the space is reused for future inserts.
  • VACUUM FULL: Rewrites the entire table to a new file, removing all bloat. This reclaims disk space but locks the table exclusively, blocking all reads and writes.
  • Autovacuum: A background daemon that wakes up periodically to run standard VACUUM on tables with significant activity.

[!WARNING] Never Disable Autovacuum Disabling autovacuum is a common production mistake. It leads to:

  1. Unbounded table growth (bloat).
  2. Index bloat (slower queries).
  3. Transaction ID Wraparound (database shutdown).

5. Transaction ID Wraparound (The Apocalypse)

xmin and xmax are 32-bit integers. This means there are only ~4 billion IDs available. If the counter overflows, old transactions might suddenly appear to be in the “future” (invisible), causing catastrophic data loss where everything mysteriously vanishes.

[!CAUTION] War Story: The Sentry Outage In 2015, the popular error-tracking service Sentry experienced a massive outage caused by Transaction ID Wraparound. Their database stopped accepting new writes because autovacuum couldn’t keep up with the bloat. To protect against data corruption, Postgres intentionally shut itself down. The recovery took hours of downtime to manually vacuum the database.

To prevent this, Postgres uses a “frozen” XID concept and forces a special “Freeze” VACUUM to run on old data, replacing the xmin with a special FrozenTransactionId (2) to indicate “this is infinitely old and visible to everyone.”

6. Summary

  1. Readers don’t block writers: Because readers see an older version of the row.
  2. Immutability: Rows are never overwritten, only versioned.
  3. Space Cost: MVCC requires extra storage for xmin/xmax headers and dead tuples.
  4. Maintenance: Regular VACUUM is mandatory to recycle space occupied by dead tuples.