Join Algorithms

[!NOTE] This module explores the core principles of Join Algorithms, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction

When a query requests data from multiple tables, Postgres must determine the most efficient way to match the rows. This is the job of the Join Algorithm.

Postgres implements three primary join algorithms:

  1. Nested Loop Join
  2. Hash Join
  3. Merge Join

The planner chooses the algorithm based on the estimated cost, which depends on table sizes, available memory (work_mem), and existing indexes.

2. Nested Loop Join

The simplest and most fundamental join algorithm.

How it Works

It takes two tables (or relations): an Outer relation and an Inner relation.

  1. For each row in the Outer relation…
  2. Scan the Inner relation to find matching rows.

Complexity

  • Time: O(N × M), where N is the number of outer rows and M is the number of inner rows.
  • Space: O(1)

Variations

  • Nested Loop with Seq Scan: Scans the entire inner table for every outer row. Extremely slow for large tables.
  • Index Nested Loop Join: Uses an index on the inner table. Time becomes O(N × log M). This is very efficient when N is small and M is large (but indexed).

Best For

  • Joining a small dataset (Outer) to a large, indexed dataset (Inner).
  • CROSS JOIN (Cartesian products).
Outer Table Inner Table For each row... ...scan inner

3. Hash Join

Used when joining two large datasets where no useful index exists.

How it Works

  1. Build Phase: Postgres takes the smaller of the two relations (Inner) and builds an in-memory Hash Table using the join key.
  2. Probe Phase: It scans the larger relation (Outer) row by row. For each row, it hashes the join key and checks the Hash Table for matches.

Complexity

  • Time: O(N + M) (linear time!).
  • Space: O(M) (must store the inner relation in memory).

Memory Considerations (work_mem)

The hash table must fit in work_mem.

  • In-Memory: If it fits, it’s extremely fast.
  • Disk Spill: If it doesn’t fit, Postgres splits the join into batches and writes temporary files to disk. This is called a Hybrid Hash Join and is significantly slower due to disk I/O.
  • Tuning: Increasing work_mem can prevent disk spills for Hash Joins.

Best For

  • Large, unindexed joins.
  • Equality joins (=) only.
1. Build (Inner) Hash Table In-Memory 2. Probe (Outer)

4. Merge Join

Used when both inputs are already sorted on the join key.

How it Works

  1. Sort both relations on the join key (unless they are already sorted by an index).
  2. Iterate through both lists simultaneously using two pointers (like the merge step in Merge Sort).

Complexity

  • Time: O(N + M) (if already sorted) or O(N log N + M log M) (if sorting is required).
  • Space: O(1) (very memory efficient).

Best For

  • Large datasets that are already sorted (e.g., clustered by primary key or using a B-Tree index).
  • Range joins (e.g., BETWEEN, >, <).
  • When work_mem is too small for a Hash Join.

5. Comparison Cheat Sheet

Algorithm Setup Cost Per-Row Cost Memory Usage Pre-requisites Best Use Case
Nested Loop Low High Very Low None Small Outer table, Indexed Inner table
Hash Join Medium (Hash) Low High work_mem Large tables, Equality joins
Merge Join High (Sort) Very Low Low Sorted Inputs Sorted large tables, Range joins

6. Interactive: Join Algorithm Race

Visualize how different algorithms process data. Note how Hash Join is heavily front-loaded (Build phase), while Nested Loop starts immediately but runs slower per row.

Join Algorithm Visualizer

See how each algorithm processes data.

Table A (Outer)
Table B (Inner)
Select an algorithm to start.

[!NOTE] Forcing Algorithms: You can temporarily force Postgres to use a specific method for testing:

SET enable_hashjoin = off;
SET enable_mergejoin = off;
-- Now likely to use Nested Loop

Use this only for debugging, never in production code!

7. Next Steps

How does Postgres know how many rows are in a table to make these cost decisions? It uses statistics. Learn about the Statistics Collector.