Join Algorithms

[!NOTE] This module explores the core principles of Join Algorithms, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction

When a query requests data from multiple tables, Postgres must determine the most efficient way to match the rows. This is the job of the Join Algorithm.

Postgres implements three primary join algorithms:

Nested Loop Join
Hash Join
Merge Join

The planner chooses the algorithm based on the estimated cost, which depends on table sizes, available memory (work_mem), and existing indexes.

2. Nested Loop Join

The simplest and most fundamental join algorithm.

How it Works

It takes two tables (or relations): an Outer relation and an Inner relation.

For each row in the Outer relation…
Scan the Inner relation to find matching rows.

Complexity

Time: O(N × M), where N is the number of outer rows and M is the number of inner rows.
Space: O(1)

Variations

Nested Loop with Seq Scan: Scans the entire inner table for every outer row. Extremely slow for large tables.
Index Nested Loop Join: Uses an index on the inner table. Time becomes O(N × log M). This is very efficient when N is small and M is large (but indexed).

Best For

Joining a small dataset (Outer) to a large, indexed dataset (Inner).
CROSS JOIN (Cartesian products).

3. Hash Join

Used when joining two large datasets where no useful index exists.

How it Works

Build Phase: Postgres takes the smaller of the two relations (Inner) and builds an in-memory Hash Table using the join key.
Probe Phase: It scans the larger relation (Outer) row by row. For each row, it hashes the join key and checks the Hash Table for matches.

Complexity

Time: O(N + M) (linear time!).
Space: O(M) (must store the inner relation in memory).

Memory Considerations (`work_mem`)

The hash table must fit in work_mem.

In-Memory: If it fits, it’s extremely fast.
Disk Spill: If it doesn’t fit, Postgres splits the join into batches and writes temporary files to disk. This is called a Hybrid Hash Join and is significantly slower due to disk I/O.
Tuning: Increasing work_mem can prevent disk spills for Hash Joins.

Best For

Large, unindexed joins.
Equality joins (=) only.

4. Merge Join

Used when both inputs are already sorted on the join key.

How it Works

Sort both relations on the join key (unless they are already sorted by an index).
Iterate through both lists simultaneously using two pointers (like the merge step in Merge Sort).

Complexity

Time: O(N + M) (if already sorted) or O(N log N + M log M) (if sorting is required).
Space: O(1) (very memory efficient).

Best For

Large datasets that are already sorted (e.g., clustered by primary key or using a B-Tree index).
Range joins (e.g., BETWEEN, >, <).
When work_mem is too small for a Hash Join.

5. Comparison Cheat Sheet

Algorithm	Setup Cost	Per-Row Cost	Memory Usage	Pre-requisites	Best Use Case
Nested Loop	Low	High	Very Low	None	Small Outer table, Indexed Inner table
Hash Join	Medium (Hash)	Low	High	`work_mem`	Large tables, Equality joins
Merge Join	High (Sort)	Very Low	Low	Sorted Inputs	Sorted large tables, Range joins

6. Interactive: Join Algorithm Race

Visualize how different algorithms process data. Note how Hash Join is heavily front-loaded (Build phase), while Nested Loop starts immediately but runs slower per row.

Join Algorithm Visualizer

See how each algorithm processes data.

Table A (Outer)

Table B (Inner)

Select an algorithm to start.

[!NOTE] Forcing Algorithms: You can temporarily force Postgres to use a specific method for testing:
SET enable_hashjoin = off;
SET enable_mergejoin = off;
-- Now likely to use Nested Loop
Use this only for debugging, never in production code!

7. Next Steps

How does Postgres know how many rows are in a table to make these cost decisions? It uses statistics. Learn about the Statistics Collector.

Join Algorithms

Join Algorithms

1. Introduction

2. Nested Loop Join

How it Works

Complexity

Variations

Best For

3. Hash Join

How it Works

Complexity

Memory Considerations (work_mem)

Best For

4. Merge Join

How it Works

Complexity

Best For

5. Comparison Cheat Sheet

6. Interactive: Join Algorithm Race

Join Algorithm Visualizer

7. Next Steps

Memory Considerations (`work_mem`)