03. Chaining & Open Addressing
1. The Collision Reality
There are two main strategies: Separate Chaining (used by Java HashMap) and Open Addressing (used by Python Dictionary and C++ unordered<sub>map</sub>).
Interactive Comparison
Watch how each strategy handles collisions differently when you add multiple keys that map to the same index.
Chaining vs. Linear Probing
Separate Chaining
Linear Probing
Strategy 1: Separate Chaining
Each bucket in the array holds a Linked List (or a dynamic array). When a collision occurs, simply append the new key to the list at that index.
- Pros:
- Simple to implement.
- Gracefully handles high load factors (performance degrades gradually).
- Deletion is easy (just remove from linked list).
- Cons:
- Requires extra memory for pointers.
- Poor cache locality (linked list nodes are scattered in memory).
Strategy 2: Open Addressing (Linear Probing)
All elements are stored directly in the array. If the target bucket is full, we search for the next empty slot using a Probing Sequence.
- Linear Probing: Check
index, thenindex+1, thenindex+2… - Pros:
- Better memory usage (no pointers).
- Excellent cache locality (sequential access).
- Cons:
- Clustering: Groups of occupied slots tend to merge, creating long “runs” that slow down insertions.
- Performance degrades sharply as load factor approaches 1.
- Deletion is tricky (requires “tombstones” or re-inserting elements).
Hardware Truth: CPU Cache vs. Pointers
Modern software is bottlenecked by memory latency, not CPU cycles.
The Chaining Latency
In Separate Chaining, the table contains a pointer to a Node object.
- CPU reads bucket → Pointer address.
- CPU must fetch that address (Pointer Chase).
- Cache Miss Probability: High. Each jump in the linked list likely hits a different “cache line,” forcing a slow trip to main RAM.
The Open Addressing Advantage
In Linear Probing, the data is inline.
- CPU reads bucket i.
- CPU pre-fetches the next 64 bytes (the “Cache Line”) automatically.
- If a collision exists at i+1, the data is already in the L1/L2 Cache.
[!IMPORTANT] Because of this, Linear Probing is often 3x-10x faster than Chaining for small-to-medium datasets, even if it performs more comparisons. In modern systems, a comparison is nearly free; a memory fetch is expensive.
2. Load Factor & Rehashing
The Load Factor (α) is defined as:
α = (Number of Elements) / (Table Size)
- For Chaining, α can be > 1.
- For Open Addressing, α must be ≤ 1.
When α exceeds a threshold (e.g., 0.75), the hash table must Rehash:
- Create a new array double the size.
- Iterate through all old elements.
- Re-insert them into the new array (their new index will be different!).
3. Python Implementation (Chaining)
Here is how you might build a simple Hash Map from scratch using chaining.
class Node:
def __init__(self, key, value):
self.key = key
self.value = value
self.next = None
class MyHashMap:
def __init__(self):
self.size = 1000
self.buckets = [None] * self.size
def _hash(self, key):
return hash(key) % self.size
def put(self, key, value):
index = self._hash(key)
head = self.buckets[index]
# Check if key exists (update)
curr = head
while curr:
if curr.key == key:
curr.value = value
return
curr = curr.next
# Insert new node at head (collision)
new_node = Node(key, value)
new_node.next = head
self.buckets[index] = new_node
def get(self, key):
index = self._hash(key)
curr = self.buckets[index]
while curr:
if curr.key == key:
return curr.value
curr = curr.next
return -1 # Not found
4. Deep Dive Strategy Lab: Open Addressing Chaining
Intuition Through Analogy
Think of this chapter like library card catalog lookup. The goal is not to memorize a fixed trick, but to repeatedly answer:
- What is the bottleneck?
- Which constraint dominates (time, memory, latency, correctness)?
- Which representation makes the bottleneck easier to eliminate?