03. Chaining & Open Addressing

1. The Collision Reality

There are two main strategies: Separate Chaining (used by Java HashMap) and Open Addressing (used by Python Dictionary and C++ unordered<sub>map</sub>).

Interactive Comparison

Watch how each strategy handles collisions differently when you add multiple keys that map to the same index.

Chaining vs. Linear Probing

Separate Chaining

Linear Probing

Strategy 1: Separate Chaining

Each bucket in the array holds a Linked List (or a dynamic array). When a collision occurs, simply append the new key to the list at that index.

Pros:
Simple to implement.
Gracefully handles high load factors (performance degrades gradually).
Deletion is easy (just remove from linked list).
Cons:
Requires extra memory for pointers.
Poor cache locality (linked list nodes are scattered in memory).

Strategy 2: Open Addressing (Linear Probing)

All elements are stored directly in the array. If the target bucket is full, we search for the next empty slot using a Probing Sequence.

Linear Probing: Check index, then index+1, then index+2…
Pros:
Better memory usage (no pointers).
Excellent cache locality (sequential access).
Cons:
Clustering: Groups of occupied slots tend to merge, creating long “runs” that slow down insertions.
Performance degrades sharply as load factor approaches 1.
Deletion is tricky (requires “tombstones” or re-inserting elements).

Hardware Truth: CPU Cache vs. Pointers

Modern software is bottlenecked by memory latency, not CPU cycles.

The Chaining Latency

In Separate Chaining, the table contains a pointer to a Node object.

CPU reads bucket → Pointer address.
CPU must fetch that address (Pointer Chase).
Cache Miss Probability: High. Each jump in the linked list likely hits a different “cache line,” forcing a slow trip to main RAM.

The Open Addressing Advantage

In Linear Probing, the data is inline.

CPU reads bucket i.
CPU pre-fetches the next 64 bytes (the “Cache Line”) automatically.
If a collision exists at i+1, the data is already in the L1/L2 Cache.

[!IMPORTANT] Because of this, Linear Probing is often 3x-10x faster than Chaining for small-to-medium datasets, even if it performs more comparisons. In modern systems, a comparison is nearly free; a memory fetch is expensive.

2. Load Factor & Rehashing

The Load Factor (α) is defined as: α = (Number of Elements) / (Table Size)

For Chaining, α can be > 1.
For Open Addressing, α must be ≤ 1.

When α exceeds a threshold (e.g., 0.75), the hash table must Rehash:

Create a new array double the size.
Iterate through all old elements.
Re-insert them into the new array (their new index will be different!).

3. Python Implementation (Chaining)

Here is how you might build a simple Hash Map from scratch using chaining.

class Node:
  def __init__(self, key, value):
    self.key = key
    self.value = value
    self.next = None

class MyHashMap:
  def __init__(self):
    self.size = 1000
    self.buckets = [None] * self.size

  def _hash(self, key):
    return hash(key) % self.size

  def put(self, key, value):
    index = self._hash(key)
    head = self.buckets[index]

    # Check if key exists (update)
    curr = head
    while curr:
      if curr.key == key:
        curr.value = value
        return
      curr = curr.next

    # Insert new node at head (collision)
    new_node = Node(key, value)
    new_node.next = head
    self.buckets[index] = new_node

  def get(self, key):
    index = self._hash(key)
    curr = self.buckets[index]
    while curr:
      if curr.key == key:
        return curr.value
      curr = curr.next
    return -1 # Not found

4. Deep Dive Strategy Lab: Open Addressing Chaining

Intuition Through Analogy

Think of this chapter like library card catalog lookup. The goal is not to memorize a fixed trick, but to repeatedly answer:

What is the bottleneck?
Which constraint dominates (time, memory, latency, correctness)?
Which representation makes the bottleneck easier to eliminate?