Hash Functions & Collisions

[!NOTE] This module explores the core principles of Hash Functions & Collisions, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Hook: Instant Retrieval

How do we find data instantly without scanning every element? Answer: Deterministic Mapping.

We use a Hash Function to convert a “Key” (String, Object) into an “Index” (Integer). This allows us to jump directly to the memory address where the data is stored. It is the “teleportation” of data structures.


2. The 4 Pillars of Hashing Mastery

Pillar Focus Why it matters
1. Intuition Space-Time Tradeoff Memory is cheap; time is expensive.
2. Rigor Birthday Paradox Understanding why collisions happen so early (O(√N)).
3. Hardware Open Addressing Exploiting CPU caches with linear probing vs. pointer chasing.
4. Patterns Sliding & Summing Solving Subarray Sums and String Search in O(N).

3. The Core Concept

Imagine a library where instead of searching for a book alphabetically (O(log N)), you have a magic formula. You feed the book title into the formula, and it tells you exactly which shelf and slot the book is in.


4. Interactive: Hash Table Visualizer

See how keys map to buckets in a small hash table (Size 8).


5. The Mathematical Anchor: The Birthday Paradox

Why do we need collision handling so early? If we have 1000 buckets, shouldn’t we be safe until we have 1000 items? No.

The Probability Trap

In a room of 23 people, there is a 50% chance that two people share a birthday.

  • There are 365 days (buckets).
  • We only have 23 people (items).
  • Intuition (Pigeonhole) says we are safe. Math (Probability) says we are colliding.

For Hash Tables

The number of items k needed to find a collision with probability P in N buckets is approximately: k ≈ √(2N ln(1/(1-P)))

For a table of size M, you can expect a collision after roughly √M insertions.

  • For M=1,000, collisions start around 32 items.
  • This is why we use a Load Factor (typically 0.75) and why robust collision resolution isn’t “nice to have”—it’s a mathematical requirement.

6. The Collision Problem

The Pigeonhole Principle states that if you have N items and M buckets, and N > M, at least one bucket must contain more than one item. Since the universe of keys is infinite and memory is finite, collisions are inevitable.

Collision Resolution Strategies

  1. Separate Chaining: Each bucket holds a linked list. If a collision happens, append to the list. (Most common).
  2. Open Addressing: If the bucket is full, find the next available slot (Linear/Quadratic probing).

7. Implementation: Standard Map Usage

Java (HashMap)

Map<String, Integer> map = new HashMap<>();

// Insert: O(1) Average
map.put("Alice", 25);

// Lookup: O(1) Average
if (map.containsKey("Alice")) {
  int age = map.get("Alice");
}

Go (map)

m := make(map[string]int)

// Insert: O(1) Average
m["Alice"] = 25

// Lookup: O(1) Average
age, exists := m["Alice"]

8. Summary & Complexity

  • Time Complexity: Average O(1), Worst Case O(N) (if all keys collide).
  • Hash Qualities: Deterministic, Uniformly Distributed, Fast.
  • Load Factor: Usually 0.75. Beyond this, the table resizes (doubles) to maintain O(1) performance.

9. Deep Dive Strategy Lab: Hashing Basics

Intuition Through Analogy

Think of this like finding a student in a massive school. Instead of walking through every classroom (Linear Search) or assuming they are in alphabetical order (Binary Search), you check their ID Number, which tells you exactly which locker is theirs (Hashing).

Mathematical Anchor: Polynomial Rolling Hash

For strings, a robust hash treats the string as a number in base P: hash(s) = (Σ s[i] &cdot; Pn-1-i) (mod M) This ensures that “abc” and “cba” have totally different hashes.