Hash Functions & Collisions
[!NOTE] This module explores the core principles of Hash Functions & Collisions, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Hook: Instant Retrieval
How do we find data instantly without scanning every element? Answer: Deterministic Mapping.
We use a Hash Function to convert a “Key” (String, Object) into an “Index” (Integer). This allows us to jump directly to the memory address where the data is stored. It is the “teleportation” of data structures.
2. The 4 Pillars of Hashing Mastery
| Pillar | Focus | Why it matters |
|---|---|---|
| 1. Intuition | Space-Time Tradeoff | Memory is cheap; time is expensive. |
| 2. Rigor | Birthday Paradox | Understanding why collisions happen so early (O(√N)). |
| 3. Hardware | Open Addressing | Exploiting CPU caches with linear probing vs. pointer chasing. |
| 4. Patterns | Sliding & Summing | Solving Subarray Sums and String Search in O(N). |
3. The Core Concept
Imagine a library where instead of searching for a book alphabetically (O(log N)), you have a magic formula. You feed the book title into the formula, and it tells you exactly which shelf and slot the book is in.
4. Interactive: Hash Table Visualizer
See how keys map to buckets in a small hash table (Size 8).
5. The Mathematical Anchor: The Birthday Paradox
Why do we need collision handling so early? If we have 1000 buckets, shouldn’t we be safe until we have 1000 items? No.
The Probability Trap
In a room of 23 people, there is a 50% chance that two people share a birthday.
- There are 365 days (buckets).
- We only have 23 people (items).
- Intuition (Pigeonhole) says we are safe. Math (Probability) says we are colliding.
For Hash Tables
The number of items k needed to find a collision with probability P in N buckets is approximately: k ≈ √(2N ln(1/(1-P)))
For a table of size M, you can expect a collision after roughly √M insertions.
- For M=1,000, collisions start around 32 items.
- This is why we use a Load Factor (typically 0.75) and why robust collision resolution isn’t “nice to have”—it’s a mathematical requirement.
6. The Collision Problem
The Pigeonhole Principle states that if you have N items and M buckets, and N > M, at least one bucket must contain more than one item. Since the universe of keys is infinite and memory is finite, collisions are inevitable.
Collision Resolution Strategies
- Separate Chaining: Each bucket holds a linked list. If a collision happens, append to the list. (Most common).
- Open Addressing: If the bucket is full, find the next available slot (Linear/Quadratic probing).
7. Implementation: Standard Map Usage
Java (HashMap)
Map<String, Integer> map = new HashMap<>();
// Insert: O(1) Average
map.put("Alice", 25);
// Lookup: O(1) Average
if (map.containsKey("Alice")) {
int age = map.get("Alice");
}
Go (map)
m := make(map[string]int)
// Insert: O(1) Average
m["Alice"] = 25
// Lookup: O(1) Average
age, exists := m["Alice"]
8. Summary & Complexity
- Time Complexity: Average O(1), Worst Case O(N) (if all keys collide).
- Hash Qualities: Deterministic, Uniformly Distributed, Fast.
- Load Factor: Usually 0.75. Beyond this, the table resizes (doubles) to maintain O(1) performance.
9. Deep Dive Strategy Lab: Hashing Basics
Intuition Through Analogy
Think of this like finding a student in a massive school. Instead of walking through every classroom (Linear Search) or assuming they are in alphabetical order (Binary Search), you check their ID Number, which tells you exactly which locker is theirs (Hashing).
Mathematical Anchor: Polynomial Rolling Hash
For strings, a robust hash treats the string as a number in base P: hash(s) = (Σ s[i] ċ Pn-1-i) (mod M) This ensures that “abc” and “cba” have totally different hashes.