Hash Functions & Collisions
[!NOTE] This module explores the core principles of Hash Functions & Collisions, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Hook: Instant Retrieval
How do we find data instantly without scanning every element? Answer: Deterministic Mapping.
We use a Hash Function to convert a “Key” (String, Object) into an “Index” (Integer). This allows us to jump directly to the memory address where the data is stored. It is the “teleportation” of data structures.
2. The 4 Pillars of Hashing Mastery
| Pillar | Focus | Why it matters |
|---|---|---|
| 1. Intuition | Space-Time Tradeoff | Memory is cheap; time is expensive. |
| 2. Rigor | Birthday Paradox | Understanding why collisions happen so early (O(√N)). |
| 3. Hardware | Open Addressing | Exploiting CPU caches with linear probing vs. pointer chasing. |
| 4. Patterns | Sliding & Summing | Solving Subarray Sums and String Search in O(N). |
3. The Core Concept
Imagine a library where instead of searching for a book alphabetically (O(log N)), you have a magic formula. You feed the book title into the formula, and it tells you exactly which shelf and slot the book is in.
4. Interactive: Hash Table Visualizer
See how keys map to buckets in a small hash table (Size 8).
5. The Mathematical Anchor: The Birthday Paradox
Why do we need collision handling so early? If we have 1000 buckets, shouldn’t we be safe until we have 1000 items? No.
The Probability Trap
In a room of 23 people, there is a 50% chance that two people share a birthday.
- There are 365 days (buckets).
- We only have 23 people (items).
- Intuition (Pigeonhole) says we are safe. Math (Probability) says we are colliding.
For Hash Tables
The number of items k needed to find a collision with probability P in N buckets is approximately: k ≈ √(2N ln(1/(1-P)))
For a table of size M, you can expect a collision after roughly √M insertions.
- For M=1,000, collisions start around 32 items.
- This is why we use a Load Factor (typically 0.75) and why robust collision resolution isn’t “nice to have”—it’s a mathematical requirement.
6. The Collision Problem
The Pigeonhole Principle states that if you have N items and M buckets, and N > M, at least one bucket must contain more than one item. Since the universe of keys is infinite and memory is finite, collisions are inevitable.
Collision Resolution Strategies
- Separate Chaining: Each bucket holds a linked list. If a collision happens, append to the list. (Most common).
- Open Addressing: If the bucket is full, find the next available slot (Linear/Quadratic probing).
7. Implementation: Standard Map Usage
Java (HashMap)
Map<String, Integer> map = new HashMap<>();
// Insert: O(1) Average
map.put("Alice", 25);
// Lookup: O(1) Average
if (map.containsKey("Alice")) {
int age = map.get("Alice");
}
Go (map)
m := make(map[string]int)
// Insert: O(1) Average
m["Alice"] = 25
// Lookup: O(1) Average
age, exists := m["Alice"]
8. Summary & Complexity
- Time Complexity: Average O(1), Worst Case O(N) (if all keys collide).
- Hash Qualities: Deterministic, Uniformly Distributed, Fast.
- Load Factor: Usually 0.75. Beyond this, the table resizes (doubles) to maintain O(1) performance.