Tries (Prefix Trees)
Master the Trie data structure. Implement efficient Autocomplete, Spell Checkers, and Prefix Search in O(L) time, and explore their deep hardware limits.
1. What is a Trie?
A Trie (derived from Retrieval) is a tree-based data structure used to efficiently store and retrieve keys in a dataset of strings. Unlike a Binary Search Tree, nodes in a Trie do not store the key associated with that node. Instead, the path from the root to the node defines the key.
Why not a HashMap?
A HashMap can look up “Apple” in O(1) (Average). But what if you want all words starting with “App”?
- HashMap: Must scan every key. O(N).
- Trie: Just walks down ‘A’ → ‘P’ → ‘P’. O(PrefixLength).
Real-World Examples
- Search Engines (Google Autocomplete): When you type “sys…”, it instantly suggests “system design”, “systemctl”, etc. A massive distributed Trie backed by caching makes this possible.
- IP Routing (Longest Prefix Match): Routers use variations of Tries (like Radix Trees) to quickly determine where to send an IP packet based on its prefix.
- Spell Checkers / Boggle Solvers: Finding all valid words on a board by exploring paths in a Trie.
2. Interactive: The Autocomplete Engine
Type a word to add it. Type a prefix to search.
3. Complexity & Constraints
| Operation | Time Complexity | Note |
|---|---|---|
| Insert | O(L) | L is word length. |
| Search | O(L) | Independent of total words N! |
| Space | O(N × L × \Sigma) | \Sigma is alphabet size (26). |
Mathematical Anchor: The Space Trade-off
Tries are fast but Space Heavy. Space = Total Characters × Pointer Size. For a dictionary of English, specific branches (like ‘z’) are sparse. Optimization: Radix Tree (Compress single-child chains like ‘A’->’p’->’p’ into one node “App”).
4. Implementation
class TrieNode {
TrieNode[] children = new TrieNode[26];
boolean isEndOfWord = false;
}
class Trie {
TrieNode root = new TrieNode();
public void insert(String word) {
TrieNode curr = root;
for (char c : word.toCharArray()) {
int index = c - 'a';
if (curr.children[index] == null) {
curr.children[index] = new TrieNode();
}
curr = curr.children[index];
}
curr.isEndOfWord = true;
}
public boolean search(String word) {
TrieNode curr = root;
for (char c : word.toCharArray()) {
int index = c - 'a';
if (curr.children[index] == null) return false;
curr = curr.children[index];
}
return curr.isEndOfWord;
}
}
type TrieNode struct {
children [26]*TrieNode
isEndOfWord bool
}
type Trie struct {
root *TrieNode
}
func Constructor() Trie {
return Trie{root: &TrieNode{}}
}
func (this *Trie) Insert(word string) {
curr := this.root
for _, c := range word {
index := c - 'a'
if curr.children[index] == nil {
curr.children[index] = &TrieNode{}
}
curr = curr.children[index]
}
curr.isEndOfWord = true
}
func (this *Trie) Search(word string) bool {
curr := this.root
for _, c := range word {
index := c - 'a'
if curr.children[index] == nil {
return false
}
curr = curr.children[index]
}
return curr.isEndOfWord
}
5. Case Study: Designing a Global Autocomplete System (PEDALS)
Let’s design a distributed autocomplete system similar to Google’s search bar.
Process Requirements
- Functional:
- User types a prefix, system returns top 10 completions based on popularity.
- System must update popularity scores based on new searches.
- Non-Functional:
- Low latency:
< 50ms. - High availability.
- High read throughput (billions of searches per day).
- Low latency:
Estimate
- Traffic: Let’s assume 5 billion searches per day. That’s
~58,000searches per second (QPS). At peak, maybe120,000 QPS. - Storage: English dictionary has
~500,000words. But we need to store user queries, which can be short phrases. Assume 100 million unique queries.- Average query length: 20 characters (20 bytes).
- Total raw data:
100M * 20 bytes = 2 GB. This easily fits in memory!
Data Model
Instead of a simple Trie that just stores words, each node must store the top 10 historical searches that pass through it. This avoids traversing the entire subtree to find the most popular words during a read query.
// Example of a Trie Node data structure in JSON format
{
"char": "a",
"children": ["p", "u", ...],
"top_queries": [
{"query": "apple", "score": 1500},
{"query": "amazon", "score": 1200},
{"query": "api", "score": 900}
]
}
Architecture
- Client: Sends prefix character by character.
- Load Balancer: Distributes traffic.
- API Gateway: Handles rate limiting and auth.
- Trie Cache (Redis/Memcached): Stores the hot paths of the Trie.
- Trie Service: In-memory service that holds the complete Trie. Replicated across multiple nodes.
- Data Aggregation Pipeline:
- We don’t update the Trie synchronously on every search (that would be a write bottleneck).
- Searches are sent to a Kafka topic.
- A stream processing engine (e.g., Spark Streaming or Flink) aggregates frequencies every few minutes.
- The updated frequencies are saved to a Database (e.g., Cassandra) and used to rebuild the offline Trie.
Localized Details (The Bottlenecks)
- Heavy Reads: The system is extremely read-heavy. The in-memory Trie service is crucial. We use a Read-Optimized Trie.
- Trie Rebuild: Building the Trie is CPU intensive. We build it offline using a map-reduce job on Hadoop/Spark. Once built, we swap the old Trie with the new Trie in memory seamlessly.
- Caching: Browsers should cache recent results. The API Gateway can use a CDN or Redis to cache exact prefix matches.
Scale
- Partitioning: What if the Trie grows beyond one machine’s memory? (e.g., we support many languages, long queries).
- Range Based Partitioning: ‘a’ to ‘m’ on Server 1, ‘n’ to ‘z’ on Server 2. Problem: ‘s’ (for ‘system’) might get way more traffic than ‘q’.
- Hash Based Partitioning: Hash the first character or prefix to determine the server. Better distribution.
- Zookeeper: Used to manage the cluster state, coordinate the Trie rebuilding process, and handle service discovery.