Capstone: E-Commerce Search Engine
[!NOTE] This module explores the core principles of Capstone: E-Commerce Search Engine, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Requirement
Goal: Build the Search system for “ShopMega”, a retailer with 50M products. Traffic: 5k QPS (Read), 100 WPS (Write). Features:
- Full-text search with typo tolerance.
- Faceted navigation (Color, Price, Brand).
- Personalized Ranking.
2. The Data Model (Mapping)
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english", // Stemming (shoes -> shoe)
"fields": { "edge": { "type": "text", "analyzer": "edge_ngram" } } // Autocomplete
},
"price": { "type": "float" }, // Range Filters
"category": { "type": "keyword" }, // Facets
"popularity": { "type": "rank_feature" } // Efficient Boosting
}
}
}
3. The Query Strategy (DSL)
We need a complex Bool Query:
- Must: Match user text (
multi_matchon title/desc)."fuzziness": "AUTO"(Handle typos).
- Should: Boost by popularity.
"rank_feature": { "field": "popularity", "boost": 2 }.
- Filter: Apply user facets (
category="shoes",price < 100).- Cached in BitSet.
4. The Architecture Diagram
Client
↓
Gateway (Rate Limit / Auth)
↓
Elasticsearch Cluster
Node 1 (Data)
Shard 1 (P)
Shard 2 (R)
Node 2 (Data)
Shard 2 (P)
Shard 1 (R)
Node 3 (Master)
Coordinator
5. Staff Decisions
- Sharding: 50M products \times 1KB = 50GB.
- Decision: 2 Primary Shards (25GB each). Perfect size (10-50GB rule).
- Replication: 2 Replicas.
- Total copies = 3. Loss of 1 data center is survivable.
- Refresh Interval: Set to
30s.- Users don’t need to search a product 1ms after it’s added.
- Gain: Massive indexing throughput.
Final Verdict: Simple, Scalable, Resilient. Gold Standard.