Capstone: E-Commerce Search Engine

[!NOTE] This module explores the core principles of Capstone: E-Commerce Search Engine, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Requirement

Goal: Build the Search system for “ShopMega”, a retailer with 50M products. Traffic: 5k QPS (Read), 100 WPS (Write). Features:

  • Full-text search with typo tolerance.
  • Faceted navigation (Color, Price, Brand).
  • Personalized Ranking.

2. The Data Model (Mapping)

{
  "mappings": {
  "properties": {
    "title": {
    "type": "text",
    "analyzer": "english", // Stemming (shoes -> shoe)
    "fields": { "edge": { "type": "text", "analyzer": "edge_ngram" } } // Autocomplete
    },
    "price": { "type": "float" }, // Range Filters
    "category": { "type": "keyword" }, // Facets
    "popularity": { "type": "rank_feature" } // Efficient Boosting
  }
  }
}

3. The Query Strategy (DSL)

We need a complex Bool Query:

  1. Must: Match user text (multi_match on title/desc).
    • "fuzziness": "AUTO" (Handle typos).
  2. Should: Boost by popularity.
    • "rank_feature": { "field": "popularity", "boost": 2 }.
  3. Filter: Apply user facets (category="shoes", price < 100).
    • Cached in BitSet.

4. The Architecture Diagram

Client
Gateway (Rate Limit / Auth)
Elasticsearch Cluster
Node 1 (Data)
Shard 1 (P)
Shard 2 (R)
Node 2 (Data)
Shard 2 (P)
Shard 1 (R)
Node 3 (Master)
Coordinator

5. Staff Decisions

  1. Sharding: 50M products \times 1KB = 50GB.
    • Decision: 2 Primary Shards (25GB each). Perfect size (10-50GB rule).
  2. Replication: 2 Replicas.
    • Total copies = 3. Loss of 1 data center is survivable.
  3. Refresh Interval: Set to 30s.
    • Users don’t need to search a product 1ms after it’s added.
    • Gain: Massive indexing throughput.

Final Verdict: Simple, Scalable, Resilient. Gold Standard.