Capacity Planning: The Math of Shards
[!NOTE] This module explores the core principles of Capacity Planning: The Math of Shards, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Goldilocks Zone: 10GB - 50GB
The #1 question: “How many shards should I have?”
The Rule: Keep shard size between 10GB and 50GB.
- Too Small (< 1GB):
- Overhead: Each shard is a Lucene index with file handles, memory buffers, and threads.
- Symptom: “OOM (Out of Memory)” and “Cluster State Explosion”.
- Too Large (> 50GB):
- Recovery Hell: Improving a 100GB shard takes hours.
- Symptom: “Cluster Red” for 4 hours after a node reboot.
2. Hot-Warm-Cold Architecture
Don’t treat all data equally.
- Hot Nodes (SSD, High CPU): Active writes, frequent searches (Last 7 days).
- Warm Nodes (HDD/Cheap SSD): Read-only, infrequent searches (Day 8 - 30).
- Cold Nodes (S3/Snapshots): “Frozen” indices for compliance (Day 30+).
Mechanism: Index Lifecycle Management (ILM).
Elasticsearch automatically moves shards from Hot → Warm → Cold based on age.
3. Interactive: Capacity Calculator
How many nodes do you need?
Formula: Total Data * (1 + Replicas) * 1.2 (Overhead)
Total Storage Needed
Total Primary Shards (Daily)
Nodes Required
4. The 85% Watermark
Elasticsearch stops allocating shards to a node if disk usage hits 85% (Low Watermark). At 90% (High Watermark), it tries to move shards AWAY. At 95% (Flood Stage), it enforces Read-Only on indices.
Lesson: Always provision 15-20% free space buffer.