Capacity Planning: The Math of Shards

[!NOTE] This module explores the core principles of Capacity Planning: The Math of Shards, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Goldilocks Zone: 10GB - 50GB

The #1 question: “How many shards should I have?”

The Rule: Keep shard size between 10GB and 50GB.

Too Small (< 1GB):
- Overhead: Each shard is a Lucene index with file handles, memory buffers, and threads.
- Symptom: “OOM (Out of Memory)” and “Cluster State Explosion”.
Too Large (> 50GB):
- Recovery Hell: Improving a 100GB shard takes hours.
- Symptom: “Cluster Red” for 4 hours after a node reboot.

2. Hot-Warm-Cold Architecture

Don’t treat all data equally.

Hot Nodes (SSD, High CPU): Active writes, frequent searches (Last 7 days).
Warm Nodes (HDD/Cheap SSD): Read-only, infrequent searches (Day 8 - 30).
Cold Nodes (S3/Snapshots): “Frozen” indices for compliance (Day 30+).

Mechanism: Index Lifecycle Management (ILM). Elasticsearch automatically moves shards from Hot → Warm → Cold based on age.

3. Interactive: Capacity Calculator

How many nodes do you need? Formula: Total Data * (1 + Replicas) * 1.2 (Overhead)

Daily Ingestion (GB): Retention (Days): Replicas: Node Disk Size (TB):

Total Storage Needed

0 TB

Total Primary Shards (Daily)

Aim for ~2-3 shards/day (40GB each)

Nodes Required

4. The 85% Watermark

Elasticsearch stops allocating shards to a node if disk usage hits 85% (Low Watermark). At 90% (High Watermark), it tries to move shards AWAY. At 95% (Flood Stage), it enforces Read-Only on indices.

Lesson: Always provision 15-20% free space buffer.