Capacity Planning: The Math of Shards

[!NOTE] This module explores the core principles of Capacity Planning: The Math of Shards, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Goldilocks Zone: 10GB - 50GB

The #1 question: “How many shards should I have?”

The Rule: Keep shard size between 10GB and 50GB.

  1. Too Small (< 1GB):
    • Overhead: Each shard is a Lucene index with file handles, memory buffers, and threads.
    • Symptom: “OOM (Out of Memory)” and “Cluster State Explosion”.
  2. Too Large (> 50GB):
    • Recovery Hell: Improving a 100GB shard takes hours.
    • Symptom: “Cluster Red” for 4 hours after a node reboot.

2. Hot-Warm-Cold Architecture

Don’t treat all data equally.

  • Hot Nodes (SSD, High CPU): Active writes, frequent searches (Last 7 days).
  • Warm Nodes (HDD/Cheap SSD): Read-only, infrequent searches (Day 8 - 30).
  • Cold Nodes (S3/Snapshots): “Frozen” indices for compliance (Day 30+).

Mechanism: Index Lifecycle Management (ILM). Elasticsearch automatically moves shards from Hot → Warm → Cold based on age.


3. Interactive: Capacity Calculator

How many nodes do you need? Formula: Total Data * (1 + Replicas) * 1.2 (Overhead)

Total Storage Needed

0 TB

Total Primary Shards (Daily)

0
Aim for ~2-3 shards/day (40GB each)

Nodes Required

0

4. The 85% Watermark

Elasticsearch stops allocating shards to a node if disk usage hits 85% (Low Watermark). At 90% (High Watermark), it tries to move shards AWAY. At 95% (Flood Stage), it enforces Read-Only on indices.

Lesson: Always provision 15-20% free space buffer.