Module Review: Scaling, Reliability & Operations

[!NOTE] This review chapter consolidates the critical concepts of capacity planning, resilience, and multi-tenancy, providing interactive tools to ensure you have internalized the hardware realities of distributed search.

1. Key Takeaways

  • Shard Sizing: Keep shards between 10GB and 50GB. Smaller shards cause “Oversharding” (JVM heap explosion via Lucene overhead); larger shards cause slow recoveries.
  • ILM (Index Lifecycle Management): Automatically move data across Hot (NVMe), Warm (HDD), and Cold (S3) nodes based on age and hardware constraints to minimize costs.
  • Split Brain: A network partition causing a cluster to elect two independent masters, leading to divergent, un-mergeable data.
  • Quorum Rule: To elect a master, you need strictly greater than half the voting members: (N/2) + 1. Always run 3 (or 5, 7) master-eligible nodes. Never 2.
  • Multi-Tenancy: “Index-per-Tenant” gives strong isolation but causes oversharding. “Shared Index with Custom Routing” provides high hardware density but requires strict routing parameters to avoid scatter-gather latency.

2. Interactive Flashcards

Click or press Enter on a card to reveal the answer. Use these to test your active recall.


3. Operations Cheat Sheet

Use this cheat sheet to remember key architectural limits and rules.

Concept Limit / Rule Reason
Shard Size 10GB → 50GB Smaller wastes JVM heap (Lucene overhead). Larger slows down cluster recovery (network/disk transfer).
Quorum Math Floor(N/2) + 1 Prevents Split Brain. Ensures a strict majority of master-eligible nodes vote.
Minimum Masters 3 Nodes 2-node clusters cannot achieve High Availability under Quorum rules.
Low Watermark 85% Disk Full Elasticsearch stops allocating new shards to the node.
High Watermark 90% Disk Full Elasticsearch aggressively moves existing shards away from the node.
Flood Stage 95% Disk Full Indices turn strictly read-only (read_only_allow_delete). Writes return 403.
Cluster Yellow Replicas Missing Missing redundancy, but all data is fully readable and writable.
Cluster Red Primary Missing Hard outage for the affected indices. Data is lost or temporarily unavailable.

4. Next Steps

You have now mastered the operational and reliability principles of running Elasticsearch at scale.

  1. Review definitions in the Elasticsearch Glossary.
  2. Proceed to the next module: Data Pipelines & Ingestion.