Module Review: Scaling, Reliability & Operations

[!NOTE] This review chapter consolidates the critical concepts of capacity planning, resilience, and multi-tenancy, providing interactive tools to ensure you have internalized the hardware realities of distributed search.

1. Key Takeaways

Shard Sizing: Keep shards between 10GB and 50GB. Smaller shards cause “Oversharding” (JVM heap explosion via Lucene overhead); larger shards cause slow recoveries.
ILM (Index Lifecycle Management): Automatically move data across Hot (NVMe), Warm (HDD), and Cold (S3) nodes based on age and hardware constraints to minimize costs.
Split Brain: A network partition causing a cluster to elect two independent masters, leading to divergent, un-mergeable data.
Quorum Rule: To elect a master, you need strictly greater than half the voting members: (N/2) + 1. Always run 3 (or 5, 7) master-eligible nodes. Never 2.
Multi-Tenancy: “Index-per-Tenant” gives strong isolation but causes oversharding. “Shared Index with Custom Routing” provides high hardware density but requires strict routing parameters to avoid scatter-gather latency.

2. Interactive Flashcards

Click or press Enter on a card to reveal the answer. Use these to test your active recall.

3. Operations Cheat Sheet

Use this cheat sheet to remember key architectural limits and rules.

Concept	Limit / Rule	Reason
Shard Size	10GB → 50GB	Smaller wastes JVM heap (Lucene overhead). Larger slows down cluster recovery (network/disk transfer).
Quorum Math	`Floor(N/2) + 1`	Prevents Split Brain. Ensures a strict majority of master-eligible nodes vote.
Minimum Masters	3 Nodes	2-node clusters cannot achieve High Availability under Quorum rules.
Low Watermark	85% Disk Full	Elasticsearch stops allocating new shards to the node.
High Watermark	90% Disk Full	Elasticsearch aggressively moves existing shards away from the node.
Flood Stage	95% Disk Full	Indices turn strictly read-only (`read_only_allow_delete`). Writes return 403.
Cluster Yellow	Replicas Missing	Missing redundancy, but all data is fully readable and writable.
Cluster Red	Primary Missing	Hard outage for the affected indices. Data is lost or temporarily unavailable.

4. Next Steps

You have now mastered the operational and reliability principles of running Elasticsearch at scale.

Review definitions in the Elasticsearch Glossary.
Proceed to the next module: Data Pipelines & Ingestion.

Module Review: Scaling, Reliability & Operations

Module Review: Scaling, Reliability & Operations

1. Key Takeaways

2. Interactive Flashcards

3. Operations Cheat Sheet

4. Next Steps

Found this lesson helpful?