Foundations — Review & Checklist

[!NOTE] This module explores the core principles of Foundations — Review & Checklist, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Key Takeaways

  • The Inverted Index: Instead of mapping row IDs to text, Elasticsearch maps words to lists of row IDs, enabling fast O(1) lookups instead of O(N) table scans.
  • Hardware Physics: Elasticsearch transforms random I/O (slow) into sequential I/O (fast) by utilizing the filesystem cache and in-memory segment intersections.
  • Horizontal Scalability: Data is partitioned into Shards (mini search engines) across multiple Nodes to achieve infinite horizontal scaling and parallelism.
  • High Availability: Replica Shards provide redundant copies of Primary Shards, enabling failover with zero downtime and increased read throughput.
  • Indexing Lifecycle: Writes move from the Memory Buffer (not searchable, not safe), to the Translog (safe), to a Refresh creating a Segment (searchable, not safe on disk), to a Flush (searchable, safe on disk).

2. Flashcards

What is an Inverted Index?

A data structure mapping terms (words) to the list of documents containing them, enabling O(1) lookups.

What is the difference between a Shard and a Replica?

A Shard is a data partition (Lucene index). A Replica is an exact copy for high availability and read scaling.

What happens during a Refresh?

Documents in the memory buffer are written to a new Segment in the filesystem cache, making them searchable.

3. Cheat Sheet

Concept Purpose Analogy
Inverted Index Fast text search lookup Book index at the back
Cluster Collection of all nodes The entire company
Node Single JVM server instance A single employee
Shard Horizontal data partition A specialized department
Replica Copy of a primary shard The backup department
Segment Immutable disk file A finalized filing cabinet
Refresh Makes data searchable Printing temporary documents
Flush Makes data durable on disk Filing documents permanently

4. Quick Revision

  • The Problem with SQL: LIKE '%text%' requires full table scans (O(N)), causing high latency for search operations.
  • Elasticsearch Scale: An Index is just a logical namespace. Shards do the actual work. You can scale horizontally by distributing Shards across Nodes.
  • Failover: Replicas are promoted to Primary Shards if a node dies, guaranteeing zero downtime.
  • Performance Trade-offs: You can increase refresh_interval for better indexing throughput at the cost of near real-time search latency.

5. Next Steps

Continue to the next module to learn about mapping and analysis: Elasticsearch course index.

Don’t forget to check the Elasticsearch Glossary if you need a refresher on the terminology used in this module!