Staff-Level Search Architecture — Review & Checklist
[!NOTE] Module review and readiness checklist for Staff-Level Search Architecture. Test your knowledge with interactive flashcards and our quick revision cheat sheet.
1. Key Takeaways
- Gateway Abstraction: Always place a Gateway Service between the client and Elasticsearch to manage auth, rate limits, and circuit breaking.
- Blue/Green Reindexing: Zero downtime mapping changes require atomic alias swaps between V1 and V2 indices.
- SLO-Driven Reliability: P99 latency is the only latency metric that matters. Do not use average or maximum (P100) latency.
- Multi-Cluster Redundancy: Cross-Cluster Replication (CCR) is expensive. Reserve multi-region redundancy for Tier 1 functionality only.
- Error Budgeting: An SLO of 99.9% gives you 43 minutes of allowed downtime per month. Spend this budget on risky but necessary deployments.
2. Interactive Flashcards
What is the primary purpose of the Gateway Service in a Search Platform?
To provide an abstraction layer that handles routing, rate limiting, and circuit breaking before requests hit the Elasticsearch cluster.
How do you perform zero-downtime mapping updates in Elasticsearch?
Using Blue/Green Deployment: create a new index, reindex the data, and perform an atomic swap of the index alias.
Why do Staff Engineers focus on P99 latency instead of Average or Max (P100)?
Averages hide outliers. Max is skewed by single anomalies (like GC pauses). P99 accurately reflects the tail latency experienced by the vast majority of users.
3. Cheat Sheet
| Concept | Action / Strategy | Trade-off / Cost |
|---|---|---|
| Gateway Proxy | Route app requests to aliases | Adds slight latency per hop |
| Blue/Green Reindex | Create V2, Reindex, Swap Alias | Requires 2x storage temporarily |
| SLO Tracking | Monitor P99 Latency & Availability | Setting too strict wastes money |
| CCR (Cross-Cluster) | Replicate Tier 1 data cross-region | 2x Infrastructure + Bandwidth |
| Error Budget | Use allowed downtime for deployments | Exhausted budget means code freeze |
4. Quick Revision
- P99 vs Max: Max (P100) is useless due to GC pauses. Optimize for P99.
- Error Budgets: 99.9% SLO = 43 minutes downtime/month.
- Blue/Green: Never mutate mappings in place. Use the V1 → V2 alias dance.
- Cost vs Redundancy: Don’t CCR your logs. Save the budget for Tier 1 search indexes.
5. Next Steps
- Action: Check out the Elasticsearch Glossary.
- Action: Return to the Elasticsearch Course Index.