SLOs: Engineering Reliability
[!NOTE] This module explores the core principles of SLOs: Engineering Reliability, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Language of Reliability
- SLI (Indicator): The metric. “95th Percentile Latency”.
- SLO (Objective): The goal. “99.9% of requests < 200ms”.
- SLA (Agreement): The punishment. “If we miss SLO, we pay you back.” (Staff Engineers care about SLOs, Lawyers care about SLAs).
2. Defining Search SLOs
What matters for search?
- Availability: Can I search? (HTTP 200 OK). Target: 99.99%.
- Latency: Is it fast? (p95 < 200ms). Target: 99.9%.
- Freshness: Is new data there? (Lag < 5s). Target: 99.0%.
The Error Budget: If SLO is 99.9%, you have 0.1% allowed failure. In a month (43,200 mins), that is 43 minutes of downtime. Use this budget to deploy risky changes. If budget is exhausted \to Code Freeze.
3. Interactive: Error Budget Calculator
Adjust your specific targets to see how much “burn” you can afford.
Allowed Downtime
43.2 min
Allowed Failed Requests
1,000
4. Hardware Reality: P99 vs Max
Why do we measure P99 (99th percentile) instead of Average?
- Average: Hides problems.
[1ms, 1ms, 10s]\to Avg 3s. Not useful. - P99: “99% of requests are faster than X”. The “Tail Latency”.
- P100 (Max): Useless. One GC pause or network blip sets P100 to 30s. Ignore it. Focus on P99.