SLOs: Engineering Reliability

[!NOTE] This module explores the core principles of SLOs: Engineering Reliability, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Language of Reliability

SLI (Indicator): The metric. “95th Percentile Latency”.
SLO (Objective): The goal. “99.9% of requests < 200ms”.
SLA (Agreement): The punishment. “If we miss SLO, we pay you back.” (Staff Engineers care about SLOs, Lawyers care about SLAs).

2. Defining Search SLOs

What matters for search?

Availability: Can I search? (HTTP 200 OK). Target: 99.99%.
Latency: Is it fast? (p95 < 200ms). Target: 99.9%.
Freshness: Is new data there? (Lag < 5s). Target: 99.0%.

The Error Budget: If SLO is 99.9%, you have 0.1% allowed failure. In a month (43,200 mins), that is 43 minutes of downtime. Use this budget to deploy risky changes. If budget is exhausted \to Code Freeze.

3. Interactive: Error Budget Calculator

Adjust your specific targets to see how much “burn” you can afford.

Availability Target (%): 99.9 Total Requests (Monthly): 1,000,000

Allowed Downtime

43.2 min

Allowed Failed Requests

1,000

4. Hardware Reality: P99 vs Max

Why do we measure P99 (99th percentile) instead of Average?

Average: Hides problems. [1ms, 1ms, 10s] \to Avg 3s. Not useful.
P99: “99% of requests are faster than X”. The “Tail Latency”.
P100 (Max): Useless. One GC pause or network blip sets P100 to 30s. Ignore it. Focus on P99.