To design systems, we need a shared language. Here are the definitions you will use every day.
1. Latency vs Throughput
This is the most common confusion for beginners.
Latency (Time)
- Definition: The time it takes to complete a single task.
- Analogy: How fast a Ferrari drives down a highway.
- Unit: Milliseconds (ms) or Seconds (s).
- Goal: Lower is better.
Throughput (Volume)
- Definition: The number of tasks completed per unit of time.
- Analogy: How many cars pass a toll booth per hour.
- Unit: Requests Per Second (RPS) or Queries Per Second (QPS).
- Goal: Higher is better.
[!NOTE] A bus has high latency (slow) but high throughput (moves 50 people at once). A Ferrari has low latency (fast) but low throughput (only 2 seats).
2. Percentiles: p50, p99, and Tail Latency
If you only look at your Average Latency, you are lying to yourself.
Why Percentiles?
An “Average” is easily skewed by a single outlier. If 9 users have 10ms latency and 1 user has 10,000ms, the average is 1,009ms. This number is useless—it doesn’t describe the reality for either group.
Percentiles solve this:
- Median (p50): The “middle” value. 50% of requests are faster than this, 50% are slower.
- Tail Latency (p99): The 99th percentile. 99% of requests are faster than this. It represents the experience of the unluckiest 1% of users.
The Amplification Effect: Why p99 is the only metric that matters at scale
In a modern microservice architecture, one user request might call 10 different services in parallel.
- If each service has a 1% chance of being slow (p99), what is the chance that the entire request is slow?
- Math: $1 - 0.99^{10} \approx 0.095 \approx \mathbf{10\%}$!
By the time you reach 100 services (typical for a Staff engineer’s domain), 63% of users will experience p99 latency on every request. This is why Staff designers optimize for the “Tail,” not the average.
3. SLA, SLO, and SLI: The Reliability Trinity
Engineers often say “I want high availability,” but how do we measure it?
1. SLI (Service Level Indicator)
The Metric. What are we actually counting?
- Example: “The percentage of HTTP 200 responses.”
2. SLO (Service Level Objective)
The Target. What is our goal for that metric?
- Example: “99.9% of requests must return HTTP 200.”
3. SLA (Service Level Agreement)
The Contract. What happens if we miss the SLO? (Money, lawyers, or free credits).
- Note: As an engineer, you focus on SLOs. SLAs are for the business.
The “Nines” Table
| Availability | Yearly Downtime | Allowed Monthly Downtime | | :— | :— | :— | | 99% (“Two Nines”) | 3.65 days | 7.3 hours | | 99.9% (“Three Nines”) | 8.77 hours | 43 minutes | | 99.99% (“Four Nines”) | 52.6 minutes | 4.3 minutes | | 99.999% (“Five Nines”) | 5.26 minutes | 26 seconds |
4. Error Budgets: Why 100% is the Wrong Goal
A common mistake for Seniors is striving for 100.0% reliability. Staff engineers know that nothing is 100%.
If your SLO is 99.9%, you have an Error Budget of 0.1%.
- If the budget is full: You can ship new features, take risks, and perform maintenance.
- If the budget is empty: You stop all deployments and focus 100% on reliability until the budget recovers.
[!IMPORTANT] Error Budgets align the Product team (who want features) with the Infra team (who want stability). If you have budget left, you aren’t shipping fast enough!
Interactive: Traffic Simulator
See the difference in action.
- Latency Mode: One car using the full speed of the road.
- Throughput Mode: Many cars filling all lanes.
Latency (Time to cross)
Throughput (Cars/sec)
5. Client-Server Model
- Client: The “Asker”. Your browser, mobile app, or a smart fridge.
- Server: The “Worker”. A computer in a data center that processes the request and sends back a response.
6. Availability vs Reliability
- Availability: “Uptime”. Is the system operational right now? (e.g., 99.9% uptime).
- Reliability: “Trust”. Does the system do the right thing? (e.g., It doesn’t lose data or calculate 1+1=3).
You can be Available but not Reliable (Returning “Error 500” successfully is strictly “available”, but useless).
7. Safety vs. Liveness
When reviewing a new architecture, a Staff engineer asks two formal questions:
1. Safety (“The Bad Thing doesn’t happen”)
Safety properties guarantee that the system remains in a valid state.
- Examples: “The database never loses an acknowledged write.” “Two users never get the same unique ID.”
- Failure: If a safety property is violated, it usually requires a human to fix the data (e.g., untangling a race condition).
2. Liveness (“The Good Thing eventually happens”)
Liveness properties guarantee that the system makes progress.
- Examples: “Every request eventually receives a response.” “The system eventually recovers after a restart.”
- Failure: If a liveness property is violated, the system is “stuck” (e.g., a deadlock or an infinite loop), but the data might still be safe.
[!NOTE] Most distributed systems trade Liveness for Safety. For example, in a network partition, a system might stop accepting writes (Loss of Liveness) to ensure no conflicting data is written (Preserving Safety).
8. The “System Qualities” (-ilities)
A Staff engineer evaluates a design by looking through five distinct “lenses.”
1. Availability (Uptime)
The probability that the system is functioning at a given time.
- Metric: SLO (e.g., 99.9% uptime).
- Focus: Keeping the lights on.
2. Reliability (Precision)
The probability that the system performs its function correctly over a period of time.
- Metric: Error Rate.
- Focus: Data integrity and correctness.
3. Resiliency (Recovery)
The ability of a system to recover from faults and gracefully handle pressure.
- Key: A system can be Highly Available but have Low Resiliency if it requires a human to manually reboot it every time a database flickers.
4. Scalability (Growth)
The ability of a system to handle increasing load by adding resources.
- Key: If doubling your users requires doubling your engineering team, your system is not scalable, even if your servers are.
5. Elasticity (Efficiency)
The ability to scale resources down when they aren’t needed to save cost.
- Example: An e-commerce site that scales to 1,000 nodes on Black Friday but costs almost $0 on a Tuesday night.
The Architect’s Matrix
| Quality | Question | Focus |
|---|---|---|
| Availability | Is it up right now? | Survival |
| Reliability | Is the answer right? | Correctness |
| Resiliency | Can it fix itself? | Recovery |
| Scalability | Can it get bigger? | Growth |
| Elasticity | Can it get smaller? | Cost |
9. Conclusion: The “Crux” of a Design
A Staff engineer doesn’t just describe a system; they identify its Crux—the hardest part of the problem.
When presenting a design, use this template:
The Crux: “The core risk in this design is [Hot Shard in DB / Network Partition in Region A / Cache Invalidation Latency].” The Mitigation: “We handle this by [Adding Jitter / Using Eventual Consistency / Implementing a Fallback].” The Trade-off: “We chose to prioritize [Availability] over [Strict Consistency] because [User Impact is lower].”