Module Review: Operations
[!NOTE] This module explores the core principles of Module Review: Operations, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Key Takeaways
- Reliability is a Feature: It must be prioritized like any other feature. SLOs and Error Budgets are the tools to negotiate this prioritization with Product.
- Blameless Culture: You cannot fix a system if people are afraid to admit mistakes. Post-Mortems focus on process, not people.
- Command & Control: During an outage, democracy is suspended. The Incident Commander leads, others follow.
- Mitigate โ Resolve: During an incident, your first goal is to stop the bleeding (Mitigate), not to find the perfect fix (Resolve).
- Paved Roads: Donโt force standards; make the right way the easiest way using Golden Paths.
2. Interactive Flashcards
Test your knowledge of Operational Excellence.
Click to Start
3. Cheat Sheet
The Nines Table
| Availability | Downtime per Year | Downtime per Month | Typical Use Case |
|---|---|---|---|
| 99% | 3.65 days | 7.31 hours | Batch jobs, non-critical internal tools |
| 99.5% | 1.83 days | 3.65 hours | Standard e-commerce, user dashboards |
| 99.9% | 8.76 hours | 43.8 minutes | Industry Standard for SaaS |
| 99.99% | 52.6 minutes | 4.38 minutes | Core Banking, Payments, Auth |
| 99.999% | 5.26 minutes | 26.3 seconds | Telco, Pacemakers, AWS S3 |
Incident Roles
- Incident Commander (IC): Leader. Decision maker.
- Operations Lead: Doer. Executes commands.
- Scribe: Recorder. Timeline keeper.
- Comms Lead: Speaker. Updates stakeholders.
SEV Levels
- SEV-1: Critical. Site down. All hands on deck.
- SEV-2: High. Major feature broken. Fix ASAP.
- SEV-3: Medium. Minor bug. Fix in business hours.
- SEV-4: Low. Cosmetic. Backlog.