Module Review: Operations

[!NOTE] This module explores the core principles of Module Review: Operations, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Key Takeaways

Reliability is a Feature: It must be prioritized like any other feature. SLOs and Error Budgets are the tools to negotiate this prioritization with Product.
Blameless Culture: You cannot fix a system if people are afraid to admit mistakes. Post-Mortems focus on process, not people.
Command & Control: During an outage, democracy is suspended. The Incident Commander leads, others follow.
Mitigate ≠ Resolve: During an incident, your first goal is to stop the bleeding (Mitigate), not to find the perfect fix (Resolve).
Paved Roads: Don’t force standards; make the right way the easiest way using Golden Paths.

2. Interactive Flashcards

Test your knowledge of Operational Excellence.

Click to Start

3. Cheat Sheet

The Nines Table

Availability	Downtime per Year	Downtime per Month	Typical Use Case
99%	3.65 days	7.31 hours	Batch jobs, non-critical internal tools
99.5%	1.83 days	3.65 hours	Standard e-commerce, user dashboards
99.9%	8.76 hours	43.8 minutes	Industry Standard for SaaS
99.99%	52.6 minutes	4.38 minutes	Core Banking, Payments, Auth
99.999%	5.26 minutes	26.3 seconds	Telco, Pacemakers, AWS S3

Incident Roles

Incident Commander (IC): Leader. Decision maker.
Operations Lead: Doer. Executes commands.
Scribe: Recorder. Timeline keeper.
Comms Lead: Speaker. Updates stakeholders.

SEV Levels

SEV-1: Critical. Site down. All hands on deck.
SEV-2: High. Major feature broken. Fix ASAP.
SEV-3: Medium. Minor bug. Fix in business hours.
SEV-4: Low. Cosmetic. Backlog.

4. Further Reading

Staff Prep Glossary