Module Review: Operations

[!NOTE] This module explores the core principles of Module Review: Operations, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Key Takeaways

  1. Reliability is a Feature: It must be prioritized like any other feature. SLOs and Error Budgets are the tools to negotiate this prioritization with Product.
  2. Blameless Culture: You cannot fix a system if people are afraid to admit mistakes. Post-Mortems focus on process, not people.
  3. Command & Control: During an outage, democracy is suspended. The Incident Commander leads, others follow.
  4. Mitigate โ‰  Resolve: During an incident, your first goal is to stop the bleeding (Mitigate), not to find the perfect fix (Resolve).
  5. Paved Roads: Donโ€™t force standards; make the right way the easiest way using Golden Paths.

2. Interactive Flashcards

Test your knowledge of Operational Excellence.

Click to Start

3. Cheat Sheet

The Nines Table

Availability Downtime per Year Downtime per Month Typical Use Case
99% 3.65 days 7.31 hours Batch jobs, non-critical internal tools
99.5% 1.83 days 3.65 hours Standard e-commerce, user dashboards
99.9% 8.76 hours 43.8 minutes Industry Standard for SaaS
99.99% 52.6 minutes 4.38 minutes Core Banking, Payments, Auth
99.999% 5.26 minutes 26.3 seconds Telco, Pacemakers, AWS S3

Incident Roles

  • Incident Commander (IC): Leader. Decision maker.
  • Operations Lead: Doer. Executes commands.
  • Scribe: Recorder. Timeline keeper.
  • Comms Lead: Speaker. Updates stakeholders.

SEV Levels

  • SEV-1: Critical. Site down. All hands on deck.
  • SEV-2: High. Major feature broken. Fix ASAP.
  • SEV-3: Medium. Minor bug. Fix in business hours.
  • SEV-4: Low. Cosmetic. Backlog.

4. Further Reading

Staff Prep Glossary