Incident Management & Response
When the pager goes off at 2 AM, your heart rate spikes. The site is down. Twitter is angry. The CEO is texting you.
As a Staff Engineer, your role shifts from “Technical Architect” to Incident Commander. Your job isn’t necessarily to fix the bug yourself, but to coordinate the team to resolve it as fast as possible.
1. The Incident Command System (ICS)
Modeled after fire department protocols, the ICS provides a clear hierarchy during chaos. There is no democracy in a fire.
Key Roles
| Role | Responsibility | Who plays it? |
|---|---|---|
| Incident Commander (IC) | The Single Source of Truth. managing the incident state, not the code. Delegating tasks. | Staff/Senior Eng |
| Operations Lead | The “Hands on Keyboard”.Executing commands, checking logs, deploying fixes. | DevOps/SRE |
| Scribe | The Historian. documenting every event, command, and timestamp in a doc. | Junior/Mid Eng |
| Comms Lead | The Shield. updating internal/external stakeholders so the IC can focus. | Eng Manager/PM |
[!TIP] The IC should not touch the keyboard. If the IC gets sucked into debugging a specific log file, they lose situational awareness. Delegate!
2. SEV Levels: Defining Urgency
Not all alerts are emergencies. You need a clear language for severity.
- SEV-1 (Critical): Site down, data loss, or major revenue impact. Drop everything. 24/7 response. (e.g., A primary database volume failure causing total read/write outage, or a DNS misconfiguration preventing all traffic).
- SEV-2 (High): Major feature broken, high latency, but workaround exists. Fix immediately (waking hours).
- SEV-3 (Medium): Minor bug, edge case, or internal tool issue. Fix within standard SLA (e.g., 3 days).
- SEV-4 (Low): Cosmetic issue or minor annoyance. Backlog.
3. Interactive: Incident Commander Simulator
You are the Incident Commander for StreamFlow, a video streaming service. Users are reporting buffering issues. Make the right calls to save the service.
4. The Incident Lifecycle
- Detect: Monitoring triggers an alert, or a customer reports an issue.
- Respond: IC is paged. Incident channel opened. Roles assigned.
- Mitigate: Primary Goal. Stop the bleeding. Rollback, degrade features, or scale up. Do NOT fix the root cause yet if it takes too long.
- War Story (Thundering Herd): At Company X, a cache invalidation bug caused 100,000 requests/sec to hit the database, bringing it down. The IC didn’t wait for a code fix. The mitigation was to temporarily block 90% of traffic at the API Gateway. The database recovered, and the team had breathing room to fix the underlying cache issue. Temporary degradation is better than a full outage.
- Resolve: Clean up. Fix the root cause. Restore full service.
- Review: The Post-Mortem. Here, you focus on key metrics: MTTD (Mean Time To Detect) and MTTR (Mean Time To Resolve). The goal is to lower both.
5. Communication: The “Holding Statement”
When the site is down, silence is terrifying to stakeholders. The IC (or Comms Lead) must provide regular updates.
Template for Executive Update:
[SEV-1] Checkout 500 Errors
Status: Investigating
Impact: ~15% of users unable to checkout.
Current Action: Rolling back last deployment.
Next Update: 15 mins.
Keep it brief. Facts only. No speculation.
6. The Post-Mortem: Learning from Failure
A post-mortem (or Incident Review) is NOT about finding who to blame. It is about understanding how the system allowed this to happen.
The 5 Whys Technique
Problem: The database CPU spiked to 100%.
- Why? A bad query was introduced.
- Why? The new “Recommended for You” feature didn’t have an index.
- Why? The developer forgot to add it.
- Why? The code review didn’t catch it.
- Why? We don’t have automated performance testing in CI to catch missing indexes. → ROOT CAUSE
Action Item: Add pg_stat_statements check in CI pipeline.
[!WARNING] If a post-mortem ends with “Developer needs to be more careful,” it has failed. Human error is inevitable. You must build systems that are resilient to human error.
7. Case Study: The “Delete All” Script
The Incident: An Ops engineer ran a cleanup script to delete old logs. Due to a typo in the path variable, it started deleting production data files.
The Response:
- Detection: Monitoring alerted on “Disk Usage Dropping Fast”.
- Mitigation: The engineer realized the mistake and
Ctrl+C‘d the script. - Impact: 5% of user avatars were lost.
The Post-Mortem Outcome: Instead of firing the engineer, the Staff Engineer asked: “Why was it possible to delete production data with a single command?”
- Fix 1: Remove direct SSH access to production for day-to-day tasks.
- Fix 2: Implement “Soft Delete” for all storage buckets.
- Fix 3: Require peer approval for any script run against prod.
The engineer who made the mistake became the champion for these new safety tools.
8. Interview Questions
When interviewing for Staff or Principal roles, expect questions that test your composure, delegation skills, and post-incident processes.
Q: Describe a major incident you were involved in and how you handled it.
- Bad Answer: “I noticed the site was down, so I ssh’d into the server and restarted the database. I was the hero.”
- Good Answer (STAR):
- Situation: “During Black Friday, our payment gateway started throwing 500s.”
- Task: “We needed to stop the revenue loss immediately while coordinating with multiple teams.”
- Action: “I stepped in as Incident Commander. I delegated the investigation to the payments lead, assigning a scribe to document the timeline. I realized the root cause would take hours to fix, so I authorized a temporary rollback to the previous day’s deployment as a mitigation.”
- Result: “We restored service in 12 minutes. The next day, I led a blameless post-mortem that identified a missing index, and we added automated performance checks to CI to prevent recurrence.”
Q: How do you balance feature development with technical debt?
- Answer: “I use Error Budgets. I work with Product to define an SLO. If we are meeting our SLO, we ship features. If we blow our Error Budget, we pause feature work and focus exclusively on reliability and paying down tech debt. This shifts the conversation from a subjective argument to an objective, data-driven decision.”