Incident Management & Response
When the pager goes off at 2 AM, your heart rate spikes. The site is down. Twitter is angry. The CEO is texting you.
As a Staff Engineer, your role shifts from “Technical Architect” to Incident Commander. Your job isn’t necessarily to fix the bug yourself, but to coordinate the team to resolve it as fast as possible.
1. The Incident Command System (ICS)
Modeled after fire department protocols, the ICS provides a clear hierarchy during chaos. There is no democracy in a fire.
Key Roles
| Role | Responsibility | Who plays it? |
|---|---|---|
| Incident Commander (IC) | The Single Source of Truth. managing the incident state, not the code. Delegating tasks. | Staff/Senior Eng |
| Operations Lead | The “Hands on Keyboard”.Executing commands, checking logs, deploying fixes. | DevOps/SRE |
| Scribe | The Historian. documenting every event, command, and timestamp in a doc. | Junior/Mid Eng |
| Comms Lead | The Shield. updating internal/external stakeholders so the IC can focus. | Eng Manager/PM |
[!TIP] The IC should not touch the keyboard. If the IC gets sucked into debugging a specific log file, they lose situational awareness. Delegate!
2. SEV Levels: Defining Urgency
Not all alerts are emergencies. You need a clear language for severity.
- SEV-1 (Critical): Site down, data loss, or major revenue impact. Drop everything. 24/7 response.
- SEV-2 (High): Major feature broken, high latency, but workaround exists. Fix immediately (waking hours).
- SEV-3 (Medium): Minor bug, edge case, or internal tool issue. Fix within standard SLA (e.g., 3 days).
- SEV-4 (Low): Cosmetic issue or minor annoyance. Backlog.
3. Interactive: Incident Commander Simulator
You are the Incident Commander for StreamFlow, a video streaming service. Users are reporting buffering issues. Make the right calls to save the service.
4. The Incident Lifecycle
- Detect: Monitoring triggers an alert, or a customer reports an issue.
- Respond: IC is paged. Incident channel opened. Roles assigned.
- Mitigate: Primary Goal. Stop the bleeding. Rollback, degrade features, or scale up. Do NOT fix the root cause yet if it takes too long.
- Resolve: Clean up. Fix the root cause. Restore full service.
- Review: The Post-Mortem.
5. Communication: The “Holding Statement”
When the site is down, silence is terrifying to stakeholders. The IC (or Comms Lead) must provide regular updates.
Template for Executive Update:
[SEV-1] Checkout 500 Errors
Status: Investigating
Impact: ~15% of users unable to checkout.
Current Action: Rolling back last deployment.
Next Update: 15 mins.
Keep it brief. Facts only. No speculation.
6. The Post-Mortem: Learning from Failure
A post-mortem (or Incident Review) is NOT about finding who to blame. It is about understanding how the system allowed this to happen.
The 5 Whys Technique
Problem: The database CPU spiked to 100%.
- Why? A bad query was introduced.
- Why? The new “Recommended for You” feature didn’t have an index.
- Why? The developer forgot to add it.
- Why? The code review didn’t catch it.
- Why? We don’t have automated performance testing in CI to catch missing indexes. → ROOT CAUSE
Action Item: Add pg_stat_statements check in CI pipeline.
[!WARNING] If a post-mortem ends with “Developer needs to be more careful,” it has failed. Human error is inevitable. You must build systems that are resilient to human error.
7. Case Study: The “Delete All” Script
The Incident: An Ops engineer ran a cleanup script to delete old logs. Due to a typo in the path variable, it started deleting production data files.
The Response:
- Detection: Monitoring alerted on “Disk Usage Dropping Fast”.
- Mitigation: The engineer realized the mistake and
Ctrl+C‘d the script. - Impact: 5% of user avatars were lost.
The Post-Mortem Outcome: Instead of firing the engineer, the Staff Engineer asked: “Why was it possible to delete production data with a single command?”
- Fix 1: Remove direct SSH access to production for day-to-day tasks.
- Fix 2: Implement “Soft Delete” for all storage buckets.
- Fix 3: Require peer approval for any script run against prod.
The engineer who made the mistake became the champion for these new safety tools.