Service Maturity & Operational Excellence
Building a service is easy. Keeping it running for 5 years with 99.99% availability is hard.
As a Staff Engineer, you are responsible for the Operational Bar. You define what “Production Ready” means for your organization. Without standards, you end up with a zoo of unmaintained, fragile microservices.
1. The Production Readiness Review (PRR)
Before a service takes real traffic, it must pass a PRR. This is not a code review. It is an operational review.
Key Questions in a PRR:
- Observability: Are logs structured (JSON)? Do we have dashboards for latency, errors, and saturation?
- Alerting: Is there an on-call rotation? Are alerts actionable?
- Disaster Recovery: Is there a backup strategy? Have we tested the restore process?
- Capacity: Have we load tested? Do we know the breaking point?
- Documentation: Is there a runbook? Is there an architecture diagram?
[!NOTE] A PRR shouldn’t be a blocker that stops innovation. It should be a consultation to help teams succeed.
2. The Service Maturity Model
You can’t fix everything at once. A maturity model gives teams a roadmap to improve.
Interactive: Service Maturity Assessor
Rate a service in your organization to see where it stands.
Maturity Checklist
3. Golden Paths (Paved Roads)
You cannot expect every team to become experts in Kubernetes, Terraform, and Prometheus.
Instead of mandating standards (“You MUST use tool X”), provide Golden Paths. A Golden Path is a supported, opinionated way to build a service where the “right thing” is the default.
- Bad: “Here is the AWS console. Go build your infra.”
- Good: “Here is a
terraform-module-servicethat gives you a load balancer, auto-scaling group, and standard monitoring dashboards out of the box.”
If teams stay on the Golden Path, they get upgrades, security patches, and PRR compliance for free. If they go off-road, they are on their own.
4. Case Study: The Orphaned Service
Context: A marketing team built a “Contest Signup” service 2 years ago using a niche framework. The original developers left.
The Issue: The service started crashing during a Super Bowl ad. No one knew how to deploy it. No logs were being collected. The on-call engineer couldn’t even log in to the server.
The Fix:
- Immediate: Reboot the server manually (Band-aid).
- Strategic: The Staff Engineer audited all 50+ services in the org.
- Policy: Every service must have an Owner (Team) and a Tier.
- Tier 1: Critical. Must meet Level 3 Maturity.
- Tier 2: Internal. Must meet Level 2.
- Tier 3: Experimental/Deprecating. Best effort.
The “Contest Service” was marked Tier 3 and scheduled for decommissioning.
[!TIP] Zombie services eat operational capacity. Identify them and shut them down or bring them up to standard.