Review & Cheat Sheet

Congratulations! You have mastered the “Day 2” operations that keep systems alive. Before moving to the Final Assessment, let’s review.


1. Key Takeaways

  • Observability is Not Monitoring: Monitoring tells you something is broken. Observability gives you the data to ask why it’s broken (Logs, Metrics, Traces).
  • Beware Cardinality: Tagging metrics with unbounded values like user_id causes Cardinality Explosion, crashing your time series databases (Prometheus).
  • Fail Fast with Circuit Breakers: Don’t let a slow dependency drag down your entire service. Open the circuit, shed load, and recover gracefully.
  • Retries Require Jitter: Naive retries create Thundering Herds. Always use Exponential Backoff and Jitter to spread out the recovery load.
  • Zero Trust Architecture: Inside your VPC, trust nothing. Use mTLS for service-to-service authentication, and OAuth 2.0 (Valet Keys) for authorization.
  • Deploy ≠ Release: Use Deployment Strategies (Blue/Green, Canary) and Feature Flags to separate the act of deploying code from exposing it to users.
  • Infrastructure as Code: GitOps ensures your infrastructure matches the state defined in your Git repository, providing a clear audit trail and easy rollbacks.

2. War Story: The 3 AM Retry Storm

In a well-documented outage, a major ride-sharing app experienced a minor network blip that disconnected thousands of mobile clients. When the network recovered seconds later, every single app attempted to reconnect simultaneously without any delay.

This massive spike in requests overwhelmed their API gateways, causing them to time out and drop connections. The apps interpreted the timeouts as failures and immediately retried again. The system had entered a Thundering Herd death spiral, effectively DDoS-ing itself. The resolution required engineers to completely shut off traffic at the load balancer and slowly bleed it back in, all because the client retry logic lacked Exponential Backoff and Jitter. This is why we always spread out recovery loads.


3. Interactive Flashcards

Click on a card to reveal the definition.

High Cardinality
Why it kills metrics:
Tagging metrics with unique IDs (e.g., `user_id`) creates millions of time series, causing TSDB memory exhaustion (OOM).
Circuit Breaker
Fail Fast:
Prevents cascading failure by stopping requests to a failing service. States: Closed (OK), Open (Blocked), Half-Open (Testing).
mTLS
Zero Trust:
Mutual TLS. Both Client and Server present certificates to authenticate each other. Prevents internal attackers.
Canary Deployment
Low Risk Release:
Rolling out a new version to a small % of users (e.g., 1%) to test stability before full rollout.
Trace Context
Propagation:
Passing `trace_id` and `span_id` headers (W3C standard) to downstream services to link logs across microservices.
Idempotency Key
Safe Retries:
A unique ID sent with requests (e.g., payments) so the server can ignore duplicate requests if a network retry happens.
Thundering Herd
The Problem:
When many clients retry simultaneously (Retry Storm) or cache expires simultaneously, overwhelming the system.
GitOps
Pull Model:
An agent (ArgoCD) inside the cluster pulls changes from Git. More secure than CI pushing to the cluster.

4. Interactive Scenario: The Panic Button

It’s 3 AM. You are on-call. The system is down. What do you do?

PAGERDUTY ALERT

"High Latency Detected on Payment Service (p99 > 5s)"


5. System Design Cheat Sheet

Category Concept Key Takeaway
Observability Logs Structured (JSON) for querying specific events.
  Metrics Aggregates for trends. Watch out for Cardinality Explosion (no user_id).
  Tracing Follow request across microservices. Use Sampling (Head/Tail).
  OTel Vendor-neutral standard. Use SDK + Collector.
Reliability Circuit Breaker Stop cascading failures. States: Closed, Open, Half-Open.
  Retry Only for transient errors. Always use Exponential Backoff + Jitter.
  Idempotency Ensure f(f(x)) = f(x). Use Idempotency-Key header.
Security TLS 1.3 Encrypts transit. 1-RTT handshake. Forward Secrecy.
Security OAuth 2.0 Authorization (Valet Key). Flows: Auth Code (User), Client Creds (Service).
  mTLS Mutual TLS. Zero Trust for service-to-service calls.
  JWT Stateless token. Header.Payload.Signature.
Deployment Rolling Low cost, K8s default. Slow rollback.
  Blue/Green Safe, instant rollback, 2x cost.
  Canary Test in production with real users (1% → 100%). Lowest risk.
  GitOps Infrastructure as Code + Automated Sync (ArgoCD). Pull Model.

6. Quick Revision

  • Logs: Point-in-time events (structured JSON). Best for deep debugging.
  • Metrics: Aggregated numerical data over time. Best for alerting and dashboards. Low cardinality is key.
  • Traces: Tracks a single request as it traverses multiple services (Context Propagation). Vital for latency analysis.
  • Circuit Breaker: Stops requests to failing services. Closed → Open → Half-Open.
  • Bulkhead Pattern: Isolates resources (like thread pools) to prevent a failure in one area from affecting others.
  • TLS 1.3: Faster (1-RTT) and more secure (Forward Secrecy).
  • OAuth 2.0 vs OIDC: OAuth is for Authorization (Delegated access). OIDC is for Authentication (Identity).
  • Canary Deployment: The safest deployment strategy. Roll out to a small percentage, verify, then expand.

Review all the terms mentioned in this module: System Design Glossary


8. What’s Next?

You have completed the core technical modules! You are now ready for the Final Boss.

The next module is Module 18: Final Assessment, where we will simulate a real System Design Interview with a full Mock Scenario.

Proceed to Module 18: Final Assessment →