Blue/Green & Canary

[!TIP] The Goal: Deploy to production without the user noticing a single error.

In the old days, we had “Maintenance Windows” (Site down from 2 AM to 4 AM). Today, Netflix deploys thousands of times a day. How? They use advanced Deployment Strategies to minimize risk.


1. The Big Three Strategies

1.1 Rolling Update (Kubernetes Default)

  • Mechanism: Replace instances one by one.
  • Start 1 new pod (v2). Wait for it to be healthy.
  • Kill 1 old pod (v1).
  • Repeat until all are v2.
  • Pros: Zero downtime. Low cost (only need +1 capacity).
  • Cons: Slow rollback. Hard to debug (traffic hits both v1 and v2 simultaneously).

1.2 Blue/Green Deployment

  • Mechanism: Two identical environments.
  • Blue: Live (v1).
  • Green: Staging (v2).
  • Run tests on Green. When ready, switch Load Balancer from Blue → Green.
  • Pros: Instant rollback (Switch back to Blue). No mixed versions.
  • Cons: Expensive (Need 2x resources).

Interactive Visualizer: Blue/Green Switch

Flip the switch to redirect all traffic instantly.

[!TIP] Try it yourself: Click “SWAP ENV” to switch traffic. Notice how the traffic (arrows) instantly moves.

Load Balancer Router

BLUE (v1)
Live Traffic
🔀
GREEN (v2)
Idle / Staging
Current Live: BLUE

1.3 Canary Deployment (The Safest)

  • Mechanism: Roll out v2 to a small subset of users (e.g., 1%).
  • Monitor metrics (Error Rate, Latency).
  • If healthy, increase to 10%, 50%, 100%.
  • If errors spike, rollback automatically.
  • Pros: Lowest risk. Real user testing.
  • Cons: Complex setup (Need advanced Load Balancer or Service Mesh like Istio).

[!NOTE] War Story: The $460 Million Deployment In 2012, Knight Capital Group lost $460 million in 45 minutes because a dead code path was accidentally triggered on a single un-updated server during a manual deployment. A strict Canary deployment strategy with automated anomaly detection would have caught the trading error rate spike on the first 1% of traffic and triggered an automatic rollback, averting the catastrophe.

Interactive Visualizer: Canary Deployment Console

You are the Release Engineer. Deploy v2 to production. Careful! v2 might have bugs.

[!TIP] Try it yourself: Drag the slider to increase v2 traffic. If you see errors, drag it back to 0% (Rollback).

Canary Deployment Console

Traffic Split: 100% v1 / 0% v2

v1 (Stable)
v2 (New)
Error Rate (v1): 0.01%
Error Rate (v2): 0.00%
Adjust slider to shift traffic.

2. Feature Flags: Decoupling Deploy from Release

What if you want to deploy code but not show it to users yet?

Feature Flags (Toggles) allow you to wrap new code in an if statement:

if feature_flags.is_enabled("new_checkout_flow", user_id):
  render_new_checkout()
else:
  render_old_checkout()
  1. Deploy: Push code to production (Flag = False).
  2. Test: Enable flag for internal users (“dogfooding”).
  3. Release: Enable flag for 10% of users.
  4. Full Launch: Enable flag for 100%.

[!NOTE] War Story: Dark Launching at Facebook When Facebook rewrote their entire Messenger backend infrastructure, instead of a risky “big bang” release, they used feature flags to route 1% of messages to the new system in the background. They verified database consistency with the old system silently (Dark Launching) for weeks before ever showing a new UI to a single user.


3. GitOps: Infrastructure as Code

GitOps (popularized by ArgoCD/Flux) means using a Git repository as the Single Source of Truth for your infrastructure.

  • Push Model (Old): CI Pipeline (Jenkins) runs kubectl apply.
  • Risk: CI needs full access to Production Cluster.
  • Pull Model (GitOps): An agent (ArgoCD) inside the cluster watches the Git Repo.
  • Security: Cluster does not expose credentials. It pulls changes.

Interactive Visualizer: GitOps Sync Loop

Watch ArgoCD detect a change and sync the cluster.

[!TIP] Try it yourself: Click “Commit” to simulate a Developer pushing to Git. Watch ArgoCD react.

👨‍💻
Developer
📜
Git Repo
v1.0
ArgoCD
ArgoCD
Synced
☸️
Cluster
v1.0

[!TIP] Why GitOps? Audit trail. Every change to production is a Git Commit. You can revert infrastructure changes just like code (git revert).


4. Summary

Strategy Speed Safety Cost Best For
Rolling Update Medium Medium Low Standard microservices.
Blue/Green Fast High High Critical legacy apps. DB migrations.
Canary Slow Highest Medium High-scale, user-facing apps.
Feature Flags Instant High Low UI changes, Dark Launches.