Immortality: Restart Policies & Healing

In a distributed system, failure is inevitable. Processes crash. Memory leaks happen. The network blips. A robust system isn’t one that never crashes; it’s one that recovers automatically.

The “Heart Monitor” Analogy Think of the Docker Daemon (specifically, containerd) as a hospital’s central monitoring station. When a container (the patient) is running, the daemon watches its vital signs (the root process PID 1). If that process exits, the daemon checks the container’s designated Restart Policy to decide whether to deploy the defibrillator (restart) or leave it stopped.


1. The 4 Policies

Docker provides four restart policies to control container resurrection. Choosing the wrong one can lead to silent failures or infinite crash loops that burn through CPU.

Policy Description When to use it
no Do not restart automatically. (Default) One-off scripts, local debugging, or when an orchestrator (like Kubernetes) handles restarts.
on-failure[:max-retries] Restart only if the process exits with a non-zero exit code (indicating an error). Optionally limit retries. Batch jobs, data migrations, or background workers that might fail transiently but succeed on retry.
always Always restart the container if it stops. If manually stopped, it will restart when the Docker daemon restarts. Critical infrastructure components, web servers, databases.
unless-stopped Like always, but if you manually docker stop it, the daemon remembers this state and won’t wake it up after a system reboot. Production services you might want to purposefully sideline for maintenance.
War Story: The "always" trap
A junior engineer once set a faulty database migration container to always. It crashed immediately, but Docker kept resurrecting it. Later, they manually stopped it to fix the issue. During a routine server patch that weekend, the server rebooted. The Docker daemon started back up, saw the always policy, and happily resurrected the broken migration script, corrupting the production database before anyone noticed. Use unless-stopped unless you have a specific reason to override manual stops!

2. The Technical Reality: Exit Codes

To understand when Docker triggers an on-failure restart, you must understand exit codes. When process PID 1 terminates, it leaves behind an integer indicating why it died.

  • Exit 0: Graceful shutdown. The process finished successfully or was asked to stop cleanly (e.g., via docker stop).
  • Exit 1: Catch-all for application errors. The code threw an unhandled exception.
  • Exit 137: Fatal error (128 + 9 SIGKILL). Typically means OOMKilled (Out of Memory). The host OS killed the container to protect the system.
  • Exit 143: Graceful termination (128 + 15 SIGTERM). The container received a stop signal but took too long, so it was forced out.

Docker uses these codes to enforce the policy. on-failure ignores Exit 0 but acts on Exit 1 or 137.


3. Exponential Backoff

If your app crashes immediately on startup (a phenomenon known as CrashLoopBackOff in Kubernetes), Docker is smart enough not to restart it in a tight, CPU-burning loop.

Instead, Docker adds a multiplying delay between restart attempts: 100ms, 200ms, 400ms, 800ms… up to a maximum limit (typically 1 minute). If the container manages to stay alive for a while, this timer resets.


4. Interactive: Resurrection Lab

Test how different policies react to different exit scenarios. Notice how on-failure treats a simulated crash differently than a graceful stop.

App Container
Running
> Container Started (PID 1 alive)

5. Code Example: Defining Policies

You can enforce restart policies either during the imperative docker run command or declaratively via Docker Compose.

version: '3.8'
services:
  web-server:
    image: nginx
    restart: always  # Will always attempt to keep Nginx alive
    ports:
      - "80:80"

  worker:
    image: my-worker
    restart: on-failure:3  # Try to restart up to 3 times if it crashes
    command: ["./process-jobs"]
# Run a new container with a restart policy
docker run -d --restart unless-stopped redis

# Update an existing, already-running container's policy dynamically
docker update --restart always my-container