Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler (HPA) is Kubernetes’ primary mechanism for scaling applications out (adding replicas) and in (removing replicas) based on demand.

It answers the question: “How many copies of this application do I need right now to handle the current load?”

1. The Control Loop: First Principles

HPA is a control loop that runs inside the kube-controller-manager (usually every 15 seconds). It constantly compares the current metric value against your desired target.

The Formula

The number of replicas is calculated using this formula:

DesiredReplicas = ceil[ CurrentReplicas * ( CurrentMetricValue / DesiredMetricValue ) ]

Example:

Current Replicas: 2
Current CPU Load: 100% (Avg per pod)
Target CPU Load: 50%

\text{Desired} = \lceil 2 \times (100 / 50) \rceil = \lceil 2 \times 2 \rceil = 4

The HPA will scale up to 4 replicas.

2. Interactive: The Scaling Simulator

Visualize how HPA reacts to traffic spikes. Notice that scaling up is fast, but scaling down is slow (to prevent “thrashing”).

Traffic Load: 200 RPS

Target: 100 RPS per Pod

Replicas: 2

Utilization: 100%

3. Stabilization Windows: Preventing “Flapping”

HPA faces a problem called Flapping (or Thrashing).

Load spikes → Scale Up.
Load drops slightly → Scale Down.
Load spikes again → Scale Up.

This causes pods to be created and destroyed rapidly, wasting CPU on startup costs.

The Solution: Behavior Policy

By default, HPA has a 5-minute scale-down stabilization window. This means: “I see that load is low, but I will wait 5 minutes before deleting pods to make sure it’s not a temporary dip.”

You can configure this in the behavior section:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15

4. Implementation Guide

1. Prerequisites: Metrics Server

HPA cannot function without the Metrics Server. It provides the currentMetricValue.

# Verify metrics server is running
kubectl top nodes
kubectl top pods

2. Deployment Manifest

Your Deployment MUST have resource requests defined. HPA uses requests to calculate utilization percentage.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: php-apache
spec:
  replicas: 1
  selector:
    matchLabels:
      run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - name: php-apache
        image: registry.k8s.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 200m # Critical for HPA calculation

3. HPA Manifest (v2)

Use autoscaling/v2 for the most features.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

[!TIP] Why 50% Utilization? A target of 100% is dangerous. If traffic spikes, you have zero buffer while new pods are booting up. A target of 50-70% leaves headroom for spikes during the scale-up lag.

5. Scaling on Custom Metrics

Sometimes CPU/Memory isn’t enough. You might want to scale on:

Requests Per Second (RPS) (from Ingress)
Queue Length (from RabbitMQ/SQS)

This requires the Prometheus Adapter. It translates Prometheus metrics into the Kubernetes Custom Metrics API so HPA can read them.

  metrics:
  - type: Pods
    pods:
      metric:
        name: packets-per-second
      target:
        type: AverageValue
        averageValue: 1k

6. Common Gotchas

[!WARNING] Missing Requests If your containers don’t have resources.requests defined, HPA will not work for CPU/Memory scaling because it cannot calculate a percentage.

[!WARNING] Cold Starts HPA reacts to current load. If your app takes 60 seconds to boot (Java/Spring), your users will see errors during that scale-up window. Use Over-provisioning or lower utilization targets to mitigate this.