Horizontal Pod Autoscaler (HPA)
The Horizontal Pod Autoscaler (HPA) is Kubernetes’ primary mechanism for scaling applications out (adding replicas) and in (removing replicas) based on demand.
It answers the question: “How many copies of this application do I need right now to handle the current load?”
1. The Control Loop: First Principles
HPA is a control loop that runs inside the kube-controller-manager (usually every 15 seconds). It constantly compares the current metric value against your desired target.
The Formula
The number of replicas is calculated using this formula:
Example:
- Current Replicas: 2
- Current CPU Load: 100% (Avg per pod)
- Target CPU Load: 50%
\text{Desired} = \lceil 2 \times (100 / 50) \rceil = \lceil 2 \times 2 \rceil = 4
The HPA will scale up to 4 replicas.
2. Interactive: The Scaling Simulator
Visualize how HPA reacts to traffic spikes. Notice that scaling up is fast, but scaling down is slow (to prevent “thrashing”).
Traffic Load: 200 RPS
Target: 100 RPS per Pod
Replicas: 2
Utilization: 100%
3. Stabilization Windows: Preventing “Flapping”
HPA faces a problem called Flapping (or Thrashing).
- Load spikes → Scale Up.
- Load drops slightly → Scale Down.
- Load spikes again → Scale Up.
This causes pods to be created and destroyed rapidly, wasting CPU on startup costs.
The Solution: Behavior Policy
By default, HPA has a 5-minute scale-down stabilization window. This means: “I see that load is low, but I will wait 5 minutes before deleting pods to make sure it’s not a temporary dip.”
You can configure this in the behavior section:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
4. Implementation Guide
1. Prerequisites: Metrics Server
HPA cannot function without the Metrics Server. It provides the currentMetricValue.
# Verify metrics server is running
kubectl top nodes
kubectl top pods
2. Deployment Manifest
Your Deployment MUST have resource requests defined. HPA uses requests to calculate utilization percentage.
apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache
spec:
replicas: 1
selector:
matchLabels:
run: php-apache
template:
metadata:
labels:
run: php-apache
spec:
containers:
- name: php-apache
image: registry.k8s.io/hpa-example
ports:
- containerPort: 80
resources:
requests:
cpu: 200m # Critical for HPA calculation
3. HPA Manifest (v2)
Use autoscaling/v2 for the most features.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
[!TIP] Why 50% Utilization? A target of 100% is dangerous. If traffic spikes, you have zero buffer while new pods are booting up. A target of 50-70% leaves headroom for spikes during the scale-up lag.
5. Scaling on Custom Metrics
Sometimes CPU/Memory isn’t enough. You might want to scale on:
- Requests Per Second (RPS) (from Ingress)
- Queue Length (from RabbitMQ/SQS)
This requires the Prometheus Adapter. It translates Prometheus metrics into the Kubernetes Custom Metrics API so HPA can read them.
metrics:
- type: Pods
pods:
metric:
name: packets-per-second
target:
type: AverageValue
averageValue: 1k
6. Common Gotchas
[!WARNING] Missing Requests If your containers don’t have
resources.requestsdefined, HPA will not work for CPU/Memory scaling because it cannot calculate a percentage.
[!WARNING] Cold Starts HPA reacts to current load. If your app takes 60 seconds to boot (Java/Spring), your users will see errors during that scale-up window. Use Over-provisioning or lower utilization targets to mitigate this.