Cluster Autoscaler (CA)

While HPA and VPA scale your applications, the Cluster Autoscaler (CA) scales your infrastructure.

It answers the question: “Do I have enough compute nodes to run all these pods?”

1. The Trigger: Pending Pods

A common misconception is that CA looks at CPU usage (e.g., “Add a node if cluster CPU > 80%”). This is FALSE.

CA only cares about Pending Pods.

Scale Up: If a pod is in Pending state because no node has enough free CPU/Memory request space.
Scale Down: If a node is underutilized (e.g., < 50% requested) and its pods can be moved elsewhere.

2. Interactive: The Cluster Visualizer

Add pods until the cluster is full. Watch the CA provision a new node to handle the overflow.

Nodes: 1

Pending Pods: 0

Node 1 (4 Slots)

Cluster is stable.

3. Expanders: How to Choose a Node Group

When CA scales up, it often has multiple Node Groups (AWS ASGs) to choose from (e.g., t3.medium, m5.large, spot-instances).

Which one does it pick? The Expander strategy decides.

Random (Default): Picks a node group at random.
Most-Pods: Picks the group that can schedule the most pending pods.
Least-Waste: Picks the group that will have the least idle CPU/Memory after scheduling.
Price: (Cloud Specific) Picks the cheapest option (Spot > On-Demand).
Priority: User-defined priority (e.g., “Always try Spot first”).

4. Configuration: Scale Down Delay

CA is conservative about scaling down. It doesn’t want to terminate a node just because it was empty for 10 seconds.

--scale-down-unneeded-time: How long a node must be empty before it’s eligible for deletion (Default: 10 minutes).
--scale-down-delay-after-add: How long to wait after a scale-up event before considering scale-down (Default: 10 minutes).

[!TIP] Use Spot Instances CA works great with Spot Instances. Create two Node Groups: one On-Demand (Base capacity) and one Spot (Burst capacity). Use Priority Expander to prefer Spot.

5. Cloud Provider Integration

CA requires permissions to talk to your cloud provider (AWS, GCP, Azure) to actually provision the VM.

AWS Example (IAM Policy)

The CA pod needs an IAM Role with these permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeTags",
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup"
      ],
      "Resource": "*"
    }
  ]
}

Kubernetes Deployment (Helm)

Use the official Helm chart:

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --set autoDiscovery.clusterName=<YOUR CLUSTER NAME> \
  --set awsRegion=us-east-1

6. Troubleshooting: “CA is Stuck”

If you have Pending Pods but CA isn’t adding nodes, check:

Pod Logs: kubectl logs -n kube-system -l app=cluster-autoscaler
ConfigMap: kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
ASG Limits: Did you hit the maxSize of your Auto Scaling Group?
Affinity/Taints: Maybe a node was added, but your pod refuses to schedule on it due to nodeAffinity or tolerations.