Configure Kubernetes autoscaling under load
Deploy a stateless API to a Kubernetes cluster with a HorizontalPodAutoscaler targeting 60% CPU utilisation. Run a k6 load test ramping to 200 virtual users and watch pods scale up in real time. Then remove the load and verify the cluster scales back down to the minimum replica count.
Why this matters
Autoscaling is the mechanism that converts cloud infrastructure from fixed cost to variable cost. Misconfiguring it is how teams end up with either runaway pod counts (expensive) or throttled APIs under load (catastrophic). Understanding the relationship between HPA metrics, resource requests, and pod scheduling is core cloud engineering knowledge.
Before you start
- A running Kubernetes cluster (minikube, kind, or a managed cluster like EKS)
- kubectl configured and pointing at the cluster
- metrics-server installed in the cluster (required for CPU-based HPA)
- k6 installed locally for load testing
Step-by-step guide
- 1
Deploy the API with resource requests
Write a Deployment manifest for a simple HTTP API with replicas: 2, resources.requests.cpu: 100m, and resources.limits.cpu: 200m. Resource requests are mandatory for HPA to function; without them, the metrics server has no denominator for the utilisation percentage.
- 2
Configure the HorizontalPodAutoscaler
Write an HPA manifest targeting the Deployment with minReplicas: 2, maxReplicas: 10, and a CPU utilisation target of 60%. Apply it and verify it is active with kubectl get hpa; the CURRENT column will show <unknown> until the metrics server has data.
- 3
Write the k6 load test
Write a k6 script that ramps from 0 to 200 virtual users over 2 minutes, holds for 3 minutes, then ramps down. Each VU makes a GET request to your API. Set thresholds: p95 latency under 500ms and error rate under 1%. These will tell you whether the autoscaling kept up.
- 4
Run the load test and watch scaling
In one terminal, run kubectl get pods -w to watch pod count in real time. In another, run the k6 test. You should see pods scale up within 30-90 seconds of the load increasing (HPA has a default 15s sync period). Watch the HPA target percentage: kubectl get hpa -w.
- 5
Verify scale-down
After the load test ends, watch the pod count. Kubernetes waits 5 minutes by default before scaling down (the stabilisation window prevents flapping). Verify pods eventually return to the minimum replica count. Then check k6's summary; confirm p95 latency and error rate stayed within thresholds during scaling.