KEDA — Kubernetes Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes HPA to scale workloads based on external event sources — Kafka consumer lag, SQS queue depth, Prometheus metrics, Redis list length, and 6...
KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes HPA to scale workloads based on external event sources. Kafka consumer lag, SQS queue depth, Prometheus metrics, Redis list length, and 60+ other scalers.
Why KEDA
Standard HPA scales on CPU and memory. KEDA scales on what actually matters for queue-driven workloads:
- Kafka consumer group lag → scale up processors when messages pile up
- SQS queue depth → scale workers when jobs accumulate
- Prometheus metric → scale on business metrics (orders per minute)
- Scale to zero — no events, zero pods (true serverless on Kubernetes)
Install
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace \
--set prometheus.metricServer.enabled=trueScaledObject — Core Resource
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: production
spec:
scaleTargetRef:
name: order-processor # Deployment to scale
minReplicaCount: 0 # scale to zero when idle
maxReplicaCount: 50
pollingInterval: 15 # check scaler every 15s
cooldownPeriod: 300 # wait 5m before scaling down
triggers:
- type: kafka
metadata:
bootstrapServers: kafka.messaging.svc:9092
consumerGroup: order-processors
topic: orders
lagThreshold: "100" # one replica per 100 messages of lag
offsetResetPolicy: latestSQS Scaler
triggers:
- type: aws-sqs-queue
authenticationRef:
name: keda-aws-credentials
metadata:
queueURL: https://sqs.eu-west-1.amazonaws.com/123456789/orders
queueLength: "50" # scale up when > 50 messages
awsRegion: eu-west-1
scaleOnInFlight: "true" # count in-flight messages too# TriggerAuthentication using IRSA (no hardcoded credentials)
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: keda-aws-credentials
namespace: production
spec:
podIdentity:
provider: aws-eks # use pod's IAM role (IRSA)Prometheus Scaler
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: orders_per_second
threshold: "10" # one replica per 10 orders/second
query: sum(rate(http_requests_total{job="order-service"}[2m]))Redis Scaler (list/stream)
triggers:
- type: redis
authenticationRef:
name: redis-auth
metadata:
address: redis.cache.svc:6379
listName: job-queue
listLength: "20" # scale when list has > 20 itemsScaledJob — for batch workloads
Scale Kubernetes Jobs (not Deployments) for batch processing. Each SQS message gets its own Job.
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: image-resizer
spec:
jobTargetRef:
template:
spec:
containers:
- name: resizer
image: myregistry/image-resizer:latest
restartPolicy: Never
maxReplicaCount: 100
triggers:
- type: aws-sqs-queue
authenticationRef:
name: keda-aws-credentials
metadata:
queueURL: https://sqs.eu-west-1.amazonaws.com/123456789/image-jobs
queueLength: "1" # one job per message
awsRegion: eu-west-1Scale-to-Zero Considerations
- Pods at zero = cold start on first event (image pull + init time)
- Use
minReplicaCount: 1if cold start latency is unacceptable - Combine with Karpenter or Cluster Autoscaler — scaling pods and nodes
Common Failure Cases
ScaledObject created but HPA is not scaling — stuck at zero or minReplicas
Why: KEDA creates the HPA successfully but the metrics adapter cannot reach the external scaler endpoint (e.g., Kafka broker unreachable, wrong bootstrap server, auth misconfigured).
Detect: kubectl describe hpa <name> shows unable to get external metric; kubectl logs -n keda deployment/keda-operator shows connection refused or auth errors.
Fix: verify the scaler endpoint is reachable from within the cluster (kubectl exec into a pod and test connectivity), and confirm the TriggerAuthentication secret contains the correct credentials.
Scale-to-zero causes thundering herd when traffic returns
Why: with minReplicaCount: 0 the first batch of events must wait for pod cold start (image pull + init) before being processed, and if the queue filled up during idle time all events arrive simultaneously.
Detect: queue depth spikes to a large value after a quiet period; consumer lag metric shows a sudden large value before any pods are running.
Fix: set minReplicaCount: 1 for latency-sensitive workloads, or pre-warm by setting cooldownPeriod high enough to keep at least one warm pod through expected idle periods.
IRSA credentials on KEDA operator fail after cluster upgrade
Why: the KEDA operator pod was not restarted after the IRSA token was refreshed following a node group rotation, and the mounted token has expired.
Detect: SQS or DynamoDB scaler logs show ExpiredTokenException or InvalidClientTokenId.
Fix: restart the KEDA operator pod to force a fresh token mount: kubectl rollout restart deployment/keda-operator -n keda.
Kafka scaler reports zero lag but consumers are actually behind
Why: offsetResetPolicy: latest combined with a consumer group that has never committed an offset causes KEDA to compute lag against the latest offset rather than the committed position, reporting zero lag.
Detect: Kafka consumer group kafka-consumer-groups.sh --describe shows CURRENT-OFFSET = - (no committed offset) while the partition LOG-END-OFFSET is much higher.
Fix: change offsetResetPolicy to earliest for new consumer groups, or ensure the consumer group commits an initial offset before KEDA begins evaluating lag.
Connections
cloud-hub · cloud/kubernetes · cloud/aws-sqs-sns · cloud/cloud-monitoring · cloud/kubernetes-operators · llms/ae-hub
Open Questions
- What monitoring and alerting matter most when this is deployed in production?
- At what scale or workload does this approach hit its practical limits?
Related reading