Debug: Cloud Cost Spike

Runbook for diagnosing unexpected cloud bill spikes and identifying the source of runaway spend.

Symptom: Cloud bill significantly higher than expected. Cost spike appeared this month. Usage looks normal but spend is not.


Quick Diagnosis

PatternLikely cause
Spike on a specific serviceNew resource created, autoscaling triggered, or data transfer increase
Gradual increase over weeksResource created and forgotten, or traffic growing without scaling policy
Spike on data transferCross-AZ traffic, NAT Gateway processing, or unexpected egress
Spike after a deployNew feature making expensive API calls or spawning resources
Spike on storageLog retention, S3 versioning accumulating, or snapshots not expiring

Likely Causes (ranked by frequency)

  1. NAT Gateway data processing — private subnet traffic not using VPC endpoints
  2. Cross-AZ data transfer — services in different AZs calling each other at high frequency
  3. Autoscaling triggered and not scaled back — peak load scaled up, minimum not reset
  4. Forgotten resource — dev environment, load test cluster, or snapshot left running
  5. LLM API costs — token usage growing unboundedly without cost controls

First Checks (fastest signal first)

  • Open Cost Explorer grouped by service — which service changed most vs the previous period?
  • Filter to the spike date — what was created, scaled, or changed on that day?
  • Check data transfer costs specifically — NAT Gateway and cross-AZ transfer are the most common hidden costs
  • Check for resources with no tags — untagged resources are often forgotten ones
  • Check LLM API usage if applicable — token costs grow non-linearly with conversation length

Signal example: AWS bill up $800 this month — Cost Explorer shows EC2 data transfer $600 higher; a new microservice is calling an internal API across AZs on every request; 50M calls/month at $0.01/GB adds up fast.


Drill Paths

SuspectGo to
NAT Gateway and VPC endpoint costscloud/vpc-design-patterns
Autoscaling and right-sizingcloud/finops-cost-management
FinOps tooling and tagging strategycloud/finops-cost-management
LLM token cost monitoringobservability/langfuse
Identifying idle resourcescloud/finops-cost-management

Fix Patterns

  • Add VPC endpoints for S3 and DynamoDB immediately — free gateway endpoints eliminate NAT charges for those services
  • Set resource tagging policy — every resource must have environment, team, service tags; untagged = under review
  • Set autoscaling minimum back to baseline after load events — scale up is automatic, scale down is not always
  • Add budget alerts at 80% and 100% of expected spend — catch spikes before the bill arrives
  • For LLM costs: set hard token limits per request and per user session; monitor cost per conversation in Langfuse

When This Is Not the Issue

If costs are rising but no specific service is spiking:

  • Usage has genuinely grown — the system is working as designed but traffic has increased
  • Check whether Savings Plans or Reserved Instances have expired — same usage, higher on-demand cost

Pivot to cloud/finops-cost-management for a systematic approach to ongoing cost governance rather than reactive spike investigation.


Connections

cloud/finops-cost-management · cloud/finops-cost-management · cloud/vpc-design-patterns · observability/langfuse

Open Questions

  • What has changed since this synthesis was written that would alter the conclusions?
  • What evidence would cause you to revise the key recommendation here?