FinOps and Cloud Cost Management
Engineering discipline for understanding, controlling, and optimising cloud spend.
Engineering discipline for understanding, controlling, and optimising cloud spend.
FinOps Principles
Three phases (FinOps Foundation):
Inform → visibility into what you're spending and why
Optimise → identify waste, rightsize, commit for discounts
Operate → embed cost accountability into engineering workflows
Key rules:
1. Tagging is the foundation — untagged resources are invisible costs
2. Cost follows architecture decisions, not budget decisions
3. Engineers who create resources should see what they cost
4. Optimise after you understand, not before
Tagging Strategy
# Enforce tags at account level via Service Control Policy
# scp-require-tags.json
SCP_POLICY = {
"Version": "2012-10-17",
"Statement": [{
"Sid": "RequireCostTags",
"Effect": "Deny",
"Action": [
"ec2:RunInstances",
"rds:CreateDBInstance",
"lambda:CreateFunction",
],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/team": "true",
"aws:RequestTag/service": "true",
"aws:RequestTag/environment": "true",
}
}
}]
}
# Standard tag schema
REQUIRED_TAGS = {
"team": "platform", # who owns this resource
"service": "order-service", # logical service name
"environment": "prod", # prod | staging | dev
"cost-centre": "CC-1234", # finance mapping
}# CDK: apply tags to all resources in a stack using Aspects
from aws_cdk import Aspects, Tags
class RequiredTagsAspect(cdk.Aspect):
def __init__(self, tags: dict[str, str]) -> None:
self.tags = tags
def visit(self, node: constructs.IConstruct) -> None:
if isinstance(node, cdk.Resource):
for key, value in self.tags.items():
Tags.of(node).add(key, value)
# Apply to entire app
app = cdk.App()
stack = MyStack(app, "OrderStack")
Aspects.of(stack).add(RequiredTagsAspect({
"team": "platform",
"service": "order-service",
"environment": app.node.try_get_context("env") or "dev",
}))AWS Budgets and Alerts
import boto3
budgets = boto3.client("budgets")
def create_service_budget(account_id: str, service: str, monthly_limit_usd: float,
alert_email: str) -> None:
budgets.create_budget(
AccountId=account_id,
Budget={
"BudgetName": f"{service}-monthly",
"BudgetLimit": {"Amount": str(monthly_limit_usd), "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST",
"CostFilters": {
"TagKeyValue": [f"user:service${service}"]
},
},
NotificationsWithSubscribers=[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80, # alert at 80% of budget
"ThresholdType": "PERCENTAGE",
},
"Subscribers": [{"SubscriptionType": "EMAIL", "Address": alert_email}],
},
{
"Notification": {
"NotificationType": "FORECASTED",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100, # alert if forecasted to exceed
"ThresholdType": "PERCENTAGE",
},
"Subscribers": [{"SubscriptionType": "EMAIL", "Address": alert_email}],
},
],
)Cost Explorer — Programmatic Analysis
import boto3
from datetime import date, timedelta
ce = boto3.client("ce", region_name="us-east-1")
def get_cost_by_service(days: int = 30) -> dict[str, float]:
end = date.today().isoformat()
start = (date.today() - timedelta(days=days)).isoformat()
response = ce.get_cost_and_usage(
TimePeriod={"Start": start, "End": end},
Granularity="MONTHLY",
Metrics=["UnblendedCost"],
GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
Filter={
"Tags": {
"Key": "environment",
"Values": ["prod"],
}
},
)
costs = {}
for result in response["ResultsByTime"]:
for group in result["Groups"]:
service = group["Keys"][0]
cost = float(group["Metrics"]["UnblendedCost"]["Amount"])
costs[service] = costs.get(service, 0) + cost
return dict(sorted(costs.items(), key=lambda x: x[1], reverse=True))
def get_rightsizing_recommendations() -> list[dict]:
response = ce.get_rightsizing_recommendation(
Service="AmazonEC2",
Configuration={
"RecommendationTarget": "CROSS_INSTANCE_FAMILY",
"BenefitsConsidered": True,
},
)
return [
{
"instance_id": r["CurrentInstance"]["ResourceId"],
"current_type": r["CurrentInstance"]["ResourceDetails"]["EC2ResourceDetails"]["InstanceType"],
"recommended_type": r["RightsizingOptions"][0]["ModifyRecommendationDetail"]["TargetInstances"][0]["ResourceDetails"]["EC2ResourceDetails"]["InstanceType"],
"monthly_savings": float(r["RightsizingOptions"][0]["ModifyRecommendationDetail"]["TargetInstances"][0]["EstimatedMonthlyGain"]),
}
for r in response["RightsizingRecommendations"]
if r["RightsizingOptions"]
]Savings Plans vs Reserved Instances
Savings Plans (preferred — more flexible):
Compute SP: applies to EC2 (any family/size/region/OS), Fargate, Lambda
Up to 66% discount vs on-demand
EC2 Instance SP: applies to a specific EC2 family in a region (higher discount ~72%)
SageMaker SP: applies to SageMaker ML workloads
Reserved Instances:
Standard RI: specific instance type, region, OS — up to 72% discount
Convertible RI: can exchange for different type — up to 54% discount
More rigid than Savings Plans; use when you have very stable, predictable workloads
Commitment terms: 1-year (lower discount) or 3-year (higher discount)
Payment: All upfront (best discount), partial upfront, no upfront
Strategy:
1. Run on-demand for 2-3 months to establish baseline
2. Use Compute Savings Plans for first 40-60% of stable spend
3. Cover remainder with on-demand + Spot where possible
4. Review quarterly — purchase in small tranches, not one big commitment
Spot Instances
# CDK: ECS service with Spot Fargate capacity provider
fargate_spot = ecs.FargateService(self, "SpotService",
cluster=cluster,
task_definition=task_def,
capacity_provider_strategies=[
ecs.CapacityProviderStrategy(
capacity_provider="FARGATE_SPOT",
weight=4, # 80% Spot
base=0,
),
ecs.CapacityProviderStrategy(
capacity_provider="FARGATE",
weight=1, # 20% on-demand fallback
base=1, # always keep at least 1 on-demand task
),
],
)
# Spot Fargate is ~70% cheaper than on-demand.
# Use for: batch processing, stateless web tiers with graceful shutdown (SIGTERM handler)
# Avoid for: databases, stateful workloads, tasks that can't tolerate 2-min interruption noticeCost Dashboard in CloudWatch
# Emit daily cost as a custom metric for dashboarding
import schedule
import time
def emit_daily_costs() -> None:
costs = get_cost_by_service(days=1)
cw = boto3.client("cloudwatch")
for service_name, cost in costs.items():
cw.put_metric_data(
Namespace="FinOps",
MetricData=[{
"MetricName": "DailyServiceCost",
"Dimensions": [
{"Name": "Service", "Value": service_name[:256]},
{"Name": "Environment", "Value": "prod"},
],
"Value": cost,
"Unit": "None", # dollars
}],
)
# Run daily at 8am UTC (Lambda schedule or cron job)S3 Cost Reduction
# Intelligent-Tiering: auto-moves objects based on access patterns
aws s3api put-bucket-intelligent-tiering-configuration \
--bucket my-bucket --id AllObjects \
--intelligent-tiering-configuration '{
"Id": "AllObjects", "Status": "Enabled",
"Tierings": [
{"Days": 90, "AccessTier": "ARCHIVE_ACCESS"},
{"Days": 180, "AccessTier": "DEEP_ARCHIVE_ACCESS"}
]
}'
# Lifecycle rule — transition old objects to IA, expire old versions
# lifecycle.json: {"Rules": [{"ID": "TransitionToIA", "Status": "Enabled",
# "Filter": {}, "Transitions": [{"Days": 30, "StorageClass": "STANDARD_IA"}],
# "NoncurrentVersionExpiration": {"NoncurrentDays": 30}}]}
aws s3api put-bucket-lifecycle-configuration --bucket my-bucket \
--lifecycle-configuration file://lifecycle.jsonData Transfer Cost Reduction
- Egress to internet: $0.09/GB — use CloudFront to cache and reduce origin calls
- Cross-AZ: $0.01/GB each way — keep services in the same AZ where latency allows
- NAT Gateway: $0.045/GB processed — route S3/DynamoDB through Gateway VPC Endpoints (free within VPC)
# Create S3 Gateway VPC endpoint (eliminates NAT Gateway charges for S3)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-xxxxxxxxx \
--service-name com.amazonaws.eu-west-1.s3 \
--route-table-ids rtb-xxxxxxxxxCommon Failure Cases
Cost allocation broken because tagging SCP was added after resources were created
Why: The SCP that enforces required tags (team, service) only applies at resource creation time; existing resources created before the SCP are not retroactively tagged and show as unallocated cost.
Detect: Cost Explorer shows a large "No tag" segment in the tag-breakdown view; aws resourcegroupstaggingapi get-resources --tag-filters returns many untagged resources.
Fix: Run AWS Config rule required-tags to identify untagged resources; use AWS Tag Editor to bulk-tag existing resources; enforce the SCP going forward and set a 30-day remediation deadline for untagged legacy resources.
Budget alert fires too late because it checks monthly actual spend, not forecasted spend
Why: An ACTUAL spend threshold of 80% only alerts after 80% of the budget is already consumed; a spend spike at month-end can blow the budget before the alert is actionable.
Detect: Budget alarm fires with less than 5 days left in the month, leaving no time to reduce spending before the threshold is exceeded.
Fix: Add a FORECASTED notification at 80% of the budget in addition to the ACTUAL 80% notification (as shown in the code above); the forecast alert gives advance warning when spend trajectory is off.
Savings Plan purchased against a service that is being migrated, leaving commitment stranded
Why: A 1-year EC2 Savings Plan was purchased against an instance family (c5) that the platform team plans to migrate to Graviton (c7g) within 6 months; EC2 Instance Savings Plans are family-specific and cannot cover Graviton instances.
Detect: Savings Plans Utilization drops below 70% after the Graviton migration; Cost Explorer shows the old SP covering near-zero usage.
Fix: Use Compute Savings Plans instead of EC2 Instance Savings Plans — Compute SPs apply to any instance family, any OS, and also cover Fargate and Lambda, providing flexibility across a migration.
Cost dashboard showing yesterday's data mistaken for real-time spend
Why: AWS Cost Explorer and the Billing console have a 24-hour data lag; teams building custom cost dashboards from the Cost Explorer API assume the latest data reflects current-day spend.
Detect: Cost metrics appear stable during an incident that is clearly generating unusual traffic; the dashboard shows correct totals only after a 24-hour delay.
Fix: Use CloudWatch EstimatedCharges metric (updated multiple times per day) for near-real-time billing alerts; document the 24-hour lag prominently on the cost dashboard so engineers do not mistake stale data for current state.
Connections
cloud-hub · cloud/aws-core · cloud/infrastructure-monitoring · cloud/github-actions · cloud/kubernetes
Open Questions
- What monitoring and alerting matter most when this is deployed in production?
- At what scale or workload does this approach hit its practical limits?
Related reading