IntermediateCloud Engineer

Build a production CloudWatch dashboard for a running service

Instrument a running Lambda or ECS service with custom CloudWatch metrics, define alarms for error rate and p99 latency, add anomaly detection to catch gradual degradation, and build a CloudWatch dashboard that shows the full service picture in one view; all managed in Terraform.

Why this matters

CloudWatch is the monitoring system you already have in every AWS account. Most engineers only use the basic metrics emitted automatically. Building a custom dashboard forces you to decide what failure looks like before it happens; which is the core discipline of SRE applied to a single service. An alarm without a dashboard is a page with no context.

Before you start

A running AWS Lambda or ECS service (the Lambda from the previous exercise works perfectly)
Terraform for infrastructure management
AWS CLI for manual testing
Basic understanding of what p50/p95/p99 latency means

Step-by-step guide

1
Emit custom metrics from your service
In the Lambda handler, use boto3 to call put_metric_data with a custom namespace (e.g., MyApp/Lambda). Emit: request_count (Count), error_count (Count), response_time_ms (Milliseconds). Invoke the function 20 times with varied inputs. Verify the metrics appear in CloudWatch Metrics within 2 minutes.
2
Create CloudWatch alarms in Terraform
Write aws_cloudwatch_metric_alarm resources for: error rate above 5% (use a math expression: error_count / request_count), and p99 latency above 500ms. Set alarm_actions to an SNS topic. Test by triggering the alarm manually with aws cloudwatch set-alarm-state.
3
Add anomaly detection
For the latency alarm, change the comparison to use ANOMALY_DETECTION_BAND. CloudWatch will model the normal latency pattern and alert when it deviates significantly. This catches gradual performance degradation that a static threshold misses. The band becomes accurate after 24 hours of data.
4
Build the dashboard in Terraform
Write an aws_cloudwatch_dashboard resource with a JSON dashboard body. Add widgets: a number widget for current error rate, a line graph for p99 latency over 1 hour, a log insights widget showing the last 10 error log lines. Use Terraform's jsonencode() to avoid manual JSON escaping.
5
Add a composite alarm
Create an aws_cloudwatch_composite_alarm that fires when both the error rate alarm AND the latency alarm are in ALARM state simultaneously. This reduces alert fatigue: a single noisy metric is informational, but two correlated bad metrics means something is genuinely wrong.
6
Test the full alert path
Deliberately break the function (raise an exception for all requests). Verify: metrics spike, alarms transition to ALARM, SNS sends a notification, composite alarm fires. Fix the function and verify alarms recover. The test is complete when you can describe the full path from bad request to recovery.

Relevant Axiom pages

AWS core concepts FinOps and cost management

What to do next

Back to Practice Lab

Why this matters

Before you start

Step-by-step guide

Emit custom metrics from your service

Create CloudWatch alarms in Terraform

Add anomaly detection

Build the dashboard in Terraform

Add a composite alarm

Test the full alert path

Relevant Axiom pages

What to do next