Build a production CloudWatch dashboard for a running service
Instrument a running Lambda or ECS service with custom CloudWatch metrics, define alarms for error rate and p99 latency, add anomaly detection to catch gradual degradation, and build a CloudWatch dashboard that shows the full service picture in one view; all managed in Terraform.
Why this matters
CloudWatch is the monitoring system you already have in every AWS account. Most engineers only use the basic metrics emitted automatically. Building a custom dashboard forces you to decide what failure looks like before it happens; which is the core discipline of SRE applied to a single service. An alarm without a dashboard is a page with no context.
Before you start
- A running AWS Lambda or ECS service (the Lambda from the previous exercise works perfectly)
- Terraform for infrastructure management
- AWS CLI for manual testing
- Basic understanding of what p50/p95/p99 latency means
Step-by-step guide
- 1
Emit custom metrics from your service
In the Lambda handler, use boto3 to call put_metric_data with a custom namespace (e.g., MyApp/Lambda). Emit: request_count (Count), error_count (Count), response_time_ms (Milliseconds). Invoke the function 20 times with varied inputs. Verify the metrics appear in CloudWatch Metrics within 2 minutes.
- 2
Create CloudWatch alarms in Terraform
Write aws_cloudwatch_metric_alarm resources for: error rate above 5% (use a math expression: error_count / request_count), and p99 latency above 500ms. Set alarm_actions to an SNS topic. Test by triggering the alarm manually with aws cloudwatch set-alarm-state.
- 3
Add anomaly detection
For the latency alarm, change the comparison to use ANOMALY_DETECTION_BAND. CloudWatch will model the normal latency pattern and alert when it deviates significantly. This catches gradual performance degradation that a static threshold misses. The band becomes accurate after 24 hours of data.
- 4
Build the dashboard in Terraform
Write an aws_cloudwatch_dashboard resource with a JSON dashboard body. Add widgets: a number widget for current error rate, a line graph for p99 latency over 1 hour, a log insights widget showing the last 10 error log lines. Use Terraform's jsonencode() to avoid manual JSON escaping.
- 5
Add a composite alarm
Create an aws_cloudwatch_composite_alarm that fires when both the error rate alarm AND the latency alarm are in ALARM state simultaneously. This reduces alert fatigue: a single noisy metric is informational, but two correlated bad metrics means something is genuinely wrong.
- 6
Test the full alert path
Deliberately break the function (raise an exception for all requests). Verify: metrics spike, alarms transition to ALARM, SNS sends a notification, composite alarm fires. Fix the function and verify alarms recover. The test is complete when you can describe the full path from bad request to recovery.