IntermediateCloud Engineer

Build a production CloudWatch dashboard for a running service

Instrument a running Lambda or ECS service with custom CloudWatch metrics, define alarms for error rate and p99 latency, add anomaly detection to catch gradual degradation, and build a CloudWatch dashboard that shows the full service picture in one view; all managed in Terraform.

Why this matters

CloudWatch is the monitoring system you already have in every AWS account. Most engineers only use the basic metrics emitted automatically. Building a custom dashboard forces you to decide what failure looks like before it happens; which is the core discipline of SRE applied to a single service. An alarm without a dashboard is a page with no context.

Before you start

Step-by-step guide

  1. 1

    Emit custom metrics from your service

    In the Lambda handler, use boto3 to call put_metric_data with a custom namespace (e.g., MyApp/Lambda). Emit: request_count (Count), error_count (Count), response_time_ms (Milliseconds). Invoke the function 20 times with varied inputs. Verify the metrics appear in CloudWatch Metrics within 2 minutes.

  2. 2

    Create CloudWatch alarms in Terraform

    Write aws_cloudwatch_metric_alarm resources for: error rate above 5% (use a math expression: error_count / request_count), and p99 latency above 500ms. Set alarm_actions to an SNS topic. Test by triggering the alarm manually with aws cloudwatch set-alarm-state.

  3. 3

    Add anomaly detection

    For the latency alarm, change the comparison to use ANOMALY_DETECTION_BAND. CloudWatch will model the normal latency pattern and alert when it deviates significantly. This catches gradual performance degradation that a static threshold misses. The band becomes accurate after 24 hours of data.

  4. 4

    Build the dashboard in Terraform

    Write an aws_cloudwatch_dashboard resource with a JSON dashboard body. Add widgets: a number widget for current error rate, a line graph for p99 latency over 1 hour, a log insights widget showing the last 10 error log lines. Use Terraform's jsonencode() to avoid manual JSON escaping.

  5. 5

    Add a composite alarm

    Create an aws_cloudwatch_composite_alarm that fires when both the error rate alarm AND the latency alarm are in ALARM state simultaneously. This reduces alert fatigue: a single noisy metric is informational, but two correlated bad metrics means something is genuinely wrong.

  6. 6

    Test the full alert path

    Deliberately break the function (raise an exception for all requests). Verify: metrics spike, alarms transition to ALARM, SNS sends a notification, composite alarm fires. Fix the function and verify alarms recover. The test is complete when you can describe the full path from bad request to recovery.

Relevant Axiom pages

What to do next

Back to Practice Lab