Infrastructure Testing
Testing infrastructure-as-code before it reaches production: static validation, compliance scanning, and live end-to-end infrastructure tests.
Testing infrastructure-as-code before it reaches production: static validation, compliance scanning, and live end-to-end infrastructure tests. Broken infra causes harder-to-diagnose failures than broken app code.
Testing Layers for IaC
Layer 1 — Static (< 30 seconds):
- Terraform validate: syntax and provider schema
- tflint: best practices and deprecated APIs
- Checkov: security and compliance rules
Layer 2 — Plan analysis (< 2 minutes):
- terraform plan in CI with sentinel/OPA policies
- Detect unexpected resource destruction
- Cost estimation (Infracost)
Layer 3 — Live tests (5-30 minutes, destroy after):
- Terratest: spin up real resources, assert behaviour, teardown
- Kitchen-Terraform (legacy Chef ecosystem)
- Pulumi Automation API tests
Layer 4 — Contract tests:
- Output variable contracts (consumers of infra modules)
- DNS resolution, TLS cert validity, health endpoint
Terraform Validation in CI
# .github/workflows/infra-validate.yaml
name: Infrastructure Validation
on:
pull_request:
paths: ["infra/**"]
jobs:
validate:
runs-on: ubuntu-latest
defaults:
run:
working-directory: infra/
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.7.0"
- name: Terraform init
run: terraform init -backend=false
- name: Terraform validate
run: terraform validate
- name: tflint
uses: terraform-linters/setup-tflint@v4
- run: tflint --config .tflint.hcl
- name: Checkov security scan
uses: bridgecrewio/checkov-action@v12
with:
directory: infra/
framework: terraform
soft_fail: false
output_format: github_failed_only
- name: Infracost estimate
uses: infracost/actions/setup@v3
- run: infracost breakdown --path . --format json --out-file /tmp/infracost.json
- uses: infracost/actions/comment@v3
with:
path: /tmp/infracost.json
behavior: updateCheckov Custom Policies
# checks/custom/ensure_s3_encryption.py
from checkov.common.models.enums import CheckResult, CheckCategories
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
class S3ServerSideEncryptionEnabled(BaseResourceCheck):
def __init__(self):
name = "Ensure S3 bucket has server-side encryption"
id = "CUSTOM_S3_001"
supported_resources = ["aws_s3_bucket"]
categories = [CheckCategories.ENCRYPTION]
super().__init__(name=name, id=id, categories=categories,
supported_resources=supported_resources)
def scan_resource_conf(self, conf):
encryption = conf.get("server_side_encryption_configuration")
if encryption:
return CheckResult.PASSED
return CheckResult.FAILED
check = S3ServerSideEncryptionEnabled()Terratest — Go Integration Tests
// infra/test/vpc_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestVPCModule(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"vpc_cidr": "10.0.0.0/16",
"environment": "test",
"region": "eu-west-1",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vpcID := terraform.Output(t, terraformOptions, "vpc_id")
require.NotEmpty(t, vpcID)
// Verify VPC exists in AWS
vpc := aws.GetVpcById(t, vpcID, "eu-west-1")
assert.Equal(t, "10.0.0.0/16", aws.GetCidrBlockOfVpc(t, vpc))
// Verify subnets created
privateSubnets := terraform.OutputList(t, terraformOptions, "private_subnet_ids")
assert.Len(t, privateSubnets, 3) // one per AZ
publicSubnets := terraform.OutputList(t, terraformOptions, "public_subnet_ids")
assert.Len(t, publicSubnets, 3)
}
Python Pulumi Automation API Tests
# tests/infra/test_lambda_stack.py
import pytest
import pulumi
from pulumi import automation as auto
@pytest.fixture(scope="module")
def stack():
stack = auto.create_or_select_stack(
stack_name="test",
work_dir="infra/lambda",
opts=auto.LocalWorkspaceOptions(env_vars={
"AWS_REGION": "eu-west-1",
"PULUMI_CONFIG_PASSPHRASE": "test",
}),
)
stack.up(on_output=print)
yield stack
stack.destroy(on_output=print)
stack.workspace.remove_stack("test")
def test_lambda_function_exists(stack):
outputs = stack.outputs()
function_name = outputs["function_name"].value
assert function_name.startswith("myapp-")
def test_lambda_has_correct_runtime(stack, lambda_client):
outputs = stack.outputs()
config = lambda_client.get_function_configuration(
FunctionName=outputs["function_name"].value
)
assert config["Runtime"] == "python3.12"
assert config["Timeout"] == 30
assert config["MemorySize"] == 512OPA Policy for Terraform Plans
# policies/required_tags.rego
package terraform.analysis
import future.keywords.contains
import future.keywords.if
deny contains msg if {
resource := input.resource_changes[_]
resource.change.actions[_] == "create"
resource.type == "aws_instance"
not resource.change.after.tags.Environment
msg := sprintf("EC2 instance %v missing required Environment tag", [resource.address])
}
deny contains msg if {
resource := input.resource_changes[_]
resource.change.actions[_] == "create"
resource.type == "aws_s3_bucket"
not resource.change.after.tags.Owner
msg := sprintf("S3 bucket %v missing required Owner tag", [resource.address])
}
Common Failure Cases
Terratest resources left orphaned after a test panic
Why: when terraform.InitAndApply succeeds but the test panics before defer terraform.Destroy runs, real cloud resources are created and never cleaned up, incurring cost and hitting quotas.
Detect: manual audit of the test account shows resources tagged environment=test that have no corresponding active test run.
Fix: always use defer terraform.Destroy as the first statement after InitAndApply, and set up a nightly cleanup Lambda or scheduled job to delete stale test resources by tag.
Checkov false negatives from dynamic references
Why: Checkov's static analysis cannot resolve Terraform var.* or data.* references, so a resource that looks unconfigured at scan time is actually correct at plan time.
Detect: Checkov reports a failing check for a resource that is correctly configured when you examine the actual plan.
Fix: combine Checkov static analysis with OPA policies that evaluate the terraform plan -json output, which has all references resolved.
terraform validate passes but terraform plan fails in CI
Why: validate only checks syntax and schema; plan requires provider credentials and may fail on missing IAM permissions, missing remote state backends, or quota limits.
Detect: CI pipeline passes the validate step but the apply step fails with an authentication or resource error.
Fix: add a terraform plan -detailed-exitcode step in CI after validate, using appropriately scoped credentials with read-only access to the state backend.
OPA policy input schema mismatch after Terraform version upgrade
Why: terraform show -json output structure changes between major Terraform versions, so OPA rules that reference input.resource_changes[_].change.after may silently skip evaluations on the new schema.
Detect: OPA deny rules return no violations even for clearly non-compliant plans after a Terraform upgrade.
Fix: pin the OPA policy input schema with a unit test that asserts the expected keys are present in a known-bad plan fixture.
Connections
tqa-hub · technical-qa/security-automation · technical-qa/chaos-engineering · cloud/gitops-patterns · cloud/platform-engineering · cloud/cloud-security
Open Questions
- What is the most common failure mode when implementing this at scale?
- How does this testing approach need to adapt for distributed or microservice architectures?
Related reading