VPC Design Patterns
Network architecture patterns for secure, scalable AWS deployments — from single-account to multi-account.
Network architecture patterns for secure, scalable AWS deployments. From single-account to multi-account.
Standard 3-Tier VPC
The foundational pattern for most production workloads:
Public Subnet (per AZ) → Load balancers, NAT Gateways, bastion hosts
Private Subnet (per AZ) → Application servers, ECS tasks, Lambda VPC
Data Subnet (per AZ) → RDS, ElastiCache, OpenSearch
CIDR allocation example (3 AZs, 10.0.0.0/16):
10.0.1.0/24 Public us-east-1a
10.0.2.0/24 Public us-east-1b
10.0.3.0/24 Public us-east-1c
10.0.10.0/22 Private us-east-1a (1022 hosts)
10.0.14.0/22 Private us-east-1b
10.0.18.0/22 Private us-east-1c
10.0.100.0/24 Data us-east-1a
10.0.101.0/24 Data us-east-1b
10.0.102.0/24 Data us-east-1c
Rules:
- Internet Gateway: attached to VPC, routes public subnets to internet
- NAT Gateway: one per AZ in public subnet, routes private subnets outbound
- Route tables: separate per tier; private routes 0.0.0.0/0 → NAT GW
CDK — Production VPC
import aws_cdk as cdk
from aws_cdk import aws_ec2 as ec2
class NetworkStack(cdk.Stack):
def __init__(self, scope, id, **kwargs):
super().__init__(scope, id, **kwargs)
self.vpc = ec2.Vpc(self, "ProductionVPC",
ip_addresses=ec2.IpAddresses.cidr("10.0.0.0/16"),
max_azs=3,
nat_gateways=3, # one per AZ — no single point of failure
subnet_configuration=[
ec2.SubnetConfiguration(
name="Public",
subnet_type=ec2.SubnetType.PUBLIC,
cidr_mask=24,
),
ec2.SubnetConfiguration(
name="Private",
subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS,
cidr_mask=22,
),
ec2.SubnetConfiguration(
name="Data",
subnet_type=ec2.SubnetType.PRIVATE_ISOLATED,
cidr_mask=24,
),
],
flow_logs={
"VpcFlowLogs": ec2.FlowLogOptions(
destination=ec2.FlowLogDestination.to_cloud_watch_logs(),
traffic_type=ec2.FlowLogTrafficType.REJECT, # log rejected traffic
)
},
)Security Groups vs NACLs
Security Groups (stateful — preferred):
- Attached to resources (EC2, RDS, Lambda)
- Allow rules only (implicit deny)
- Stateful: responses allowed automatically
- Changes take effect immediately
- Rules can reference other security groups by ID (not just CIDR)
Network ACLs (stateless):
- Attached to subnets
- Allow AND deny rules, numbered priority
- Stateless: you must allow inbound AND outbound explicitly
- Applied to all resources in the subnet
- Use for: block specific IP ranges, add defence-in-depth layer
Best practice:
- Security groups: primary access control (always)
- NACLs: add only if you need subnet-level deny rules (e.g., block known bad IPs)
Security group: allow HTTPS from ALB only
Inbound: 443 from alb-sg (not the world)
Outbound: 443 to 0.0.0.0/0 (for external API calls)
Outbound: 5432 to rds-sg (for database)
# CDK security group pattern
from aws_cdk import aws_ec2 as ec2
alb_sg = ec2.SecurityGroup(self, "ALBSG",
vpc=vpc,
description="ALB — accepts internet HTTPS",
allow_all_outbound=False,
)
alb_sg.add_ingress_rule(ec2.Peer.any_ipv4(), ec2.Port.tcp(443))
alb_sg.add_ingress_rule(ec2.Peer.any_ipv4(), ec2.Port.tcp(80))
app_sg = ec2.SecurityGroup(self, "AppSG",
vpc=vpc,
description="Application — only accepts traffic from ALB",
allow_all_outbound=True,
)
app_sg.add_ingress_rule(alb_sg, ec2.Port.tcp(8000)) # only from ALB SG
rds_sg = ec2.SecurityGroup(self, "RDSSG",
vpc=vpc,
description="RDS — only accepts traffic from application tier",
allow_all_outbound=False,
)
rds_sg.add_ingress_rule(app_sg, ec2.Port.tcp(5432)) # only from app SGVPC Endpoints (Reduce NAT Cost)
# VPC Endpoints let private subnets reach AWS services WITHOUT a NAT Gateway
# Traffic stays on the AWS backbone — no internet gateway required
# Reduces NAT Gateway data processing costs significantly for S3-heavy workloads
# Gateway endpoint (S3, DynamoDB) — free
s3_endpoint = vpc.add_gateway_endpoint("S3Endpoint",
service=ec2.GatewayVpcEndpointAwsService.S3,
subnets=[ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS)],
)
# Interface endpoint (Secrets Manager, SQS, ECR, etc.) — ~$7/month per AZ per service
secrets_endpoint = ec2.InterfaceVpcEndpoint(self, "SecretsManagerEndpoint",
vpc=vpc,
service=ec2.InterfaceVpcEndpointAwsService.SECRETS_MANAGER,
subnets=ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS),
security_groups=[endpoint_sg],
private_dns_enabled=True, # DNS resolves secretsmanager.region.amazonaws.com to private IP
)
# ECR endpoints (required for Fargate in private subnets with no NAT)
for svc in [
ec2.InterfaceVpcEndpointAwsService.ECR,
ec2.InterfaceVpcEndpointAwsService.ECR_DOCKER,
ec2.InterfaceVpcEndpointAwsService.CLOUDWATCH_LOGS,
]:
ec2.InterfaceVpcEndpoint(self, f"{svc.name}Endpoint",
vpc=vpc,
service=svc,
private_dns_enabled=True,
subnets=ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS),
)Multi-Account Network with Transit Gateway
Single account: VPC peering is fine (one-to-one, no transitive routing)
Multi-account: Transit Gateway (hub-and-spoke, transitive routing, centrally managed)
Typical multi-account layout:
Network account: Transit Gateway (hub), Direct Connect, VPN
Shared Services: Route 53 Resolver, Internal ALBs, ECR
Production: Application VPCs attached to TGW
Dev/Staging: Isolated VPCs (no prod-to-dev path)
Route table isolation:
- Production VPCs → can reach Shared Services, cannot reach Dev
- Dev VPCs → can reach Shared Services, cannot reach Production
- All → can reach Network account (DNS, egress)
Cost Optimisation
NAT Gateway: $0.045/GB data processed + $0.045/hour per NAT GW
Biggest savings opportunities:
1. VPC Endpoints for S3 (free gateway endpoint):
Large S3 usage → can save hundreds/month in NAT Gateway charges
2. Centralised NAT (single NAT in one AZ):
Saves: NAT GW cost per AZ
Risk: traffic from other AZs incurs cross-AZ data transfer ($0.01/GB)
Decision: centralised NAT < cross-AZ cost only if inter-AZ traffic is low
3. NAT Instance (free-tier eligible EC2):
Saves: $0.045/hour × 3 AZs = ~$100/month
Risk: you manage availability, patching, and scaling
Only worth it for dev/staging, never production
4. IPv6 Egress-Only Internet Gateway (free):
For outbound-only IPv6 traffic — no charge
Common Failure Cases
Fargate tasks in private subnets fail to pull images with "CannotPullContainerError"
Why: private subnets route outbound traffic through a NAT Gateway, but the ECR interface VPC endpoints (ECR API, ECR Docker, and S3 gateway endpoint) are missing; without them, the task cannot reach ECR and the NAT gateway is either absent or its route table is misconfigured.
Detect: ECS task stopped reason shows CannotPullContainerError: ... no such host or i/o timeout; VPC Flow Logs show REJECT on the ECR endpoint IPs.
Fix: add com.amazonaws.<region>.ecr.api, com.amazonaws.<region>.ecr.dkr interface endpoints and an S3 gateway endpoint to the private subnets where Fargate runs; ensure all three have private_dns_enabled=True.
NAT Gateway becomes a single point of failure and entire AZ loses internet egress
Why: only one NAT Gateway was deployed (in one AZ) to save cost, and private subnets in the other AZs route their 0.0.0.0/0 traffic to it; when that AZ has an outage the other AZs lose outbound connectivity.
Detect: applications in specific AZs lose outbound connectivity; VPC Flow Logs show all traffic from private subnets in those AZs targeting the NAT Gateway IP being rejected; the NAT Gateway AZ is in an impaired state.
Fix: deploy one NAT Gateway per AZ and associate each private subnet's route table with the NAT Gateway in the same AZ; accept the additional ~$100/month as the cost of AZ-level resilience.
Security group rules accumulate until the 60-rule limit is reached, blocking new rules
Why: engineers add inbound rules over time without removing stale ones; when the per-security-group limit is reached, CloudFormation/CDK updates that add a new rule fail.
Detect: aws ec2 describe-security-groups shows a security group with 55+ rules; CDK/Terraform apply fails with InvalidGroup.RuleLimit error.
Fix: audit and remove stale rules using VPC Flow Logs to identify rules with zero accepted traffic over 30 days; refactor to reference security group IDs rather than CIDR ranges where possible (one rule covers all IPs in the referenced group).
VPC CIDR conflict blocks Transit Gateway attachment to a new VPC
Why: a new VPC was created with a CIDR that overlaps an existing VPC already attached to the Transit Gateway; Transit Gateway route tables cannot route overlapping CIDRs.
Detect: TGW attachment creation succeeds but route propagation fails; aws ec2 describe-transit-gateway-route-tables shows a route conflict error; connectivity between the new VPC and others fails.
Fix: delete the conflicting VPC and recreate with a non-overlapping CIDR; maintain a CIDR allocation registry (IPAM or a spreadsheet) before provisioning any new VPC to prevent overlaps.
Connections
cloud/cloud-hub · cloud/cloud-networking · cloud/aws-networking-advanced · cloud/aws-eks · cloud/aws-fargate · cloud/multi-tenancy · cloud/aws-cdk
Open Questions
- What monitoring and alerting matter most when this is deployed in production?
- At what scale or workload does this approach hit its practical limits?
Related reading