AWS SQS and SNS
The messaging backbone of AWS event-driven architectures. SQS = queue (point-to-point). SNS = pub/sub (one-to-many fan-out).
The messaging backbone of AWS event-driven architectures. SQS = queue (point-to-point). SNS = pub/sub (one-to-many fan-out). They compose: SNS fan-out to multiple SQS queues is the most common production pattern.
SQS — Simple Queue Service
Managed message queue. Producers send messages; consumers poll and process; consumers delete after processing. At-least-once delivery.
Queue types:
- Standard — best-effort ordering, at-least-once delivery, nearly unlimited throughput
- FIFO — exactly-once processing, strict ordering, 3,000 msg/s (300 without batching)
import boto3
sqs = boto3.client("sqs", region_name="eu-west-1")
QUEUE_URL = "https://sqs.eu-west-1.amazonaws.com/123456789/my-queue"
# Send
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps({"order_id": "ORD-123", "total": 49.99}),
MessageAttributes={
"priority": {"DataType": "String", "StringValue": "high"}
}
)
# Receive and process
while True:
response = sqs.receive_message(
QueueUrl=QUEUE_URL,
MaxNumberOfMessages=10, # batch up to 10
WaitTimeSeconds=20, # long polling — reduces empty responses
VisibilityTimeout=60 # other consumers can't see it for 60s
)
for msg in response.get("Messages", []):
body = json.loads(msg["Body"])
try:
process_order(body)
sqs.delete_message(
QueueUrl=QUEUE_URL,
ReceiptHandle=msg["ReceiptHandle"]
)
except Exception:
pass # message becomes visible again after VisibilityTimeoutVisibility timeout — how long the message is hidden from other consumers after it's received. Set to 6× your average processing time. If processing takes longer, extend with ChangeMessageVisibility.
Dead Letter Queue (DLQ) — after N failed processing attempts, messages move to a DLQ. Alert on DLQ depth.
# Set DLQ on a queue
aws sqs set-queue-attributes \
--queue-url https://sqs.eu-west-1.amazonaws.com/123456789/my-queue \
--attributes '{
"RedrivePolicy": "{
\"deadLetterTargetArn\": \"arn:aws:sqs:eu-west-1:123456789:my-dlq\",
\"maxReceiveCount\": \"5\"
}"
}'SNS — Simple Notification Service
Pub/sub. Publishers send to a Topic; SNS fans out to all subscribers (SQS queues, Lambda, HTTP endpoints, email, SMS).
sns = boto3.client("sns", region_name="eu-west-1")
# Publish
sns.publish(
TopicArn="arn:aws:sns:eu-west-1:123456789:order-events",
Message=json.dumps({
"event": "ORDER_PLACED",
"orderId": "ORD-123",
"total": 49.99
}),
MessageAttributes={
"event_type": {"DataType": "String", "StringValue": "ORDER_PLACED"}
}
)SNS message filtering — subscribers only receive messages matching their filter policy. The fulfillment queue gets ORDER_PLACED; the analytics queue gets all events.
{
"event_type": ["ORDER_PLACED", "ORDER_SHIPPED"]
}Fan-Out Pattern
SNS topic → multiple SQS queues. Each queue drives an independent microservice. Decouples producers from consumers completely.
Order Service
│
▼
SNS: order-events
├──► SQS: fulfillment-queue ──► Fulfillment Lambda
├──► SQS: email-queue ──► Email Lambda
├──► SQS: analytics-queue ──► Analytics Lambda
└──► SQS: audit-queue ──► Audit Lambda
All four consumers process every order event independently. Adding a new consumer is a new SQS subscription. No change to the Order Service.
SQS as Lambda Trigger
Lambda polls SQS automatically. Configure batch size and concurrency.
aws lambda create-event-source-mapping \
--function-name order-processor \
--event-source-arn arn:aws:sqs:eu-west-1:123456789:fulfillment-queue \
--batch-size 10 \
--function-response-types ReportBatchItemFailuresAt scale, Lambda scales horizontally. One Lambda invocation per SQS batch, up to 60 concurrent invocations per FIFO queue (unlimited for Standard).
EventBridge vs SNS vs SQS
| SQS | SNS | EventBridge | |
|---|---|---|---|
| Pattern | Queue (point-to-point) | Pub/sub fan-out | Event bus (content routing) |
| Consumers | One consumer per message | All subscribers | Rules-based routing |
| Filtering | At consumer level | Message attribute filter | Rich content-based rules |
| Replay | No (DLQ only) | No | Archive + replay (30 days) |
| Sources | Your code | Your code | 200+ AWS services + SaaS |
EventBridge is the modern choice when you need routing based on event content or AWS service events.
Common Failure Cases
Messages landing in DLQ immediately — maxReceiveCount too low
Why: a transient processing error (e.g., DB timeout) on a message increments its receive count, and if maxReceiveCount is set to 1 or 2, it hits the DLQ before any meaningful retry.
Detect: DLQ depth grows rapidly immediately after messages are sent; the original queue depth stays low.
Fix: set maxReceiveCount to at least 5 to allow for transient failures, and pair it with an appropriate VisibilityTimeout so retries have time to succeed.
FIFO queue throughput bottleneck — all messages use the same MessageGroupId
Why: FIFO queues process one message at a time per MessageGroupId; using a single group ID effectively serialises all processing to a single consumer.
Detect: queue depth grows despite Lambda concurrency being available; CloudWatch NumberOfMessagesSent > NumberOfMessagesDeleted consistently.
Fix: partition messages into multiple MessageGroupId values (e.g., by customer ID, order shard) to allow parallel processing across groups.
SNS delivery to SQS silently drops messages — missing access policy
Why: when SNS tries to deliver to an SQS queue, the queue's access policy must explicitly allow sns:SendMessage from the SNS topic ARN; without it, SNS delivery fails silently.
Detect: SNS NumberOfNotificationsDelivered is zero for the SQS subscription; no errors are visible because SNS delivery failures are not automatically alarmed.
Fix: add an SQS access policy statement allowing sqs:SendMessage with ArnEquals: {"aws:SourceArn": "<topic-arn>"} as the condition; alarm on SNSNumberOfNotificationsFailed.
Message processing causes infinite retry loop — poison pill message
Why: a structurally invalid message (malformed JSON, unexpected schema) always fails processing, consumes a visibility timeout slot each time, and repeatedly re-enters the queue up to maxReceiveCount.
Detect: the DLQ consistently receives the same message IDs; the original queue's ApproximateNumberOfMessagesNotVisible is elevated.
Fix: wrap the message parser in a try/except that catches schema errors and explicitly deletes the message (or routes to DLQ immediately) rather than letting it time out and re-enter.
Connections
cloud-hub · cloud/aws-core · cloud/aws-lambda-patterns · cloud/cloud-monitoring
Open Questions
- What monitoring and alerting matter most when this is deployed in production?
- At what scale or workload does this approach hit its practical limits?
Related reading