AWS SQS and SNS

The messaging backbone of AWS event-driven architectures. SQS = queue (point-to-point). SNS = pub/sub (one-to-many fan-out).

Updated Invalid Date·

aws sqs sns messaging queues pub-sub event-driven

The messaging backbone of AWS event-driven architectures. SQS = queue (point-to-point). SNS = pub/sub (one-to-many fan-out). They compose: SNS fan-out to multiple SQS queues is the most common production pattern.

SQS — Simple Queue Service

Managed message queue. Producers send messages; consumers poll and process; consumers delete after processing. At-least-once delivery.

Queue types:

Standard — best-effort ordering, at-least-once delivery, nearly unlimited throughput
FIFO — exactly-once processing, strict ordering, 3,000 msg/s (300 without batching)

import boto3

sqs = boto3.client("sqs", region_name="eu-west-1")
QUEUE_URL = "https://sqs.eu-west-1.amazonaws.com/123456789/my-queue"

# Send
sqs.send_message(
    QueueUrl=QUEUE_URL,
    MessageBody=json.dumps({"order_id": "ORD-123", "total": 49.99}),
    MessageAttributes={
        "priority": {"DataType": "String", "StringValue": "high"}
    }
)

# Receive and process
while True:
    response = sqs.receive_message(
        QueueUrl=QUEUE_URL,
        MaxNumberOfMessages=10,       # batch up to 10
        WaitTimeSeconds=20,           # long polling — reduces empty responses
        VisibilityTimeout=60          # other consumers can't see it for 60s
    )

    for msg in response.get("Messages", []):
        body = json.loads(msg["Body"])
        try:
            process_order(body)
            sqs.delete_message(
                QueueUrl=QUEUE_URL,
                ReceiptHandle=msg["ReceiptHandle"]
            )
        except Exception:
            pass  # message becomes visible again after VisibilityTimeout

Visibility timeout — how long the message is hidden from other consumers after it's received. Set to 6× your average processing time. If processing takes longer, extend with ChangeMessageVisibility.

Dead Letter Queue (DLQ) — after N failed processing attempts, messages move to a DLQ. Alert on DLQ depth.

# Set DLQ on a queue
aws sqs set-queue-attributes \
  --queue-url https://sqs.eu-west-1.amazonaws.com/123456789/my-queue \
  --attributes '{
    "RedrivePolicy": "{
      \"deadLetterTargetArn\": \"arn:aws:sqs:eu-west-1:123456789:my-dlq\",
      \"maxReceiveCount\": \"5\"
    }"
  }'

Pub/sub. Publishers send to a Topic; SNS fans out to all subscribers (SQS queues, Lambda, HTTP endpoints, email, SMS).

sns = boto3.client("sns", region_name="eu-west-1")

# Publish
sns.publish(
    TopicArn="arn:aws:sns:eu-west-1:123456789:order-events",
    Message=json.dumps({
        "event": "ORDER_PLACED",
        "orderId": "ORD-123",
        "total": 49.99
    }),
    MessageAttributes={
        "event_type": {"DataType": "String", "StringValue": "ORDER_PLACED"}
    }
)

SNS message filtering — subscribers only receive messages matching their filter policy. The fulfillment queue gets ORDER_PLACED; the analytics queue gets all events.

{
  "event_type": ["ORDER_PLACED", "ORDER_SHIPPED"]
}

Fan-Out Pattern

SNS topic → multiple SQS queues. Each queue drives an independent microservice. Decouples producers from consumers completely.

Order Service
    │
    ▼
SNS: order-events
    ├──► SQS: fulfillment-queue  ──► Fulfillment Lambda
    ├──► SQS: email-queue        ──► Email Lambda
    ├──► SQS: analytics-queue    ──► Analytics Lambda
    └──► SQS: audit-queue        ──► Audit Lambda

All four consumers process every order event independently. Adding a new consumer is a new SQS subscription. No change to the Order Service.

SQS as Lambda Trigger

Lambda polls SQS automatically. Configure batch size and concurrency.

aws lambda create-event-source-mapping \
  --function-name order-processor \
  --event-source-arn arn:aws:sqs:eu-west-1:123456789:fulfillment-queue \
  --batch-size 10 \
  --function-response-types ReportBatchItemFailures

At scale, Lambda scales horizontally. One Lambda invocation per SQS batch, up to 60 concurrent invocations per FIFO queue (unlimited for Standard).

	SQS	SNS	EventBridge
Pattern	Queue (point-to-point)	Pub/sub fan-out	Event bus (content routing)
Consumers	One consumer per message	All subscribers	Rules-based routing
Filtering	At consumer level	Message attribute filter	Rich content-based rules
Replay	No (DLQ only)	No	Archive + replay (30 days)
Sources	Your code	Your code	200+ AWS services + SaaS

EventBridge is the modern choice when you need routing based on event content or AWS service events.

Common Failure Cases

Messages landing in DLQ immediately — maxReceiveCount too low Why: a transient processing error (e.g., DB timeout) on a message increments its receive count, and if maxReceiveCount is set to 1 or 2, it hits the DLQ before any meaningful retry. Detect: DLQ depth grows rapidly immediately after messages are sent; the original queue depth stays low. Fix: set maxReceiveCount to at least 5 to allow for transient failures, and pair it with an appropriate VisibilityTimeout so retries have time to succeed.

FIFO queue throughput bottleneck — all messages use the same MessageGroupId Why: FIFO queues process one message at a time per MessageGroupId; using a single group ID effectively serialises all processing to a single consumer. Detect: queue depth grows despite Lambda concurrency being available; CloudWatch NumberOfMessagesSent > NumberOfMessagesDeleted consistently. Fix: partition messages into multiple MessageGroupId values (e.g., by customer ID, order shard) to allow parallel processing across groups.

SNS delivery to SQS silently drops messages — missing access policy Why: when SNS tries to deliver to an SQS queue, the queue's access policy must explicitly allow sns:SendMessage from the SNS topic ARN; without it, SNS delivery fails silently. Detect: SNS NumberOfNotificationsDelivered is zero for the SQS subscription; no errors are visible because SNS delivery failures are not automatically alarmed. Fix: add an SQS access policy statement allowing sqs:SendMessage with ArnEquals: {"aws:SourceArn": "<topic-arn>"} as the condition; alarm on SNSNumberOfNotificationsFailed.

Message processing causes infinite retry loop — poison pill message Why: a structurally invalid message (malformed JSON, unexpected schema) always fails processing, consumes a visibility timeout slot each time, and repeatedly re-enters the queue up to maxReceiveCount. Detect: the DLQ consistently receives the same message IDs; the original queue's ApproximateNumberOfMessagesNotVisible is elevated. Fix: wrap the message parser in a try/except that catches schema errors and explicitly deletes the message (or routes to DLQ immediately) rather than letting it time out and re-enter.

Connections

cloud-hub · cloud/aws-core · cloud/aws-lambda-patterns · cloud/cloud-monitoring

Open Questions

What monitoring and alerting matter most when this is deployed in production?
At what scale or workload does this approach hit its practical limits?