AWS SQS and SNS

The messaging backbone of AWS event-driven architectures. SQS = queue (point-to-point). SNS = pub/sub (one-to-many fan-out).

The messaging backbone of AWS event-driven architectures. SQS = queue (point-to-point). SNS = pub/sub (one-to-many fan-out). They compose: SNS fan-out to multiple SQS queues is the most common production pattern.


SQS — Simple Queue Service

Managed message queue. Producers send messages; consumers poll and process; consumers delete after processing. At-least-once delivery.

Queue types:

  • Standard — best-effort ordering, at-least-once delivery, nearly unlimited throughput
  • FIFO — exactly-once processing, strict ordering, 3,000 msg/s (300 without batching)
import boto3

sqs = boto3.client("sqs", region_name="eu-west-1")
QUEUE_URL = "https://sqs.eu-west-1.amazonaws.com/123456789/my-queue"

# Send
sqs.send_message(
    QueueUrl=QUEUE_URL,
    MessageBody=json.dumps({"order_id": "ORD-123", "total": 49.99}),
    MessageAttributes={
        "priority": {"DataType": "String", "StringValue": "high"}
    }
)

# Receive and process
while True:
    response = sqs.receive_message(
        QueueUrl=QUEUE_URL,
        MaxNumberOfMessages=10,       # batch up to 10
        WaitTimeSeconds=20,           # long polling — reduces empty responses
        VisibilityTimeout=60          # other consumers can't see it for 60s
    )

    for msg in response.get("Messages", []):
        body = json.loads(msg["Body"])
        try:
            process_order(body)
            sqs.delete_message(
                QueueUrl=QUEUE_URL,
                ReceiptHandle=msg["ReceiptHandle"]
            )
        except Exception:
            pass  # message becomes visible again after VisibilityTimeout

Visibility timeout — how long the message is hidden from other consumers after it's received. Set to 6× your average processing time. If processing takes longer, extend with ChangeMessageVisibility.

Dead Letter Queue (DLQ) — after N failed processing attempts, messages move to a DLQ. Alert on DLQ depth.

# Set DLQ on a queue
aws sqs set-queue-attributes \
  --queue-url https://sqs.eu-west-1.amazonaws.com/123456789/my-queue \
  --attributes '{
    "RedrivePolicy": "{
      \"deadLetterTargetArn\": \"arn:aws:sqs:eu-west-1:123456789:my-dlq\",
      \"maxReceiveCount\": \"5\"
    }"
  }'

SNS — Simple Notification Service

Pub/sub. Publishers send to a Topic; SNS fans out to all subscribers (SQS queues, Lambda, HTTP endpoints, email, SMS).

sns = boto3.client("sns", region_name="eu-west-1")

# Publish
sns.publish(
    TopicArn="arn:aws:sns:eu-west-1:123456789:order-events",
    Message=json.dumps({
        "event": "ORDER_PLACED",
        "orderId": "ORD-123",
        "total": 49.99
    }),
    MessageAttributes={
        "event_type": {"DataType": "String", "StringValue": "ORDER_PLACED"}
    }
)

SNS message filtering — subscribers only receive messages matching their filter policy. The fulfillment queue gets ORDER_PLACED; the analytics queue gets all events.

{
  "event_type": ["ORDER_PLACED", "ORDER_SHIPPED"]
}

Fan-Out Pattern

SNS topic → multiple SQS queues. Each queue drives an independent microservice. Decouples producers from consumers completely.

Order Service
    │
    ▼
SNS: order-events
    ├──► SQS: fulfillment-queue  ──► Fulfillment Lambda
    ├──► SQS: email-queue        ──► Email Lambda
    ├──► SQS: analytics-queue    ──► Analytics Lambda
    └──► SQS: audit-queue        ──► Audit Lambda

All four consumers process every order event independently. Adding a new consumer is a new SQS subscription. No change to the Order Service.


SQS as Lambda Trigger

Lambda polls SQS automatically. Configure batch size and concurrency.

aws lambda create-event-source-mapping \
  --function-name order-processor \
  --event-source-arn arn:aws:sqs:eu-west-1:123456789:fulfillment-queue \
  --batch-size 10 \
  --function-response-types ReportBatchItemFailures

At scale, Lambda scales horizontally. One Lambda invocation per SQS batch, up to 60 concurrent invocations per FIFO queue (unlimited for Standard).


EventBridge vs SNS vs SQS

SQSSNSEventBridge
PatternQueue (point-to-point)Pub/sub fan-outEvent bus (content routing)
ConsumersOne consumer per messageAll subscribersRules-based routing
FilteringAt consumer levelMessage attribute filterRich content-based rules
ReplayNo (DLQ only)NoArchive + replay (30 days)
SourcesYour codeYour code200+ AWS services + SaaS

EventBridge is the modern choice when you need routing based on event content or AWS service events.


Common Failure Cases

Messages landing in DLQ immediately — maxReceiveCount too low Why: a transient processing error (e.g., DB timeout) on a message increments its receive count, and if maxReceiveCount is set to 1 or 2, it hits the DLQ before any meaningful retry. Detect: DLQ depth grows rapidly immediately after messages are sent; the original queue depth stays low. Fix: set maxReceiveCount to at least 5 to allow for transient failures, and pair it with an appropriate VisibilityTimeout so retries have time to succeed.

FIFO queue throughput bottleneck — all messages use the same MessageGroupId Why: FIFO queues process one message at a time per MessageGroupId; using a single group ID effectively serialises all processing to a single consumer. Detect: queue depth grows despite Lambda concurrency being available; CloudWatch NumberOfMessagesSent > NumberOfMessagesDeleted consistently. Fix: partition messages into multiple MessageGroupId values (e.g., by customer ID, order shard) to allow parallel processing across groups.

SNS delivery to SQS silently drops messages — missing access policy Why: when SNS tries to deliver to an SQS queue, the queue's access policy must explicitly allow sns:SendMessage from the SNS topic ARN; without it, SNS delivery fails silently. Detect: SNS NumberOfNotificationsDelivered is zero for the SQS subscription; no errors are visible because SNS delivery failures are not automatically alarmed. Fix: add an SQS access policy statement allowing sqs:SendMessage with ArnEquals: {"aws:SourceArn": "<topic-arn>"} as the condition; alarm on SNSNumberOfNotificationsFailed.

Message processing causes infinite retry loop — poison pill message Why: a structurally invalid message (malformed JSON, unexpected schema) always fails processing, consumes a visibility timeout slot each time, and repeatedly re-enters the queue up to maxReceiveCount. Detect: the DLQ consistently receives the same message IDs; the original queue's ApproximateNumberOfMessagesNotVisible is elevated. Fix: wrap the message parser in a try/except that catches schema errors and explicitly deletes the message (or routes to DLQ immediately) rather than letting it time out and re-enter.

Connections

cloud-hub · cloud/aws-core · cloud/aws-lambda-patterns · cloud/cloud-monitoring

Open Questions

  • What monitoring and alerting matter most when this is deployed in production?
  • At what scale or workload does this approach hit its practical limits?