AWS ElastiCache

Managed Redis and Memcached in AWS. ElastiCache handles node provisioning, patching, failure detection, and replication. Use Redis for most workloads — richer data structures, persistence, pub/sub.

Managed Redis and Memcached in AWS. ElastiCache handles node provisioning, patching, failure detection, and replication. Use Redis for most workloads. Richer data structures, persistence, pub/sub.


Redis vs Memcached on ElastiCache

RedisMemcached
Data structuresStrings, Hashes, Lists, Sets, Sorted Sets, StreamsStrings only
PersistenceRDB snapshots + AOFNone
ReplicationMulti-AZ with auto-failoverNo replication
Pub/SubYesNo
ClusteringRedis Cluster (horizontal sharding)Yes
Use caseSession store, cache, queues, leaderboards, rate limitingSimple high-throughput cache

Default choice: Redis. Memcached only when you need pure simplicity and maximum throughput with no durability.


ElastiCache Redis — Terraform

# infra/elasticache.tf
resource "aws_elasticache_subnet_group" "redis" {
  name       = "myapp-redis"
  subnet_ids = var.private_subnet_ids
}

resource "aws_security_group" "redis" {
  name   = "myapp-redis"
  vpc_id = var.vpc_id

  ingress {
    from_port       = 6379
    to_port         = 6379
    protocol        = "tcp"
    security_groups = [var.app_security_group_id]
  }
}

resource "aws_elasticache_replication_group" "redis" {
  replication_group_id       = "myapp-redis"
  description                = "MyApp Redis cluster"
  node_type                  = "cache.r7g.large"     # 13 GB RAM
  num_cache_clusters         = 2                     # primary + 1 replica
  port                       = 6379
  subnet_group_name          = aws_elasticache_subnet_group.redis.name
  security_group_ids         = [aws_security_group.redis.id]
  at_rest_encryption_enabled = true
  transit_encryption_enabled = true                  # TLS in transit
  auth_token                 = var.redis_auth_token  # Redis AUTH password
  automatic_failover_enabled = true
  multi_az_enabled           = true
  maintenance_window         = "sun:05:00-sun:06:00"
  snapshot_window            = "03:00-04:00"
  snapshot_retention_limit   = 7

  log_delivery_configuration {
    destination      = aws_cloudwatch_log_group.redis.name
    destination_type = "cloudwatch-logs"
    log_format       = "json"
    log_type         = "slow-log"
  }
}

output "redis_endpoint" {
  value = aws_elasticache_replication_group.redis.primary_endpoint_address
}

Python — redis-py Connection

# app/cache.py
import redis.asyncio as redis
import json
import os
from typing import Any, Optional

_redis_client: redis.Redis = None

def get_redis() -> redis.Redis:
    global _redis_client
    if _redis_client is None:
        _redis_client = redis.Redis(
            host=os.environ["REDIS_HOST"],
            port=6379,
            password=os.environ["REDIS_PASSWORD"],
            ssl=True,
            decode_responses=True,
            socket_connect_timeout=2,
            socket_timeout=2,
            retry_on_timeout=True,
            max_connections=20,
        )
    return _redis_client

class Cache:
    def __init__(self):
        self.client = get_redis()

    async def get(self, key: str) -> Optional[Any]:
        value = await self.client.get(key)
        return json.loads(value) if value else None

    async def set(self, key: str, value: Any, ttl: int = 300):
        await self.client.setex(key, ttl, json.dumps(value))

    async def delete(self, key: str):
        await self.client.delete(key)

    async def invalidate_pattern(self, pattern: str):
        """Invalidate all keys matching a glob pattern (use sparingly — O(N))."""
        keys = await self.client.keys(pattern)
        if keys:
            await self.client.delete(*keys)

Caching Patterns

# Cache-aside with ElastiCache
cache = Cache()

async def get_product(product_id: str) -> dict:
    cache_key = f"product:{product_id}"

    # Try cache first
    cached = await cache.get(cache_key)
    if cached:
        return cached

    # Miss — load from DB
    product = await db.fetch_product(product_id)
    if product:
        await cache.set(cache_key, product, ttl=600)  # cache 10 minutes
    return product

# Rate limiting
async def check_rate_limit(user_id: str, limit: int = 100, window: int = 3600) -> bool:
    key = f"rate_limit:{user_id}"
    pipe = cache.client.pipeline()
    pipe.incr(key)
    pipe.expire(key, window)
    results = await pipe.execute()
    count = results[0]
    return count <= limit

# Session store
async def create_session(user_id: str, data: dict) -> str:
    import secrets
    session_id = secrets.token_urlsafe(32)
    await cache.set(f"session:{session_id}", {"user_id": user_id, **data}, ttl=86400)
    return session_id

async def get_session(session_id: str) -> Optional[dict]:
    return await cache.get(f"session:{session_id}")

Pub/Sub for Real-Time Events

# Publisher
async def publish_order_event(order_id: str, status: str):
    client = get_redis()
    await client.publish("order_events", json.dumps({
        "order_id": order_id,
        "status": status,
        "timestamp": datetime.utcnow().isoformat(),
    }))

# Subscriber
async def subscribe_to_order_events():
    client = get_redis()
    pubsub = client.pubsub()
    await pubsub.subscribe("order_events")

    async for message in pubsub.listen():
        if message["type"] == "message":
            event = json.loads(message["data"])
            await handle_order_event(event)

Monitoring and Alarms

import boto3

cloudwatch = boto3.client("cloudwatch")

# Alarm: cache hits too low (caching is not effective)
cloudwatch.put_metric_alarm(
    AlarmName="ElastiCache-LowCacheHitRate",
    MetricName="CacheHits",
    Namespace="AWS/ElastiCache",
    Dimensions=[{"Name": "CacheClusterId", "Value": "myapp-redis-001"}],
    Statistic="Average",
    Period=300,
    EvaluationPeriods=3,
    Threshold=0.7,   # alert if hit rate < 70%
    ComparisonOperator="LessThanThreshold",
    AlarmActions=["arn:aws:sns:eu-west-1:123:ops-alerts"],
)

Common Failure Cases

Connection refused from application — security group not open Why: the Redis security group only allows inbound 6379 from specific sources, and the application's security group is not listed. Detect: redis.exceptions.ConnectionError: Error 111 connecting to <host>:6379. Connection refused. in application logs. Fix: add an inbound rule to the ElastiCache security group allowing TCP 6379 from the application's security group ID (not its IP).

TLS handshake failure after enabling transit_encryption_enabled Why: the redis-py client is connecting without ssl=True after encryption was enabled on the cluster, or the wrong port is used (ElastiCache TLS uses 6379 not 6380 by default). Detect: ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] or ConnectionRefusedError after enabling TLS. Fix: set ssl=True in the redis-py connection and ensure REDIS_HOST points to the primary endpoint, not an individual node endpoint.

Cache stampede on cold start — all keys expired simultaneously Why: all cache keys were set with the same TTL (e.g., after a deploy flush or a bulk-load), so they expire at the same time and all requests hit the database at once. Detect: CloudWatch CacheHits drops to zero while DatabaseConnections spikes; latency p99 spikes immediately after the TTL window. Fix: add random jitter to TTLs (ttl = base_ttl + random.randint(0, 30)) so expirations are spread over time.

KEYS pattern scan causes latency spikes in production Why: KEYS * or KEYS pattern:* blocks the Redis event loop for the duration of the scan — a single call on a large keyspace can freeze Redis for hundreds of milliseconds. Detect: CloudWatch EngineCPUUtilization spikes coincide with latency spikes; SLOWLOG GET shows the KEYS command. Fix: replace KEYS with SCAN (cursor-based, non-blocking) for all production pattern lookups; the invalidate_pattern helper in the code above is an example of this anti-pattern to fix.

Connections

cloud-hub · cloud/aws-rds-aurora · cs-fundamentals/caching-strategies · cloud/serverless-patterns · cs-fundamentals/distributed-systems · cloud/finops-cost-management

Open Questions

  • What monitoring and alerting matter most when this is deployed in production?
  • At what scale or workload does this approach hit its practical limits?