NoSQL Databases

NoSQL covers four distinct families (document, key-value, wide-column, graph) each with different consistency models and query trade-offs. PostgreSQL + Redis is the most common 2026 production stack; NoSQL wins when data shape is variable, write throughput is extreme, or relationship traversal dominates.

SQL vs NoSQL is not either/or. It is choosing the right tool. Most production systems use both. The vault covers relational databases in cs-fundamentals/sql, cs-fundamentals/database-design, and cs-fundamentals/database-transactions; this page covers the four NoSQL families.


The Four Families

FamilyDatabaseData ModelQuery Model
DocumentMongoDB, Firestore, CouchbaseJSON/BSON documentsRich queries, aggregation pipeline
Key-ValueRedis, DynamoDBKey → value/documentPoint lookups + range scans
Wide-ColumnCassandra, ScyllaDB, HBaseTables with dynamic columnsPartition key + clustering key
GraphNeo4j, Amazon NeptuneNodes + edges with propertiesCypher / Gremlin traversal

MongoDB

Document database. Stores data as flexible BSON documents (binary JSON). No rigid schema. A laptop document can have 40 fields while a t-shirt in the same collection has 8.

When to Use

  • Product catalogs where items have heterogeneous attributes
  • CMS content where articles, videos, and podcasts have different metadata
  • API response caching where upstream payload shape varies by provider
  • Any schema that you expect to evolve frequently

Key Concepts

Collections → Documents → Fields. No ALTER TABLE — add fields to documents without a migration.

Indexing: Create indexes before queries run at scale. Compound indexes cover multi-field queries. Text indexes for full-text search. explain() to verify index usage.

from pymongo import MongoClient, ASCENDING, DESCENDING

client = MongoClient("mongodb://localhost:27017")
db = client["ecommerce"]
products = db["products"]

# Create compound index (category first for filter; price for sort)
products.create_index([("category", ASCENDING), ("price", DESCENDING)])

# Query — uses the compound index
results = products.find(
    {"category": "laptop", "price": {"$lt": 2000}},
    {"name": 1, "price": 1, "_id": 0}  # projection
).sort("price", DESCENDING).limit(10)

Aggregation Pipeline

The $match → $group → $project pattern:

pipeline = [
    {"$match": {"category": "laptop", "in_stock": True}},   # filter first — uses index
    {"$group": {
        "_id": "$brand",
        "avg_price": {"$avg": "$price"},
        "count": {"$sum": 1}
    }},
    {"$sort": {"avg_price": -1}},
    {"$limit": 10}
]

results = list(products.aggregate(pipeline))

Rules:

  • $match as early as possible — the query planner hoists it to the top to leverage indexes
  • Each stage can use up to 100MB RAM; add allowDiskUse=True for large aggregations
  • $vectorSearch (MongoDB 8.0+) enables semantic search within the aggregation pipeline

DynamoDB

AWS-managed serverless key-value/document store. Single-digit millisecond reads at any scale. Pricing by RCUs (read capacity units) and WCUs (write capacity units).

When to Use

  • AWS-native applications with known, stable access patterns
  • High-scale session storage, shopping carts, event sourcing
  • Simple key-value workloads where flexibility > query power

Partition Key Design

Each partition delivers max 3,000 RCUs/s and 1,000 WCUs/s. Partition key choice controls which node holds data.

Hot partition problem: If all ORDER items share the same PK value, all reads hit one partition. Avoid keys that concentrate similar entities.

Bad:  PK = "ORDER"          SK = "ORDER#2026-05-02#abc123"   # all orders → one partition
Good: PK = "USER#alice"     SK = "ORDER#2026-05-02#abc123"   # orders spread by user

Single Table Design

Store multiple entity types in one table using composite keys. Identify access patterns before designing keys.

Entity: User      → PK: USER#<id>        SK: PROFILE
Entity: Order     → PK: USER#<id>        SK: ORDER#<date>#<order_id>
Entity: OrderItem → PK: ORDER#<order_id> SK: ITEM#<sku>

Query "all orders for user alice" → PK = USER#alice, SK begins_with ORDER#

Global Secondary Indexes (GSIs)

A GSI creates a sidecar table with data rewritten for a secondary access pattern. Populated asynchronously (eventual consistency). GSIs have their own RCU/WCU provisioning.

import boto3

dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
table = dynamodb.Table("Orders")

# Query primary key
response = table.get_item(Key={"PK": "USER#alice", "SK": "ORDER#2026-05-01#ord123"})

# Query GSI — finds all orders with a given status
response = table.query(
    IndexName="StatusDateIndex",
    KeyConditionExpression="order_status = :s AND order_date > :d",
    ExpressionAttributeValues={":s": "PENDING", ":d": "2026-05-01"}
)

Project only needed attributes onto the GSI to minimise storage cost.


Cassandra / ScyllaDB

Wide-column store. Designed for hundreds of thousands of writes per second across distributed clusters with no single point of failure. Multi-datacenter replication built in.

When to Use

  • IoT telemetry, time-series data (sensor readings, metrics)
  • Event logs where you need high write throughput and simple queries
  • Multi-datacenter deployments needing tunable consistency
  • Write-heavy workloads that would create a bottleneck on a single primary

When NOT to Use

  • Complex ad-hoc queries (no joins, no aggregations without materialised views)
  • ACID transactions across multiple partitions (use Postgres)
  • Small-scale applications (operational complexity not worth it below ~millions of writes/day)

Data Model

Keyspace → Table → Row. Rows identified by a partition key (which node) and optional clustering key (order within partition).

-- Cassandra Query Language (CQL)
CREATE TABLE sensor_readings (
    device_id   UUID,
    recorded_at TIMESTAMP,
    temperature FLOAT,
    humidity    FLOAT,
    PRIMARY KEY (device_id, recorded_at)   -- device_id = partition, recorded_at = clustering
) WITH CLUSTERING ORDER BY (recorded_at DESC);

-- Query: last 100 readings for device X (single-partition — fast)
SELECT * FROM sensor_readings WHERE device_id = ? LIMIT 100;

Wide-column: each row can have different column values. Efficient for sparse data.

ScyllaDB is a drop-in C++ reimplementation of Cassandra — 10x faster throughput per node. Preferred for new deployments where Cassandra compatibility is needed.


Redis Data Structures

Redis is covered as a caching tool in infra/caching. This section covers its data structure capabilities beyond caching.

Redis stores data entirely in memory with optional persistence (RDB snapshots, AOF log). Sub-millisecond latency for all operations.

Data Structures

import redis

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

# HASH — field:value pairs; good for user sessions, object caching
r.hset("user:alice", mapping={"name": "Alice", "plan": "pro", "credits": "100"})
r.hincrby("user:alice", "credits", -10)  # atomic decrement

# SORTED SET — scored members with range queries
r.zadd("leaderboard", {"alice": 9850, "bob": 7200, "carol": 11000})
top3 = r.zrevrange("leaderboard", 0, 2, withscores=True)

# SET — unique members; deduplication, tags
r.sadd("active_sessions", "sess:abc", "sess:def")
r.sismember("active_sessions", "sess:abc")  # True

# LIST — ordered; queues, activity feeds
r.lpush("job_queue", "job:123")   # enqueue
r.brpop("job_queue", timeout=30)  # blocking dequeue

# STREAM — append-only event log
r.xadd("events", {"type": "order.placed", "user": "alice", "order": "ord123"})
entries = r.xrange("events", "-", "+", count=100)

# GEOSPATIAL — proximity queries
r.geoadd("stores", (longitude, latitude, "store:london-bridge"))
nearby = r.geosearch("stores", longitude=0.0, latitude=51.5, radius=5, unit="km")

Redis as a Rate Limiter

def is_rate_limited(user_id: str, limit: int, window_seconds: int) -> bool:
    key = f"rate:{user_id}"
    current = r.incr(key)
    if current == 1:
        r.expire(key, window_seconds)
    return current > limit

Neo4j (Graph Databases)

Property graph model: nodes (entities), edges (relationships), and properties on both. Query language: Cypher.

When to Use

  • Social graphs: friends, followers, recommendations (friends-of-friends queries)
  • Fraud detection: money flow graphs, circular transaction detection
  • Knowledge graphs for AI RAG pipelines
  • Identity/access management with complex permission hierarchies
  • Any domain where "how things relate to each other" is the core query

Cypher Examples

-- Friends of friends (2 hops) that Alice is not already connected to
MATCH (alice:Person {name: 'Alice'})-[:FRIENDS_WITH*2]-(fof)
WHERE NOT (alice)-[:FRIENDS_WITH]-(fof) AND alice <> fof
RETURN DISTINCT fof.name, fof.city

-- Shortest path between two users
MATCH path = shortestPath((a:Person {name: 'Alice'})-[:KNOWS*]-(b:Person {name: 'Carol'}))
RETURN path

-- Find all paths in a fraud ring (circular transactions)
MATCH p = (a:Account)-[:TRANSFER*1..5]->(a)
WHERE ALL(r IN relationships(p) WHERE r.amount > 10000)
RETURN p LIMIT 10

Python Driver

from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

with driver.session() as session:
    result = session.run(
        "MATCH (p:Person {name: $name})-[:FRIENDS_WITH]->(friend) RETURN friend.name",
        name="Alice"
    )
    friends = [record["friend.name"] for record in result]

Decision Guide

ScenarioBest ChoiceWhy
Variable-schema documentsMongoDBSchema flexibility, rich queries
AWS serverless, stable access patternsDynamoDBManaged, cheap at scale, single-digit ms
Write-heavy time-series, multi-datacenterCassandra/ScyllaDBLinear write scaling, no SPOF
Sub-ms caching, sessions, real-timeRedisIn-memory, rich data structures
Relationship traversalNeo4jCypher makes graph queries natural
Existing Postgres, need vectorspgvectorNo extra service, SQL + vectors
Existing Postgres, need searchPostgreSQL full-text / Elasticsearch

Default recommendation: Start with PostgreSQL. Add Redis for caching and sessions. Reach for MongoDB when schema genuinely varies. Use DynamoDB only if you are already committed to AWS and have stable access patterns you can model up front.


Key Facts

  • MongoDB 8.0 adds $vectorSearch in aggregation pipelines for AI applications
  • DynamoDB: 3,000 RCUs/s and 1,000 WCUs/s per partition — hot partition is the #1 design failure
  • Cassandra: designed for 100k+ writes/second; no joins, no ad-hoc aggregation
  • Redis: sub-1ms latency; Sorted Sets make leaderboards and rate limiting trivial
  • Neo4j: Cypher's *2 hop notation makes friend-of-friend queries a single line vs complex SQL self-joins
  • PostgreSQL + Redis is the most common production stack in 2026

Common Failure Cases

DynamoDB hot partition throttling everything Why: when most requests target the same partition key value (e.g., PK="ORDER" for all orders), a single partition absorbs all traffic and is throttled at 3,000 RCUs/s, even if the table's total provisioned capacity is much higher. Detect: ConsumedReadCapacityUnits at the table level looks fine, but ThrottledRequests is non-zero; inspecting per-partition metrics shows one partition at 100% utilisation. Fix: redesign the key to distribute load — use USER#<id> as the partition key so reads spread across users, or add a random suffix and fan out reads.

MongoDB collection scan because an index is missing Why: without an index on the queried field, MongoDB performs a full collection scan (COLLSCAN); at millions of documents, this causes seconds-long queries and high CPU. Detect: explain("executionStats") shows stage: COLLSCAN and nDocsExamined >> nReturned. Fix: create a compound index on the fields used in find() filter and sort; put equality conditions first, range conditions last in compound indexes.

Redis data lost on restart because persistence is off Why: the default Redis configuration keeps all data in memory only; a restart or crash loses everything without RDB snapshots or AOF logging enabled. Detect: after a Redis restart, all cached data and sessions are gone; application errors spike as DB is cold. Fix: enable AOF persistence (appendonly yes) for durability or RDB snapshots for a recovery point; for sessions/caching, accept loss but ensure the application gracefully handles cache misses on cold start.

Cassandra query without partition key causes full cluster scan Why: Cassandra cannot fulfill queries that don't include the partition key; it must scatter the query to every node, which is catastrophically slow and can destabilise the cluster. Detect: query latency is seconds instead of milliseconds; nodetool tpstats shows high pending reads on all nodes. Fix: deny any query to production that omits the partition key; use ALLOW FILTERING only in development and never in a hot path; redesign the table if the access pattern requires it.

Connections

Open Questions

  • When does MongoDB Atlas Vector Search compete meaningfully with dedicated vector databases like Qdrant for RAG workloads?
  • Is ScyllaDB now the default recommendation over Apache Cassandra for new deployments?
  • How does DynamoDB's PartiQL support change the "no ad-hoc queries" constraint for teams familiar with SQL?