Vector Stores

Vector stores are the storage layer of RAG systems — pgvector for existing Postgres stacks, Chroma for local dev, Qdrant for production self-hosted, Pinecone for zero-ops managed, Weaviate for built-in hybrid search.

Updated Invalid Date·

vector-store embeddings pgvector chroma qdrant pinecone similarity-search

Databases optimised for storing and searching high-dimensional embedding vectors. The storage layer of any RAG system.

How Vector Search Works

Each document chunk is embedded into a vector (e.g. 1,536 dimensions for OpenAI's embeddings). At query time:

Embed the query using the same model
Find the k most similar vectors in the store using approximate nearest neighbour (ANN) search
Return the corresponding documents

Similarity metrics:

Cosine similarity — measures angle between vectors; standard for text embeddings
Dot product — equivalent to cosine if vectors are normalised; faster
Euclidean distance — measures absolute distance; less common for text

ANN algorithms: HNSW (Hierarchical Navigable Small World) is the current standard — O(log n) search with high recall. IVF (Inverted File Index) for GPU-accelerated search.

Options

pgvector (Postgres-native)

Best for: Existing Postgres stack, transactional workloads alongside vector search.

Extension to PostgreSQL. Adds a vector type and <-> (L2), <#> (inner product), <=> (cosine) operators. HNSW and IVF_FLAT index support.

CREATE EXTENSION vector;
CREATE TABLE documents (id bigserial PRIMARY KEY, content text, embedding vector(1536));
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

-- Insert
INSERT INTO documents (content, embedding) VALUES ('...', '[0.1, 0.2, ...]'::vector);

-- Search
SELECT content FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;

Managed options: Supabase (pgvector built-in), Neon, AWS RDS. Start here if you already have Postgres. No new infrastructure needed.

Limitations: Not optimised for billion-vector scale; query latency increases significantly above ~10M vectors.

Chroma

Best for: Local development, prototyping, Python-native projects.

In-memory or local persistent store. Zero config. Ships as a Python package.

import chromadb

client = chromadb.Client()  # in-memory, or chromadb.PersistentClient("./chroma_db")
collection = client.create_collection("documents")

collection.add(
    documents=["First document", "Second document"],
    ids=["doc1", "doc2"]
)

results = collection.query(query_texts=["query text"], n_results=2)

Does not require a separate server for development. Has a server mode for production (but use Qdrant or Weaviate there instead).

Qdrant

Best for: Production self-hosted, high performance, rich filtering.

Open-source, written in Rust, extremely fast. Full-featured: HNSW, scalar/binary quantisation, payload filtering (filter by metadata alongside vector similarity), sparse vectors (for hybrid search without a separate BM25 index).

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient("localhost", port=6333)
client.create_collection("docs", vectors_config=VectorParams(size=1536, distance=Distance.COSINE))

client.upsert("docs", points=[
    PointStruct(id=1, vector=[0.1, ...], payload={"source": "manual.pdf", "page": 3})
])

results = client.search("docs", query_vector=[0.1, ...], limit=5,
                         query_filter=Filter(must=[FieldCondition(key="source", match=MatchValue(value="manual.pdf"))]))

Managed: Qdrant Cloud. Self-hosted: Docker single node or Kubernetes cluster.

Weaviate

Best for: Hybrid search (BM25 + vector) in one database, GraphQL API.

Built-in BM25 + dense hybrid search without needing a separate keyword search layer. Module system for automatic vectorisation (call embedding model on ingest).

Good choice when you want hybrid retrieval without managing two separate systems (Elasticsearch + vector DB). See infra/weaviate for full setup and query examples.

Pinecone

Best for: Fully managed, zero ops, scale on demand.

Serverless mode: no cluster management, pay per query. Fast and reliable. The most popular managed option.

Limitations: Proprietary, no self-hosting, egress costs. Vendor lock-in risk.

Redis

Best for: Low-latency cache + vector search in one; session memory for agents.

Redis Stack includes RediSearch with vector similarity search. Good for agent working memory (< 1M vectors, fast retrieval, collocated with session data).

Choosing

Situation	Recommendation
Existing Postgres stack	pgvector
Local dev / prototyping	Chroma
Production, self-hosted	Qdrant
Production, no ops	Pinecone
Hybrid search, no extra infra	Weaviate
Agent session memory	Redis
Billion-vector scale	Pinecone or Weaviate Cloud

Hybrid Search Architecture

Production RAG typically runs BM25 (keyword) and dense (vector) in parallel and merges with reciprocal rank fusion (RRF):

BM25 results (ranked list) ──┐
                              ├─ RRF merge → top-k → reranker → answer
Dense vector results (ranked) ┘

Options:

Weaviate or Qdrant — native hybrid, no extra components
pgvector + pg_trgm — Postgres-native, less sophisticated
Elasticsearch/OpenSearch + pgvector — separate systems, complex ops

For most production RAG: Qdrant or Weaviate with their native hybrid search. Then add a reranker (Cohere Rerank) on top. See rag/pipeline.

Key Facts

HNSW (Hierarchical Navigable Small World): O(log n) ANN search — current standard algorithm
Cosine similarity is standard for text embeddings; dot product is equivalent if vectors are normalised
pgvector limitations: query latency increases significantly above ~10M vectors
Qdrant: written in Rust; includes sparse vectors for hybrid search without a separate BM25 layer
Production RAG hybrid search: BM25 + dense in parallel, merged with reciprocal rank fusion (RRF)
Weaviate: built-in BM25 + dense hybrid without needing separate Elasticsearch layer
Pinecone serverless: no cluster management, pay-per-query; vendor lock-in risk

Common Failure Cases

pgvector query latency degrades above ~1M rows despite HNSW index
Why: the HNSW index was built incrementally as rows were inserted; the graph structure fragments over many small inserts.
Detect: EXPLAIN ANALYZE shows HNSW scan time climbing; rebuilding the index from scratch on the same data is 5x faster.
Fix: drop the index, bulk-load all data, then CREATE INDEX CONCURRENTLY on the populated table with high maintenance_work_mem.

Cosine similarity returns wrong neighbours after embedding model change
Why: new documents were embedded with a different model than the existing index; vectors from different models are not comparable.
Detect: similarity scores for known-similar pairs drop near zero; retrieval quality degrades without any code or data change.
Fix: re-embed the entire collection with the new model before deploying; never mix embeddings from different models in the same collection.

Chroma in-memory client loses all data on process restart
Why: chromadb.Client() was used instead of PersistentClient; data exists only in RAM.
Detect: collection is empty after server restart; queries return no results on a collection that appeared populated.
Fix: use chromadb.PersistentClient("./chroma_db") in any environment where data must survive process restarts.

Qdrant filtered search is 10x slower than unfiltered
Why: the payload field being filtered is not indexed; Qdrant rescores every matching payload point instead of using the HNSW graph.
Detect: filtered query latency is much higher than unfiltered for the same k; Qdrant logs show a full payload scan.
Fix: call create_payload_index on frequently filtered fields; Qdrant then uses a combined vector+payload index.

Pinecone upsert silently succeeds but queries return no results
Why: vectors were upserted to the wrong namespace, or the index dimension doesn't match the embedding model used at query time.
Detect: upsert response shows success but query returns 0 matches; check namespace parameter and vector dimension in both calls.
Fix: always specify namespace explicitly; verify index dimension matches embedding model output dimension at creation time.

Connections

rag/pipeline — how vector stores fit into the full RAG pipeline end-to-end
rag/embeddings — which embedding model to use when populating the store
rag/hybrid-retrieval — BM25 + dense hybrid search implementation patterns
rag/reranking — reranking results from the vector store before passing to the LLM
infra/caching — Redis also used for semantic cache backed by vector similarity
infra/huggingface — BGE-M3 and other embedding models for populating the store

Open Questions

At what vector count does pgvector performance degrade enough to warrant migrating to Qdrant?
How does Qdrant's native sparse vector support compare to Weaviate's BM25 for pure keyword retrieval?
What is the replication and backup story for self-hosted Qdrant in production?