Build a RAG pipeline from scratch
Build a complete retrieval-augmented generation pipeline without a framework. You will chunk a PDF document, embed each chunk with a local model, store the vectors in Chroma, and then answer user questions by retrieving the most relevant chunks and passing them to Claude with the query.
Why this matters
RAG is how most production AI systems get factual, up-to-date answers without hallucinating. Understanding the pipeline end-to-end; chunk size trade-offs, embedding choice, retrieval strategy; makes you dangerous in any AI engineering role. Frameworks hide this; building it yourself exposes exactly where things go wrong.
Before you start
- Python basics; you should be comfortable writing functions and handling files
- Basic understanding of what an LLM is and how prompting works
- Anthropic API key or access to a local model via Ollama
- pip-installable environment (uv or venv)
Step-by-step guide
- 1
Pick a PDF and chunk it
Choose any long PDF (a research paper works well). Use PyMuPDF or pdfplumber to extract raw text, then split it into overlapping chunks of ~512 tokens using a recursive character splitter. Print the first three chunks so you can see what the model will actually receive.
import fitz # pip install pymupdf import re def load_pdf(path: str) -> str: doc = fitz.open(path) return "\n".join(page.get_text() for page in doc) def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]: words = text.split() chunks = [] for i in range(0, len(words), chunk_size - overlap): chunk = " ".join(words[i : i + chunk_size]) chunks.append(chunk) return chunks text = load_pdf("paper.pdf") chunks = chunk_text(text) print(f"{len(chunks)} chunks created") for i, c in enumerate(chunks[:3]): print(f"--- chunk {i} ---\n{c[:200]}\n") - 2
Embed each chunk locally
Install sentence-transformers and load the all-MiniLM-L6-v2 model. Run each chunk through the model to produce a 384-dimensional embedding vector. This step happens entirely on your machine; no API call, no cost.
from sentence_transformers import SentenceTransformer # pip install sentence-transformers model = SentenceTransformer("all-MiniLM-L6-v2") # Batch embed all chunks (much faster than one-by-one) embeddings = model.encode(chunks, show_progress_bar=True) print(f"Embedding shape: {embeddings.shape}") # (n_chunks, 384) print(f"First vector sample: {embeddings[0][:5].tolist()}") - 3
Store vectors in Chroma
Start a persistent Chroma client, create a collection, and add each chunk along with its embedding and the chunk index as metadata. Verify the collection size matches your chunk count before moving on.
import chromadb # pip install chromadb client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_or_create_collection("rag_demo") collection.add( ids=[str(i) for i in range(len(chunks))], embeddings=embeddings.tolist(), documents=chunks, metadatas=[{"chunk_index": i} for i in range(len(chunks))], ) print(f"Collection size: {collection.count()}") # should match len(chunks) - 4
Retrieve on query
Embed the user's question with the same model, then query Chroma for the top-3 most similar chunks by cosine distance. Print the retrieved chunks; this is your context window budget before you hand anything to Claude.
def retrieve(query: str, top_k: int = 3) -> list[str]: query_embedding = model.encode([query]).tolist() results = collection.query( query_embeddings=query_embedding, n_results=top_k, include=["documents", "metadatas", "distances"], ) docs = results["documents"][0] for i, (doc, dist) in enumerate(zip(docs, results["distances"][0])): print(f"[{i}] distance={dist:.3f}\n{doc[:200]}\n") return docs query = "What is the main contribution of this paper?" context_chunks = retrieve(query) - 5
Answer with Claude
Build a system prompt that instructs Claude to answer only from the provided context. Concatenate the retrieved chunks into a user message alongside the question, call the Anthropic Messages API, and print the response. Then ask a question your PDF does not answer and observe what happens.
import anthropic # pip install anthropic client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env def answer(query: str, context: list[str]) -> str: context_text = "\n\n".join( f"[Chunk {i}]\n{c}" for i, c in enumerate(context) ) response = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, system=( "Answer the user's question using ONLY the provided context chunks. " "If the answer is not in the context, say so explicitly." ), messages=[{ "role": "user", "content": f"<context>\n{context_text}\n</context>\n\nQuestion: {query}", }], ) return response.content[0].text print(answer(query, context_chunks)) - 6
Add source citations
Update the prompt to ask Claude to cite which chunk index it drew each claim from. Verify the citations match the retrieved content. This is the minimum viable attribution that makes RAG outputs auditable.
def answer_with_citations(query: str, context: list[str]) -> str: context_text = "\n\n".join( f"[Chunk {i}]\n{c}" for i, c in enumerate(context) ) response = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, system=( "Answer using ONLY the provided context chunks. " "After each factual claim, add a citation in the format [Chunk N]. " "If information is not in the context, write 'Not found in context.'" ), messages=[{ "role": "user", "content": f"<context>\n{context_text}\n</context>\n\nQuestion: {query}", }], ) return response.content[0].text print(answer_with_citations(query, context_chunks))