BeginnerAI Engineer

Build a RAG pipeline from scratch

Build a complete retrieval-augmented generation pipeline without a framework. You will chunk a PDF document, embed each chunk with a local model, store the vectors in Chroma, and then answer user questions by retrieving the most relevant chunks and passing them to Claude with the query.

Why this matters

RAG is how most production AI systems get factual, up-to-date answers without hallucinating. Understanding the pipeline end-to-end; chunk size trade-offs, embedding choice, retrieval strategy; makes you dangerous in any AI engineering role. Frameworks hide this; building it yourself exposes exactly where things go wrong.

Before you start

Python basics; you should be comfortable writing functions and handling files
Basic understanding of what an LLM is and how prompting works
Anthropic API key or access to a local model via Ollama
pip-installable environment (uv or venv)

Step-by-step guide

Pick a PDF and chunk it

Choose any long PDF (a research paper works well). Use PyMuPDF or pdfplumber to extract raw text, then split it into overlapping chunks of ~512 tokens using a recursive character splitter. Print the first three chunks so you can see what the model will actually receive.

import fitz  # pip install pymupdf
import re

def load_pdf(path: str) -> str:
    doc = fitz.open(path)
    return "\n".join(page.get_text() for page in doc)

def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i : i + chunk_size])
        chunks.append(chunk)
    return chunks

text = load_pdf("paper.pdf")
chunks = chunk_text(text)
print(f"{len(chunks)} chunks created")
for i, c in enumerate(chunks[:3]):
    print(f"--- chunk {i} ---\n{c[:200]}\n")

Embed each chunk locally

Install sentence-transformers and load the all-MiniLM-L6-v2 model. Run each chunk through the model to produce a 384-dimensional embedding vector. This step happens entirely on your machine; no API call, no cost.

from sentence_transformers import SentenceTransformer
# pip install sentence-transformers

model = SentenceTransformer("all-MiniLM-L6-v2")

# Batch embed all chunks (much faster than one-by-one)
embeddings = model.encode(chunks, show_progress_bar=True)

print(f"Embedding shape: {embeddings.shape}")  # (n_chunks, 384)
print(f"First vector sample: {embeddings[0][:5].tolist()}")

Store vectors in Chroma

Start a persistent Chroma client, create a collection, and add each chunk along with its embedding and the chunk index as metadata. Verify the collection size matches your chunk count before moving on.

import chromadb
# pip install chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("rag_demo")

collection.add(
    ids=[str(i) for i in range(len(chunks))],
    embeddings=embeddings.tolist(),
    documents=chunks,
    metadatas=[{"chunk_index": i} for i in range(len(chunks))],
)

print(f"Collection size: {collection.count()}")  # should match len(chunks)

Retrieve on query

Embed the user's question with the same model, then query Chroma for the top-3 most similar chunks by cosine distance. Print the retrieved chunks; this is your context window budget before you hand anything to Claude.

def retrieve(query: str, top_k: int = 3) -> list[str]:
    query_embedding = model.encode([query]).tolist()
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    docs = results["documents"][0]
    for i, (doc, dist) in enumerate(zip(docs, results["distances"][0])):
        print(f"[{i}] distance={dist:.3f}\n{doc[:200]}\n")
    return docs

query = "What is the main contribution of this paper?"
context_chunks = retrieve(query)

Answer with Claude

Build a system prompt that instructs Claude to answer only from the provided context. Concatenate the retrieved chunks into a user message alongside the question, call the Anthropic Messages API, and print the response. Then ask a question your PDF does not answer and observe what happens.

import anthropic
# pip install anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

def answer(query: str, context: list[str]) -> str:
    context_text = "\n\n".join(
        f"[Chunk {i}]\n{c}" for i, c in enumerate(context)
    )
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=(
            "Answer the user's question using ONLY the provided context chunks. "
            "If the answer is not in the context, say so explicitly."
        ),
        messages=[{
            "role": "user",
            "content": f"<context>\n{context_text}\n</context>\n\nQuestion: {query}",
        }],
    )
    return response.content[0].text

print(answer(query, context_chunks))

Add source citations

Update the prompt to ask Claude to cite which chunk index it drew each claim from. Verify the citations match the retrieved content. This is the minimum viable attribution that makes RAG outputs auditable.

def answer_with_citations(query: str, context: list[str]) -> str:
    context_text = "\n\n".join(
        f"[Chunk {i}]\n{c}" for i, c in enumerate(context)
    )
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=(
            "Answer using ONLY the provided context chunks. "
            "After each factual claim, add a citation in the format [Chunk N]. "
            "If information is not in the context, write 'Not found in context.'"
        ),
        messages=[{
            "role": "user",
            "content": f"<context>\n{context_text}\n</context>\n\nQuestion: {query}",
        }],
    )
    return response.content[0].text

print(answer_with_citations(query, context_chunks))

Relevant Axiom pages

RAG pipeline overview Chunking strategies Embeddings Anthropic API

What to do next

Back to Practice Lab