BeginnerAI Engineer

Build a RAG pipeline from scratch

Build a complete retrieval-augmented generation pipeline without a framework. You will chunk a PDF document, embed each chunk with a local model, store the vectors in Chroma, and then answer user questions by retrieving the most relevant chunks and passing them to Claude with the query.

Why this matters

RAG is how most production AI systems get factual, up-to-date answers without hallucinating. Understanding the pipeline end-to-end; chunk size trade-offs, embedding choice, retrieval strategy; makes you dangerous in any AI engineering role. Frameworks hide this; building it yourself exposes exactly where things go wrong.

Before you start

Step-by-step guide

  1. 1

    Pick a PDF and chunk it

    Choose any long PDF (a research paper works well). Use PyMuPDF or pdfplumber to extract raw text, then split it into overlapping chunks of ~512 tokens using a recursive character splitter. Print the first three chunks so you can see what the model will actually receive.

    import fitz  # pip install pymupdf
    import re
    
    def load_pdf(path: str) -> str:
        doc = fitz.open(path)
        return "\n".join(page.get_text() for page in doc)
    
    def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
        words = text.split()
        chunks = []
        for i in range(0, len(words), chunk_size - overlap):
            chunk = " ".join(words[i : i + chunk_size])
            chunks.append(chunk)
        return chunks
    
    text = load_pdf("paper.pdf")
    chunks = chunk_text(text)
    print(f"{len(chunks)} chunks created")
    for i, c in enumerate(chunks[:3]):
        print(f"--- chunk {i} ---\n{c[:200]}\n")
  2. 2

    Embed each chunk locally

    Install sentence-transformers and load the all-MiniLM-L6-v2 model. Run each chunk through the model to produce a 384-dimensional embedding vector. This step happens entirely on your machine; no API call, no cost.

    from sentence_transformers import SentenceTransformer
    # pip install sentence-transformers
    
    model = SentenceTransformer("all-MiniLM-L6-v2")
    
    # Batch embed all chunks (much faster than one-by-one)
    embeddings = model.encode(chunks, show_progress_bar=True)
    
    print(f"Embedding shape: {embeddings.shape}")  # (n_chunks, 384)
    print(f"First vector sample: {embeddings[0][:5].tolist()}")
  3. 3

    Store vectors in Chroma

    Start a persistent Chroma client, create a collection, and add each chunk along with its embedding and the chunk index as metadata. Verify the collection size matches your chunk count before moving on.

    import chromadb
    # pip install chromadb
    
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_or_create_collection("rag_demo")
    
    collection.add(
        ids=[str(i) for i in range(len(chunks))],
        embeddings=embeddings.tolist(),
        documents=chunks,
        metadatas=[{"chunk_index": i} for i in range(len(chunks))],
    )
    
    print(f"Collection size: {collection.count()}")  # should match len(chunks)
  4. 4

    Retrieve on query

    Embed the user's question with the same model, then query Chroma for the top-3 most similar chunks by cosine distance. Print the retrieved chunks; this is your context window budget before you hand anything to Claude.

    def retrieve(query: str, top_k: int = 3) -> list[str]:
        query_embedding = model.encode([query]).tolist()
        results = collection.query(
            query_embeddings=query_embedding,
            n_results=top_k,
            include=["documents", "metadatas", "distances"],
        )
        docs = results["documents"][0]
        for i, (doc, dist) in enumerate(zip(docs, results["distances"][0])):
            print(f"[{i}] distance={dist:.3f}\n{doc[:200]}\n")
        return docs
    
    query = "What is the main contribution of this paper?"
    context_chunks = retrieve(query)
  5. 5

    Answer with Claude

    Build a system prompt that instructs Claude to answer only from the provided context. Concatenate the retrieved chunks into a user message alongside the question, call the Anthropic Messages API, and print the response. Then ask a question your PDF does not answer and observe what happens.

    import anthropic
    # pip install anthropic
    
    client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env
    
    def answer(query: str, context: list[str]) -> str:
        context_text = "\n\n".join(
            f"[Chunk {i}]\n{c}" for i, c in enumerate(context)
        )
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            system=(
                "Answer the user's question using ONLY the provided context chunks. "
                "If the answer is not in the context, say so explicitly."
            ),
            messages=[{
                "role": "user",
                "content": f"<context>\n{context_text}\n</context>\n\nQuestion: {query}",
            }],
        )
        return response.content[0].text
    
    print(answer(query, context_chunks))
  6. 6

    Add source citations

    Update the prompt to ask Claude to cite which chunk index it drew each claim from. Verify the citations match the retrieved content. This is the minimum viable attribution that makes RAG outputs auditable.

    def answer_with_citations(query: str, context: list[str]) -> str:
        context_text = "\n\n".join(
            f"[Chunk {i}]\n{c}" for i, c in enumerate(context)
        )
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            system=(
                "Answer using ONLY the provided context chunks. "
                "After each factual claim, add a citation in the format [Chunk N]. "
                "If information is not in the context, write 'Not found in context.'"
            ),
            messages=[{
                "role": "user",
                "content": f"<context>\n{context_text}\n</context>\n\nQuestion: {query}",
            }],
        )
        return response.content[0].text
    
    print(answer_with_citations(query, context_chunks))

Relevant Axiom pages

What to do next

Back to Practice Lab