IntermediateAI Engineer

Write an LLM-as-judge eval

Design a repeatable evaluation harness for a RAG or chatbot system. You will curate 10 question-answer pairs as a golden set, write a faithfulness scorer that sends each answer to Claude with the source context and a rubric, and produce a pass/fail result with an aggregate score you can track over time.

Why this matters

LLM-as-judge is how teams catch regressions without a human in the loop for every change. A 10-case golden set sounds trivial but already catches model swaps, prompt regressions, and chunking bugs that manual inspection misses. The discipline of defining a rubric before looking at outputs is the hardest part and the most valuable skill to build.

Before you start

Working RAG pipeline or chatbot you can call programmatically
Anthropic API access
Understanding of what faithfulness means in RAG context (answer grounded in retrieved docs)
Python with the anthropic SDK installed

Step-by-step guide

1
Define your rubric first
Write out what a faithful answer means before touching code. A minimal rubric: every factual claim in the answer must be traceable to the provided context; the answer must not introduce facts not present in the context. Write this as plain English; you will paste it into the judge prompt.
2
Curate 10 golden cases
Write 10 questions you know the answer to from your source documents. For each, record the expected answer (ground truth) and the context chunk it comes from. Include 2 edge cases: one question the document does not answer, and one where the answer requires combining two chunks.

Build the judge prompt

Write a system prompt for Claude that presents the question, the retrieved context, and the model's answer, then asks Claude to score faithfulness from 1-5 and explain its reasoning. Use XML tags to separate the sections clearly; Claude follows structured prompts more reliably.

JUDGE_SYSTEM = """You are an evaluation judge assessing whether an AI answer is faithful to the provided context.

Score from 1 to 5:
5 = Every factual claim is directly supported by the context.
4 = Nearly all claims are supported; minor extrapolation.
3 = Most claims supported but one unsupported inference.
2 = Several claims not found in the context.
1 = Answer contradicts the context or mostly fabricated.

Respond with exactly this format:
<score>N</score>
<reasoning>one sentence explanation</reasoning>"""

def judge(question: str, context: str, answer: str) -> tuple[int, str]:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        temperature=0,   # deterministic scoring
        system=JUDGE_SYSTEM,
        messages=[{"role": "user", "content": (
            f"<question>{question}</question>\n"
            f"<context>{context}</context>\n"
            f"<answer>{answer}</answer>"
        )}],
    )
    text = response.content[0].text
    import re
    score = int(re.search(r"<score>(\d)</score>", text).group(1))
    reasoning = re.search(r"<reasoning>(.*?)</reasoning>", text, re.S).group(1).strip()
    return score, reasoning

Run each case through the judge

For each golden case, call your system to get an answer, then call Claude with the judge prompt. Parse the score from the response. If Claude returns reasoning alongside the score, log it; that reasoning is where the signal is.

golden_cases = [
    {
        "question": "What is the capital of France?",
        "context": "France is a country in Western Europe. Its capital city is Paris.",
        "expected": "Paris",
    },
    # ... add 9 more cases
]

results = []
for case in golden_cases:
    answer = your_system(case["question"], case["context"])
    score, reasoning = judge(case["question"], case["context"], answer)
    results.append({
        **case,
        "answer": answer,
        "score": score,
        "reasoning": reasoning,
        "pass": score >= 4,
    })
    print(f"Q: {case['question'][:60]}... | score={score} | {'PASS' if score >= 4 else 'FAIL'}")
    if score < 4:
        print(f"  Reasoning: {reasoning}")

Aggregate and report

Calculate pass rate (score >= 4) and mean score across all 10 cases. Print a table: question, expected, actual, score, pass/fail. Run the eval twice and check it is deterministic; if scores vary significantly, add temperature=0 to your judge call.

import statistics

pass_rate = sum(1 for r in results if r["pass"]) / len(results)
mean_score = statistics.mean(r["score"] for r in results)

print(f"\n=== Eval Results ===")
print(f"Pass rate:  {pass_rate:.0%}  ({sum(r['pass'] for r in results)}/{len(results)})")
print(f"Mean score: {mean_score:.2f} / 5.0")

# Print failures for inspection
failures = [r for r in results if not r["pass"]]
if failures:
    print(f"\nFailures ({len(failures)}):")
    for r in failures:
        print(f"  [{r['score']}] {r['question'][:60]}")
        print(f"       Answer: {r['answer'][:80]}")
        print(f"       Reason: {r['reasoning']}")

Relevant Axiom pages

LLM-as-judge methodology Eval methodology overview RAGAS eval framework Prompting techniques

What to do next

Back to Practice Lab