Write an LLM-as-judge eval
Design a repeatable evaluation harness for a RAG or chatbot system. You will curate 10 question-answer pairs as a golden set, write a faithfulness scorer that sends each answer to Claude with the source context and a rubric, and produce a pass/fail result with an aggregate score you can track over time.
Why this matters
LLM-as-judge is how teams catch regressions without a human in the loop for every change. A 10-case golden set sounds trivial but already catches model swaps, prompt regressions, and chunking bugs that manual inspection misses. The discipline of defining a rubric before looking at outputs is the hardest part and the most valuable skill to build.
Before you start
- Working RAG pipeline or chatbot you can call programmatically
- Anthropic API access
- Understanding of what faithfulness means in RAG context (answer grounded in retrieved docs)
- Python with the anthropic SDK installed
Step-by-step guide
- 1
Define your rubric first
Write out what a faithful answer means before touching code. A minimal rubric: every factual claim in the answer must be traceable to the provided context; the answer must not introduce facts not present in the context. Write this as plain English; you will paste it into the judge prompt.
- 2
Curate 10 golden cases
Write 10 questions you know the answer to from your source documents. For each, record the expected answer (ground truth) and the context chunk it comes from. Include 2 edge cases: one question the document does not answer, and one where the answer requires combining two chunks.
- 3
Build the judge prompt
Write a system prompt for Claude that presents the question, the retrieved context, and the model's answer, then asks Claude to score faithfulness from 1-5 and explain its reasoning. Use XML tags to separate the sections clearly; Claude follows structured prompts more reliably.
JUDGE_SYSTEM = """You are an evaluation judge assessing whether an AI answer is faithful to the provided context. Score from 1 to 5: 5 = Every factual claim is directly supported by the context. 4 = Nearly all claims are supported; minor extrapolation. 3 = Most claims supported but one unsupported inference. 2 = Several claims not found in the context. 1 = Answer contradicts the context or mostly fabricated. Respond with exactly this format: <score>N</score> <reasoning>one sentence explanation</reasoning>""" def judge(question: str, context: str, answer: str) -> tuple[int, str]: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=256, temperature=0, # deterministic scoring system=JUDGE_SYSTEM, messages=[{"role": "user", "content": ( f"<question>{question}</question>\n" f"<context>{context}</context>\n" f"<answer>{answer}</answer>" )}], ) text = response.content[0].text import re score = int(re.search(r"<score>(\d)</score>", text).group(1)) reasoning = re.search(r"<reasoning>(.*?)</reasoning>", text, re.S).group(1).strip() return score, reasoning - 4
Run each case through the judge
For each golden case, call your system to get an answer, then call Claude with the judge prompt. Parse the score from the response. If Claude returns reasoning alongside the score, log it; that reasoning is where the signal is.
golden_cases = [ { "question": "What is the capital of France?", "context": "France is a country in Western Europe. Its capital city is Paris.", "expected": "Paris", }, # ... add 9 more cases ] results = [] for case in golden_cases: answer = your_system(case["question"], case["context"]) score, reasoning = judge(case["question"], case["context"], answer) results.append({ **case, "answer": answer, "score": score, "reasoning": reasoning, "pass": score >= 4, }) print(f"Q: {case['question'][:60]}... | score={score} | {'PASS' if score >= 4 else 'FAIL'}") if score < 4: print(f" Reasoning: {reasoning}") - 5
Aggregate and report
Calculate pass rate (score >= 4) and mean score across all 10 cases. Print a table: question, expected, actual, score, pass/fail. Run the eval twice and check it is deterministic; if scores vary significantly, add temperature=0 to your judge call.
import statistics pass_rate = sum(1 for r in results if r["pass"]) / len(results) mean_score = statistics.mean(r["score"] for r in results) print(f"\n=== Eval Results ===") print(f"Pass rate: {pass_rate:.0%} ({sum(r['pass'] for r in results)}/{len(results)})") print(f"Mean score: {mean_score:.2f} / 5.0") # Print failures for inspection failures = [r for r in results if not r["pass"]] if failures: print(f"\nFailures ({len(failures)}):") for r in failures: print(f" [{r['score']}] {r['question'][:60]}") print(f" Answer: {r['answer'][:80]}") print(f" Reason: {r['reasoning']}")