Evals

7 pages

BFCL — Berkeley Function Calling Leaderboard

Berkeley Function Calling Leaderboard — the de facto standard benchmark for LLM tool-use capability; v4 adds agentic evaluation (web search, memory, format sensitivity) on top of the classic single-turn categories.

bfclfunction-callingtool-usebenchmarks

LLM Benchmarks

Standard LLM benchmarks and what they actually measure — knowing which are saturated, contaminated, or misused prevents drawing wrong production decisions from benchmark scores.

benchmarksswe-benchmmlugpqa

LLM Evaluation

LLM evaluation methodology — only 52% of AI orgs have evals in place, making this the most common gap; covers offline/online/agent/RAG eval types, framework selection, golden set construction, and CI integration.

evalsllm-as-judgeswe-benchbraintrust

LLM-as-Judge

Using an LLM to evaluate another LLM's outputs is the standard approach for open-ended tasks — calibrate against human labels (target Spearman > 0.8), use explicit rubrics, and account for position/verbosity/self-enhancement biases.

llm-as-judgeevaluationscoringrubric

OpenAI Evals

Open-source framework for evaluating LLMs and LLM-powered systems, plus a registry of community benchmarks.

openai-evalsevaluationbenchmarksllm-testing

RAGAS — RAG Evaluation Framework

Reference-free evaluation framework for RAG pipelines.

ragasrag-evaluationfaithfulnessanswer-relevancy

test_rag.py

Open-source pytest-style LLM eval framework by Confident AI with 50+ research-backed metrics, G-Eval custom criteria scoring, and threshold-gated CI integration.

evalsllm-testingdeepevalg-eval