BFCL — Berkeley Function Calling Leaderboard
Berkeley Function Calling Leaderboard — the de facto standard benchmark for LLM tool-use capability; v4 adds agentic evaluation (web search, memory, format sensitivity) on top of the classic single-turn categories.
LLM Benchmarks
Standard LLM benchmarks and what they actually measure — knowing which are saturated, contaminated, or misused prevents drawing wrong production decisions from benchmark scores.
LLM Evaluation
LLM evaluation methodology — only 52% of AI orgs have evals in place, making this the most common gap; covers offline/online/agent/RAG eval types, framework selection, golden set construction, and CI integration.
LLM-as-Judge
Using an LLM to evaluate another LLM's outputs is the standard approach for open-ended tasks — calibrate against human labels (target Spearman > 0.8), use explicit rubrics, and account for position/verbosity/self-enhancement biases.
OpenAI Evals
Open-source framework for evaluating LLMs and LLM-powered systems, plus a registry of community benchmarks.
RAGAS — RAG Evaluation Framework
Reference-free evaluation framework for RAG pipelines.
test_rag.py
Open-source pytest-style LLM eval framework by Confident AI with 50+ research-backed metrics, G-Eval custom criteria scoring, and threshold-gated CI integration.