Eval benchmarks — reference
Sr Staff candidates fluently discuss MMLU vs MMLU-Pro vs GPQA. They know which benchmarks are saturated, contaminated, and frontier. This page is the cheat sheet — bookmark it, drill it before any onsite.
What you'll learn
- Why benchmarks matter (and why they all become saturated)
- Knowledge / general capability — MMLU, MMLU-Pro, GPQA, HLE
- Math benchmarks — GSM8K → MATH → AIME → FrontierMath → HLE
- Code benchmarks — HumanEval → LiveCodeBench → SWE-Bench
- Reasoning & abstract — ARC-AGI and friends
- Long-context evals — NIAH → RULER → BABILong
- Multimodal — MMMU, MathVista, VideoMME
- Instruction following — IFEval, MT-Bench, Chatbot Arena
- Agents & tool use — SWE-Bench Verified, GAIA, OSWorld, BFCL
- Safety & red-teaming — HarmBench, XSTest, JailbreakBench, WMDP
- How to read a benchmark score (the meta-skill)
Benchmarks let you compare models on a fixed measuring stick. The catch: every benchmark eventually saturates (frontier models score > 90%) AND eventually leaks into pretraining data (contamination). The good 2026 ones are hand-written by experts AND continuously refreshed.
A high score doesn't always mean the model can do the task
Many benchmarks (HumanEval, MATH, GSM8K) appear in pretraining data. The model may have memorized the test set. Modern benchmarks defend against this:
- Continuously refreshed — LiveCodeBench (monthly Codeforces), LiveBench.
- Hand-curated & held out — GPQA-Diamond, FrontierMath, HLE.
- Post-cutoff data — AIME 2024/2025 dropped after most model cutoffs.
What "saturated" means
A benchmark is saturated when frontier models score > 90-95%. At that point, score differences are noise; you can't distinguish models. Move to harder ones (MMLU-Pro, GPQA-Diamond, FrontierMath, HLE). Saturated benchmarks aren't useless — they remain reasonable lower-bound sanity checks — but they don't differentiate frontier work.
- Benchmarks are tools, not truth. They saturate and leak.
- 2026 frontier evals: GPQA-Diamond, FrontierMath, HLE, SWE-Bench Verified, LiveCodeBench, ARC-AGI-2.
- Always ask: contamination? saturation? what's the human baseline?
MMLU is saturated and stop using it as the headline. MMLU-Pro is the harder mid-tier (10-option MCQ). GPQA-Diamond is contamination-resistant PhD-level science. HLE (2025) is the new "next MMLU" — very low SOTA, designed to differentiate frontier models for years.
| Benchmark | Format | What it measures | Status |
|---|---|---|---|
| MMLU | 57-subject multiple choice | Broad academic knowledge | Saturated (~90%+ frontier; ~88% Claude 3.5) |
| MMLU-Pro | 10-option MCQ, harder distractors | Same domains, less guessable | Less saturated (~80% top, ~70% Llama 3.1 405B) |
| GPQA-Diamond | Hand-crafted PhD-level science | Contamination-resistant reasoning | ~85% SOTA (o3); ~50% GPT-4o; ~40% Llama 3.1 405B |
| HLE (Humanity's Last Exam) | ~3000 expert-written hard polymath | 2025; "the next MMLU" | Very low SOTA still (~25% top) |
| BBH (BIG-Bench Hard) | 23 hard tasks from BIG-Bench | Reasoning + symbolic | Saturated for top models |
| HellaSwag | Sentence completion | Commonsense | Fully saturated |
| ARC-Challenge | Grade-school science MCQ | Easy reasoning | Saturated |
| WinoGrande | Pronoun resolution | Coreference / commonsense | Saturated |
- MMLU = stale. Cite MMLU-Pro for capacity differentiation.
- GPQA-Diamond = the contamination-resistant PhD eval.
- HLE = "next MMLU." Watch this one over the next 2 years.
The math benchmark ladder mirrors the reasoning-model arc. GSM8K and MATH are saturated. AIME 2024/2025 is the current high-signal eval. FrontierMath and HLE are the hard frontier — very low SOTA, designed to last.
| Benchmark | Format | What it measures | Status |
|---|---|---|---|
| GSM8K | Grade-school word problems | Multi-step arithmetic | Saturated (>95%); contamination concern |
| MATH | Competition problems | Algebra / geometry / calculus | ~95%+ SOTA reasoning models |
| AIME 2024 / 2025 | 15 problems, integer answer | Olympiad-level | Current high-signal eval. o3: ~96% AIME 2024 |
| FrontierMath | Hand-crafted research-level math | Hard novel math problems | Was <3% pre-o3; o3 jumped to ~25% |
| OlympiadBench | Olympiad math + physics | Multi-modal (some have figures) | ~50% top reasoning models |
| Omni-Math | Comprehensive math eval | Comprehensive coverage | Newer; less contaminated |
- Saturated: GSM8K, MATH. Don't lead with these.
- Lead with: AIME 2024/2025, FrontierMath.
- o3's FrontierMath jump (3% → 25%) was the headline 2024 reasoning result.
HumanEval and MBPP are saturated and contaminated. LiveCodeBench (continuously refreshed Codeforces/LeetCode) is the contamination-resistant alternative. SWE-Bench Verified is the gold standard for code agents — real GitHub issues with real test suites.
| Benchmark | Format | What it measures | Status |
|---|---|---|---|
| HumanEval | 164 hand-written Python functions | Function-level code gen | Saturated (>95%); heavy contamination |
| MBPP | 974 simple Python problems | Function-level | Saturated |
| HumanEval+ / MBPP+ (EvalPlus) | Same problems, more tests | Catches edge-case failures | Still useful; ~80% top |
| LiveCodeBench | Continuously-updated Codeforces / LeetCode | Contamination-resistant code gen | ~70% SOTA on hardest segments |
| SWE-Bench (full) | 2294 real GitHub issues | Multi-file code edits with tests | ~30-40% top |
| SWE-Bench Verified | 500 hand-verified subset | Same, cleaned (no flaky tests) | ~70%+ top agents (Devin, Claude code) |
| SWE-Bench Lite | Smaller, easier subset | Faster eval iteration | ~50%+ |
| BigCodeBench | Library API usage tasks | Practical code with stdlib + libs | ~50% |
| RepoBench | Repository-level completion | Long-context code understanding | Active |
- HumanEval = saturated + leaked. Stop citing it as the headline.
- LiveCodeBench = contamination-resistant function-level eval.
- SWE-Bench Verified = the gold standard for code agents.
ARC-AGI is Chollet's "novel reasoning" test — visual abstract puzzles where the rule must be inferred from a few examples. o3 broke ARC-AGI (76%) in late 2024; ARC-AGI-2 reset the bar.
| Benchmark | Format | What it measures | Status |
|---|---|---|---|
| ARC-AGI / ARC-AGI-2 | Visual abstract reasoning grids | "Few-shot" novel reasoning | Chollet's eval. o3 breakthrough late 2024 (76%); ARC-AGI-2 reset bar |
| BIG-Bench | 200+ tasks | Broad capability sampling | Most subtasks saturated; superseded by BBH and others |
| DROP | Reading comprehension w/ math | Discrete reasoning over paragraphs | Saturated |
- ARC-AGI = the novel-reasoning test. ARC-AGI-2 is the current frontier.
- If asked about Chollet's evals, mention his Ndea program-synthesis lab too.
NIAH (needle-in-a-haystack) is saturated for any modern long-context model. RULER's multi-needle / multi-hop variants are the realistic test — most models drop sharply past 32k context, no matter what their advertised window is.
| Benchmark | Format | What it measures |
|---|---|---|
| NIAH (Needle in a Haystack) | Plant a fact, ask for retrieval at varying depths | Basic long-context retrieval. Saturated for any modern long-context model. |
| RULER | Multi-needle, multi-hop, aggregation | Realistic long-context use; better than NIAH. Most models drop sharply past 32k. |
| BABILong | bAbI tasks padded to long context | Reasoning over long context |
| Loong / Loogle / ZeroSCROLLS | Long-doc QA | Real long-context tasks |
| LongBench | 21 task suite | Comprehensive but somewhat saturated |
- NIAH is too easy. Cite RULER for real long-context.
- Most "1M context" models are not actually usable past ~64k. Verify.
MMMU is the headline multimodal eval (college-level reasoning across 30 subjects with images). MathVista probes visual + math reasoning. VideoMME is the harder long-video eval. Specialty subsets (OCR, charts, docs) for production use cases.
| Benchmark | Format | What it measures |
|---|---|---|
| MMMU | 11k MCQ across 30 subjects, multi-image | College-level multimodal reasoning. ~70% SOTA, ~85% on Pro variant. |
| MathVista | Math with visual context | Visual + math reasoning |
| VQAv2 | Open-ended visual QA | Saturated (~85%+) |
| OCR-Bench, ChartQA, DocVQA | Specific multimodal tasks | OCR / chart / doc understanding |
| VideoMME | Video understanding QA | Long-video QA; harder than image-only |
- MMMU = college-level multimodal headline eval.
- For production: use the specialty evals (OCR, chart, doc, video) that match your domain.
IFEval (verifiable format constraints) is the cleanest instruction-following eval. MT-Bench and Arena are LLM-as-judge / human-preference — fuzzier signal. Chatbot Arena is the most user-facing leaderboard but is biased by stylistic preference.
| Benchmark | Format | What it measures |
|---|---|---|
| IFEval | Verifiable format constraints | "Answer in exactly 50 words", "include the word X 3 times". Programmatic check. ~85% top. |
| MT-Bench | 80 multi-turn prompts, GPT-4 judge | General chat quality. ~9/10 top. |
| AlpacaEval 2 / Arena-Hard | LLM-as-judge winrate | Pairwise preference vs reference |
| Chatbot Arena | Crowdsourced pairwise (lmsys) | Real-user preferences. Most-cited "feel" leaderboard. Caveats around stylistic preference bias. |
- IFEval = clean programmatic instruction-following metric.
- Arena = human preference + style bias. Useful but with caveats.
SWE-Bench Verified is the gold standard for software-engineering agents. GAIA tests general assistant agents on real-world tasks. OSWorld and WebArena test computer use. BFCL is the standard for tool/function-calling accuracy.
| Benchmark | Format | What it measures |
|---|---|---|
| SWE-Bench (Verified) | GitHub issue → patch | Software engineering agent. Currently the gold standard. |
| AgentBench | 8 environments | Multi-domain agent eval |
| GAIA | Real-world tasks needing web + tools | General assistant agent. Hard. |
| WebArena / WebShop / VisualWebArena | Browser-driven tasks | Computer-use / browser-agent eval |
| OSWorld | Computer-use desktop tasks | Multi-app workflows |
| tau-bench | Multi-turn customer-service tool use | Conversational agent w/ tools |
| BFCL (Berkeley Function Calling Leaderboard) | Function-calling accuracy | Tool selection + arg extraction |
| ToolBench / API-Bank | Multi-step API use | End-to-end tool-using agent |
- SWE-Bench Verified = the SWE agent leaderboard.
- BFCL = the function-calling correctness leaderboard.
- OSWorld / WebArena = the computer-use frontier.
HarmBench measures attack success on harmful behaviors. XSTest catches over-refusal (false positives). WMDP probes weapons-of-mass-destruction-proxy knowledge. TruthfulQA probes whether models parrot common misconceptions.
| Benchmark | What it measures |
|---|---|
| HarmBench | 200+ harmful behaviors with jailbreak attempts. Measures attack success. |
| XSTest | Over-refusal: 250 safe prompts that look unsafe. Measures false-refusal rate. |
| JailbreakBench | Standard jailbreak corpus (PAIR, AutoDAN, etc.) |
| WMDP | Weapons-of-mass-destruction proxy. Probes dangerous knowledge. |
| BBQ | Bias in QA. Measures stereotype reliance. |
| TruthfulQA | Misconceptions / falsehoods. Measures whether model parrots common errors. |
- HarmBench + XSTest pair: attack success vs over-refusal — both matter.
- WMDP = the dangerous-capability eval that affects RSP / Preparedness levels.
A score in isolation is meaningless. Always ask: prompt template, sampling strategy (pass@1 vs N), test vs holdout, contamination risk, human baseline. Sr Staff candidates question every score reflexively.
- What's the prompt template? Few-shot vs zero-shot, CoT vs no-CoT, system prompt — all change scores by 5-15%.
- Pass@1, pass@k, maj@N, or best-of-N? Wildly different compute footprints; sampling strategy can swing scores 20+ points.
- Test set or private holdout? If test, contamination risk applies.
- Was it contaminated in pretraining? Check the data cutoff vs the benchmark publication date.
- What's the human baseline? A model scoring 60% might be human-level on a hard eval, or below random on a hard eval.
Holistic / aggregate evals worth knowing
- HELM (Stanford) — multi-metric, multi-benchmark holistic eval. Slow but thorough.
- OpenLLM Leaderboard v2 (HF) — community-run aggregate, includes GPQA, MMLU-Pro, MUSR, BBH, IFEval, MATH-Hard.
- LiveBench — contamination-free, monthly refresh.
- Chatbot Arena — community winrate. Most user-facing leaderboard.
- Always interrogate the methodology before quoting a score.
- Pass@1 is the production-relevant metric. Pass@N shows ceiling.
- If a paper doesn't disclose prompt template + sampling, the score is suspect.
Sample interview Qs
- "What's the difference between MMLU and MMLU-Pro?" → MMLU-Pro: 10 options vs 4, harder distractors, more reasoning required, less saturated.
- "Why is GPQA-Diamond contamination-resistant?" → Hand-written by domain PhDs; many problems require multi-step reasoning that's not memorizable; held out from web crawl.
- "Pass@1 vs Pass@10 — when each?" → Pass@1: production-relevant (one-shot quality). Pass@10: capability ceiling under sampling. Difference shows how much test-time compute helps.
- "Why is Chatbot Arena criticized?" → Stylistic preferences (emoji use, response length, formatting) influence votes more than capability; not a clean capability eval. Style-controlled variants exist.
- "What does SWE-Bench measure that HumanEval doesn't?" → Real multi-file code edits in actual repos with real test suites; agent loop (read → patch → test); long-context understanding; agentic planning. HumanEval = single-function gen.
- "How would you build an internal eval for your team?" → Curated holdout from real prod traffic; LLM-as-judge with calibration; human spot-checks; per-segment slicing (easy/hard/by topic); regression gates; track over time.