REFERENCE · CHEAT SHEET

Eval benchmarks — reference

Sr Staff candidates fluently discuss MMLU vs MMLU-Pro vs GPQA. They know which benchmarks are saturated, contaminated, and frontier. This page is the cheat sheet — bookmark it, drill it before any onsite.

Read ~20 min Use as Pre-onsite reference Asked at All frontier labs

What you'll learn

Why benchmarks matter (and why they all become saturated)
Knowledge / general capability — MMLU, MMLU-Pro, GPQA, HLE
Math benchmarks — GSM8K → MATH → AIME → FrontierMath → HLE
Code benchmarks — HumanEval → LiveCodeBench → SWE-Bench
Reasoning & abstract — ARC-AGI and friends
Long-context evals — NIAH → RULER → BABILong
Multimodal — MMMU, MathVista, VideoMME
Instruction following — IFEval, MT-Bench, Chatbot Arena
Agents & tool use — SWE-Bench Verified, GAIA, OSWorld, BFCL
Safety & red-teaming — HarmBench, XSTest, JailbreakBench, WMDP
How to read a benchmark score (the meta-skill)

FOUNDATIONS · WHY BENCHMARKS

Why benchmarks matter (and why they all become saturated)

TL;DR

Benchmarks let you compare models on a fixed measuring stick. The catch: every benchmark eventually saturates (frontier models score > 90%) AND eventually leaks into pretraining data (contamination). The good 2026 ones are hand-written by experts AND continuously refreshed.

THE INSIGHT — contamination is everywhere

A high score doesn't always mean the model can do the task

Many benchmarks (HumanEval, MATH, GSM8K) appear in pretraining data. The model may have memorized the test set. Modern benchmarks defend against this:

Continuously refreshed — LiveCodeBench (monthly Codeforces), LiveBench.
Hand-curated & held out — GPQA-Diamond, FrontierMath, HLE.
Post-cutoff data — AIME 2024/2025 dropped after most model cutoffs.

What "saturated" means

A benchmark is saturated when frontier models score > 90-95%. At that point, score differences are noise; you can't distinguish models. Move to harder ones (MMLU-Pro, GPQA-Diamond, FrontierMath, HLE). Saturated benchmarks aren't useless — they remain reasonable lower-bound sanity checks — but they don't differentiate frontier work.

REMEMBER

Benchmarks are tools, not truth. They saturate and leak.
2026 frontier evals: GPQA-Diamond, FrontierMath, HLE, SWE-Bench Verified, LiveCodeBench, ARC-AGI-2.
Always ask: contamination? saturation? what's the human baseline?

KNOWLEDGE · GENERAL CAPABILITY

Knowledge benchmarks — MMLU, MMLU-Pro, GPQA, HLE

TL;DR

MMLU is saturated and stop using it as the headline. MMLU-Pro is the harder mid-tier (10-option MCQ). GPQA-Diamond is contamination-resistant PhD-level science. HLE (2025) is the new "next MMLU" — very low SOTA, designed to differentiate frontier models for years.

Benchmark	Format	What it measures	Status
MMLU	57-subject multiple choice	Broad academic knowledge	Saturated (~90%+ frontier; ~88% Claude 3.5)
MMLU-Pro	10-option MCQ, harder distractors	Same domains, less guessable	Less saturated (~80% top, ~70% Llama 3.1 405B)
GPQA-Diamond	Hand-crafted PhD-level science	Contamination-resistant reasoning	~85% SOTA (o3); ~50% GPT-4o; ~40% Llama 3.1 405B
HLE (Humanity's Last Exam)	~3000 expert-written hard polymath	2025; "the next MMLU"	Very low SOTA still (~25% top)
BBH (BIG-Bench Hard)	23 hard tasks from BIG-Bench	Reasoning + symbolic	Saturated for top models
HellaSwag	Sentence completion	Commonsense	Fully saturated
ARC-Challenge	Grade-school science MCQ	Easy reasoning	Saturated
WinoGrande	Pronoun resolution	Coreference / commonsense	Saturated

REMEMBER

MMLU = stale. Cite MMLU-Pro for capacity differentiation.
GPQA-Diamond = the contamination-resistant PhD eval.
HLE = "next MMLU." Watch this one over the next 2 years.

MATH · COMPETITION → RESEARCH

Math benchmarks — GSM8K → MATH → AIME → FrontierMath → HLE

TL;DR

The math benchmark ladder mirrors the reasoning-model arc. GSM8K and MATH are saturated. AIME 2024/2025 is the current high-signal eval. FrontierMath and HLE are the hard frontier — very low SOTA, designed to last.

Benchmark	Format	What it measures	Status
GSM8K	Grade-school word problems	Multi-step arithmetic	Saturated (>95%); contamination concern
MATH	Competition problems	Algebra / geometry / calculus	~95%+ SOTA reasoning models
AIME 2024 / 2025	15 problems, integer answer	Olympiad-level	Current high-signal eval. o3: ~96% AIME 2024
FrontierMath	Hand-crafted research-level math	Hard novel math problems	Was <3% pre-o3; o3 jumped to ~25%
OlympiadBench	Olympiad math + physics	Multi-modal (some have figures)	~50% top reasoning models
Omni-Math	Comprehensive math eval	Comprehensive coverage	Newer; less contaminated

REMEMBER

Saturated: GSM8K, MATH. Don't lead with these.
Lead with: AIME 2024/2025, FrontierMath.
o3's FrontierMath jump (3% → 25%) was the headline 2024 reasoning result.

CODE · FUNCTION → AGENT

Code benchmarks — HumanEval → LiveCodeBench → SWE-Bench

TL;DR

HumanEval and MBPP are saturated and contaminated. LiveCodeBench (continuously refreshed Codeforces/LeetCode) is the contamination-resistant alternative. SWE-Bench Verified is the gold standard for code agents — real GitHub issues with real test suites.

Benchmark	Format	What it measures	Status
HumanEval	164 hand-written Python functions	Function-level code gen	Saturated (>95%); heavy contamination
MBPP	974 simple Python problems	Function-level	Saturated
HumanEval+ / MBPP+ (EvalPlus)	Same problems, more tests	Catches edge-case failures	Still useful; ~80% top
LiveCodeBench	Continuously-updated Codeforces / LeetCode	Contamination-resistant code gen	~70% SOTA on hardest segments
SWE-Bench (full)	2294 real GitHub issues	Multi-file code edits with tests	~30-40% top
SWE-Bench Verified	500 hand-verified subset	Same, cleaned (no flaky tests)	~70%+ top agents (Devin, Claude code)
SWE-Bench Lite	Smaller, easier subset	Faster eval iteration	~50%+
BigCodeBench	Library API usage tasks	Practical code with stdlib + libs	~50%
RepoBench	Repository-level completion	Long-context code understanding	Active

REMEMBER

HumanEval = saturated + leaked. Stop citing it as the headline.
LiveCodeBench = contamination-resistant function-level eval.
SWE-Bench Verified = the gold standard for code agents.

REASONING · ABSTRACT

Reasoning & abstract — ARC-AGI and friends

TL;DR

ARC-AGI is Chollet's "novel reasoning" test — visual abstract puzzles where the rule must be inferred from a few examples. o3 broke ARC-AGI (76%) in late 2024; ARC-AGI-2 reset the bar.

Benchmark	Format	What it measures	Status
ARC-AGI / ARC-AGI-2	Visual abstract reasoning grids	"Few-shot" novel reasoning	Chollet's eval. o3 breakthrough late 2024 (76%); ARC-AGI-2 reset bar
BIG-Bench	200+ tasks	Broad capability sampling	Most subtasks saturated; superseded by BBH and others
DROP	Reading comprehension w/ math	Discrete reasoning over paragraphs	Saturated

REMEMBER

ARC-AGI = the novel-reasoning test. ARC-AGI-2 is the current frontier.
If asked about Chollet's evals, mention his Ndea program-synthesis lab too.

LONG CONTEXT · RETRIEVAL

Long-context evals — NIAH → RULER → BABILong

TL;DR

NIAH (needle-in-a-haystack) is saturated for any modern long-context model. RULER's multi-needle / multi-hop variants are the realistic test — most models drop sharply past 32k context, no matter what their advertised window is.

Benchmark	Format	What it measures
NIAH (Needle in a Haystack)	Plant a fact, ask for retrieval at varying depths	Basic long-context retrieval. Saturated for any modern long-context model.
RULER	Multi-needle, multi-hop, aggregation	Realistic long-context use; better than NIAH. Most models drop sharply past 32k.
BABILong	bAbI tasks padded to long context	Reasoning over long context
Loong / Loogle / ZeroSCROLLS	Long-doc QA	Real long-context tasks
LongBench	21 task suite	Comprehensive but somewhat saturated

PITFALL — context length advertised ≠ usable

A model claiming 1M context window can still degrade past 32k on RULER. Real long-context performance requires both architecture (RoPE extension, MLA) and continued-training on long-doc data. Always check RULER, not just NIAH.

REMEMBER

NIAH is too easy. Cite RULER for real long-context.
Most "1M context" models are not actually usable past ~64k. Verify.

MULTIMODAL · VISION + TEXT

Multimodal — MMMU, MathVista, VideoMME

TL;DR

MMMU is the headline multimodal eval (college-level reasoning across 30 subjects with images). MathVista probes visual + math reasoning. VideoMME is the harder long-video eval. Specialty subsets (OCR, charts, docs) for production use cases.

Benchmark	Format	What it measures
MMMU	11k MCQ across 30 subjects, multi-image	College-level multimodal reasoning. ~70% SOTA, ~85% on Pro variant.
MathVista	Math with visual context	Visual + math reasoning
VQAv2	Open-ended visual QA	Saturated (~85%+)
OCR-Bench, ChartQA, DocVQA	Specific multimodal tasks	OCR / chart / doc understanding
VideoMME	Video understanding QA	Long-video QA; harder than image-only

REMEMBER

MMMU = college-level multimodal headline eval.
For production: use the specialty evals (OCR, chart, doc, video) that match your domain.

INSTRUCTION FOLLOWING · CHAT

Instruction following — IFEval, MT-Bench, Chatbot Arena

TL;DR

IFEval (verifiable format constraints) is the cleanest instruction-following eval. MT-Bench and Arena are LLM-as-judge / human-preference — fuzzier signal. Chatbot Arena is the most user-facing leaderboard but is biased by stylistic preference.

Benchmark	Format	What it measures
IFEval	Verifiable format constraints	"Answer in exactly 50 words", "include the word X 3 times". Programmatic check. ~85% top.
MT-Bench	80 multi-turn prompts, GPT-4 judge	General chat quality. ~9/10 top.
AlpacaEval 2 / Arena-Hard	LLM-as-judge winrate	Pairwise preference vs reference
Chatbot Arena	Crowdsourced pairwise (lmsys)	Real-user preferences. Most-cited "feel" leaderboard. Caveats around stylistic preference bias.

PITFALL — Chatbot Arena style bias

Stylistic preferences (emoji use, response length, formatting) influence votes more than capability — voters reward "looks confident and structured." Style-controlled variants exist. Don't quote raw Arena ELO as a pure-capability metric.

REMEMBER

IFEval = clean programmatic instruction-following metric.
Arena = human preference + style bias. Useful but with caveats.

AGENTS · TOOL USE

Agents & tool use — SWE-Bench Verified, GAIA, OSWorld, BFCL

TL;DR

SWE-Bench Verified is the gold standard for software-engineering agents. GAIA tests general assistant agents on real-world tasks. OSWorld and WebArena test computer use. BFCL is the standard for tool/function-calling accuracy.

Benchmark	Format	What it measures
SWE-Bench (Verified)	GitHub issue → patch	Software engineering agent. Currently the gold standard.
AgentBench	8 environments	Multi-domain agent eval
GAIA	Real-world tasks needing web + tools	General assistant agent. Hard.
WebArena / WebShop / VisualWebArena	Browser-driven tasks	Computer-use / browser-agent eval
OSWorld	Computer-use desktop tasks	Multi-app workflows
tau-bench	Multi-turn customer-service tool use	Conversational agent w/ tools
BFCL (Berkeley Function Calling Leaderboard)	Function-calling accuracy	Tool selection + arg extraction
ToolBench / API-Bank	Multi-step API use	End-to-end tool-using agent

REMEMBER

SWE-Bench Verified = the SWE agent leaderboard.
BFCL = the function-calling correctness leaderboard.
OSWorld / WebArena = the computer-use frontier.

SAFETY · RED-TEAMING

Safety & red-teaming — HarmBench, XSTest, JailbreakBench, WMDP

TL;DR

HarmBench measures attack success on harmful behaviors. XSTest catches over-refusal (false positives). WMDP probes weapons-of-mass-destruction-proxy knowledge. TruthfulQA probes whether models parrot common misconceptions.

Benchmark	What it measures
HarmBench	200+ harmful behaviors with jailbreak attempts. Measures attack success.
XSTest	Over-refusal: 250 safe prompts that look unsafe. Measures false-refusal rate.
JailbreakBench	Standard jailbreak corpus (PAIR, AutoDAN, etc.)
WMDP	Weapons-of-mass-destruction proxy. Probes dangerous knowledge.
BBQ	Bias in QA. Measures stereotype reliance.
TruthfulQA	Misconceptions / falsehoods. Measures whether model parrots common errors.

REMEMBER

HarmBench + XSTest pair: attack success vs over-refusal — both matter.
WMDP = the dangerous-capability eval that affects RSP / Preparedness levels.

META · HOW TO READ

How to read a benchmark score (the meta-skill)

TL;DR

A score in isolation is meaningless. Always ask: prompt template, sampling strategy (pass@1 vs N), test vs holdout, contamination risk, human baseline. Sr Staff candidates question every score reflexively.

What's the prompt template? Few-shot vs zero-shot, CoT vs no-CoT, system prompt — all change scores by 5-15%.
Pass@1, pass@k, maj@N, or best-of-N? Wildly different compute footprints; sampling strategy can swing scores 20+ points.
Test set or private holdout? If test, contamination risk applies.
Was it contaminated in pretraining? Check the data cutoff vs the benchmark publication date.
What's the human baseline? A model scoring 60% might be human-level on a hard eval, or below random on a hard eval.

Holistic / aggregate evals worth knowing

HELM (Stanford) — multi-metric, multi-benchmark holistic eval. Slow but thorough.
OpenLLM Leaderboard v2 (HF) — community-run aggregate, includes GPQA, MMLU-Pro, MUSR, BBH, IFEval, MATH-Hard.
LiveBench — contamination-free, monthly refresh.
Chatbot Arena — community winrate. Most user-facing leaderboard.

REMEMBER

Always interrogate the methodology before quoting a score.
Pass@1 is the production-relevant metric. Pass@N shows ceiling.
If a paper doesn't disclose prompt template + sampling, the score is suspect.

Sample interview Qs

"What's the difference between MMLU and MMLU-Pro?" → MMLU-Pro: 10 options vs 4, harder distractors, more reasoning required, less saturated.
"Why is GPQA-Diamond contamination-resistant?" → Hand-written by domain PhDs; many problems require multi-step reasoning that's not memorizable; held out from web crawl.
"Pass@1 vs Pass@10 — when each?" → Pass@1: production-relevant (one-shot quality). Pass@10: capability ceiling under sampling. Difference shows how much test-time compute helps.
"Why is Chatbot Arena criticized?" → Stylistic preferences (emoji use, response length, formatting) influence votes more than capability; not a clean capability eval. Style-controlled variants exist.
"What does SWE-Bench measure that HumanEval doesn't?" → Real multi-file code edits in actual repos with real test suites; agent loop (read → patch → test); long-context understanding; agentic planning. HumanEval = single-function gen.
"How would you build an internal eval for your team?" → Curated holdout from real prod traffic; LLM-as-judge with calibration; human spot-checks; per-segment slicing (easy/hard/by topic); regression gates; track over time.