REFERENCE · CHEAT SHEET

Eval benchmarks — reference

Sr Staff candidates fluently discuss MMLU vs MMLU-Pro vs GPQA. They know which benchmarks are saturated, contaminated, and frontier. This page is the cheat sheet — bookmark it, drill it before any onsite.

Read ~20 min Use as Pre-onsite reference Asked at All frontier labs
01
FOUNDATIONS · WHY BENCHMARKS

Why benchmarks matter (and why they all become saturated)

TL;DR

Benchmarks let you compare models on a fixed measuring stick. The catch: every benchmark eventually saturates (frontier models score > 90%) AND eventually leaks into pretraining data (contamination). The good 2026 ones are hand-written by experts AND continuously refreshed.

THE INSIGHT — contamination is everywhere

A high score doesn't always mean the model can do the task

Many benchmarks (HumanEval, MATH, GSM8K) appear in pretraining data. The model may have memorized the test set. Modern benchmarks defend against this:

  • Continuously refreshed — LiveCodeBench (monthly Codeforces), LiveBench.
  • Hand-curated & held out — GPQA-Diamond, FrontierMath, HLE.
  • Post-cutoff data — AIME 2024/2025 dropped after most model cutoffs.

What "saturated" means

A benchmark is saturated when frontier models score > 90-95%. At that point, score differences are noise; you can't distinguish models. Move to harder ones (MMLU-Pro, GPQA-Diamond, FrontierMath, HLE). Saturated benchmarks aren't useless — they remain reasonable lower-bound sanity checks — but they don't differentiate frontier work.

REMEMBER
  • Benchmarks are tools, not truth. They saturate and leak.
  • 2026 frontier evals: GPQA-Diamond, FrontierMath, HLE, SWE-Bench Verified, LiveCodeBench, ARC-AGI-2.
  • Always ask: contamination? saturation? what's the human baseline?
02
KNOWLEDGE · GENERAL CAPABILITY

Knowledge benchmarks — MMLU, MMLU-Pro, GPQA, HLE

TL;DR

MMLU is saturated and stop using it as the headline. MMLU-Pro is the harder mid-tier (10-option MCQ). GPQA-Diamond is contamination-resistant PhD-level science. HLE (2025) is the new "next MMLU" — very low SOTA, designed to differentiate frontier models for years.

BenchmarkFormatWhat it measuresStatus
MMLU57-subject multiple choiceBroad academic knowledgeSaturated (~90%+ frontier; ~88% Claude 3.5)
MMLU-Pro10-option MCQ, harder distractorsSame domains, less guessableLess saturated (~80% top, ~70% Llama 3.1 405B)
GPQA-DiamondHand-crafted PhD-level scienceContamination-resistant reasoning~85% SOTA (o3); ~50% GPT-4o; ~40% Llama 3.1 405B
HLE (Humanity's Last Exam)~3000 expert-written hard polymath2025; "the next MMLU"Very low SOTA still (~25% top)
BBH (BIG-Bench Hard)23 hard tasks from BIG-BenchReasoning + symbolicSaturated for top models
HellaSwagSentence completionCommonsenseFully saturated
ARC-ChallengeGrade-school science MCQEasy reasoningSaturated
WinoGrandePronoun resolutionCoreference / commonsenseSaturated
REMEMBER
  • MMLU = stale. Cite MMLU-Pro for capacity differentiation.
  • GPQA-Diamond = the contamination-resistant PhD eval.
  • HLE = "next MMLU." Watch this one over the next 2 years.
03
MATH · COMPETITION → RESEARCH

Math benchmarks — GSM8K → MATH → AIME → FrontierMath → HLE

TL;DR

The math benchmark ladder mirrors the reasoning-model arc. GSM8K and MATH are saturated. AIME 2024/2025 is the current high-signal eval. FrontierMath and HLE are the hard frontier — very low SOTA, designed to last.

BenchmarkFormatWhat it measuresStatus
GSM8KGrade-school word problemsMulti-step arithmeticSaturated (>95%); contamination concern
MATHCompetition problemsAlgebra / geometry / calculus~95%+ SOTA reasoning models
AIME 2024 / 202515 problems, integer answerOlympiad-levelCurrent high-signal eval. o3: ~96% AIME 2024
FrontierMathHand-crafted research-level mathHard novel math problemsWas <3% pre-o3; o3 jumped to ~25%
OlympiadBenchOlympiad math + physicsMulti-modal (some have figures)~50% top reasoning models
Omni-MathComprehensive math evalComprehensive coverageNewer; less contaminated
REMEMBER
  • Saturated: GSM8K, MATH. Don't lead with these.
  • Lead with: AIME 2024/2025, FrontierMath.
  • o3's FrontierMath jump (3% → 25%) was the headline 2024 reasoning result.
04
CODE · FUNCTION → AGENT

Code benchmarks — HumanEval → LiveCodeBench → SWE-Bench

TL;DR

HumanEval and MBPP are saturated and contaminated. LiveCodeBench (continuously refreshed Codeforces/LeetCode) is the contamination-resistant alternative. SWE-Bench Verified is the gold standard for code agents — real GitHub issues with real test suites.

BenchmarkFormatWhat it measuresStatus
HumanEval164 hand-written Python functionsFunction-level code genSaturated (>95%); heavy contamination
MBPP974 simple Python problemsFunction-levelSaturated
HumanEval+ / MBPP+ (EvalPlus)Same problems, more testsCatches edge-case failuresStill useful; ~80% top
LiveCodeBenchContinuously-updated Codeforces / LeetCodeContamination-resistant code gen~70% SOTA on hardest segments
SWE-Bench (full)2294 real GitHub issuesMulti-file code edits with tests~30-40% top
SWE-Bench Verified500 hand-verified subsetSame, cleaned (no flaky tests)~70%+ top agents (Devin, Claude code)
SWE-Bench LiteSmaller, easier subsetFaster eval iteration~50%+
BigCodeBenchLibrary API usage tasksPractical code with stdlib + libs~50%
RepoBenchRepository-level completionLong-context code understandingActive
REMEMBER
  • HumanEval = saturated + leaked. Stop citing it as the headline.
  • LiveCodeBench = contamination-resistant function-level eval.
  • SWE-Bench Verified = the gold standard for code agents.
05
REASONING · ABSTRACT

Reasoning & abstract — ARC-AGI and friends

TL;DR

ARC-AGI is Chollet's "novel reasoning" test — visual abstract puzzles where the rule must be inferred from a few examples. o3 broke ARC-AGI (76%) in late 2024; ARC-AGI-2 reset the bar.

BenchmarkFormatWhat it measuresStatus
ARC-AGI / ARC-AGI-2Visual abstract reasoning grids"Few-shot" novel reasoningChollet's eval. o3 breakthrough late 2024 (76%); ARC-AGI-2 reset bar
BIG-Bench200+ tasksBroad capability samplingMost subtasks saturated; superseded by BBH and others
DROPReading comprehension w/ mathDiscrete reasoning over paragraphsSaturated
REMEMBER
  • ARC-AGI = the novel-reasoning test. ARC-AGI-2 is the current frontier.
  • If asked about Chollet's evals, mention his Ndea program-synthesis lab too.
06
LONG CONTEXT · RETRIEVAL

Long-context evals — NIAH → RULER → BABILong

TL;DR

NIAH (needle-in-a-haystack) is saturated for any modern long-context model. RULER's multi-needle / multi-hop variants are the realistic test — most models drop sharply past 32k context, no matter what their advertised window is.

BenchmarkFormatWhat it measures
NIAH (Needle in a Haystack)Plant a fact, ask for retrieval at varying depthsBasic long-context retrieval. Saturated for any modern long-context model.
RULERMulti-needle, multi-hop, aggregationRealistic long-context use; better than NIAH. Most models drop sharply past 32k.
BABILongbAbI tasks padded to long contextReasoning over long context
Loong / Loogle / ZeroSCROLLSLong-doc QAReal long-context tasks
LongBench21 task suiteComprehensive but somewhat saturated
PITFALL — context length advertised ≠ usable
A model claiming 1M context window can still degrade past 32k on RULER. Real long-context performance requires both architecture (RoPE extension, MLA) and continued-training on long-doc data. Always check RULER, not just NIAH.
REMEMBER
  • NIAH is too easy. Cite RULER for real long-context.
  • Most "1M context" models are not actually usable past ~64k. Verify.
07
MULTIMODAL · VISION + TEXT

Multimodal — MMMU, MathVista, VideoMME

TL;DR

MMMU is the headline multimodal eval (college-level reasoning across 30 subjects with images). MathVista probes visual + math reasoning. VideoMME is the harder long-video eval. Specialty subsets (OCR, charts, docs) for production use cases.

BenchmarkFormatWhat it measures
MMMU11k MCQ across 30 subjects, multi-imageCollege-level multimodal reasoning. ~70% SOTA, ~85% on Pro variant.
MathVistaMath with visual contextVisual + math reasoning
VQAv2Open-ended visual QASaturated (~85%+)
OCR-Bench, ChartQA, DocVQASpecific multimodal tasksOCR / chart / doc understanding
VideoMMEVideo understanding QALong-video QA; harder than image-only
REMEMBER
  • MMMU = college-level multimodal headline eval.
  • For production: use the specialty evals (OCR, chart, doc, video) that match your domain.
08
INSTRUCTION FOLLOWING · CHAT

Instruction following — IFEval, MT-Bench, Chatbot Arena

TL;DR

IFEval (verifiable format constraints) is the cleanest instruction-following eval. MT-Bench and Arena are LLM-as-judge / human-preference — fuzzier signal. Chatbot Arena is the most user-facing leaderboard but is biased by stylistic preference.

BenchmarkFormatWhat it measures
IFEvalVerifiable format constraints"Answer in exactly 50 words", "include the word X 3 times". Programmatic check. ~85% top.
MT-Bench80 multi-turn prompts, GPT-4 judgeGeneral chat quality. ~9/10 top.
AlpacaEval 2 / Arena-HardLLM-as-judge winratePairwise preference vs reference
Chatbot ArenaCrowdsourced pairwise (lmsys)Real-user preferences. Most-cited "feel" leaderboard. Caveats around stylistic preference bias.
PITFALL — Chatbot Arena style bias
Stylistic preferences (emoji use, response length, formatting) influence votes more than capability — voters reward "looks confident and structured." Style-controlled variants exist. Don't quote raw Arena ELO as a pure-capability metric.
REMEMBER
  • IFEval = clean programmatic instruction-following metric.
  • Arena = human preference + style bias. Useful but with caveats.
09
AGENTS · TOOL USE

Agents & tool use — SWE-Bench Verified, GAIA, OSWorld, BFCL

TL;DR

SWE-Bench Verified is the gold standard for software-engineering agents. GAIA tests general assistant agents on real-world tasks. OSWorld and WebArena test computer use. BFCL is the standard for tool/function-calling accuracy.

BenchmarkFormatWhat it measures
SWE-Bench (Verified)GitHub issue → patchSoftware engineering agent. Currently the gold standard.
AgentBench8 environmentsMulti-domain agent eval
GAIAReal-world tasks needing web + toolsGeneral assistant agent. Hard.
WebArena / WebShop / VisualWebArenaBrowser-driven tasksComputer-use / browser-agent eval
OSWorldComputer-use desktop tasksMulti-app workflows
tau-benchMulti-turn customer-service tool useConversational agent w/ tools
BFCL (Berkeley Function Calling Leaderboard)Function-calling accuracyTool selection + arg extraction
ToolBench / API-BankMulti-step API useEnd-to-end tool-using agent
REMEMBER
  • SWE-Bench Verified = the SWE agent leaderboard.
  • BFCL = the function-calling correctness leaderboard.
  • OSWorld / WebArena = the computer-use frontier.
10
SAFETY · RED-TEAMING

Safety & red-teaming — HarmBench, XSTest, JailbreakBench, WMDP

TL;DR

HarmBench measures attack success on harmful behaviors. XSTest catches over-refusal (false positives). WMDP probes weapons-of-mass-destruction-proxy knowledge. TruthfulQA probes whether models parrot common misconceptions.

BenchmarkWhat it measures
HarmBench200+ harmful behaviors with jailbreak attempts. Measures attack success.
XSTestOver-refusal: 250 safe prompts that look unsafe. Measures false-refusal rate.
JailbreakBenchStandard jailbreak corpus (PAIR, AutoDAN, etc.)
WMDPWeapons-of-mass-destruction proxy. Probes dangerous knowledge.
BBQBias in QA. Measures stereotype reliance.
TruthfulQAMisconceptions / falsehoods. Measures whether model parrots common errors.
REMEMBER
  • HarmBench + XSTest pair: attack success vs over-refusal — both matter.
  • WMDP = the dangerous-capability eval that affects RSP / Preparedness levels.
11
META · HOW TO READ

How to read a benchmark score (the meta-skill)

TL;DR

A score in isolation is meaningless. Always ask: prompt template, sampling strategy (pass@1 vs N), test vs holdout, contamination risk, human baseline. Sr Staff candidates question every score reflexively.

  1. What's the prompt template? Few-shot vs zero-shot, CoT vs no-CoT, system prompt — all change scores by 5-15%.
  2. Pass@1, pass@k, maj@N, or best-of-N? Wildly different compute footprints; sampling strategy can swing scores 20+ points.
  3. Test set or private holdout? If test, contamination risk applies.
  4. Was it contaminated in pretraining? Check the data cutoff vs the benchmark publication date.
  5. What's the human baseline? A model scoring 60% might be human-level on a hard eval, or below random on a hard eval.

Holistic / aggregate evals worth knowing

REMEMBER
  • Always interrogate the methodology before quoting a score.
  • Pass@1 is the production-relevant metric. Pass@N shows ceiling.
  • If a paper doesn't disclose prompt template + sampling, the score is suspect.

Sample interview Qs

  1. "What's the difference between MMLU and MMLU-Pro?" → MMLU-Pro: 10 options vs 4, harder distractors, more reasoning required, less saturated.
  2. "Why is GPQA-Diamond contamination-resistant?" → Hand-written by domain PhDs; many problems require multi-step reasoning that's not memorizable; held out from web crawl.
  3. "Pass@1 vs Pass@10 — when each?" → Pass@1: production-relevant (one-shot quality). Pass@10: capability ceiling under sampling. Difference shows how much test-time compute helps.
  4. "Why is Chatbot Arena criticized?" → Stylistic preferences (emoji use, response length, formatting) influence votes more than capability; not a clean capability eval. Style-controlled variants exist.
  5. "What does SWE-Bench measure that HumanEval doesn't?" → Real multi-file code edits in actual repos with real test suites; agent loop (read → patch → test); long-context understanding; agentic planning. HumanEval = single-function gen.
  6. "How would you build an internal eval for your team?" → Curated holdout from real prod traffic; LLM-as-judge with calibration; human spot-checks; per-segment slicing (easy/hard/by topic); regression gates; track over time.