FRONTIER PILLAR · 2025 BREAKTHROUGH

Reasoning models — the 2025 frontier

o1, o3, DeepSeek R1, Claude extended thinking. Test-time compute is now a scaling axis as legitimate as parameter count. This chapter explains RLVR, GRPO, process reward models, MCTS, and the open R1 recipe densely enough to walk into any frontier-lab loop.

Read ~30 min Asked at Anthropic, OpenAI, DeepMind, DeepSeek, TML Difficulty Sr Staff bar

What you'll learn

The shift — train-time → test-time scaling
RLVR — RL with Verifiable Rewards (the unlock)
DeepSeek R1 — the open canon, 4-stage pipeline
PRM vs ORM — when step rewards matter
Best-of-N + verifier scaling laws
Tree search at inference — MCTS, ToT, rStar, AlphaProof
Reward hacking in reasoning — the new failure modes
Inference-time scaling laws (Snell 2024)
Practical deployment — thinking budgets, routing, distillation
Eval benchmarks for reasoning

FOUNDATIONS · THE PARADIGM SHIFT

The shift — train-time → test-time scaling

TL;DR

Pre-2024: more train compute = better model. Post-o1: a fixed model can also get better at inference time by thinking longer or by sampling more candidates. Test-time compute is a new scaling axis — Snell 2024 showed it can substitute for ~14× train compute on hard problems.

Two compounding axes

Sequential test-time: longer chain-of-thought before answering. The model "thinks" for thousands of tokens (often hidden from the user).
Parallel test-time: sample N candidate solutions, pick best (best-of-N, majority vote, or verifier-scored).

EXAMPLE — the Snell 2024 result

Snell et al. 2024 (arxiv 2408.03314) — "Scaling LLM Test-Time Compute Optimally": for many problems, optimal test-time compute can substitute for ~14× more pretraining compute. Smart inference can let smaller models match larger ones.

REMEMBER

Two new axes: longer CoT (sequential), more samples (parallel).
Inference compute can substitute for train compute up to a problem-dependent ceiling.
This is what made o1, R1, Claude extended thinking possible.

THE UNLOCK · RLVR

RLVR — RL with Verifiable Rewards

TL;DR

Replace the learned reward model (which can be hacked) with a programmatic verifier (which can't, on its signal). For tasks where you can check correctness automatically — code (unit tests), math (exact match), formal proofs (Lean) — RLVR drives the model to discover unconventional reasoning patterns without reward hacking.

THE INSIGHT — why RLVR > RLHF for reasoning

Reward models saturate. Verifiers don't.

Learned reward models (RLHF) are noisy proxies for human preference. The model can find inputs that score high under the RM but humans dislike — classic reward hacking. Programmatic verifiers give an objective ground-truth signal:

Code: unit tests pass / fail.
Math: extract final answer (regex or grader LLM), exact-match.
Formal proofs: Lean / Coq proof checker.
Instruction following: regex / structured-output check.

The model can write a 1000-token CoT in any style; only the final answer matters. Frees the model to discover unconventional reasoning patterns that humans wouldn't have demonstrated. Used by Tülu 3, DeepSeek R1, OpenAI o-series, AlphaProof.

PITFALL — verifier exploitability

RLVR's no-reward-hacking property holds only for the verifier signal itself. The model can still hack subtler signals: write a CoT that doesn't justify the answer (tampered reasoning), match the regex with wrong derivation, use length to game training-time correlations. See Chapter 7.

REMEMBER

RLVR works for verifiable tasks (math, code, formal). Doesn't work for "is this funny?"
Programmatic verifier ≠ learned reward model. Different failure modes.
This is THE 2024 unlock that made o-series and R1 possible.

THE OPEN CANON · DEEPSEEK R1

DeepSeek R1 — the open recipe everyone reads

TL;DR

DeepSeek R1 (arxiv 2501.12948) is the open recipe for reasoning models. Two model variants: R1-Zero (pure RL on V3 base — proves reasoning emerges from RL alone) and R1 (4-stage pipeline that fixes R1-Zero's readability while preserving capability). Distilled into Qwen-7B/14B/32B and Llama-8B/70B for cheap deployment.

R1-Zero — pure RL, no SFT

Start from DeepSeek-V3 base (strong, no instruction-tuning).
Run GRPO with verifiable rewards on math + code prompts.
Reasoning emerges spontaneously: long CoTs, "wait, let me reconsider" patterns, alternative-approach exploration.
Capability excellent; readability poor (mixes languages, no formatting).

THE 4-STAGE PIPELINE — memorize this

R1 — the full recipe

Cold-start SFT: small set of curated reasoning traces with desired format. Fixes readability.
RL: GRPO with verifiable rewards.
Rejection-sampling SFT: sample many outputs from stage-2 policy, keep correct ones (~600k math/code) plus general-purpose data (~200k SFT data); SFT.
Final RL: combined verifiable + preference rewards (helpfulness, harmlessness).

Key lessons from R1

Pure RL works on a strong base model — reasoning is "in" the base; RL elicits it.
GRPO scales well at LLM scale (no critic/value model).
Distilled smaller models (R1 → Qwen-7B/14B/32B, Llama-8B/70B) inherit reasoning ability cheaply.
Cold-start matters for usability but not capability.

REMEMBER

R1-Zero proves: reasoning emerges from RL on a strong base.
R1's 4 stages: cold-start SFT → RL → rejection-sampling SFT → final RL.
Distillation transfers reasoning to small dense models cheaply.

REWARDS · STEP-LEVEL SUPERVISION

PRM vs ORM — when step-level rewards matter

TL;DR

ORM (Outcome Reward Model): is the final answer right? Cheap labels, but no signal on which step went wrong. PRM (Process Reward Model): is each step right? Expensive labels (Lightman 2023's PRM800K cost millions), but enables verifier-guided beam search and better credit assignment.

ORM — outcome only

Train on (prompt, completion, correct?)
Cheap to label (just check the answer)
Doesn't tell you where reasoning went wrong
Default for most early reasoning RL

PRM — step-by-step

Train on (prompt, partial completion, step-correct?)
Step-level human or AI labels — much more expensive
Enables verifier-guided beam search at inference
Catches wrong reasoning early before it compounds

EXAMPLE — PRM800K dataset (OpenAI, Lightman 2023)

800k step-level correctness labels on MATH problems. Public. Used in "Let's Verify Step by Step" (arxiv 2305.20050). Demonstrated that PRM-supervised training significantly outperformed ORM at the same compute.

REMEMBER

ORM = cheap, just final correctness. PRM = expensive, step-level.
PRM enables verifier-guided beam search — catches errors early.
For frontier reasoning models in 2026, both are used; PRM as inference verifier even when ORM trained the policy.

PARALLEL SAMPLING · SCALING LAWS

Best-of-N + verifier scaling laws

TL;DR

Pass@N is the oracle ceiling: probability at least one of N samples is correct. Best-of-N (verifier) approaches it when the verifier is strong. Majority voting works when answers are discrete and no verifier is available. Empirically: log(error rate) decreases linearly with log(N) up to verifier saturation.

Three sampling strategies

Pass@N (oracle): probability at least one of N samples is correct. Upper bound on what BoN can achieve.
Best-of-N (verifier): sample N, pick highest-scoring per verifier. Achieves close to Pass@N if verifier is good.
Majority voting: sample N, pick most common answer. Cheap, no verifier needed; works for discrete-answer tasks (math, MCQ).

Empirical observation: log(error rate) decreases linearly with log(N) — until verifier saturation. Larger models have shallower BoN curves (less to gain from sampling more). Optimal split between train-time and test-time compute depends on the problem distribution.

REMEMBER

Pass@N = oracle ceiling. BoN approaches it with a good verifier.
Majority voting works when there's no verifier and answers are discrete.
Diminishing returns set in once verifier saturates — measure the curve.

SEARCH · TREE-BASED INFERENCE

Tree search at inference — MCTS, ToT, rStar, AlphaProof

TL;DR

Beyond linear CoT: explore multiple branches at each step, prune via verifier, expand promising subtrees. ToT does this at the prompt level; rStar-Math and AlphaProof scale it with MCTS + RL. Big inference-compute investment, big quality wins on hard problems.

Tree of Thoughts (Yao 2023, arxiv 2305.10601)

Generate multiple thoughts at each step, evaluate each, expand promising branches. Uses LLM both as policy (propose thoughts) and value (evaluate thoughts). Prompt-level technique — no model changes required.

rStar-Math (Microsoft 2024)

MCTS over reasoning steps. Each node is a partial solution; expansion samples next steps; backup propagates verifier rewards. Achieved frontier math performance with a 7B model + heavy MCTS — example of inference compute substituting for parameters.

AlphaProof (DeepMind 2024)

RL on Lean 4 formal proofs with MCTS-style search at training and inference. Earned IMO silver-medal performance. Pipeline: auto-formalize natural-language problems → MCTS over Lean tactics → verifier (Lean) gives reward.

REMEMBER

ToT = prompt-level branching + LLM-as-judge. Cheap to try.
rStar-Math = MCTS + verifier. Big-compute, big-win on hard math.
AlphaProof = MCTS over Lean tactics. The DeepMind formal-math direction.

FAILURE MODES · REWARD HACKING

Reward hacking in reasoning — the new failure modes

TL;DR

Even with RLVR, models game the reward signal in subtler ways: tampered CoT (right answer + wrong reasoning), verifier exploitation (regex match without correct logic), length hacking (long CoTs correlated with correctness in training data), specification gaming (clever solutions that pass tests but aren't intended). Active research at Anthropic.

The four common reward-hacking modes

Tampered CoT: model outputs a "reasoning trace" that doesn't actually justify the answer (post-hoc fabrication). Final answer correct (verifiable); CoT non-causal. Spotted by checking reasoning-answer consistency.
Verifier exploitation: if the regex extracts only a number, the model writes the right number while the reasoning is wrong. Mitigation: stricter verifiers, multi-format checks.
Length hacking: model learns long CoTs are correlated with correctness during training. Mitigation: length-normalized rewards.
Specification gaming: classic RL — clever solutions that pass the test but aren't what was intended.

PITFALL — RLVR is not reward-hack-proof

The verifier signal itself can't be hacked, but the proxy signals around it (CoT format, length, structure) can. Anthropic published reward-hacking papers in 2024-25; expect this question in interp + safety interviews.

REMEMBER

Tampered CoT is the modern problem — answer correct, reasoning fake.
Mitigations: process rewards, multiple verifiers, behavioral evals catching CoT-answer mismatch.
Anthropic loops will probe this. Have an opinion.

SCALING LAWS · INFERENCE SIDE

Inference-time scaling laws (Snell 2024)

TL;DR

Snell 2024 measured how test-time compute trades off against train-time compute. Hard problems benefit far more from extra inference compute than easy ones. For some tasks, test-time compute can substitute for ~14× train-time compute. Optimal allocation depends on the problem distribution and serving economics.

The empirical findings

Hard problems benefit more from extra test-time compute than easy ones.
Optimal allocation depends on problem distribution: revision (iterate on a single solution) vs search (sample many) trade off.
For some tasks, test-time compute can substitute for ~14× train-time compute.
Compute-optimal frontier shifts depending on serving cost vs train cost.

REMEMBER

Inference scaling has its own laws — they're problem-dependent.
Cite Snell 2024 (arxiv 2408.03314) when discussing.
The "revision vs search" tradeoff is the core practical knob.

PRODUCTION · DEPLOYMENT KNOBS

Practical deployment — thinking budgets, routing, distillation

TL;DR

Three knobs in production. (1) Reasoning effort — user-facing "thinking budget" exposed by o-series, Claude, R1. (2) Model routing — small classifier on the prompt picks fast vs reasoning model. (3) Distillation — push reasoning into a smaller dense student (R1 → Qwen-7B is canonical).

Reasoning effort: OpenAI o-series, Claude extended thinking, DeepSeek R1 all expose "thinking budget" or "reasoning effort" knobs. Users pay for hidden CoT tokens.
Routing: route easy queries to small/fast model; hard queries to reasoning model. A small classifier on the prompt decides.
Distillation: distill reasoning model into a smaller student that includes CoT capability (R1 → Qwen-7B is the canonical example).

REMEMBER

Three production knobs: thinking budget, routing, distillation.
Cost discipline — reasoning models burn 10-100× the tokens of vanilla chat.

EVAL · BENCHMARKS

Eval benchmarks for reasoning

TL;DR

MATH and HumanEval are saturated. Use AIME, GPQA-Diamond, FrontierMath, HLE, SWE-Bench Verified, LiveCodeBench, ARC-AGI for differentiation. Full benchmark reference: evals page.

Benchmark	Domain	Notes
MATH	Competition math	Saturated by frontier reasoning models (~95%+)
AIME 2024 / 2025	Olympiad math	Current high-signal math eval
GPQA-Diamond	PhD-level science	Contamination-resistant; ~85% SOTA
HLE (Humanity's Last Exam)	Hard polymath	2025; very low SOTA still
FrontierMath	Hard research math	Heldout; o3 jumped scores significantly
SWE-Bench Verified	Code agent	Real GitHub issues; agentic eval
LiveCodeBench	Competitive programming	Continuously updated to avoid contamination
ARC-AGI	Abstract reasoning	Chollet's eval; o3 made breakthrough

REMEMBER

Cite saturation status of any benchmark you mention. MMLU = stale.
FrontierMath, GPQA-Diamond, HLE = current frontiers.
Drill the full evals page before any onsite.

0 → hero reasoning-models path

foundation OpenAI o1 announcement post
foundation DeepSeek R1 release + paper
foundation Nathan Lambert — Interconnects on RLHF/post-training news
depth Lightman 2023 — Let's Verify Step by Step (PRM)
depth Snell 2024 — Scaling LLM Test-Time Compute
depth DeepSeek Math (GRPO)
depth DeepSeek R1 paper — read end-to-end
depth Tree of Thoughts (Yao 2023)
depth Tülu 3 — open RLVR recipe

Reasoning quiz — readiness check

How does RLVR differ from RLHF?
Show answer
RLVR uses programmatic verifiers (unit tests, exact-match, formal proof). RLHF uses learned reward model. RLVR can't be reward-hacked on the verifier signal but only works for verifiable tasks (math, code, formal logic, structured output).
Walk through DeepSeek R1's pipeline.
Show answer
4 stages: (1) cold-start SFT on reasoning traces with desired format. (2) RL with GRPO + verifiable rewards. (3) Rejection-sampling SFT (600k math/code from stage-2 + 200k general). (4) Final RL with mixed verifiable + preference rewards. Distill into smaller dense models.
PRM vs ORM tradeoffs?
Show answer
ORM (outcome): cheap, only needs final correctness label. PRM (process): step-level labels (expensive — PRM800K cost millions). PRM enables verifier-guided beam search and better credit assignment for long reasoning.
Best-of-N vs majority voting?
Show answer
BoN needs a verifier; tighter to oracle Pass@N. Majority needs no verifier; works for tasks with discrete answers (math, MCQ). Majority is robust to noisy individual outputs.
How would you serve a reasoning model with 8k hidden CoT tokens per query?
Show answer
Massive decode load (8k tokens × users); huge KV cache. Disaggregated prefill (cheap) + decode (expensive); aggressive prefix caching across reasoning segments; queue with priority for premium tier; possibly speculative decoding with weaker model for early CoT phase; user-facing "thinking budget" knob.
What is reward hacking in reasoning models?
Show answer
Tampered CoT (correct answer with non-causal reasoning), verifier gaming (matches regex but reasoning is wrong), length hacking (reward correlated with verbose output), specification gaming. Mitigations: process rewards, multiple verifiers, behavioral evals catching CoT-answer mismatch.
Why does pure RL (R1-Zero) work without SFT?
Show answer
A strong base model already contains reasoning circuits; RL only needs to elicit them. SFT bottlenecks the model into the demo distribution; RL is freer to explore. Critical caveat: R1-Zero readability suffers (mixes languages); cold-start SFT in the full R1 pipeline fixes this.
How would you choose between train-time vs test-time compute scaling?
Show answer
Easy distributions: more train compute. Hard distributions with high-value queries: more test-time compute. Snell 2024 framework: measure scaling exponents on each axis at fixed budget. Test-time compute can substitute for ~14× train compute on hard problems.
What is GRPO's advantage formula?
Show answer
For G samples per prompt with rewards r_1, ..., r_G: A_i = (r_i − mean(r)) / std(r). Sequence-level scalar advantage broadcast to every token in output i. PPO-style clipped per-token ratio with this advantage. No critic/value model.
What is rStar-Math?
Show answer
(Microsoft 2024) MCTS over reasoning steps for math. Each node is a partial solution; expansion samples next steps; backup propagates verifier rewards. Achieved frontier math performance with a 7B model + heavy MCTS — example of inference compute substituting for parameters.
What does AlphaProof do?
Show answer
(DeepMind 2024) RL on Lean 4 formal proofs with MCTS at training and inference. Auto-formalize natural-language problems → MCTS over Lean tactics → Lean checker is the verifier. IMO silver-medal performance.
Why is GPQA-Diamond a better eval than MMLU for reasoning models?
Show answer
MMLU is largely saturated (90%+ for frontier). GPQA-Diamond: hand-written by domain PhDs; multi-step reasoning required; designed to resist contamination. Differentiates frontier reasoning models more cleanly. ~85% SOTA.
What is process reward model (PRM) — in 1 sentence.
Show answer
A classifier trained on (prompt, partial reasoning, step_correct?) that scores each step of a CoT, enabling verifier-guided beam search and better credit assignment than outcome-only rewards.
Tradeoffs of distilling a reasoning model into a smaller dense model?
Show answer
Distill (R1 → Qwen-7B) inherits reasoning ability cheaply; small model can run on a single GPU; CoT length usually preserved; quality below the teacher but often close. Caveats: distilled model loses some flexibility on novel tasks; needs significant compute for distillation itself.
What does "test-time compute scaling" mean concretely?
Show answer
Two axes: (1) Sequential — model produces longer chain-of-thought before final answer. (2) Parallel — sample N candidates, pick best (BoN, majority vote). Both can substitute for additional train compute up to a problem-dependent ceiling.