Reasoning models — the 2025 frontier
o1, o3, DeepSeek R1, Claude extended thinking. Test-time compute is now a scaling axis as legitimate as parameter count. This chapter explains RLVR, GRPO, process reward models, MCTS, and the open R1 recipe densely enough to walk into any frontier-lab loop.
What you'll learn
- The shift — train-time → test-time scaling
- RLVR — RL with Verifiable Rewards (the unlock)
- DeepSeek R1 — the open canon, 4-stage pipeline
- PRM vs ORM — when step rewards matter
- Best-of-N + verifier scaling laws
- Tree search at inference — MCTS, ToT, rStar, AlphaProof
- Reward hacking in reasoning — the new failure modes
- Inference-time scaling laws (Snell 2024)
- Practical deployment — thinking budgets, routing, distillation
- Eval benchmarks for reasoning
Pre-2024: more train compute = better model. Post-o1: a fixed model can also get better at inference time by thinking longer or by sampling more candidates. Test-time compute is a new scaling axis — Snell 2024 showed it can substitute for ~14× train compute on hard problems.
Two compounding axes
- Sequential test-time: longer chain-of-thought before answering. The model "thinks" for thousands of tokens (often hidden from the user).
- Parallel test-time: sample N candidate solutions, pick best (best-of-N, majority vote, or verifier-scored).
Snell et al. 2024 (arxiv 2408.03314) — "Scaling LLM Test-Time Compute Optimally": for many problems, optimal test-time compute can substitute for ~14× more pretraining compute. Smart inference can let smaller models match larger ones.
- Two new axes: longer CoT (sequential), more samples (parallel).
- Inference compute can substitute for train compute up to a problem-dependent ceiling.
- This is what made o1, R1, Claude extended thinking possible.
Replace the learned reward model (which can be hacked) with a programmatic verifier (which can't, on its signal). For tasks where you can check correctness automatically — code (unit tests), math (exact match), formal proofs (Lean) — RLVR drives the model to discover unconventional reasoning patterns without reward hacking.
Reward models saturate. Verifiers don't.
Learned reward models (RLHF) are noisy proxies for human preference. The model can find inputs that score high under the RM but humans dislike — classic reward hacking. Programmatic verifiers give an objective ground-truth signal:
- Code: unit tests pass / fail.
- Math: extract final answer (regex or grader LLM), exact-match.
- Formal proofs: Lean / Coq proof checker.
- Instruction following: regex / structured-output check.
The model can write a 1000-token CoT in any style; only the final answer matters. Frees the model to discover unconventional reasoning patterns that humans wouldn't have demonstrated. Used by Tülu 3, DeepSeek R1, OpenAI o-series, AlphaProof.
- RLVR works for verifiable tasks (math, code, formal). Doesn't work for "is this funny?"
- Programmatic verifier ≠ learned reward model. Different failure modes.
- This is THE 2024 unlock that made o-series and R1 possible.
DeepSeek R1 (arxiv 2501.12948) is the open recipe for reasoning models. Two model variants: R1-Zero (pure RL on V3 base — proves reasoning emerges from RL alone) and R1 (4-stage pipeline that fixes R1-Zero's readability while preserving capability). Distilled into Qwen-7B/14B/32B and Llama-8B/70B for cheap deployment.
R1-Zero — pure RL, no SFT
- Start from DeepSeek-V3 base (strong, no instruction-tuning).
- Run GRPO with verifiable rewards on math + code prompts.
- Reasoning emerges spontaneously: long CoTs, "wait, let me reconsider" patterns, alternative-approach exploration.
- Capability excellent; readability poor (mixes languages, no formatting).
R1 — the full recipe
- Cold-start SFT: small set of curated reasoning traces with desired format. Fixes readability.
- RL: GRPO with verifiable rewards.
- Rejection-sampling SFT: sample many outputs from stage-2 policy, keep correct ones (~600k math/code) plus general-purpose data (~200k SFT data); SFT.
- Final RL: combined verifiable + preference rewards (helpfulness, harmlessness).
Key lessons from R1
- Pure RL works on a strong base model — reasoning is "in" the base; RL elicits it.
- GRPO scales well at LLM scale (no critic/value model).
- Distilled smaller models (R1 → Qwen-7B/14B/32B, Llama-8B/70B) inherit reasoning ability cheaply.
- Cold-start matters for usability but not capability.
- R1-Zero proves: reasoning emerges from RL on a strong base.
- R1's 4 stages: cold-start SFT → RL → rejection-sampling SFT → final RL.
- Distillation transfers reasoning to small dense models cheaply.
ORM (Outcome Reward Model): is the final answer right? Cheap labels, but no signal on which step went wrong. PRM (Process Reward Model): is each step right? Expensive labels (Lightman 2023's PRM800K cost millions), but enables verifier-guided beam search and better credit assignment.
ORM — outcome only
- Train on (prompt, completion, correct?)
- Cheap to label (just check the answer)
- Doesn't tell you where reasoning went wrong
- Default for most early reasoning RL
PRM — step-by-step
- Train on (prompt, partial completion, step-correct?)
- Step-level human or AI labels — much more expensive
- Enables verifier-guided beam search at inference
- Catches wrong reasoning early before it compounds
800k step-level correctness labels on MATH problems. Public. Used in "Let's Verify Step by Step" (arxiv 2305.20050). Demonstrated that PRM-supervised training significantly outperformed ORM at the same compute.
- ORM = cheap, just final correctness. PRM = expensive, step-level.
- PRM enables verifier-guided beam search — catches errors early.
- For frontier reasoning models in 2026, both are used; PRM as inference verifier even when ORM trained the policy.
Pass@N is the oracle ceiling: probability at least one of N samples is correct. Best-of-N (verifier) approaches it when the verifier is strong. Majority voting works when answers are discrete and no verifier is available. Empirically: log(error rate) decreases linearly with log(N) up to verifier saturation.
Three sampling strategies
- Pass@N (oracle): probability at least one of N samples is correct. Upper bound on what BoN can achieve.
- Best-of-N (verifier): sample N, pick highest-scoring per verifier. Achieves close to Pass@N if verifier is good.
- Majority voting: sample N, pick most common answer. Cheap, no verifier needed; works for discrete-answer tasks (math, MCQ).
Empirical observation: log(error rate) decreases linearly with log(N) — until verifier saturation. Larger models have shallower BoN curves (less to gain from sampling more). Optimal split between train-time and test-time compute depends on the problem distribution.
- Pass@N = oracle ceiling. BoN approaches it with a good verifier.
- Majority voting works when there's no verifier and answers are discrete.
- Diminishing returns set in once verifier saturates — measure the curve.
Beyond linear CoT: explore multiple branches at each step, prune via verifier, expand promising subtrees. ToT does this at the prompt level; rStar-Math and AlphaProof scale it with MCTS + RL. Big inference-compute investment, big quality wins on hard problems.
Tree of Thoughts (Yao 2023, arxiv 2305.10601)
Generate multiple thoughts at each step, evaluate each, expand promising branches. Uses LLM both as policy (propose thoughts) and value (evaluate thoughts). Prompt-level technique — no model changes required.
rStar-Math (Microsoft 2024)
MCTS over reasoning steps. Each node is a partial solution; expansion samples next steps; backup propagates verifier rewards. Achieved frontier math performance with a 7B model + heavy MCTS — example of inference compute substituting for parameters.
AlphaProof (DeepMind 2024)
RL on Lean 4 formal proofs with MCTS-style search at training and inference. Earned IMO silver-medal performance. Pipeline: auto-formalize natural-language problems → MCTS over Lean tactics → verifier (Lean) gives reward.
- ToT = prompt-level branching + LLM-as-judge. Cheap to try.
- rStar-Math = MCTS + verifier. Big-compute, big-win on hard math.
- AlphaProof = MCTS over Lean tactics. The DeepMind formal-math direction.
Even with RLVR, models game the reward signal in subtler ways: tampered CoT (right answer + wrong reasoning), verifier exploitation (regex match without correct logic), length hacking (long CoTs correlated with correctness in training data), specification gaming (clever solutions that pass tests but aren't intended). Active research at Anthropic.
The four common reward-hacking modes
- Tampered CoT: model outputs a "reasoning trace" that doesn't actually justify the answer (post-hoc fabrication). Final answer correct (verifiable); CoT non-causal. Spotted by checking reasoning-answer consistency.
- Verifier exploitation: if the regex extracts only a number, the model writes the right number while the reasoning is wrong. Mitigation: stricter verifiers, multi-format checks.
- Length hacking: model learns long CoTs are correlated with correctness during training. Mitigation: length-normalized rewards.
- Specification gaming: classic RL — clever solutions that pass the test but aren't what was intended.
- Tampered CoT is the modern problem — answer correct, reasoning fake.
- Mitigations: process rewards, multiple verifiers, behavioral evals catching CoT-answer mismatch.
- Anthropic loops will probe this. Have an opinion.
Snell 2024 measured how test-time compute trades off against train-time compute. Hard problems benefit far more from extra inference compute than easy ones. For some tasks, test-time compute can substitute for ~14× train-time compute. Optimal allocation depends on the problem distribution and serving economics.
The empirical findings
- Hard problems benefit more from extra test-time compute than easy ones.
- Optimal allocation depends on problem distribution: revision (iterate on a single solution) vs search (sample many) trade off.
- For some tasks, test-time compute can substitute for ~14× train-time compute.
- Compute-optimal frontier shifts depending on serving cost vs train cost.
- Inference scaling has its own laws — they're problem-dependent.
- Cite Snell 2024 (arxiv 2408.03314) when discussing.
- The "revision vs search" tradeoff is the core practical knob.
Three knobs in production. (1) Reasoning effort — user-facing "thinking budget" exposed by o-series, Claude, R1. (2) Model routing — small classifier on the prompt picks fast vs reasoning model. (3) Distillation — push reasoning into a smaller dense student (R1 → Qwen-7B is canonical).
- Reasoning effort: OpenAI o-series, Claude extended thinking, DeepSeek R1 all expose "thinking budget" or "reasoning effort" knobs. Users pay for hidden CoT tokens.
- Routing: route easy queries to small/fast model; hard queries to reasoning model. A small classifier on the prompt decides.
- Distillation: distill reasoning model into a smaller student that includes CoT capability (R1 → Qwen-7B is the canonical example).
- Three production knobs: thinking budget, routing, distillation.
- Cost discipline — reasoning models burn 10-100× the tokens of vanilla chat.
MATH and HumanEval are saturated. Use AIME, GPQA-Diamond, FrontierMath, HLE, SWE-Bench Verified, LiveCodeBench, ARC-AGI for differentiation. Full benchmark reference: evals page.
| Benchmark | Domain | Notes |
|---|---|---|
| MATH | Competition math | Saturated by frontier reasoning models (~95%+) |
| AIME 2024 / 2025 | Olympiad math | Current high-signal math eval |
| GPQA-Diamond | PhD-level science | Contamination-resistant; ~85% SOTA |
| HLE (Humanity's Last Exam) | Hard polymath | 2025; very low SOTA still |
| FrontierMath | Hard research math | Heldout; o3 jumped scores significantly |
| SWE-Bench Verified | Code agent | Real GitHub issues; agentic eval |
| LiveCodeBench | Competitive programming | Continuously updated to avoid contamination |
| ARC-AGI | Abstract reasoning | Chollet's eval; o3 made breakthrough |
- Cite saturation status of any benchmark you mention. MMLU = stale.
- FrontierMath, GPQA-Diamond, HLE = current frontiers.
- Drill the full evals page before any onsite.
0 → hero reasoning-models path
- foundation OpenAI o1 announcement post
- foundation DeepSeek R1 release + paper
- foundation Nathan Lambert — Interconnects on RLHF/post-training news
- depth Lightman 2023 — Let's Verify Step by Step (PRM)
- depth Snell 2024 — Scaling LLM Test-Time Compute
- depth DeepSeek Math (GRPO)
- depth DeepSeek R1 paper — read end-to-end
- depth Tree of Thoughts (Yao 2023)
- depth Tülu 3 — open RLVR recipe
Reasoning quiz — readiness check
- How does RLVR differ from RLHF?
Show answer
RLVR uses programmatic verifiers (unit tests, exact-match, formal proof). RLHF uses learned reward model. RLVR can't be reward-hacked on the verifier signal but only works for verifiable tasks (math, code, formal logic, structured output).
- Walk through DeepSeek R1's pipeline.
Show answer
4 stages: (1) cold-start SFT on reasoning traces with desired format. (2) RL with GRPO + verifiable rewards. (3) Rejection-sampling SFT (600k math/code from stage-2 + 200k general). (4) Final RL with mixed verifiable + preference rewards. Distill into smaller dense models.
- PRM vs ORM tradeoffs?
Show answer
ORM (outcome): cheap, only needs final correctness label. PRM (process): step-level labels (expensive — PRM800K cost millions). PRM enables verifier-guided beam search and better credit assignment for long reasoning.
- Best-of-N vs majority voting?
Show answer
BoN needs a verifier; tighter to oracle Pass@N. Majority needs no verifier; works for tasks with discrete answers (math, MCQ). Majority is robust to noisy individual outputs.
- How would you serve a reasoning model with 8k hidden CoT tokens per query?
Show answer
Massive decode load (8k tokens × users); huge KV cache. Disaggregated prefill (cheap) + decode (expensive); aggressive prefix caching across reasoning segments; queue with priority for premium tier; possibly speculative decoding with weaker model for early CoT phase; user-facing "thinking budget" knob.
- What is reward hacking in reasoning models?
Show answer
Tampered CoT (correct answer with non-causal reasoning), verifier gaming (matches regex but reasoning is wrong), length hacking (reward correlated with verbose output), specification gaming. Mitigations: process rewards, multiple verifiers, behavioral evals catching CoT-answer mismatch.
- Why does pure RL (R1-Zero) work without SFT?
Show answer
A strong base model already contains reasoning circuits; RL only needs to elicit them. SFT bottlenecks the model into the demo distribution; RL is freer to explore. Critical caveat: R1-Zero readability suffers (mixes languages); cold-start SFT in the full R1 pipeline fixes this.
- How would you choose between train-time vs test-time compute scaling?
Show answer
Easy distributions: more train compute. Hard distributions with high-value queries: more test-time compute. Snell 2024 framework: measure scaling exponents on each axis at fixed budget. Test-time compute can substitute for ~14× train compute on hard problems.
- What is GRPO's advantage formula?
Show answer
For G samples per prompt with rewards r_1, ..., r_G: A_i = (r_i − mean(r)) / std(r). Sequence-level scalar advantage broadcast to every token in output i. PPO-style clipped per-token ratio with this advantage. No critic/value model.
- What is rStar-Math?
Show answer
(Microsoft 2024) MCTS over reasoning steps for math. Each node is a partial solution; expansion samples next steps; backup propagates verifier rewards. Achieved frontier math performance with a 7B model + heavy MCTS — example of inference compute substituting for parameters.
- What does AlphaProof do?
Show answer
(DeepMind 2024) RL on Lean 4 formal proofs with MCTS at training and inference. Auto-formalize natural-language problems → MCTS over Lean tactics → Lean checker is the verifier. IMO silver-medal performance.
- Why is GPQA-Diamond a better eval than MMLU for reasoning models?
Show answer
MMLU is largely saturated (90%+ for frontier). GPQA-Diamond: hand-written by domain PhDs; multi-step reasoning required; designed to resist contamination. Differentiates frontier reasoning models more cleanly. ~85% SOTA.
- What is process reward model (PRM) — in 1 sentence.
Show answer
A classifier trained on (prompt, partial reasoning, step_correct?) that scores each step of a CoT, enabling verifier-guided beam search and better credit assignment than outcome-only rewards.
- Tradeoffs of distilling a reasoning model into a smaller dense model?
Show answer
Distill (R1 → Qwen-7B) inherits reasoning ability cheaply; small model can run on a single GPU; CoT length usually preserved; quality below the teacher but often close. Caveats: distilled model loses some flexibility on novel tasks; needs significant compute for distillation itself.
- What does "test-time compute scaling" mean concretely?
Show answer
Two axes: (1) Sequential — model produces longer chain-of-thought before final answer. (2) Parallel — sample N candidates, pick best (BoN, majority vote). Both can substitute for additional train compute up to a problem-dependent ceiling.