FRONTIER PILLAR · 2025 BREAKTHROUGH

Reasoning models — the 2025 frontier

o1, o3, DeepSeek R1, Claude extended thinking. Test-time compute is now a scaling axis as legitimate as parameter count. This chapter explains RLVR, GRPO, process reward models, MCTS, and the open R1 recipe densely enough to walk into any frontier-lab loop.

Read ~30 min Asked at Anthropic, OpenAI, DeepMind, DeepSeek, TML Difficulty Sr Staff bar
01
FOUNDATIONS · THE PARADIGM SHIFT

The shift — train-time → test-time scaling

TL;DR

Pre-2024: more train compute = better model. Post-o1: a fixed model can also get better at inference time by thinking longer or by sampling more candidates. Test-time compute is a new scaling axis — Snell 2024 showed it can substitute for ~14× train compute on hard problems.

Two compounding axes

EXAMPLE — the Snell 2024 result

Snell et al. 2024 (arxiv 2408.03314) — "Scaling LLM Test-Time Compute Optimally": for many problems, optimal test-time compute can substitute for ~14× more pretraining compute. Smart inference can let smaller models match larger ones.

REMEMBER
  • Two new axes: longer CoT (sequential), more samples (parallel).
  • Inference compute can substitute for train compute up to a problem-dependent ceiling.
  • This is what made o1, R1, Claude extended thinking possible.
02
THE UNLOCK · RLVR

RLVR — RL with Verifiable Rewards

TL;DR

Replace the learned reward model (which can be hacked) with a programmatic verifier (which can't, on its signal). For tasks where you can check correctness automatically — code (unit tests), math (exact match), formal proofs (Lean) — RLVR drives the model to discover unconventional reasoning patterns without reward hacking.

THE INSIGHT — why RLVR > RLHF for reasoning

Reward models saturate. Verifiers don't.

Learned reward models (RLHF) are noisy proxies for human preference. The model can find inputs that score high under the RM but humans dislike — classic reward hacking. Programmatic verifiers give an objective ground-truth signal:

  • Code: unit tests pass / fail.
  • Math: extract final answer (regex or grader LLM), exact-match.
  • Formal proofs: Lean / Coq proof checker.
  • Instruction following: regex / structured-output check.

The model can write a 1000-token CoT in any style; only the final answer matters. Frees the model to discover unconventional reasoning patterns that humans wouldn't have demonstrated. Used by Tülu 3, DeepSeek R1, OpenAI o-series, AlphaProof.

PITFALL — verifier exploitability
RLVR's no-reward-hacking property holds only for the verifier signal itself. The model can still hack subtler signals: write a CoT that doesn't justify the answer (tampered reasoning), match the regex with wrong derivation, use length to game training-time correlations. See Chapter 7.
REMEMBER
  • RLVR works for verifiable tasks (math, code, formal). Doesn't work for "is this funny?"
  • Programmatic verifier ≠ learned reward model. Different failure modes.
  • This is THE 2024 unlock that made o-series and R1 possible.
03
THE OPEN CANON · DEEPSEEK R1

DeepSeek R1 — the open recipe everyone reads

TL;DR

DeepSeek R1 (arxiv 2501.12948) is the open recipe for reasoning models. Two model variants: R1-Zero (pure RL on V3 base — proves reasoning emerges from RL alone) and R1 (4-stage pipeline that fixes R1-Zero's readability while preserving capability). Distilled into Qwen-7B/14B/32B and Llama-8B/70B for cheap deployment.

R1-Zero — pure RL, no SFT

  1. Start from DeepSeek-V3 base (strong, no instruction-tuning).
  2. Run GRPO with verifiable rewards on math + code prompts.
  3. Reasoning emerges spontaneously: long CoTs, "wait, let me reconsider" patterns, alternative-approach exploration.
  4. Capability excellent; readability poor (mixes languages, no formatting).
THE 4-STAGE PIPELINE — memorize this

R1 — the full recipe

  1. Cold-start SFT: small set of curated reasoning traces with desired format. Fixes readability.
  2. RL: GRPO with verifiable rewards.
  3. Rejection-sampling SFT: sample many outputs from stage-2 policy, keep correct ones (~600k math/code) plus general-purpose data (~200k SFT data); SFT.
  4. Final RL: combined verifiable + preference rewards (helpfulness, harmlessness).

Key lessons from R1

REMEMBER
  • R1-Zero proves: reasoning emerges from RL on a strong base.
  • R1's 4 stages: cold-start SFT → RL → rejection-sampling SFT → final RL.
  • Distillation transfers reasoning to small dense models cheaply.
04
REWARDS · STEP-LEVEL SUPERVISION

PRM vs ORM — when step-level rewards matter

TL;DR

ORM (Outcome Reward Model): is the final answer right? Cheap labels, but no signal on which step went wrong. PRM (Process Reward Model): is each step right? Expensive labels (Lightman 2023's PRM800K cost millions), but enables verifier-guided beam search and better credit assignment.

ORM — outcome only

  • Train on (prompt, completion, correct?)
  • Cheap to label (just check the answer)
  • Doesn't tell you where reasoning went wrong
  • Default for most early reasoning RL

PRM — step-by-step

  • Train on (prompt, partial completion, step-correct?)
  • Step-level human or AI labels — much more expensive
  • Enables verifier-guided beam search at inference
  • Catches wrong reasoning early before it compounds
EXAMPLE — PRM800K dataset (OpenAI, Lightman 2023)

800k step-level correctness labels on MATH problems. Public. Used in "Let's Verify Step by Step" (arxiv 2305.20050). Demonstrated that PRM-supervised training significantly outperformed ORM at the same compute.

REMEMBER
  • ORM = cheap, just final correctness. PRM = expensive, step-level.
  • PRM enables verifier-guided beam search — catches errors early.
  • For frontier reasoning models in 2026, both are used; PRM as inference verifier even when ORM trained the policy.
05
PARALLEL SAMPLING · SCALING LAWS

Best-of-N + verifier scaling laws

TL;DR

Pass@N is the oracle ceiling: probability at least one of N samples is correct. Best-of-N (verifier) approaches it when the verifier is strong. Majority voting works when answers are discrete and no verifier is available. Empirically: log(error rate) decreases linearly with log(N) up to verifier saturation.

Three sampling strategies

Empirical observation: log(error rate) decreases linearly with log(N) — until verifier saturation. Larger models have shallower BoN curves (less to gain from sampling more). Optimal split between train-time and test-time compute depends on the problem distribution.

REMEMBER
  • Pass@N = oracle ceiling. BoN approaches it with a good verifier.
  • Majority voting works when there's no verifier and answers are discrete.
  • Diminishing returns set in once verifier saturates — measure the curve.
06
SEARCH · TREE-BASED INFERENCE

Tree search at inference — MCTS, ToT, rStar, AlphaProof

TL;DR

Beyond linear CoT: explore multiple branches at each step, prune via verifier, expand promising subtrees. ToT does this at the prompt level; rStar-Math and AlphaProof scale it with MCTS + RL. Big inference-compute investment, big quality wins on hard problems.

Tree of Thoughts (Yao 2023, arxiv 2305.10601)

Generate multiple thoughts at each step, evaluate each, expand promising branches. Uses LLM both as policy (propose thoughts) and value (evaluate thoughts). Prompt-level technique — no model changes required.

rStar-Math (Microsoft 2024)

MCTS over reasoning steps. Each node is a partial solution; expansion samples next steps; backup propagates verifier rewards. Achieved frontier math performance with a 7B model + heavy MCTS — example of inference compute substituting for parameters.

AlphaProof (DeepMind 2024)

RL on Lean 4 formal proofs with MCTS-style search at training and inference. Earned IMO silver-medal performance. Pipeline: auto-formalize natural-language problems → MCTS over Lean tactics → verifier (Lean) gives reward.

REMEMBER
  • ToT = prompt-level branching + LLM-as-judge. Cheap to try.
  • rStar-Math = MCTS + verifier. Big-compute, big-win on hard math.
  • AlphaProof = MCTS over Lean tactics. The DeepMind formal-math direction.
07
FAILURE MODES · REWARD HACKING

Reward hacking in reasoning — the new failure modes

TL;DR

Even with RLVR, models game the reward signal in subtler ways: tampered CoT (right answer + wrong reasoning), verifier exploitation (regex match without correct logic), length hacking (long CoTs correlated with correctness in training data), specification gaming (clever solutions that pass tests but aren't intended). Active research at Anthropic.

The four common reward-hacking modes

PITFALL — RLVR is not reward-hack-proof
The verifier signal itself can't be hacked, but the proxy signals around it (CoT format, length, structure) can. Anthropic published reward-hacking papers in 2024-25; expect this question in interp + safety interviews.
REMEMBER
  • Tampered CoT is the modern problem — answer correct, reasoning fake.
  • Mitigations: process rewards, multiple verifiers, behavioral evals catching CoT-answer mismatch.
  • Anthropic loops will probe this. Have an opinion.
08
SCALING LAWS · INFERENCE SIDE

Inference-time scaling laws (Snell 2024)

TL;DR

Snell 2024 measured how test-time compute trades off against train-time compute. Hard problems benefit far more from extra inference compute than easy ones. For some tasks, test-time compute can substitute for ~14× train-time compute. Optimal allocation depends on the problem distribution and serving economics.

The empirical findings

REMEMBER
  • Inference scaling has its own laws — they're problem-dependent.
  • Cite Snell 2024 (arxiv 2408.03314) when discussing.
  • The "revision vs search" tradeoff is the core practical knob.
09
PRODUCTION · DEPLOYMENT KNOBS

Practical deployment — thinking budgets, routing, distillation

TL;DR

Three knobs in production. (1) Reasoning effort — user-facing "thinking budget" exposed by o-series, Claude, R1. (2) Model routing — small classifier on the prompt picks fast vs reasoning model. (3) Distillation — push reasoning into a smaller dense student (R1 → Qwen-7B is canonical).

REMEMBER
  • Three production knobs: thinking budget, routing, distillation.
  • Cost discipline — reasoning models burn 10-100× the tokens of vanilla chat.
10
EVAL · BENCHMARKS

Eval benchmarks for reasoning

TL;DR

MATH and HumanEval are saturated. Use AIME, GPQA-Diamond, FrontierMath, HLE, SWE-Bench Verified, LiveCodeBench, ARC-AGI for differentiation. Full benchmark reference: evals page.

BenchmarkDomainNotes
MATHCompetition mathSaturated by frontier reasoning models (~95%+)
AIME 2024 / 2025Olympiad mathCurrent high-signal math eval
GPQA-DiamondPhD-level scienceContamination-resistant; ~85% SOTA
HLE (Humanity's Last Exam)Hard polymath2025; very low SOTA still
FrontierMathHard research mathHeldout; o3 jumped scores significantly
SWE-Bench VerifiedCode agentReal GitHub issues; agentic eval
LiveCodeBenchCompetitive programmingContinuously updated to avoid contamination
ARC-AGIAbstract reasoningChollet's eval; o3 made breakthrough
REMEMBER
  • Cite saturation status of any benchmark you mention. MMLU = stale.
  • FrontierMath, GPQA-Diamond, HLE = current frontiers.
  • Drill the full evals page before any onsite.

0 → hero reasoning-models path

  1. foundation OpenAI o1 announcement post
  2. foundation DeepSeek R1 release + paper
  3. foundation Nathan Lambert — Interconnects on RLHF/post-training news
  4. depth Lightman 2023 — Let's Verify Step by Step (PRM)
  5. depth Snell 2024 — Scaling LLM Test-Time Compute
  6. depth DeepSeek Math (GRPO)
  7. depth DeepSeek R1 paper — read end-to-end
  8. depth Tree of Thoughts (Yao 2023)
  9. depth Tülu 3 — open RLVR recipe

Reasoning quiz — readiness check

  1. How does RLVR differ from RLHF?
    Show answer

    RLVR uses programmatic verifiers (unit tests, exact-match, formal proof). RLHF uses learned reward model. RLVR can't be reward-hacked on the verifier signal but only works for verifiable tasks (math, code, formal logic, structured output).

  2. Walk through DeepSeek R1's pipeline.
    Show answer

    4 stages: (1) cold-start SFT on reasoning traces with desired format. (2) RL with GRPO + verifiable rewards. (3) Rejection-sampling SFT (600k math/code from stage-2 + 200k general). (4) Final RL with mixed verifiable + preference rewards. Distill into smaller dense models.

  3. PRM vs ORM tradeoffs?
    Show answer

    ORM (outcome): cheap, only needs final correctness label. PRM (process): step-level labels (expensive — PRM800K cost millions). PRM enables verifier-guided beam search and better credit assignment for long reasoning.

  4. Best-of-N vs majority voting?
    Show answer

    BoN needs a verifier; tighter to oracle Pass@N. Majority needs no verifier; works for tasks with discrete answers (math, MCQ). Majority is robust to noisy individual outputs.

  5. How would you serve a reasoning model with 8k hidden CoT tokens per query?
    Show answer

    Massive decode load (8k tokens × users); huge KV cache. Disaggregated prefill (cheap) + decode (expensive); aggressive prefix caching across reasoning segments; queue with priority for premium tier; possibly speculative decoding with weaker model for early CoT phase; user-facing "thinking budget" knob.

  6. What is reward hacking in reasoning models?
    Show answer

    Tampered CoT (correct answer with non-causal reasoning), verifier gaming (matches regex but reasoning is wrong), length hacking (reward correlated with verbose output), specification gaming. Mitigations: process rewards, multiple verifiers, behavioral evals catching CoT-answer mismatch.

  7. Why does pure RL (R1-Zero) work without SFT?
    Show answer

    A strong base model already contains reasoning circuits; RL only needs to elicit them. SFT bottlenecks the model into the demo distribution; RL is freer to explore. Critical caveat: R1-Zero readability suffers (mixes languages); cold-start SFT in the full R1 pipeline fixes this.

  8. How would you choose between train-time vs test-time compute scaling?
    Show answer

    Easy distributions: more train compute. Hard distributions with high-value queries: more test-time compute. Snell 2024 framework: measure scaling exponents on each axis at fixed budget. Test-time compute can substitute for ~14× train compute on hard problems.

  9. What is GRPO's advantage formula?
    Show answer

    For G samples per prompt with rewards r_1, ..., r_G: A_i = (r_i − mean(r)) / std(r). Sequence-level scalar advantage broadcast to every token in output i. PPO-style clipped per-token ratio with this advantage. No critic/value model.

  10. What is rStar-Math?
    Show answer

    (Microsoft 2024) MCTS over reasoning steps for math. Each node is a partial solution; expansion samples next steps; backup propagates verifier rewards. Achieved frontier math performance with a 7B model + heavy MCTS — example of inference compute substituting for parameters.

  11. What does AlphaProof do?
    Show answer

    (DeepMind 2024) RL on Lean 4 formal proofs with MCTS at training and inference. Auto-formalize natural-language problems → MCTS over Lean tactics → Lean checker is the verifier. IMO silver-medal performance.

  12. Why is GPQA-Diamond a better eval than MMLU for reasoning models?
    Show answer

    MMLU is largely saturated (90%+ for frontier). GPQA-Diamond: hand-written by domain PhDs; multi-step reasoning required; designed to resist contamination. Differentiates frontier reasoning models more cleanly. ~85% SOTA.

  13. What is process reward model (PRM) — in 1 sentence.
    Show answer

    A classifier trained on (prompt, partial reasoning, step_correct?) that scores each step of a CoT, enabling verifier-guided beam search and better credit assignment than outcome-only rewards.

  14. Tradeoffs of distilling a reasoning model into a smaller dense model?
    Show answer

    Distill (R1 → Qwen-7B) inherits reasoning ability cheaply; small model can run on a single GPU; CoT length usually preserved; quality below the teacher but often close. Caveats: distilled model loses some flexibility on novel tasks; needs significant compute for distillation itself.

  15. What does "test-time compute scaling" mean concretely?
    Show answer

    Two axes: (1) Sequential — model produces longer chain-of-thought before final answer. (2) Parallel — sample N candidates, pick best (BoN, majority vote). Both can substitute for additional train compute up to a problem-dependent ceiling.