PILLAR · LLM SYSTEMS

LLM training & RLHF

Pretraining, mid-training, SFT, preference optimization, RLVR, reasoning models. The 2025-26 frontier-lab interview probes data quality, GRPO/DPO, and the DeepSeek R1 recipe — be ready to design a Llama-scale run end-to-end.

Read ~35 min Asked at Anthropic, OpenAI, DeepMind, DeepSeek, Together Difficulty Sr Staff bar

What you'll learn

The pretraining recipe — data, mixing, dedup, quality
Mid-training & continual pretraining — when and why
SFT — instruction tuning that doesn't blunt capability
The preference-optimization zoo — RLHF / DPO / IPO / KTO / ORPO / GRPO
Constitutional AI & RLAIF — Anthropic's stack
RLVR & reasoning models — the 2024-25 unlock
DeepSeek R1 deep dive — the open recipe
Distillation — soft labels, hard labels, on-policy
Quantization-aware & FP8 training — the H100/Blackwell era
Long-context extension — YaRN, ring attention, training tricks

FOUNDATIONS · PRETRAINING

The pretraining recipe — data is the moat

TL;DR

Frontier 2026 pretraining = 15–30T+ tokens across web + code + books + math + multilingual + synthetic. Quality classifier and MinHash dedup do most of the heavy lifting. Mixing weights up-sample high-quality sources; phased curricula anneal toward reasoning data near the end.

Data sources — what goes in

Frontier mixtures pull from five buckets, in roughly decreasing volume:

Web crawl — CommonCrawl, RefinedWeb, FineWeb, FineWeb-Edu.
Code — GitHub, StackExchange, The Stack v2.
Books, Wikipedia, papers — ArXiv, S2ORC.
Curated forums & math — Reddit, OpenWebMath, FineMath.
Multilingual + synthetic — mC4, CulturaX, Nemotron-CC, Cosmopedia, Phi-style synthetic textbooks.

Quality filtering — six-stage pipeline

URL/domain filtering — block adult content and low-quality TLDs.
Language ID via fastText or CLD3 — keep target languages only.
Heuristic (Gopher) rules — avg word length 3–10, % lines ending in punctuation, % stop words, % alphabetic chars.
Repetition removal — drop docs with excessive line/paragraph repetition.
Toxicity / PII filters.
Quality classifier — small classifier trained on "high quality" (Wikipedia, books) vs random web; score every doc.

EXAMPLE — FineWeb-Edu's classifier filter

HF used Llama-3-70B to label educational value with a regression head, trained a small classifier on those labels, then filtered 15T → 1.3T high-quality tokens. Models trained on the 1.3T-token subset beat models trained on the full 15T at the same compute — quality > quantity once you cross a threshold.

Deduplication — three layers stacked

Exact dedup — hash per document or paragraph.
MinHash LSH — compute MinHash signatures over n-gram shingles (5-grams typical), bucket into LSH bands; candidate pairs share at least one band; verify with Jaccard. Llama and DeepSeek both use this.
SemDeDup (Abbas 2023) — cluster embeddings with k-means, dedup within clusters by cosine. Catches paraphrases and translations MinHash misses.
Suffix array dedup (Lee 2022, arxiv 2107.06499) — exact dedup of long substrings within and across documents. Strips license text, navigation bars, boilerplate.

EXAMPLE — MinHash LSH parameters in practice

For ~5T documents, a typical setup is 10 bands × 9 hashes per band (90 hashes total). The (b, r) tuple controls precision/recall: more bands → higher recall (catch more duplicates) but more false positives at the band-collision step; more hashes per band → higher precision but lower recall. The Jaccard verification step at the end catches the false positives.

Mixing weights & curricula

Up-sample high-quality (Wikipedia, books, code), down-sample bulk web. DoReMi (Xie 2023, arxiv 2305.10429) trains a small reference model + proxy model and uses Group DRO to find domain weights minimizing worst-case excess loss. DataComp-LM sweeps mixtures and reports compute-optimal blends.

Most labs do uniform mixing throughout, but some do phased: bulk web first, then up-weight high-quality + math + code in a late-stage anneal.

EXAMPLE — Llama 3 & DeepSeek V3 annealing

Llama 3 pretrained on 15T tokens total and annealed on a curated mixture in the last 40B tokens. DeepSeek V3 used a separate mid-training phase emphasizing math and code before SFT. Both shifts boost reasoning evals at almost zero cost.

PITFALL — eval contamination

Common dedup pipelines miss test-set leakage from web. MMLU questions appear verbatim on tutoring sites, GSM8K problems get reposted on Reddit. Always run an explicit n-gram contamination check against your eval suites and remove matches before training. A few percentage points of "improvement" on MMLU often turns out to be leak.

REMEMBER

Quality classifier + MinHash dedup do 90% of the work — get them right.
Phased curricula (anneal toward reasoning data) are nearly free wins.
Run explicit eval-contamination checks; trust no dedup pipeline by default.

PRETRAINING · LIFECYCLE

Mid-training & continual pretraining — when and why

TL;DR

Mid-training is a 2024-25 term for a phase between pretraining and SFT — same causal LM loss, but heavily up-weighted on reasoning, math, code, and synthetic CoT. Continual pretraining (CPT) is the same idea applied to a frozen base for new domains/languages. Both fight catastrophic forgetting with replay + low LR.

Mid-training — the new norm

Mid-training sits between pretraining and SFT. You continue causal LM training but with a different mixture — heavy up-weight on reasoning, math, code, possibly synthetic chain-of-thought data. Often combined with a WSD (warmup-stable-decay) decay phase. The model becomes "reasoning-ready" without yet being instruction-tuned.

Continual pretraining (CPT) — adapting a frozen base

Take a pretrained model and continue on new data — new language, new domain, new time period. The risk is catastrophic forgetting; the standard mitigations are:

Replay — mix old data, 5–30% of original mixture.
Lower LR — well below pretraining peak.
LR schedule — small warmup, then decay.

PITFALL — over-aggressive CPT learning rate

Restarting training at the original peak LR is the #1 cause of collapse during CPT. The model's optimizer state is gone but the weights remember nothing about why they're where they are. Use 1/10 to 1/100 of pretraining peak.

REMEMBER

Mid-training = reasoning-focused causal LM phase; cheap, high-leverage.
CPT requires replay + low LR or you'll forget everything.
WSD decay pairs naturally with mid-training annealing.

POST-TRAINING · SFT

SFT — instruction tuning that doesn't blunt capability

TL;DR

SFT trains on (instruction, response) pairs with cross-entropy on response tokens (mask the prompt). It teaches format, not capability — capability comes from pretraining. LIMA showed diversity beats scale beyond ~10k–100k examples; Tülu 3 showed 1M+ helps when domains are diverse. Get the chat template right or everything breaks.

The mechanics

Single- or multi-turn instruction-response pairs, trained with cross-entropy on response tokens (mask the prompt). Modern SFT datasets fall into three buckets:

Synthetic from a larger model — Self-Instruct, Evol-Instruct, OpenHermes, Tülu mixes.
Human-written — OpenAssistant, ShareGPT-style.
Filtered/curated mixtures from existing public corpora.

LIMA hypothesis vs Tülu 3

LIMA (arxiv 2305.11206) — "less is more for alignment" — diversity > scale beyond ~10k–100k examples. Capabilities live in the base model; SFT only teaches format/instruction-following.

Tülu 3 (arxiv 2411.15124) updated this: up to 1M+ examples helps when domains are diverse (math, code, IF, safety, multilingual). The truth is "diversity matters more than count, but more diversity is more count."

Few-shot vs zero-shot tradeoff

Pretraining gives ICL (in-context learning) ability for free. SFT collapses the model toward instruction-following at zero-shot, which can sometimes hurt few-shot performance.

PITFALL — chat template drift

Formatting consistency is critical: system/user/assistant tokens, ChatML, Llama 3 special tokens. A model trained with one template and inferenced with another silently degrades by 5–20 points on benchmarks. Always pin the template; never rely on a tokenizer's default during eval.

REMEMBER

SFT teaches format, not capability — keep examples diverse.
Mask the prompt; loss on response tokens only.
Pin the chat template across train + eval + serve.

POST-TRAINING · PREFERENCE OPT

The preference-optimization zoo — RLHF, DPO, GRPO, and friends

TL;DR

RLHF (PPO + reward model + KL penalty) was the original (Christiano 2017, then InstructGPT 2022). DPO (2023) eliminated the RM and the RL loop with a one-step closed form. GRPO (DeepSeek 2024) dropped the value model — group statistics replace the baseline. IPO/KTO/ORPO are variants. Know the DPO derivation cold.

Original deep RLHF — Christiano 2017

The seminal "Deep RL from Human Preferences" paper (Christiano et al. 2017, arxiv 1706.03741) — predates InstructGPT by 5 years. Demonstrated RLHF on Atari and MuJoCo. Frontier-lab interviewers ask "who invented this?" — name Christiano.

RLHF / PPO — the LLM-era recipe

(Ouyang 2022, arxiv 2203.02155 — InstructGPT). Three stages:

Train a reward model RM on (prompt, chosen, rejected) pairs with Bradley-Terry: P(chosen ≻ rejected) = σ(r(chosen) − r(rejected)).
RL fine-tune policy π with PPO against RM, with KL penalty to reference (SFT) policy: r_total = RM(x,y) − β·KL(π||π_ref).
PPO clipped objective, value head, GAE for advantage estimation.

PITFALL — reward hacking & RM overoptimization

RM correlates with true preference up to a point, then diverges (Goodhart's law). PPO will happily exploit that gap — outputs "look great" to the RM but a human rates them worse. Mitigations: stronger KL coefficient, RM ensembles, PPO early stopping by held-out human eval, or move to RLVR (verifiable rewards — see Ch. 6).

DPO — the simpler one (most popular 2023-24)

(Rafailov 2023, arxiv 2305.18290). Reformulates RLHF as classification on the reference policy. Loss:

L_DPO(π_θ; π_ref) = −E_(x, y_w, y_l) [
    log σ( β · [
        (log π_θ(y_w|x) − log π_ref(y_w|x))
      − (log π_θ(y_l|x) − log π_ref(y_l|x))
    ])
  ]

Eliminates the RM and the RL loop. Trains directly on preference pairs (y_w = winner, y_l = loser).

THE INSIGHT — DPO closed-form derivation

Why DPO works without an RM

Start from the KL-constrained RL objective:

max E[r(x,y)] − β · KL(π || π_ref)

The closed-form optimal policy is:

π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp( r(x,y) / β )

Solving for the implicit reward:

r(x,y) = β · log( π*(y|x) / π_ref(y|x) ) + β · log Z(x)

Plug into the Bradley-Terry preference probability — the partition Z(x) cancels because both the chosen and rejected response share the same prompt — and you get the DPO loss above. The point: DPO trains the policy directly while implicitly learning the reward model. No separate RM. No PPO loop.

PITFALL — mode collapse in DPO

DPO can simultaneously decrease the likelihood of both chosen and rejected responses — only the gap between log-probs matters. The policy may drift far from π_ref and outputs degrade. Mitigations: stronger β, IPO (squared loss), mix in SFT loss as regularizer, conservative sampling temperature during training, or anchor with a small SFT replay set.

IPO, KTO, ORPO — the variants

IPO (Azar 2023) — replaces log-sigmoid with squared loss to prevent overfitting on deterministic preferences.
KTO (Ethayarajh 2024, arxiv 2402.01306) — Kahneman-Tversky Optimization. Doesn't require pairwise data — only binary "good"/"bad" labels per response. Models prospect-theoretic utility. Easier data collection.
ORPO (Hong 2024, arxiv 2403.07691) — combines SFT + odds-ratio preference loss in one step. Skips the SFT stage entirely.

GRPO — the 2024 winner (DeepSeek)

(DeepSeek Math, arxiv 2402.03300; key role in DeepSeek R1.) For each prompt, sample G outputs from the current policy. Compute rewards r₁..r_G. Compute the sequence-level advantage:

A_i = (r_i − mean(r_1..r_G)) / std(r_1..r_G)

No critic / value model needed — group statistics replace the value baseline. PPO-style clipped objective uses this advantage broadcast to every token in output i (per-token ratio, scalar advantage). Removes the critic, halves optimizer memory.

EXAMPLE — per-token credit assignment in GRPO (or lack thereof)

Common follow-up: "What about per-token credit assignment in GRPO?" Answer: there isn't any — every token in output i gets the same scalar A_i. This is GRPO's strength on verifiable-reward tasks (the reward is sequence-level anyway — pass/fail unit tests, math correctness) and its weakness on dense-reward problems where the signal is genuinely token-local.

RLOO and REINFORCE++ (Ahmadian 2024) are competing approaches that also drop the critic but use different baselines (leave-one-out mean, EMA baseline).

PPO (RLHF)

Needs RM + value model
~4x base model in memory
Per-token credit via GAE
Reward hacking risk

GRPO

No critic; group baseline
~2x base model
Sequence-level advantage
Pairs naturally with verifiable rewards

REMEMBER

Christiano 2017 = origin of deep RLHF; InstructGPT 2022 = LLM application.
DPO derivation: KL-constrained RL → closed-form policy → solve for reward → Bradley-Terry → Z(x) cancels.
GRPO drops the critic; advantage is sequence-level.
Mode collapse in DPO is real; β + SFT mix mitigates.

POST-TRAINING · ANTHROPIC STACK

Constitutional AI & RLAIF — preferences from a model, not a crowd

TL;DR

RLAIF replaces human preference labels with AI-judge labels — cheaper, scales further, often comparable quality. Constitutional AI (Anthropic) makes the AI judge follow explicit principles: SL phase trains the model to self-critique and revise; RL phase trains an RM on AI-generated preference pairs. Subjectivity moves from crowd to constitution.

RLAIF — the cheap version

Like RLHF but preference labels come from an AI judge (often a stronger model) instead of humans. Cheaper, scales further, often comparable quality on most tasks. The labeling distribution is more consistent than humans (less noise) but inherits the judge's biases.

Constitutional AI (Anthropic)

(Bai 2022, arxiv 2212.08073). Two phases:

SL phase — model critiques and revises its own outputs against constitutional principles. Train on the revisions (so the model learns to produce the revised version directly).
RL phase — model generates pairs; another model picks the better one based on principles; train RM on these AI-generated preferences; PPO as usual.

The constitution itself is a small set of natural-language principles (be helpful, avoid harm, refuse illegal requests, etc.). Subjectivity shifts from crowd workers to the principles you write.

REMEMBER

RLAIF = labels from an AI judge instead of humans.
Constitutional AI = RLAIF where the judge applies explicit written principles.
Cheaper than RLHF; subjectivity moves from crowd to constitution.

POST-TRAINING · REASONING

RLVR & reasoning models — the 2024-25 unlock

TL;DR

RLVR (RL with Verifiable Rewards) replaces a learned RM with a programmatic verifier — unit tests, exact-match math, formal proof checker. No reward hacking on the verifier signal. This is what made o-series, R1, and Tülu 3's reasoning capability possible. Test-time compute (length of reasoning) becomes a knob.

RLVR — the idea

Instead of a learned RM, use a programmatic verifier:

Unit tests for code (pass/fail).
Exact-match for math (extracted via regex or grader LLM).
Formal proof checker (Lean, Coq).

No reward hacking on verifiable tasks — the verifier is the ground truth. The "RL revolution of 2024-25" is largely RLVR. Used by Tülu 3 (math, code, IF), DeepSeek R1, OpenAI o-series.

PITFALL — reward hacking on the verifier

Even verifiable rewards can be gamed: tampered CoT (correct answer with non-causal reasoning), verifier exploitation (matches the regex but reasoning is wrong), length hacking (rambling earns partial credit somewhere). Mitigations: process rewards (PRM), multiple verifiers, behavioral evals catching CoT-answer mismatch.

OpenAI o1 / o3 — what we know

Trained with large-scale RL on reasoning traces. The model learns to produce long chains of thought, including backtracking, self-verification, and alternative approaches. Test-time compute (length of reasoning) becomes a knob — more thinking → better answers, on a power-law curve.

REMEMBER

RLVR = programmatic verifier replaces RM. No (RM-style) reward hacking.
Reasoning emerges from RL on verifiable tasks; the base must be strong.
Test-time compute is a real, monotone knob — power law.

CASE STUDY · OPEN RECIPE

DeepSeek R1 deep dive — the recipe everyone copies

TL;DR

DeepSeek R1 (Jan 2025, arxiv 2501.12948) showed that pure RL with verifiable rewards on a strong base elicits reasoning — no SFT required (R1-Zero). The full R1 wraps that with cold-start SFT, rejection-sampled SFT, and a final mixed-reward RL. Distilled smaller models inherit reasoning cheaply.

THE INSIGHT — the R1 4-stage pipeline

How DeepSeek built a reasoning model in the open

Cold-start SFT — small set of curated reasoning traces with desired format. Establishes structure.
RL with GRPO + verifiable rewards — math, code. Reasoning emerges.
Rejection-sampling SFT — sample many completions from stage-2 model; keep the correct ones (~600k math/code) and add ~200k general data; SFT on this.
Final RL — mixed verifiable + preference rewards. Polish helpfulness and safety while preserving reasoning.

Then distill into smaller dense models (Qwen-7B/14B/32B, Llama-8B/70B) by training them on R1's outputs.

R1-Zero — the proof of concept

R1-Zero is pure RL (GRPO) with verifiable rewards on a base model (DeepSeek V3 base). No SFT. Reasoning emerges — the model learns long CoT, "aha moments" of self-correction. Readability is poor (it mixes languages) but capability is strong. The full R1 fixes readability with the cold-start and rejection-sampling stages.

Lessons

Pure RL works on a strong base model — reasoning is "in" the base; RL elicits it. You can't bootstrap reasoning from a weak base.
GRPO scales well — the no-critic design simplifies infra and halves memory.
Distilled smaller models inherit reasoning cheaply — a 7B distilled from R1 beats a 7B trained from scratch on reasoning evals.

REMEMBER

R1 pipeline: cold-start SFT → RLVR/GRPO → rejection-sampling SFT → mixed-reward RL → distill.
R1-Zero proved pure RL on a strong base elicits reasoning without SFT.
Distillation transfers reasoning cheaply to smaller dense models.

POST-TRAINING · TRANSFER

Distillation — soft labels, hard labels, on-policy

TL;DR

Hinton-style distillation: minimize KL between teacher and student softmaxes at temperature T. Soft labels carry more info than argmax. Hard-label distillation (just teacher's argmax) is easier and widely used. On-policy distillation (GKD, MiniLLM) has the student sample from itself and the teacher correct — fixes exposure bias.

Soft-label (Hinton) distillation

Minimize KL(p_teacher || p_student) where p_teacher = softmax(logits_T / T) at temperature T. Student learns from "soft labels" — full distribution carries more info than argmax (e.g., teacher might be 60% A, 30% B, 10% C — student learns the relative ranking, not just A).

Hard-label distillation

Train student on teacher's argmax outputs. Used widely (DeepSeek R1 → Qwen distill works this way). Simpler, doesn't need teacher's logits, equivalent to "teacher generates SFT data."

On-policy distillation (GKD, MiniLLM)

Student samples from itself during training; teacher provides corrections on the student's own trajectory. Avoids exposure bias — the gap between training distribution (teacher's) and inference distribution (student's). More expensive but higher quality on long generations.

REMEMBER

Soft labels > hard labels in info but harder to log.
Hard-label distill = "teacher generates SFT data" — simple and used everywhere.
On-policy distill fixes exposure bias for long generations.

SYSTEMS · LOW-PRECISION

Quantization-aware & FP8 training — the H100/Blackwell era

TL;DR

QAT inserts fake quantization in the forward pass and uses straight-through estimators backward — model becomes robust to INT4/INT8 inference. FP8 training (E4M3 forward, E5M2 backward, FP32 master weights) is now standard on H100/Blackwell. DeepSeek V3's per-tile FP8 recipe is the open reference.

QAT — train through the quantizer

Insert fake quantization (round-to-fp then back) in the forward pass; use the straight-through estimator on the backward (gradient passes through the rounding as if it were identity). Train so the model is robust to quantization errors. Used to ship clean INT4 / INT8 inference checkpoints.

FP8 training (H100/Blackwell)

Two formats:

E4M3 (4-bit exponent, 3-bit mantissa) — for forward pass and weights.
E5M2 (5-bit exponent, 2-bit mantissa, larger range) — for gradients.

Per-tensor or per-channel scaling factors. Maintain FP32 master weights so the optimizer state has full precision.

EXAMPLE — DeepSeek V3 FP8 recipe

Per-tile scaling, FP32 promotion, BF16 fallback

From the DeepSeek V3 technical report (arxiv 2412.19437):

Fine-grained quantization — per-tile scaling: 1×128 tiles for activations, 128×128 tiles for weights. Each tile gets its own scale factor.
Online scaling — compute the scale on-the-fly from current statistics, not from a calibration set.
FP32 accumulation promotion — partial sums get promoted up to FP32 every 128 elements during the matmul accumulation. Prevents accumulation drift.
BF16 fallback — embedding, output head, and normalization stay in BF16. They're a small fraction of compute but quality-critical.

Result: ~2× throughput vs BF16 with no measurable quality loss across DeepSeek V3 evals.

MX formats (Microscaling)

OCP standard 2024 — block of 32 elements share a power-of-2 scale. MXFP8, MXFP6, MXFP4. Hardware support shipping in Blackwell. Better outlier handling than per-tensor and lower metadata overhead than per-channel.

REMEMBER

QAT = fake quant forward + STE backward → robust to INT inference.
FP8 split: E4M3 forward/weights, E5M2 gradients, FP32 master weights.
DeepSeek V3 recipe: per-tile scaling (1×128 / 128×128), online scaling, FP32 accum every 128, BF16 fallback for embed/output/LN.
MX formats are the next step — Blackwell has hardware support.

SYSTEMS · LONG CONTEXT

Long-context extension — YaRN, ring attention, training tricks

TL;DR

You don't pretrain on 128k. You pretrain on 4-8k, then extend with Position Interpolation, NTK-aware scaling, or YaRN, with a brief continued-training pass. Ring attention partitions sequence across GPUs for the actual long-context training. Then evaluate carefully — needle-in-haystack passing doesn't mean multi-hop reasoning over the full window works.

Position interpolation family

Position Interpolation (Chen 2023) — rescale RoPE positions m → m · L_train / L_target. Continue training briefly. Loses some short-context quality.
NTK-aware scaling — instead of uniform scaling, scale RoPE base θ. Higher θ → longer wavelengths → less position aliasing at long range.
YaRN (Peng 2023, arxiv 2309.00071) — frequency-categorized interpolation + attention temperature. Used to extend Llama from 4k → 128k.

Ring & striped attention — sequence-parallel for the actual training

Ring Attention (Liu 2023, arxiv 2310.01889) — partition sequence across N GPUs. Each GPU holds Q, K, V for its chunk. Pass K, V around the ring, accumulating attention. Combined with FlashAttention's block-wise softmax, you can train on 1M+ context.
Striped Attention — load-balanced ring (causal mask creates uneven work; striping fixes it).

Long-context evaluation

Standard suite: needle-in-haystack (planted fact), RULER (multiple needles, multi-hop), LongBench, BABILong, multi-doc QA.

PITFALL — needle-passing ≠ long-context reasoning

A model that gets needle-in-haystack 100% can still fail at multi-hop reasoning over long context. Single-needle retrieval is the easiest possible long-context task. Always evaluate with RULER or BABILong as well, and probe with multi-doc reasoning.

Curriculum / rejection sampling / best-of-N (bonus toolkit)

Rejection sampling fine-tuning (RFT) — sample many completions, keep only correct (verified or RM-scored), SFT on those. Llama 3 RLHF had a rejection sampling stage.
Best-of-N — at inference, sample N, pick highest-RM-scoring. Strong baseline; often matches RL-tuned model up to N=64.
Self-distillation — train student on outputs of a (possibly identical-arch) teacher. Reduces "teacher's noise". Iterative self-distillation sometimes improves quality (Born-Again Networks).
Curriculum learning — order data easy → hard. In LLMs, results mixed; usually domain-mixing changes (mid-training, annealing) matter more.

REMEMBER

Pretrain short, extend with YaRN (or NTK-aware), continue-train briefly.
Ring + Striped attention + FlashAttention enables 1M+ token training.
Needle-in-haystack is necessary but not sufficient — evaluate multi-hop.

0 → hero reading path for LLM training + RLHF

foundation OpenAI Spinning Up — RL primer; the Vanilla PG and PPO chapters are crucial
foundation Chip Huyen — RLHF: Reinforcement Learning from Human Feedback
foundation Hugging Face — Illustrating RLHF
build TRL library — try DPO / PPO on a small model
build GPT-NeoX or Megatron-LM — read pretraining infra code
depth InstructGPT (Ouyang 2022)
depth Christiano 2017 — Deep RLHF (the original)
depth DPO (Rafailov 2023)
depth DeepSeek Math / GRPO
depth DeepSeek R1 paper — read end-to-end
depth Constitutional AI (Bai 2022)
depth Llama 3 paper
depth DeepSeek V3 technical report
depth Tülu 3 — open SFT/DPO recipe
depth Nathan Lambert's Interconnects blog — RLHF / post-training news

LLM training quiz — readiness check

Walk through Llama 3's training pipeline.
Show answer
15T pretraining tokens with quality classifier on web; MinHash dedup; phased annealing in last 40B tokens. Then SFT + rejection sampling + DPO. Architecture: GQA + RoPE + RMSNorm + SwiGLU. Scaling: 8B, 70B, 405B variants. Multimodal added in 3.2.
Why GRPO over PPO?
Show answer
No critic/value model → halves optimizer memory; group-relative advantage is well-conditioned for verifiable rewards; simpler to scale. Tradeoff: only sequence-level credit assignment.
How would you train a reasoning model from scratch?
Show answer
Strong base → RLVR with GRPO + verifiable rewards (math, code) → SFT on rejection-sampled good traces → final RL with mixed verifiable + preference rewards. Distill into smaller dense models.
How does Constitutional AI differ from RLHF?
Show answer
Preferences come from an AI judge applying explicit principles, not humans. SL phase: model self-critiques + revises per principles. RL phase: AI-generated preference pairs train RM, then PPO. Reduces human labeling cost; shifts subjectivity to constitution.
What is mode collapse in DPO?
Show answer
DPO can decrease likelihood of both chosen and rejected (only the gap matters); policy may drift far from reference. Mitigations: stronger β, IPO (squared loss), mix in SFT loss, conservative sampling.
Walk through DeepSeek V3's FP8 recipe.
Show answer
E4M3 forward + weights, E5M2 backward. Per-tile (1×128 activations / 128×128 weights) scaling. FP32 master weights. FP32 partial-sum promotion every 128 elements. BF16 fallback for embeddings, output head, normalization. ~2× speedup vs BF16 with no quality loss.
Explain Chinchilla and why models today violate it.
Show answer
Chinchilla: D ≈ 20·N is compute-optimal during training. But inference cost dominates total cost when serving for years → over-train smaller models past Chinchilla optimum (Llama 3 8B on 15T tokens = 1875 tokens/param) for cheaper per-query inference.
What's the closed-form derivation of DPO?
Show answer
Start from KL-constrained RL: max E[r] − β KL(π||π_ref). Closed-form optimum: π* = (1/Z) π_ref · exp(r/β). Solve for r: r = β log(π*/π_ref) + β log Z. Plug into Bradley-Terry preference; Z(x) cancels. Result: DPO loss equals NLL of preferences under the implicit reward model induced by the policy.
What's RFT (rejection sampling fine-tuning)?
Show answer
Sample many completions per prompt; keep only correct (verified or RM-scored); SFT on those. Llama 3 RLHF used this. Cheaper than RL; often nearly as good. Used as a stage in DeepSeek R1's pipeline.
What is RLVR and what's its main advantage over RLHF?
Show answer
RL with verifiable rewards: programmatic verifier (unit tests, math grader, formal proof checker) instead of learned RM. No reward hacking on the verifier signal. Drives reasoning-model training (o-series, R1, Tülu 3).
Walk through DeepSeek R1's pipeline (4 stages).
Show answer
(1) Cold-start SFT: small set of curated reasoning traces with desired format. (2) RL with GRPO + verifiable rewards. (3) Rejection-sampling SFT: 600k math/code from stage-2 + 200k general. (4) Final RL: mixed verifiable + preference rewards. Distill into dense models.
What is MinHash LSH for dedup?
Show answer
Compute MinHash signatures over n-gram shingles (5-grams typical). Bucket into LSH bands; candidate pairs share at least one band. Verify with Jaccard. For 5T docs, typically 10 bands × 9 hashes (controls precision/recall). Used by DeepSeek, Llama, FineWeb.
What is SemDeDup?
Show answer
Cluster embeddings with k-means; dedup within clusters by cosine similarity. Catches near-duplicates that MinHash misses (paraphrases, translations, reformatting). Used as a complement to MinHash.
Why mid-training before SFT?
Show answer
A phase between pretraining and SFT: continue causal LM but heavy upweight on reasoning, math, code, possibly synthetic CoT. Combined with WSD decay phase. Model becomes "reasoning-ready" without yet being instruction-tuned. DeepSeek V3 used this.
What's the LIMA hypothesis?
Show answer
(arxiv 2305.11206) "Less is more for alignment" — diversity > scale beyond ~10k–100k SFT examples. The model's capabilities come from pretraining; SFT only teaches format/instruction-following. Recent work (Tülu 3) shows up to 1M+ helps when domains are diverse.
What is REINFORCE and why mention it now?
Show answer
Original policy gradient: ∇L = −E[log π(a|s) · A]. PPO adds clipping + value baseline. RLOO / REINFORCE++ (Ahmadian 2024) skip the critic (like GRPO) but use leave-one-out or EMA baselines. Coming back into fashion for LLM RL.
What is KTO?
Show answer
Kahneman-Tversky Optimization (Ethayarajh 2024). Doesn't require pairwise data — only binary "good"/"bad" labels per response. Models prospect-theoretic utility. Easier data collection.
What's catastrophic forgetting and how do you mitigate?
Show answer
Continued training on new data degrades performance on old data. Mitigations: (1) replay (mix old data, 5-30%); (2) much lower LR; (3) PEFT (LoRA — train only adapters); (4) elastic weight consolidation; (5) regularize toward original weights.
What's the difference between in-context learning and fine-tuning?
Show answer
ICL: no weight updates; few-shot examples in prompt steer behavior. Fast, no infra, but limited context, no persistence. Fine-tuning: weight updates persist; better quality on narrow domain; risks catastrophic forgetting; needs infra. 2026 guidance: try prompt → RAG → SFT/LoRA → full FT → RL in order.
What's reward hacking in reasoning models?
Show answer
Model gaming the reward signal: tampered CoT (correct answer with non-causal reasoning); verifier exploitation (matches the regex but reasoning is wrong); length hacking. Mitigations: process rewards (PRM), multiple verifiers, behavioral evals catching CoT-answer mismatch.