FOUNDATIONS · ARCHITECTURE

Transformer internals

Every modern LLM — Llama 4, DeepSeek V3, Claude, GPT — is a stack of one block repeated. Master the block and you master the model. This chapter walks the block end-to-end at Sr Staff depth: attention math, the MHA→MLA evolution, RoPE/YaRN, KV cache, MoE routing, normalization, tokenization, parallelism.

Read ~45 min Asked at Anthropic, OpenAI, DeepMind, Meta Difficulty Sr Staff bar Last refresh 2026

What you'll learn

The transformer block — anatomy of one decoder layer
Self-attention — math, complexity, masking
The KV evolution — MHA → MQA → GQA → MLA
Positional encodings — sinusoidal, learned, ALiBi, RoPE, YaRN, NoPE
KV cache mechanics — what to cache, how big it gets
Feed-forward block — MLPs, gated activations, SwiGLU
Mixture of Experts — routing, balancing, DeepSeek's tricks
Normalization & residuals — pre- vs post-norm, the residual stream
Tokenization — BPE, SentencePiece, byte fallback
Sequence & context parallelism — when training breaks the GPU

FOUNDATIONS · ANATOMY

The transformer block — anatomy of one decoder layer

TL;DR

A modern decoder-only transformer is one block repeated L times. Each block has two sub-layers — attention and FFN — each wrapped in a pre-norm + residual sandwich. The residual stream is the highway; sub-layers are tributaries that read from it and write back to it. Get this picture right and every other detail snaps into place.

The 30,000-ft picture

One pre-norm block, in two equations:

x' = x + Attention(LayerNorm(x))
y  = x' + FFN(LayerNorm(x'))

Repeat L times (Llama 3 70B: L=80; DeepSeek V3: L=61). Project to vocab logits at the end. That's the entire model.

Pre-norm transformer block. Residual stream is the highway; attention and FFN are tributaries. Repeat L times.

Why this shape — the residual stream as a bus

The mech-interp framing (Anthropic, Elhage 2021): treat the residual stream as a high-dimensional communication bus. Every sub-layer reads from it (via LN), computes a delta, and writes the delta back. The model never overwrites the stream — it only adds to it. This is why residuals are non-negotiable: they preserve a clean identity gradient path through L layers and let later layers re-use information from any earlier layer.

residual stream — the per-token vector that passes through every block unchanged except for the deltas added by attention and FFN. Hidden dim d = the width of the bus.

Sizing one block

Parameter count of a pre-norm block at hidden dim d with FFN expansion 4d:

Attention (Q, K, V, O projections): 4·d²
FFN (W₁, W₂): 2·(4d·d) = 8·d²
Two LayerNorms: 2·2d (negligible)
Total ≈ 12·d² per block — FFN is 2× the attention cost.

EXAMPLE — Llama 3 70B

d = 8192, L = 80. One block ≈ 12 · 8192² ≈ 805M params. Stack 80 → ~64B params just from blocks; the rest is embeddings (vocab 128k × 8192 ≈ 1B) and a few odds and ends. Add GQA, SwiGLU and you land at 70B.

REMEMBER

Modern blocks are pre-norm: x + Sublayer(LN(x)), twice (attention then FFN).
The residual stream is a bus: sub-layers read, compute a delta, write back.
Per-block params ≈ 12·d²; FFN is 2× attention.

CORE · ATTENTION

Self-attention — math, complexity, masking

TL;DR

Attention is softmax(QKᵀ/√d_k)·V. The √d_k keeps softmax from saturating; the causal mask makes it autoregressive; the L² in compute and memory is why long context is hard. Every attention variant in the rest of the chapter is a reaction to that L².

The formula, with shapes

Given input X ∈ ℝ^L×d (L tokens, hidden dim d), one attention head computes:

Q = X W_Q     # (L, d_k)
K = X W_K     # (L, d_k)
V = X W_V     # (L, d_v)

scores = Q Kᵀ / √d_k        # (L, L)
scores = scores + causal_mask
attn   = softmax(scores)    # rows sum to 1
out    = attn V             # (L, d_v)

Why divide by √d_k

If Q and K entries are unit-variance, the dot product Q·K has variance d_k. For d_k = 128 that's a stdev of ~11 — softmax of values that large saturates to a one-hot, killing gradients. Dividing by √d_k restores unit variance so softmax stays in its useful regime.

PITFALL

Skipping the √d_k scaling looks fine on small toy models (low d_k) and silently breaks training the moment you scale up. It's the most common bug when someone reimplements attention from scratch.

Causal masking

For an autoregressive LM, token q_i must not see future tokens k_j for j > i. Implement by adding −∞ (in practice -1e9) to the upper triangular of the score matrix before softmax. After softmax those entries are 0 and the row still sums to 1.

Causal attention: token q_i can only attend to k_j for j ≤ i. Each row sums to 1 after softmax.

Complexity — the L² problem

Quantity	Cost	Why it hurts
Compute	O(L²·d)	Doubling context = 4× FLOPs
Activation memory	O(L²)	Materialized scores blow up VRAM
KV cache (decode)	O(L·d)	Linear, but big at long context

FlashAttention sidesteps the O(L²) memory by tiling the score matrix and recomputing on the fly — never materializes the full scores in HBM. This is implementation, not algorithm; the compute is still O(L²·d).

EXAMPLE — concrete cost at 128k context

L = 128k, d = 8k. Score matrix L² = 16B entries × 2 bytes = 32 GB per head per layer if you naively materialize. With H=64 heads × L=80 layers, naive attention is impossible. FlashAttention drops the materialization; the FLOPs (16B × 8k ≈ 130 TFLOPs per head per layer) you still pay.

REMEMBER

softmax(QKᵀ/√d_k)·V — the formula and the √d_k justification.
Causal mask = upper-triangular −∞ added before softmax.
Attention is O(L²·d) compute and O(L²) memory; FlashAttention tames the memory, not the compute.

CORE · KV EVOLUTION

The KV evolution — MHA → MQA → GQA → MLA

TL;DR

Multi-head attention's KV cache is the inference bottleneck at long context. The four-step evolution shrinks it: MHA (one KV per head) → MQA (one KV total, 8× smaller, quality dip) → GQA (a few KV groups, the 2024 default) → MLA (low-rank latent KV, ~7% the size of MHA, MHA-quality). Each step trades a different kind of redundancy for memory.

MHA — the original

MHA: H independent (Q, K, V) projections of dim d/H; concatenate H outputs and project back to d. Each head learns different relations (syntactic, positional, semantic). KV cache = 2 · H · d_head per token per layer.

MQA — one KV, H queries

MQA (Shazeer 2019): all H query heads share one K and V projection. KV cache shrinks H× (typically 8× or more). Cost: small but real quality drop, especially on harder tasks.

GQA — the 2024 default

GQA (Ainslie 2023, arxiv 2305.13245): an interpolation. G groups of K/V heads, each shared by H/G query heads. Llama 2/3 70B: H=64 query heads, G=8 KV heads → 8× cache reduction with negligible quality loss.

PITFALL — GQA naming is backwards

The "G" in GQA(g=8) is the number of KV heads, not the number of query heads per group. So GQA(g=1) = MQA, GQA(g=H) = MHA. When someone says "GQA(g=k)" read it as "k KV heads."

Q heads (blue) always 8; K (gold) and V (green) shared in groups. KV cache size ∝ # of KV heads.

MLA — the DeepSeek leap

MLA (Multi-head Latent Attention) (DeepSeek V2/V3, arxiv 2405.04434) compresses K and V into a low-rank latent vector c (dim ~512). At attention time, K and V are reconstructed from c via up-projection. The killer trick: the up-projection can be absorbed into Q's matrix during inference, so you cache only the small c (~7% the size of MHA cache) and retain MHA-level quality.

THE INSIGHT — decoupled RoPE

Why MLA needs a split between content and rotary paths

The standard Anthropic follow-up: "If MLA absorbs the K up-projection into Q, how does it apply RoPE?" The answer is the load-bearing detail of MLA.

RoPE rotates K by an angle that depends on position, not on the projection matrices. So you cannot pre-multiply the rotation into Q's matrix — the rotation is data-dependent at inference time. MLA solves this by splitting Q and K into two parts:

Q^C, K^C — content path: low-rank, reconstructed from latent c; no RoPE; the up-projection absorbs cleanly into Q.
Q^R, K^R — rotary path: small dim (~64); RoPE applied; K^R shared across all heads (MQA-style for the rotary part).

Final attention score = Q^C·K^C + Q^R·K^R. The cache stores only c + the small K^R. This is the architectural trick that lets MLA be both MHA-quality and tiny-cache.

Side-by-side

Variant	KV cache / token / layer	Quality	Used by
MHA	`2·H·d_head`	baseline	GPT-2/3, original Llama
MQA	`2·d_head` (8× smaller for H=8)	small drop	PaLM, Falcon
GQA(g=8)	`2·8·d_head` (typical 8× smaller than MHA at H=64)	≈ MHA	Llama 2/3/4, Mistral, Qwen
MLA	`~d_latent + d_R` (~7% of MHA)	≈ MHA	DeepSeek V2/V3

REMEMBER

The progression MHA → MQA → GQA → MLA is a series of cache-shrinking tricks; quality is mostly preserved.
GQA(g=k) means k KV heads, not k query heads — the naming traps people.
MLA's content path is absorbable into Q; the rotary path is decoupled because RoPE is position-dependent. This is the Anthropic question.

CORE · POSITION

Positional encodings — sinusoidal, learned, ALiBi, RoPE, YaRN, NoPE

TL;DR

Self-attention is permutation-invariant — without positional info, "the dog bit the man" equals "the man bit the dog." Six families have been tried; in 2026 the winning recipe is RoPE for pretraining and YaRN to extend to long context.

The six families

Method	Mechanism	Extrapolation	Used by
Sinusoidal	Add `sin(pos/10000^(2i/d))` to embeddings	weak in practice	Original Transformer
Learned absolute	Lookup table indexed by position	none beyond train length	GPT-2, GPT-3
ALiBi	Add linear bias `−m·\|i−j\|` to scores; no embedding	strong	MPT, BLOOM
RoPE	Rotate Q,K in 2D subspaces by `θ_i·pos`	weak alone, strong with YaRN	Llama, Mistral, DeepSeek, Qwen
YaRN	Frequency-categorized RoPE rescaling + temperature	extends 4k → 128k	Llama 3, Qwen 2
NoPE	No positional info; causal mask provides position implicitly	surprising on some tasks	research

RoPE in one paragraph

RoPE (Su 2021, arxiv 2104.09864) groups Q and K dimensions into 2D pairs and rotates each pair by an angle θ_i · pos, with θ_i = base^−2i/d (default base 10000). Because rotations preserve dot products, Q·K after rotation depends only on relative position (pos_q − pos_k). This gives you relative-position semantics without learning a relative bias matrix.

# Per 2D pair (x, y), at position p, with base frequency θ:
[x']   [cos(θp)  -sin(θp)] [x]
[y'] = [sin(θp)   cos(θp)] [y]

EXAMPLE — RoPE base and context length

Llama 2 trained at 4k with base=10000. Llama 3 raised the base to 500000 to push the wavelength spectrum out to 8k. To go further (8k → 128k), Llama 3 used YaRN-style rescaling during a long-context fine-tune. The base controls how fast the slowest-frequency dim rotates; longer wavelengths support longer context.

YaRN — extending RoPE to long context

YaRN (Peng 2023) categorizes RoPE frequencies into three bands and rescales them differently:

High-freq dims (fast oscillation, position-sensitive): minimal interpolation — they extrapolate.
Low-freq dims (slow, semantic): full positional interpolation (compress positions into the trained range).
Mid-freq dims: smoothly interpolated between the two.

Plus an attention-temperature factor that compensates for the lower entropy at long context. Result: extend a 4k pretrained model to 128k with a brief continued-training pass on long-context data.

ALiBi vs RoPE — pick one and defend

ALiBi

Linear bias added to attention scores — no embeddings to learn.
Per-head slope m (geometric sequence).
Excellent length extrapolation out of the box.
Slightly worse in-distribution quality than RoPE.

RoPE

Rotates Q,K — encodes relative position via dot product.
Better quality at training length; weak naive extrapolation.
Pairs cleanly with YaRN/NTK rescaling for long context.
The 2026 default.

PITFALL — RoPE convention

Two interleavings exist: GPT-J style (interleave dim pairs (0,1),(2,3),...) and GPT-NeoX/Llama style (split halves (0, d/2), (1, d/2+1),...). Loading weights across implementations without remapping gives garbage outputs that look almost-right. Always check.

REMEMBER

RoPE rotates Q,K in 2D pairs by θ_i·pos; dot product captures relative position.
Modern recipe: train at moderate length with RoPE, then YaRN-extend with continued long-context training.
ALiBi is the simpler alternative with strong extrapolation but weaker peak quality.

INFERENCE · MEMORY

KV cache mechanics — what to cache, how big it gets

TL;DR

During autoregressive decode, K and V for past tokens never change — cache them, recompute only Q for the new token. Cache size scales with 2 · L · n_kv_heads · d_head · n_bytes per token. At long context this dominates VRAM, which is why GQA and MLA exist.

Why a cache exists

For token t, attention reads K and V for all tokens ≤ t. None of those K, V vectors depend on t's query — they were computed when each past token was processed. Recomputing them per step would make decode O(L²) per token; caching makes it O(L) per token (reading the cache).

The size formula

cache_size = 2 · L_layers · n_kv_heads · d_head · seq_len · bytes_per_value

The 2 is for K + V. n_kv_heads is where MHA → MLA shrinks the bill. bytes_per_value is 2 for bf16/fp16, 1 for int8 quantized cache.

EXAMPLE — Llama 3 70B at 128k context

L=80, n_kv_heads=8 (GQA), d_head=128, seq=128k, bf16. Cache = 2 · 80 · 8 · 128 · 128k · 2 ≈ 40 GB per request. On an 80 GB H100, you can fit one such request plus the model. Single-prompt long-context inference is largely a KV-cache problem.

For a deeper dive on paged KV, prefix sharing and the prefill vs decode distinction, see LLM inference.

REMEMBER

Cache K,V because they never change after the token is produced.
Cache size = 2 · L · n_kv_heads · d_head · seq · bytes.
At long context the cache, not the weights, dominates VRAM — this is why GQA/MLA matter.

CORE · FEED-FORWARD

Feed-forward block — MLPs, gated activations, SwiGLU

TL;DR

The FFN is a per-token MLP — usually 4× wider than the residual stream — that holds most of the model's parameters and (per mech-interp) most of the model's "knowledge." SwiGLU replaced ReLU/GELU around 2022 and is now universal: it splits the input into a gate and a value, modulates them with Swish, and trades one matrix for measurable quality gains.

Standard FFN — the original

y = W_2 · σ(W_1 · x + b_1) + b_2     # σ ∈ {ReLU, GELU}

Hidden dim 4d. Two matrices, one nonlinearity. GPT-2/3, BERT.

SwiGLU — the 2024+ default

SwiGLU (Shazeer 2020) introduces a gate:

y = W_2 · ( Swish(W_1 · x) ⊙ (W_3 · x) )
   where Swish(z) = z · sigmoid(z)

Three matrices. To keep parameter count constant relative to a 4d standard FFN, hidden dim is reduced to ~2.67d (= 8d/3). Used by Llama, Mistral, DeepSeek, PaLM, Qwen.

EXAMPLE — SwiGLU param accounting

Standard FFN at d=8192: 2 · (4·8192·8192) = 537M params. SwiGLU at hidden = 8/3 · 8192 ≈ 21845: 3 · (21845·8192) ≈ 537M. Same parameter count, gated nonlinearity, ~1% perplexity win on equal-data training.

Why gating works (intuition)

The gate Swish(W_1·x) learns which dimensions of the value W_3·x to suppress per-token. It's a soft, data-dependent feature selector — close in spirit to LSTM gates and to multiplicative interactions. Empirically the gain over plain ReLU/GELU is small but consistent and free, so everyone adopted it.

REMEMBER

FFN is a per-token MLP at hidden dim ~4d (or 8d/3 for SwiGLU); 2/3 of block params live here.
SwiGLU = W_2(Swish(W_1·x) ⊙ W_3·x), three matrices, hidden dim 8d/3 to match params.
Gated activations beat pure activations by a small but free margin — universal in 2026.

SCALE · MOE

Mixture of Experts — routing, balancing, DeepSeek's tricks

TL;DR

Replace each FFN with N expert FFNs and a learned router that picks k experts per token. You get the parameter capacity of a giant model with the FLOPs of a small one. The whole design problem is keeping experts balanced — three balancing tricks (aux loss, router Z-loss, aux-loss-free bias) are all on the interview menu.

The forward pass

Per token x:

Router: logits = W_r · x over N experts.
Top-k: pick the k experts with highest logits; softmax their logits → gate weights g_i.
Expert forward: each picked expert i computes FFN_i(x).
Combine: y = Σ_i g_i · FFN_i(x).

Top-2 router picks experts 1 and 4 with gates 0.6 and 0.4. Output = 0.6·E1(x) + 0.4·E4(x).

Routing variants

Top-1 (Switch Transformer, Fedus 2021): one expert per token. Cheapest; quality dip vs top-2.
Top-2 (Mixtral): two experts. Mixtral-8x7B → 47B total params, ~13B active per token.
Top-K with shared experts (DeepSeek-MoE, arxiv 2401.06066): some experts always activate, routed experts specialize. DeepSeek V3 has 1 shared + 256 routed, top-8 routed.
Expert Choice (Zhou 2022): experts pick tokens, not vice versa — guarantees load balance by construction. Tradeoff: tokens not chosen are dropped or sent to a default.

Load balancing — get the formula right

Without intervention the router collapses onto a few favorite experts. Three remedies, in chronological order:

1. Switch Transformer auxiliary loss

L_aux = α · N · Σ_{i=1}^{N} f_i · P_i

N = number of experts
f_i = fraction of tokens routed to expert i (per batch)
P_i = mean router probability for expert i (per batch)
α = scalar coefficient (typically 0.01)
Why the N multiplier? Under perfectly uniform routing (f_i = P_i = 1/N), the loss equals 1 regardless of N — cleanly normalized.

2. Router Z-loss (memorize alongside)

(ST-MoE / PaLM, also DeepSeek) Penalize the log-partition of router logits to keep them bounded:

L_z = (1/B) · Σ_i ( log Σ_j exp(x_{ij}) )²

Where x_ij is the router logit for expert j on token i. Without it, router logits drift in bf16/fp8 and softmax becomes numerically unstable.

3. Auxiliary-loss-free balancing (DeepSeek V3)

Drop the auxiliary loss entirely. Maintain a per-expert bias b_i that's added to the router logits at routing time only (not used for the gate weights). When an expert is over-subscribed, decrement its bias; under-subscribed, increment. The bias is a hyperparam-light PI controller on the routing distribution, with no gradient interference from an auxiliary loss term.

PITFALL — MoE balancing in practice

Aux loss too small → expert collapse. Too large → router gradient drowns task gradient → quality drop. The DeepSeek V3 approach side-steps this by moving balancing out of the loss entirely. Also: the aux loss must be a differentiable proxy because f_i uses argmax and isn't itself differentiable — that's why it multiplies by P_i.

When MoE pays off

MoE buys parameter capacity for free FLOPs at training time, but the full model must fit in memory at inference (all experts must be loaded; only k run per token). On a single GPU MoE is rarely worth it. Across many GPUs with expert parallelism (each GPU holds different experts), MoE wins because compute scales with active params not total params.

REMEMBER

Router → top-k → gate-weighted sum. Mixtral top-2; DeepSeek top-8 with 1 shared expert.
Switch aux loss: L_aux = α·N·Σ f_i·P_i; N normalizes uniform routing to 1.
Router Z-loss penalizes the log-partition for fp8/bf16 stability.
DeepSeek V3 replaces aux loss with a per-expert bias updated as a PI controller — no gradient interference.

CORE · NORMALIZATION

Normalization & residuals — pre- vs post-norm, the residual stream

TL;DR

Pre-norm puts the LayerNorm inside the residual branch — gradients flow through the residual unchanged, which is why pre-norm trains deep models stably. Post-norm gives slightly better quality if you can train it. RMSNorm replaced LayerNorm in most modern stacks; QK-Norm is the new stability trick.

Pre-norm vs post-norm

Pre-norm (modern default)

x_{l+1} = x_l + Sublayer(LN(x_l))

Identity gradient path through residual.
Stable for L > 100; requires no warmup tricks.
Used by GPT-2, Llama, Mistral, DeepSeek, PaLM.

Post-norm (original)

x_{l+1} = LN(x_l + Sublayer(x_l))

Slightly better quality if it trains.
Gradient passes through LN's reciprocal-std → harder to train deep.
Used by original Transformer, BERT.

RMSNorm — LayerNorm minus the mean

RMSNorm (Zhang & Sennrich 2019) drops the mean-subtraction step of LayerNorm:

RMSNorm(x) = x / sqrt(mean(x²) + ε) · γ

Faster (no mean), no bias parameter, slightly better empirically. Used in Llama, Mistral, DeepSeek. LayerNorm is now legacy in LLMs.

Sandwich norm and QK-Norm

Sandwich norm / double norm (Gemma 2, OLMo): LN both before and after a sub-layer. Adds stability at scale.
QK-Norm (OLMo, some Gemma variants): apply LN to Q and K before the dot product. Bounds attention score magnitudes, prevents softmax saturation, key for training very large/wide models.

Why residuals are non-negotiable

Without residuals, the gradient through L sub-layers is a product of L Jacobians — vanishes or explodes. With residuals, ∂(x + f(x))/∂x = I + ∂f/∂x, preserving an identity component. This is the architectural enabler that lets transformers go to L=80, 100, or beyond.

EXAMPLE — what pre-norm buys you

The original 12-layer Transformer (post-norm) needed careful warmup or it diverged. GPT-2 swapped to pre-norm and trained 48 layers stably without exotic tricks. Every major LLM since has been pre-norm; the cost (small quality hit) is dominated by the gain (you can scale depth).

REMEMBER

Pre-norm = stable scaling; post-norm = slightly better quality if it trains. Use pre-norm.
RMSNorm has replaced LayerNorm in modern stacks (no mean, no bias, faster).
QK-Norm bounds attention scores — increasingly common at scale.
Residuals give the identity gradient path that makes L=80 trainable.

DATA · TOKENIZATION

Tokenization — BPE, SentencePiece, byte fallback

TL;DR

Tokenization is the un-glamorous step that quietly governs cost, multilingual fairness, and arithmetic. The 2026 default is byte-level BPE (OpenAI tiktoken) or SentencePiece-BPE (Llama/Mistral) with byte fallback for robustness. Larger vocabs (200k+) reduce tokens-per-text but bloat the embedding table.

The four families

Algorithm	Idea	Used by
BPE	Start from chars; iteratively merge most frequent pair.	GPT-2/3, Llama (via SentencePiece-BPE)
Byte-level BPE	BPE on UTF-8 bytes; vocab includes 256 byte tokens; handles any input.	GPT-2, tiktoken
SentencePiece (BPE or Unigram)	Operates on raw text — no whitespace pretokenization.	Llama, Mistral, T5
Unigram LM	Start with large vocab; iteratively prune tokens that least decrease likelihood. Probabilistic segmentation.	SentencePiece-Unigram, mBART

Byte fallback — the robustness trick

When the tokenizer encounters a character it can't segment (rare emoji, unusual script), byte fallback emits the raw UTF-8 bytes as tokens instead of an <unk>. Critical for production: the model can always generate something, never crashes on input.

Vocab size — the tradeoff

Larger vocab → fewer tokens per text → faster inference per character; but larger embedding table and more compute per softmax. tiktoken cl100k_base (GPT-3.5/4) uses ~100k vocab; o200k (GPT-4o) uses 200k. Llama 3 jumped to 128k from Llama 2's 32k for the same reason.

PITFALL — tokenization gotchas

Numbers: GPT-2 splits "1234" oddly. Llama 3 forces single-digit tokenization for arithmetic.
Spaces: " hello" ≠ "hello" — leading space is part of the token. Many prompt bugs trace to this.
Non-English asymmetry: a sentence in Thai may use 4× more tokens than the same in English → 4× the API cost. Larger multilingual vocabs and byte fallback partly fix this.
Glitch tokens: rare-vocab tokens that the model never trained on can produce hallucinations or repetition (SolidGoldMagikarp).

BPE vs Unigram — when each

BPE is deterministic — given a tokenizer, every text has one segmentation. Unigram supports subword regularization: at training time, sample different segmentations of the same text per epoch as data augmentation. Used in some multilingual setups; BPE is more common.

REMEMBER

Byte-level BPE (tiktoken) and SentencePiece-BPE (Llama) dominate; both with byte fallback.
Vocab size is a cost / table-size tradeoff; 100k–200k is the modern range.
Unigram LM enables probabilistic segmentation (subword regularization); BPE is deterministic.
Spaces are part of tokens; non-English costs more tokens — both real-world traps.

SCALE · PARALLELISM

Sequence & context parallelism — when training breaks the GPU

TL;DR

Tensor parallelism shards weights but leaves the full sequence on every GPU — at long context the activations overflow VRAM. Sequence parallelism shards the sequence in LN/dropout regions; context parallelism shards it inside attention via ring exchange. Together they make >1M context training feasible.

The problem TP alone doesn't solve

With tensor parallelism (TP), Q/K/V projections are sharded across GPUs, but LayerNorm, residuals and dropout all see the full hidden dim and the full sequence on each GPU. At long context the activations from these regions blow VRAM. The shared region between sub-layers is what TP can't help.

Sequence parallelism

Sequence parallelism (Korthikanti 2022, arxiv 2205.05198) shards the sequence dimension during the LN/dropout regions where TP can't help:

TP region (attention/FFN matmul): full sequence, sharded hidden.
SP region (LN, dropout, residual): sharded sequence, full hidden.
Boundaries: all-gather (going into TP region) and reduce-scatter (going out).

Net effect: dramatic activation memory reduction at the cost of two extra collective ops per layer. Combined with TP it gives much higher effective context per GPU.

Context parallelism — ring attention

Context parallelism shards the sequence across GPUs inside attention itself. Each GPU holds a chunk of the sequence; K and V chunks rotate around a ring (Ring Attention, Liu 2023, arxiv 2310.01889) so every Q chunk eventually sees every K,V chunk. With FlashAttention-style online softmax, you accumulate the partial outputs without materializing the full score matrix anywhere. This is what enables >1M context training and inference.

EXAMPLE — when each kicks in

32k context: TP alone is fine. 128k context: TP + SP — SP cuts the activation memory of the LN regions enough to fit. 1M+ context: TP + SP + context parallelism (ring attention) — the only way the full attention matrix doesn't have to live on any single GPU.

REMEMBER

TP shards weights; SP shards activations in LN/dropout/residual regions.
SP boundaries: all-gather in, reduce-scatter out.
Context parallelism (ring attention) shards the sequence inside attention itself — enables >1M context.

Common interview questions

"Why divide attention scores by √d_k?" → Without scaling, dot product variance grows with d_k → softmax saturates → vanishing gradients.
"Walk through MHA → MQA → GQA → MLA." → Reduces KV cache size step-by-step. MLA goes furthest by storing only a low-rank latent and reconstructing K,V on the fly.
"Why RoPE over learned positional?" → Encodes relative position naturally via rotation in 2D subspaces; extrapolates better with PI/YaRN tricks.
"Pre-norm vs post-norm — pick one and defend." → Pre-norm: gradients flow through residual cleanly, stable for depth, easier to train at scale. Post-norm: slightly better quality if stable, used in original Transformer, harder to train deep.
"Walk through one MoE forward pass." → Token x → router → softmax over E experts → top-k selection (and gating weights) → forward through k experts → weighted sum → output. Plus aux load-balance loss during training.
"Why does DeepSeek V3 use shared + routed experts?" → Shared experts capture common knowledge across all tokens; routed experts specialize on niches. Reduces redundant routed-expert capacity.
"How does QK-Norm help?" → LN on Q and K before dot product bounds attention score magnitudes; prevents softmax saturation at large scales; improves training stability for very deep / wide models.
"What changes if you switch from BPE to Unigram LM?" → Unigram supports probabilistic segmentation (sample different segmentations per training pass) → subword regularization. BPE is deterministic.

0 → hero reading path for transformer internals

foundation Karpathy — Let's build GPT from scratch (2 hours; the canonical lecture)
foundation The Annotated Transformer (Harvard NLP; line-by-line walkthrough of the original paper)
foundation The Illustrated Transformer (Jay Alammar)
build nanoGPT — read it, train it, modify it
build Lilian Weng — The Transformer Family v2
depth Attention Is All You Need (Vaswani 2017)
depth RoPE (Su 2021)
depth GQA (Ainslie 2023)
depth DeepSeek-V2 / MLA (Liu 2024)
depth DeepSeek-MoE
depth SwiGLU / GLU variants (Shazeer 2020)
depth YaRN (Peng 2023) for long context extension
depth Ring Attention (Liu 2023)

Transformer quiz — readiness check

Walk through one self-attention forward pass with shapes.
Show answer
x: (B, T, d). Q = xW_Q, K = xW_K, V = xW_V → each (B, T, d_k). scores = Q K^T / √d_k → (B, T, T). Apply causal mask (set upper triangle to -∞). attn = softmax(scores, dim=-1). out = attn V → (B, T, d_v).
Memory complexity of attention?
Show answer
O(T²) for the attention matrix. FLOPs O(T² d). FlashAttention reduces memory to O(T) via tiling + online softmax — never materializes the full attn matrix in HBM.
Difference between MHA, MQA, GQA, MLA in KV cache size.
Show answer
Per token per layer: MHA: 2·n_heads·d_head. MQA: 2·d_head (8× smaller for 8-head MHA). GQA(g): 2·g·d_head (intermediate). MLA: 2·d_latent (much smaller — DeepSeek-V2 latent ~512 vs MHA's ~8192).
Why does MLA need decoupled RoPE?
Show answer
RoPE is non-linear in position — you cannot absorb a position-rotated K reconstruction into Q's projection. So MLA splits Q,K into content (low-rank, no RoPE, absorbable) and rotary (small dim, RoPE-applied, MQA-style shared) parts. Final score = Q^C K^C + Q^R K^R.
Sinusoidal vs RoPE vs ALiBi — when each?
Show answer
Sinusoidal: original; can extrapolate weakly. Learned absolute: GPT-2/3; bad extrapolation. ALiBi: linear attention bias; good extrapolation; no positional embedding. RoPE: rotation in 2D subspaces; relative position via dot product; current default. RoPE + YaRN: best for long-context extension.
Why scale FFN hidden dim 4d?
Show answer
Empirically optimal width-to-depth ratio at fixed parameters. 4d gives the FFN ~2/3 of the block's parameters; the remaining 1/3 are attention. SwiGLU uses ~2.67d × 3 matrices to keep param count constant.
Switch Transformer aux loss formula?
Show answer
L_aux = α · N · Σ f_i · P_i where N = num experts, f_i = fraction of tokens routed to expert i, P_i = mean router prob for expert i, α ≈ 0.01. The N multiplier normalizes so uniform routing → loss = 1.
What is auxiliary-loss-free balancing (DeepSeek V3)?
Show answer
Drop the aux loss; instead maintain per-expert bias added to router logits at routing time only. Decrement bias for over-subscribed experts, increment for under-subscribed. Hyperparam-free PI controller; avoids the gradient interference aux losses introduce.
What does router Z-loss do?
Show answer
L_z = (1/B) Σ (log Σ exp(x_ij))² where x_ij are router logits. Penalizes the partition function — keeps router logits bounded so softmax stays stable in bf16/fp8. Used by ST-MoE / PaLM / DeepSeek.
Why softmax not sigmoid in attention?
Show answer
Each row should sum to 1 — attention is a distribution over keys. Sigmoid would let multiple keys be "fully" attended. Some recent work explores sigmoid attention for length generalization but softmax is the default.
Why do tokenizers matter for non-English?
Show answer
BPE merges trained on English-heavy corpora produce many tokens per word for non-English languages → cost asymmetry (more tokens to encode same content) and worse inductive bias for those languages. Fix: multilingual training, byte fallback, larger vocab.
What's the difference between byte-level BPE and SentencePiece?
Show answer
Byte-level BPE (GPT-2): operates on UTF-8 bytes; vocab includes all 256 bytes; handles any input. SentencePiece (Kudo): operates on raw text (no whitespace pretokenization); supports BPE or Unigram; standard for Llama / Mistral.
How does Mixtral 8x7B work?
Show answer
Each FFN block has 8 expert FFNs; router selects top-2 per token; output is gate-weighted sum of 2 experts. Total params 47B; active per token ~13B (because only 2/8 experts run + the shared attention). Inference is faster than 47B dense; quality close to it.
Why does the residual connection matter for deep transformers?
Show answer
Without residuals, gradient through L sublayers is a product of L Jacobians → vanishes or explodes. Residual ∂(x + f(x))/∂x = I + ∂f/∂x preserves a clean identity gradient path. Lets you train 100+ layer networks.
What's QK-Norm and when is it used?
Show answer
LayerNorm applied to Q and K before the dot product. Bounds attention score magnitudes → prevents softmax saturation → improves training stability at very large scale. Used in OLMo and some recent Gemma variants.
Why is GPT-style decoder-only the dominant architecture in 2026?
Show answer
(1) Generative tasks need autoregressive decoding. (2) Encoder-decoder requires task split (BERT-style filling vs GPT-style generation); decoder-only with causal attention does both. (3) Scaling has worked exceptionally well; switching costs aren't worth it. (4) Inference can use KV cache straightforwardly.
What does sequence parallelism shard, and what's the comm pattern?
Show answer
Shards the sequence dim during LN/dropout/residual regions (where TP can't help). All-gather + reduce-scatter on TP/SP boundaries. Cuts activation memory significantly. Combined with TP, gives much higher effective context per GPU.
What is "expert choice" routing vs token-choice?
Show answer
Token-choice (standard): each token picks top-k experts. Can produce imbalance. Expert-choice (Zhou 2022): each expert picks top-k tokens (capacity per expert is fixed). Guarantees load balance — no aux loss needed. Tradeoff: tokens not picked are dropped or routed to a default.
Why pre-norm enables deeper networks?
Show answer
Pre-norm: x + Sublayer(LN(x)) — residual is unnormalized, so gradients flow through it cleanly via the identity path. Post-norm: LN(x + Sublayer(x)) — gradient passes through the LN's reciprocal-std term, which can shrink or amplify; harder to train deep without careful warmup.
How does YaRN extend RoPE to long context?
Show answer
Frequency-categorized RoPE rescaling: high-freq RoPE dims (position-sensitive, fast oscillation) get less interpolation; low-freq dims (semantic, slow oscillation) get full interpolation. Plus an attention temperature factor. Allows extending pretrained 4k models to 128k with brief continued training.