LLM inference
vLLM internals, FlashAttention, PagedAttention, speculative decoding, quantization, disaggregated serving. The OpenAI / vLLM / Together / Anthropic loops probe deeply here — know throughput vs TTFT tradeoffs cold and be ready to design a serving stack on the whiteboard.
What you'll learn
- The decode bottleneck — why inference is memory-bandwidth-bound
- KV cache — size formulas, real-world numbers
- PagedAttention & vLLM — the OS-VM analogy that fixed inference
- Continuous batching — the per-iteration scheduler
- Speculative decoding — exact, fast, mathematically clean
- Quantization for inference — INT4 / FP8 / SmoothQuant / GGUF
- FlashAttention — IO-aware attention v1 → v2 → v3
- Parallelism in inference — TP, PP, EP for serving
- Sampling — greedy through nucleus and beyond
- The latency vocabulary — TTFT, ITL, TPS, throughput
- Disaggregated serving — Splitwise, DistServe, Mooncake
- Prefix & KV-cache compression — beyond MLA
- Inference engines compared — vLLM, SGLang, TRT-LLM, TGI
Prefill is compute-bound (you process the whole prompt in one big matmul). Decode is memory-bandwidth-bound (every step reads all weights + the KV cache to emit one token). Batching helps decode by amortizing the weight read across requests. Almost every inference optimization follows from this one fact.
Two phases, two bottlenecks
An LLM serves a request in two distinct phases:
- Prefill — process the prompt (N tokens) in one forward pass. Compute scales as N². FLOP-bound for long prompts.
- Decode — generate one token at a time. Each step recomputes one query against the entire KV cache. Tiny compute per step but enormous memory reads.
Why decode is bandwidth-bound
Each decoded token requires reading all model weights (e.g., ~140 GB for a 70B model in BF16) plus the full KV cache for that request — every single step. Compute is one Q vector against the cache; the dominant cost is memory traffic. Batching helps because you read the weights once and reuse them across many requests' Q vectors.
- Prefill = compute-bound. Decode = memory-bandwidth-bound.
- Batching amortizes the weight read; it's the #1 throughput lever.
- Everything in this chapter follows from arithmetic intensity.
Per-token-per-layer KV memory is 2 · n_kv_heads · d_head · n_bytes. For Llama 3 70B at 128k context with GQA: ~40 GB per request. With MHA it would be ~320 GB per request — totally unservable, which is exactly why GQA / MLA exist. KV cache is the dominant memory cost during long-context inference and the dominant bandwidth cost during decode.
The formula — get it right under interview pressure
During autoregressive decode, K and V for previous tokens don't change — cache them. At step t, only compute Q, K, V for the new token; append K,V to the cache; attention is Q (1×d) attending to cache (t×d) → O(t·d) per step.
Per token, per layer: 2 · n_kv_heads · d_head · n_bytes (the leading 2 is K + V).
Total: n_layers · seq_len · 2 · n_kv_heads · d_head · n_bytes.
The number you'll be asked to derive on the whiteboard
Llama 3 70B uses GQA with 8 KV heads, d_head = 128, 80 layers, BF16 (2 bytes).
- per token per layer:
2 · 8 · 128 · 2 = 4,096 bytes - per token (all 80 layers):
4,096 · 80 = 327,680 bytes ≈ 320 KB - at 128k tokens:
320 KB · 131,072 ≈ ~40 GB per request - If built with MHA (64 KV heads): 8× larger → ~320 GB per request — totally unservable
- MLA (DeepSeek-style, latent dim ~512): ~3 GB at the same context — >10× smaller than GQA
This is exactly why GQA / MQA / MLA exist. The KV cache, not the weights, is what limits concurrent context capacity in long-context serving.
- MHA:
2 · n_heads · d_headper token per layer. - MQA:
2 · d_head(single shared KV head). - GQA(g):
2 · g · d_head(g KV heads shared by g groups of Q heads). - MLA:
2 · d_latent(e.g., 512 vs 8192 for d_model=8192). Quality matches MHA at ~7% cache size.
Quality ranking: MHA ≥ GQA ≥ MQA, but GQA matches MHA at much lower memory.
- Llama 3 70B GQA at 128k = ~40 GB per request. MHA equivalent = ~320 GB.
- Memorize the formula:
2 · n_kv_heads · d_head · n_bytesper token per layer. - KV cache is the dominant memory cost at long context and the dominant bandwidth cost during decode.
Naive KV cache allocates one contiguous block per request, sized for max context. Result: massive internal fragmentation, low concurrency. PagedAttention (Kwon 2023) splits the KV cache into fixed-size blocks and gives each request a "block table" — exactly like OS virtual memory. Eliminates fragmentation, enables prefix sharing, supports beam search cheaply.
The OS-VM analogy
(Kwon 2023, arxiv 2309.06180.) Inspired by OS virtual memory:
- KV cache is split into fixed-size blocks (e.g., 16 tokens).
- Each request has a "block table" mapping logical → physical blocks.
- New tokens allocate new blocks on demand.
- Different requests can share blocks (e.g., common prefix).
Eliminates fragmentation, enables prefix sharing, supports complex sampling (beam search, parallel sampling) cheaply.
- Smaller (16, vLLM default) — less internal fragmentation, more granular sharing across requests, more block-table overhead.
- Larger (128) — less metadata, fewer GPU memory operations per attention, more wasted space at sequence ends.
16 is the vLLM default; experiment for your specific workload.
- PagedAttention = OS virtual memory for KV cache.
- Block table indirection = prefix sharing + low fragmentation + beam search for free.
- This is the single most impactful inference paper of 2023.
Static batching collects N requests, runs all to completion, ships the batch. Short requests sit and wait for long ones — terrible utilization. Continuous batching (Orca, Yu 2022) reschedules per iteration: finished requests leave, new ones join. vLLM, TGI, TensorRT-LLM all do this. It's the second-most-important scheduling idea after PagedAttention.
Static vs continuous
Static batching
- Collect N requests, run all to completion.
- Short requests wait for long ones.
- Throughput limited by tail of batch.
- Good for offline / batch jobs only.
Continuous batching (Orca)
- Scheduler operates per-iteration.
- After each forward pass, finished requests leave; new ones join.
- Maximizes GPU utilization.
- Standard in vLLM, TGI, TRT-LLM.
- Continuous = per-iteration scheduling; finished out, new in.
- Static batching is fine for offline; never for serving.
- Long prefills still cause head-of-line blocking — chunk them.
A small, fast draft model proposes K tokens. The big target model verifies all K in parallel with one forward pass. Accept tokens up to the first rejection; sample one fresh token from a residual distribution at the rejection point. The math is constructed so the marginal output distribution exactly equals p_target. ~2-3× speedup when draft acceptance is high and draft cost is low.
The algorithm
(Leviathan 2023, arxiv 2211.17192.) A small draft model generates K tokens; the target model verifies all K in parallel with one forward pass; accept tokens up to first rejection, sample one more from corrected distribution. Always exact (samples from target distribution).
Why the output distribution is exactly p_target
For each drafted token t:
- Accept with probability
min(1, p_target(t) / p_draft(t)). - If rejected, sample one fresh token from the residual distribution:
(truncated to non-negative, normalized). Stop processing further drafted tokens.p_residual(t') = max(0, p_target(t') − p_draft(t')) / Z - If all K drafts accepted, sample one bonus token from
p_target.
Why exact? Condition on accepted vs rejected and integrate. The marginal distribution of each emitted token is:
p_emit(t) = p_draft(t) · min(1, p_t/p_d) [accepted path]
+ (1 − Σ p_draft · accept) · p_residual(t) [rejected path]
= p_target(t) [the algebra works out]
Frontier-lab probe: "Prove speculative decoding samples from p_target." Be ready to write this on a whiteboard.
Speedup ~2–3× depending on draft acceptance rate. Best when draft is fast (~5% of target cost) and aligned (high acceptance).
EAGLE / EAGLE-2/3
(Li 2024, arxiv 2401.15077.) Uses target model's hidden states (not just tokens) as input to a small auto-regressive head that drafts. Higher acceptance because the draft sees richer context.
Medusa
(Cai 2024.) Multiple parallel decoding heads on the target model; each head predicts token at position +1, +2, +3, etc. Tree-based verification. No separate draft model. Simpler deployment.
Lookahead decoding
(Fu 2024.) Generates n-grams via Jacobi iteration on the target model itself; verifies through a unified attention pattern. Draft-free.
- Speculative decoding is exact — output distribution =
p_targetby construction. - Accept prob =
min(1, p_t/p_d); on reject, sample from(p_t − p_d)_+ / Z. - EAGLE leads on acceptance rate; Medusa is simpler; Lookahead is draft-free.
The 2026 production sweet spot: W4A16 (INT4 weights via GPTQ/AWQ, BF16 activations) for cost-sensitive serving; FP8 W8A8 on H100/Blackwell for max throughput. SmoothQuant migrates outliers from activations to weights so W8A8 actually works. NF4 is for QLoRA. GGUF is llama.cpp's format for CPU/edge.
INT8 / FP8 — the W8A8 family
Per-channel weight quantization; activations either quantized too (W8A8) or kept in FP16 (W8A16). LLM.int8() (Dettmers) handles outliers via mixed precision. FP8 is native in H100 — generally W8A8 with per-tensor or per-channel scales.
SmoothQuant (Xiao 2023, arxiv 2211.10438): activations have outlier channels that ruin quantization; migrate the difficulty from activations to weights via a diagonal scale s:
(X · diag(s)⁻¹) · (diag(s) · W)
Now activations have smaller outliers and weights absorb the scale. This is the standard recipe for production W8A8 to actually work.
W4A16 — most common production deployment
INT4 weights (GPTQ/AWQ), bf16 activations. Smaller than W8A8 in storage but equal compute (matmul still runs in bf16). Pure W4A4 is much harder due to activation outliers.
- GPTQ (Frantar 2022, arxiv 2210.17323) — post-training, layer-by-layer quantization. Uses second-order info (approximate Hessian via calibration set) to minimize MSE on activations. Per-group quantization (group size 128).
- AWQ (Lin 2023, arxiv 2306.00978) — protect "salient" weight channels by scaling — small scaling factors applied so quantization error on important channels is reduced. Activation-aware.
NF4 (QLoRA)
(Dettmers, arxiv 2305.14314.) "NormalFloat" — quantization levels chosen to match a normal distribution's quantiles. Good for weights (which are roughly Gaussian).
Microscaling (MX)
Block-wise scales — better outlier handling. MXFP8 / MXFP6 / MXFP4. Hardware support in Blackwell.
GGUF
llama.cpp's K-quants (Q4_K_M, Q5_K_S, etc.) — block-wise non-uniform schemes good for CPU/edge inference. Designed for the GGUF container format.
Weights INT4 (GPTQ/AWQ) or FP4 (Blackwell), activations FP8 or BF16. Tensor cores actually accelerate INT8 / FP8 matmul, so W8A8 has both storage and compute wins on H100/Blackwell. W4A16 wins on memory bandwidth (smaller weight read) but uses BF16 tensor cores.
- W4A16 = most common production. FP8 W8A8 = max throughput on H100+.
- SmoothQuant is the trick that makes W8A8 actually work in production.
- GPTQ uses a Hessian; AWQ protects salient channels.
Standard softmax-attention materializes an L×L matrix in HBM — IO-bound at long sequence length. FlashAttention tiles the computation and uses online softmax to keep partial state in SRAM. v1 (Dao 2022) showed ~3× speedup. v2 improved work partitioning. v3 (Shah, Bikshandi, Ye, Thakkar, Ramani, Dao 2024) exploits H100 features (TMA, warp specialization, FP8) to hit ~75% of H100 peak.
FA1 — the IO-aware idea
(Dao 2022, arxiv 2205.14135.) Computes attention in tiles. Standard softmax requires the full row to normalize. Online softmax: maintain running max and sum, rescale as new tiles arrive. Avoids materializing the L×L attention matrix in HBM. IO-aware; ~3× faster for typical seq lengths, much more for long.
FA2 — better partitioning
(Dao 2023, arxiv 2307.08691.) Improved work partitioning, reduced non-matmul FLOPs, parallelizes across sequence dim. Better GPU occupancy.
FA3 — H100-specific
FA3 (Shah, Bikshandi, Ye, Thakkar, Ramani, Dao 2024, arxiv 2407.08608). Note: FA3 lead author is Jay Shah, not Tri Dao (who is on the paper). Frontier-lab interviewers actually probe the attribution.
- TMA (Tensor Memory Accelerator) — async memory copy from HBM → SMEM, freeing the warp scheduler.
- Warp specialization — producer/consumer warps overlap data movement and compute.
- FP8 with two-stage scaling — preserves precision through the matmul.
Result: up to ~75% of H100 peak (~740 TFLOPS BF16).
- FA = tile + online softmax → never materialize the L×L matrix.
- FA3 lead author = Jay Shah; Tri Dao is on the paper. Get the attribution right.
- FA3 features: TMA, warp specialization, FP8 two-stage scaling.
Tensor Parallel shards weight matrices across GPUs (within NVLink, two all-reduces per layer, TP ≤ 8). Pipeline Parallel shards layers (lower BW, pipeline bubbles). Expert Parallel places different MoE experts on different GPUs (all-to-all dispatch). Serving combines them — DeepSeek V3 ran EP across 64-256 GPUs with custom DualPipe overlap kernels.
Tensor Parallel (TP)
Shard each weight matrix across GPUs. For attention: shard heads. For FFN: shard hidden dim. Two all-reduces per layer (after attention output proj, after FFN W₂). Bandwidth-hungry → keep within NVLink domain (TP ≤ 8 typically).
Pipeline Parallel (PP)
Shard layers across GPUs. Lower bandwidth; introduces pipeline bubbles. Mostly used for very large models when TP within node isn't enough.
Expert Parallel (EP)
For MoE, place different experts on different GPUs. Token dispatch via all-to-all. Combines with TP and DP. DeepSeek V3 used EP across 64-256 GPUs with custom comm/compute overlap kernels (DualPipe).
- TP within NVLink (≤ 8). PP across nodes if needed. EP for MoE.
- Two all-reduces per layer in TP — bandwidth is the bottleneck.
- DualPipe (DeepSeek) overlaps comm and compute across pipeline stages.
Greedy is deterministic and prone to repetition. Top-k clips to a fixed count; top-p (nucleus) clips to a cumulative probability mass; min-p adapts to distribution flatness. Beam search is great for translation, terrible for creative gen. Repetition / frequency / presence penalties are hacky but ubiquitous in production.
The full menu
- Greedy — argmax. Deterministic. Boring; prone to repetition.
- Temperature — divide logits by T. T<1 sharper, T>1 flatter. T=0 = greedy.
- Top-k — keep top k logits, renormalize, sample.
- Top-p (nucleus) (Holtzman 2019) — keep smallest set whose cumulative prob ≥ p, renormalize, sample. p=0.9 or 0.95 typical.
- Min-p — keep tokens with prob ≥ p · max_prob. Adaptive — wider when distribution flat.
- Beam search — maintain k beams; expand all; keep top k by total log-prob. Good for translation, bad for creative gen (mode-collapsed).
- Contrastive search (Su 2022) — max α·logp − (1−α)·max_sim_to_history. Reduces repetition.
- Repetition penalty / frequency / presence penalties (OpenAI-style) — hacky but effective.
- Top-p with p ≈ 0.9-0.95 + temperature ≈ 0.7-1.0 = the safe production default.
- Min-p is an under-rated upgrade: adapts to distribution flatness.
- Beam search for translation; never for chat.
TTFT = how long until the first token streams (prefill-bound, FLOP-heavy). ITL = per-token decode time (memory-bandwidth-bound). TPS per stream = 1/ITL. Aggregate throughput = sum across streams (higher batch = higher aggregate but worse per-stream ITL). SLA-driven scheduling balances these.
The four numbers everyone confuses
| Metric | What it is | What it's bound by |
|---|---|---|
| TTFT (Time to First Token) | Latency from request → first token | Prefill (compute the prompt's KV in one pass). FLOP-bound for long prompts. |
| ITL (Inter-Token Latency) | Per-token decode time | Memory-bandwidth-bound (must read weights + KV every step). |
| TPS (Tokens Per Second) | Per-stream throughput = 1/ITL | Same as ITL. |
| Aggregate throughput | Tokens/sec across all concurrent streams | Higher batch → higher aggregate but worse per-stream ITL. |
Tradeoff: low TTFT + low ITL = expensive (small batches, lots of GPUs idle). High aggregate throughput = large batches, slower per-user. SLA-driven scheduling balances these.
- TTFT = prefill (compute). ITL = decode (bandwidth). Don't conflate.
- Higher batch → better aggregate throughput, worse per-stream ITL.
- Autoscale on queue depth, provision for p95.
Prefill is compute-bound, decode is memory-bandwidth-bound. Co-locating them causes interference (decode latency spikes when prefill runs). Disaggregate them onto separate GPU pools and transfer KV via NVLink/RDMA. Splitwise (Microsoft 2023), DistServe (Berkeley 2024), Mooncake (Kimi 2024) are the canon. Sarathi-Serve's chunked prefill is the alternative for single-pool deployments.
Why disaggregate
Prefill is compute-bound, decode is memory-bandwidth-bound. Running them on the same GPU causes interference — decode latency spikes during prefill (head-of-line blocking).
Disaggregation: separate prefill and decode pools. Prefill GPUs do bulk compute, then transfer KV cache to decode GPUs over fast interconnect (NVLink, RDMA). Independent scaling.
The canon
- Splitwise (Patel 2024, arxiv 2311.18677) — Microsoft's original disaggregation paper.
- DistServe (Zhong 2024, arxiv 2401.09670) — Berkeley, similar idea, careful cluster sizing analysis.
- Mooncake (Qin 2024, arxiv 2407.00079) — Kimi's open-source disagg + KVCache-store design (separates KV cache as a tier between RAM and HBM).
DeepSeek's open-source serving stack uses this pattern.
Chunked prefill (Sarathi-Serve) — the single-pool alternative
Sarathi-Serve (Agrawal 2024, arxiv 2403.02310): instead of running long prefills as one big forward pass (which freezes decode for everyone), chunk the prefill into smaller pieces and interleave them with decode iterations. Removes TTFT spikes during sustained traffic. Now standard in vLLM and TensorRT-LLM.
Prefix caching
When many requests share a prompt prefix (system prompts, few-shot examples, RAG context), cache the KV for that prefix and reuse. Hash the prefix tokens, look up in a KV-cache pool. vLLM's PagedAttention enables sharing at block granularity.
Massive throughput wins for chatbots with shared system prompts and for agentic workloads with growing transcripts.
- Disaggregate prefill from decode → independent scaling, no interference.
- Splitwise / DistServe / Mooncake are the canon. Mooncake adds a KV-cache tier.
- Sarathi-Serve chunks prefill — alternative for single-pool deployments.
- Prefix caching is free throughput when system prompts are shared.
Once you've adopted GQA or MLA architecturally, the next lever is on-the-fly KV compression. KIVI quantizes K per-channel and V per-token to INT2 with little quality loss. H2O / Heavy Hitters evict everything but the top-attention tokens. ChunkAttention dedups shared prefix chunks across requests via prefix tree. CLA shares KV across consecutive layers.
The KV-compression toolkit
- KIVI (Liu 2024) — per-channel quantization of K, per-token quantization of V → INT2 KV cache with minimal quality loss.
- H2O / Heavy Hitters (Zhang 2023, arxiv 2306.14048) — identify the small set of "heavy hitter" tokens that dominate attention; evict the rest. Drastically smaller cache, ~5% quality drop.
- SnapKV, PyramidKV — variants on eviction with per-layer policies.
- ChunkAttention — identify and dedup shared prefix chunks across distinct requests via prefix tree.
- CLA / MLKV — cross-layer KV sharing (use the same KV across multiple consecutive transformer layers — DeepSeek-V2 also explores this).
SGLang's prefix-cache structure: a radix trie over KV cache. When a new request shares prefix with cached requests, the tree walk gives O(log n) lookup of the longest matching prefix. Enables aggressive prefix-cache reuse with low overhead.
- KIVI: INT2 KV with per-channel K, per-token V — almost free quality.
- H2O: keep heavy hitters, evict the rest. ~5% quality cost.
- ChunkAttention + CLA / MLKV cut cache further. Stack with GQA / MLA.
- SGLang's RadixAttention is the prefix-cache reference impl.
vLLM is the open-source throughput champion (PagedAttention, continuous batching). SGLang is catching up fast with RadixAttention and great structured-output performance. TensorRT-LLM is the NVIDIA-only speed king but model-conversion friction is real. TGI is the easy-deploy option. llama.cpp / MLC-LLM / ExecuTorch own the edge.
The matrix
| Engine | Strengths | Weak/Notes |
|---|---|---|
| vLLM | PagedAttention, continuous batching, broad model support, open | Throughput champion in 2024; SGLang catching up |
| SGLang | RadixAttention (prefix tree caching), fast structured generation, very fast tool-call decoding | Newer; growing rapidly in 2025 |
| TensorRT-LLM | Best-in-class on Nvidia HW (kernel fusion, paged kv, in-flight batching) | Nvidia-only, model conversion friction |
| TGI (Hugging Face) | Easy deployment, broad model coverage | Throughput trails vLLM/SGLang |
| llama.cpp | CPU + Metal + CUDA, GGUF format, edge-friendly | Throughput-limited at scale |
| MLC-LLM, ExecuTorch | On-device (mobile, browser via WebGPU) | Edge / consumer product |
- vLLM = open default. SGLang = catching up, best for structured output.
- TRT-LLM = peak Nvidia perf, friction tax to convert.
- llama.cpp / MLC-LLM / ExecuTorch = edge.
0 → hero reading path for LLM inference
- foundation vLLM blog — start with PagedAttention post, then continuous batching, then prefix caching
- foundation TGI docs
- foundation Sebastian Raschka — Coding the KV Cache from scratch
- build Walk through vLLM source — start with
vllm/model_executor/layers/attention/ - build Implement speculative decoding in numpy — drill until you can do it in 30 min
- depth PagedAttention / vLLM paper (Kwon 2023)
- depth FlashAttention v1 (Dao 2022)
- depth FlashAttention v2
- depth FlashAttention v3 (Shah et al.)
- depth Speculative Decoding (Leviathan 2023)
- depth EAGLE
- depth Splitwise — disaggregated serving
- depth Mooncake (Kimi)
- depth Sarathi-Serve — chunked prefill
- depth GPTQ + AWQ + SmoothQuant for quantization
- depth LMSYS blog (SGLang authors) — RadixAttention etc.
LLM inference quiz — readiness check
- Walk through one decode step inside vLLM.
Show answer
Scheduler picks ready requests up to memory limit. For each: compute Q, K, V from previous token's hidden state; write K, V to next free block in the request's block table. PagedAttention kernel: attention over all blocks for that request. FFN. Sample next token. Update block table. New requests joining first do prefill (one big forward over their prompt).
- Why is decode memory-bandwidth-bound?
Show answer
Each decoded token requires reading all weights (~140 GB for 70B fp16) plus the full KV cache. Compute is small (one new Q vector). Batching helps because you reuse one weight read across many requests in the batch.
- How would you reduce TTFT for 100k-token prompts?
Show answer
Chunked prefill (Sarathi-Serve), prefix caching, sequence/context parallel for prefill, more GPUs allocated to that request. The bottleneck is FLOPs.
- Speculative decoding tradeoff?
Show answer
Wins when draft acceptance rate is high (similar distribution to target) and draft is fast (≤ ~5% target cost). Loses if draft too divergent (low acceptance) or too slow (overhead dominates).
- Compare GQA vs MQA vs MLA on KV cache size.
Show answer
MHA: 2·n_heads·d_head per token per layer. MQA: 2·d_head. GQA(g): 2·g·d_head. MLA: 2·d_latent (much smaller, e.g., 512 vs 8192 for d_model=8192). Quality: MHA ≥ GQA ≥ MQA, MLA matches MHA at 7% cache size.
- What's the bottleneck for serving 10k QPS chat?
Show answer
Memory bandwidth (decode), KV cache size (concurrent context), GPU count (cost). Mitigations: GQA/MLA, FP8 weights, prefix caching, continuous batching, speculative decoding, model routing (cheap → reasoning).
- Design a serving stack for a reasoning model with 8k hidden CoT tokens per request.
Show answer
Heavy decode load (8k tokens × users); huge KV cache. Disaggregated prefill (cheap) and decode (expensive); aggressive prefix caching across reasoning segments; queue with priority for premium tier; possibly spec-decoding with a weaker model for early CoT phase.
- How does FA3 use H100 features?
Show answer
TMA (Tensor Memory Accelerator) for async memory copy from HBM → SMEM. Warp-specialized producer/consumer pattern overlapping data move with compute. FP8 matmuls with two-stage scaling. Up to 75% of H100 peak (~740 TFLOPS bf16).
- Prove speculative decoding is exact.
Show answer
For drafted token t: accept with prob min(1, p_t/p_d). On reject: sample from residual (p_t − p_d)_+ / Z. Marginal distribution of emitted token = p_d · accept + (1 − p_d · accept) · residual = p_t (algebra works out). Each emitted token is exactly distributed as p_t.
- Worked example: KV cache for Llama 3 70B at 128k context?
Show answer
GQA: 8 KV heads, d_head=128, 80 layers, bf16. Per token per layer: 2 · 8 · 128 · 2 = 4096 B. Per token: × 80 = ~320 KB. At 128k tokens: ~40 GB per request. MHA equivalent (64 KV heads): 8× = ~320 GB.
- Why disaggregate prefill and decode?
Show answer
Prefill is compute-bound; decode is memory-bandwidth-bound. Same GPU running both → decode latency spikes during prefill. Disaggregation: separate pools, transfer KV cache via NVLink/RDMA. Independent scaling. Splitwise / DistServe / Mooncake.
- Difference between PagedAttention block size 16 vs 128?
Show answer
Smaller (16): less internal fragmentation, more granular sharing across requests, more block-table overhead. Larger (128): less metadata, fewer GPU memory operations per attention, more wasted space at sequence ends. 16 is the vLLM default; experimentation needed for specific workloads.
- What is RadixAttention (SGLang)?
Show answer
Prefix tree (radix trie) over KV cache. When a new request shares prefix with cached requests, the tree walk gives O(log n) lookup of the longest matching prefix. Enables aggressive prefix-cache reuse with low overhead. SGLang's contribution.
- Top-p vs top-k vs min-p?
Show answer
Top-k: keep top k probabilities. Top-p (nucleus): keep smallest set with cumulative prob ≥ p. Min-p: keep tokens with p ≥ p_threshold · max(p) — adaptive (wider in flat distributions).
- EAGLE vs Medusa for speculative decoding?
Show answer
EAGLE: small auto-regressive head conditioned on target's hidden states (richer context → high acceptance). Medusa: multiple parallel heads predicting +1, +2, +3 with tree verification. EAGLE-2/3 push further. Medusa simpler deployment; EAGLE higher acceptance.
- What is SmoothQuant?
Show answer
Migrate quantization difficulty from activations to weights via diagonal scaling: (X · diag(s)−1) · (diag(s) · W). Now activations have smaller outliers, weights absorb the scale. Standard recipe for production W8A8.
- What's the difference between W4A16 and W8A8?
Show answer
W4A16: INT4 weights, bf16 activations. Smaller storage; matmul still in bf16. Most common production. W8A8: INT8 weights AND activations. Faster matmul (uses int8 tensor cores) but harder due to activation outliers. Need SmoothQuant.
- What is chunked prefill (Sarathi-Serve)?
Show answer
Instead of running long prefills as one big forward (freezing decode for all users), chunk the prefill and interleave with decode iterations. Removes TTFT spikes during sustained traffic. Now standard in vLLM and TensorRT-LLM.
- What is KIVI / H2O?
Show answer
KIVI: per-channel quant of K, per-token quant of V → INT2 KV cache with minimal quality loss. H2O (Heavy Hitters): identify the small set of tokens that dominate attention; evict the rest. Drastically smaller cache; ~5% quality drop.
- Continuous batching vs static batching — explain.
Show answer
Static: collect N requests, run all to completion. Short requests wait for long ones. Continuous: per-iteration scheduling. After each forward pass, finished requests leave; new requests join. Maximizes GPU utilization. vLLM, TGI, TensorRT-LLM all do this.