PILLAR · LLM SYSTEMS

LLM inference

vLLM internals, FlashAttention, PagedAttention, speculative decoding, quantization, disaggregated serving. The OpenAI / vLLM / Together / Anthropic loops probe deeply here — know throughput vs TTFT tradeoffs cold and be ready to design a serving stack on the whiteboard.

Read ~40 min Asked at OpenAI, Anthropic, vLLM, Together, Baseten, NVIDIA Difficulty Sr Staff bar

What you'll learn

The decode bottleneck — why inference is memory-bandwidth-bound
KV cache — size formulas, real-world numbers
PagedAttention & vLLM — the OS-VM analogy that fixed inference
Continuous batching — the per-iteration scheduler
Speculative decoding — exact, fast, mathematically clean
Quantization for inference — INT4 / FP8 / SmoothQuant / GGUF
FlashAttention — IO-aware attention v1 → v2 → v3
Parallelism in inference — TP, PP, EP for serving
Sampling — greedy through nucleus and beyond
The latency vocabulary — TTFT, ITL, TPS, throughput
Disaggregated serving — Splitwise, DistServe, Mooncake
Prefix & KV-cache compression — beyond MLA
Inference engines compared — vLLM, SGLang, TRT-LLM, TGI

FOUNDATIONS · MENTAL MODEL

The decode bottleneck — why inference is memory-bandwidth-bound

TL;DR

Prefill is compute-bound (you process the whole prompt in one big matmul). Decode is memory-bandwidth-bound (every step reads all weights + the KV cache to emit one token). Batching helps decode by amortizing the weight read across requests. Almost every inference optimization follows from this one fact.

Two phases, two bottlenecks

An LLM serves a request in two distinct phases:

Prefill — process the prompt (N tokens) in one forward pass. Compute scales as N². FLOP-bound for long prompts.
Decode — generate one token at a time. Each step recomputes one query against the entire KV cache. Tiny compute per step but enormous memory reads.

Why decode is bandwidth-bound

Each decoded token requires reading all model weights (e.g., ~140 GB for a 70B model in BF16) plus the full KV cache for that request — every single step. Compute is one Q vector against the cache; the dominant cost is memory traffic. Batching helps because you read the weights once and reuse them across many requests' Q vectors.

arithmetic intensity — FLOPs performed per byte of memory traffic. Decode has very low intensity (~1 FLOP/byte) → bandwidth-bound. Prefill has high intensity (hundreds) → compute-bound. This single metric explains every inference optimization.

REMEMBER

Prefill = compute-bound. Decode = memory-bandwidth-bound.
Batching amortizes the weight read; it's the #1 throughput lever.
Everything in this chapter follows from arithmetic intensity.

FOUNDATIONS · MEMORY

KV cache — size formulas, real-world numbers

TL;DR

Per-token-per-layer KV memory is 2 · n_kv_heads · d_head · n_bytes. For Llama 3 70B at 128k context with GQA: ~40 GB per request. With MHA it would be ~320 GB per request — totally unservable, which is exactly why GQA / MLA exist. KV cache is the dominant memory cost during long-context inference and the dominant bandwidth cost during decode.

The formula — get it right under interview pressure

During autoregressive decode, K and V for previous tokens don't change — cache them. At step t, only compute Q, K, V for the new token; append K,V to the cache; attention is Q (1×d) attending to cache (t×d) → O(t·d) per step.

Per token, per layer: 2 · n_kv_heads · d_head · n_bytes (the leading 2 is K + V).

Total: n_layers · seq_len · 2 · n_kv_heads · d_head · n_bytes.

EXAMPLE — Llama 3 70B at 128k context (GQA, must know cold)

The number you'll be asked to derive on the whiteboard

Llama 3 70B uses GQA with 8 KV heads, d_head = 128, 80 layers, BF16 (2 bytes).

per token per layer: 2 · 8 · 128 · 2 = 4,096 bytes
per token (all 80 layers): 4,096 · 80 = 327,680 bytes ≈ 320 KB
at 128k tokens: 320 KB · 131,072 ≈ ~40 GB per request
If built with MHA (64 KV heads): 8× larger → ~320 GB per request — totally unservable
MLA (DeepSeek-style, latent dim ~512): ~3 GB at the same context — >10× smaller than GQA

This is exactly why GQA / MQA / MLA exist. The KV cache, not the weights, is what limits concurrent context capacity in long-context serving.

EXAMPLE — GQA vs MQA vs MLA cheatsheet

MHA: 2 · n_heads · d_head per token per layer.
MQA: 2 · d_head (single shared KV head).
GQA(g): 2 · g · d_head (g KV heads shared by g groups of Q heads).
MLA: 2 · d_latent (e.g., 512 vs 8192 for d_model=8192). Quality matches MHA at ~7% cache size.

Quality ranking: MHA ≥ GQA ≥ MQA, but GQA matches MHA at much lower memory.

REMEMBER

Llama 3 70B GQA at 128k = ~40 GB per request. MHA equivalent = ~320 GB.
Memorize the formula: 2 · n_kv_heads · d_head · n_bytes per token per layer.
KV cache is the dominant memory cost at long context and the dominant bandwidth cost during decode.

SYSTEMS · MEMORY MGMT

PagedAttention & vLLM — the OS-VM analogy that fixed inference

TL;DR

Naive KV cache allocates one contiguous block per request, sized for max context. Result: massive internal fragmentation, low concurrency. PagedAttention (Kwon 2023) splits the KV cache into fixed-size blocks and gives each request a "block table" — exactly like OS virtual memory. Eliminates fragmentation, enables prefix sharing, supports beam search cheaply.

The OS-VM analogy

(Kwon 2023, arxiv 2309.06180.) Inspired by OS virtual memory:

KV cache is split into fixed-size blocks (e.g., 16 tokens).
Each request has a "block table" mapping logical → physical blocks.
New tokens allocate new blocks on demand.
Different requests can share blocks (e.g., common prefix).

Eliminates fragmentation, enables prefix sharing, supports complex sampling (beam search, parallel sampling) cheaply.

EXAMPLE — block size 16 vs 128

Smaller (16, vLLM default) — less internal fragmentation, more granular sharing across requests, more block-table overhead.
Larger (128) — less metadata, fewer GPU memory operations per attention, more wasted space at sequence ends.

16 is the vLLM default; experiment for your specific workload.

REMEMBER

PagedAttention = OS virtual memory for KV cache.
Block table indirection = prefix sharing + low fragmentation + beam search for free.
This is the single most impactful inference paper of 2023.

SYSTEMS · SCHEDULING

Continuous batching — the per-iteration scheduler

TL;DR

Static batching collects N requests, runs all to completion, ships the batch. Short requests sit and wait for long ones — terrible utilization. Continuous batching (Orca, Yu 2022) reschedules per iteration: finished requests leave, new ones join. vLLM, TGI, TensorRT-LLM all do this. It's the second-most-important scheduling idea after PagedAttention.

Static vs continuous

Static batching

Collect N requests, run all to completion.
Short requests wait for long ones.
Throughput limited by tail of batch.
Good for offline / batch jobs only.

Continuous batching (Orca)

Scheduler operates per-iteration.
After each forward pass, finished requests leave; new ones join.
Maximizes GPU utilization.
Standard in vLLM, TGI, TRT-LLM.

PITFALL — head-of-line blocking from prefill

Even with continuous batching, a long prefill (say, a 100k-token prompt) freezes the decode stream for everyone in the batch — one user's TTFT becomes everyone else's TTFT. Fix: chunked prefill (Sarathi-Serve, Ch. 11) or full prefill/decode disaggregation.

REMEMBER

Continuous = per-iteration scheduling; finished out, new in.
Static batching is fine for offline; never for serving.
Long prefills still cause head-of-line blocking — chunk them.

DECODING · ACCELERATION

Speculative decoding — exact, fast, mathematically clean

TL;DR

A small, fast draft model proposes K tokens. The big target model verifies all K in parallel with one forward pass. Accept tokens up to the first rejection; sample one fresh token from a residual distribution at the rejection point. The math is constructed so the marginal output distribution exactly equals p_target. ~2-3× speedup when draft acceptance is high and draft cost is low.

The algorithm

(Leviathan 2023, arxiv 2211.17192.) A small draft model generates K tokens; the target model verifies all K in parallel with one forward pass; accept tokens up to first rejection, sample one more from corrected distribution. Always exact (samples from target distribution).

THE INSIGHT — speculative decoding accept/reject math

Why the output distribution is exactly `p_target`

For each drafted token t:

Accept with probability min(1, p_target(t) / p_draft(t)).
If rejected, sample one fresh token from the residual distribution:
```
p_residual(t') = max(0, p_target(t') − p_draft(t')) / Z
```
(truncated to non-negative, normalized). Stop processing further drafted tokens.
If all K drafts accepted, sample one bonus token from p_target.

Why exact? Condition on accepted vs rejected and integrate. The marginal distribution of each emitted token is:

p_emit(t) = p_draft(t) · min(1, p_t/p_d)              [accepted path]
          + (1 − Σ p_draft · accept) · p_residual(t)   [rejected path]
        = p_target(t)                                  [the algebra works out]

Frontier-lab probe: "Prove speculative decoding samples from p_target." Be ready to write this on a whiteboard.

Speedup ~2–3× depending on draft acceptance rate. Best when draft is fast (~5% of target cost) and aligned (high acceptance).

EAGLE / EAGLE-2/3

(Li 2024, arxiv 2401.15077.) Uses target model's hidden states (not just tokens) as input to a small auto-regressive head that drafts. Higher acceptance because the draft sees richer context.

Medusa

(Cai 2024.) Multiple parallel decoding heads on the target model; each head predicts token at position +1, +2, +3, etc. Tree-based verification. No separate draft model. Simpler deployment.

Lookahead decoding

(Fu 2024.) Generates n-grams via Jacobi iteration on the target model itself; verifies through a unified attention pattern. Draft-free.

PITFALL — speculative decoding tradeoff

Wins only when (a) draft acceptance rate is high (similar distribution to target) AND (b) draft is fast (≤ ~5% of target cost). Loses if draft is too divergent (low acceptance, wasted compute) or too slow (overhead dominates). Tune both.

REMEMBER

Speculative decoding is exact — output distribution = p_target by construction.
Accept prob = min(1, p_t/p_d); on reject, sample from (p_t − p_d)_+ / Z.
EAGLE leads on acceptance rate; Medusa is simpler; Lookahead is draft-free.

SYSTEMS · LOW-PRECISION

Quantization for inference — INT4 / FP8 / SmoothQuant / GGUF

TL;DR

The 2026 production sweet spot: W4A16 (INT4 weights via GPTQ/AWQ, BF16 activations) for cost-sensitive serving; FP8 W8A8 on H100/Blackwell for max throughput. SmoothQuant migrates outliers from activations to weights so W8A8 actually works. NF4 is for QLoRA. GGUF is llama.cpp's format for CPU/edge.

INT8 / FP8 — the W8A8 family

Per-channel weight quantization; activations either quantized too (W8A8) or kept in FP16 (W8A16). LLM.int8() (Dettmers) handles outliers via mixed precision. FP8 is native in H100 — generally W8A8 with per-tensor or per-channel scales.

SmoothQuant (Xiao 2023, arxiv 2211.10438): activations have outlier channels that ruin quantization; migrate the difficulty from activations to weights via a diagonal scale s:

(X · diag(s)⁻¹) · (diag(s) · W)

Now activations have smaller outliers and weights absorb the scale. This is the standard recipe for production W8A8 to actually work.

W4A16 — most common production deployment

INT4 weights (GPTQ/AWQ), bf16 activations. Smaller than W8A8 in storage but equal compute (matmul still runs in bf16). Pure W4A4 is much harder due to activation outliers.

GPTQ (Frantar 2022, arxiv 2210.17323) — post-training, layer-by-layer quantization. Uses second-order info (approximate Hessian via calibration set) to minimize MSE on activations. Per-group quantization (group size 128).
AWQ (Lin 2023, arxiv 2306.00978) — protect "salient" weight channels by scaling — small scaling factors applied so quantization error on important channels is reduced. Activation-aware.

NF4 (QLoRA)

(Dettmers, arxiv 2305.14314.) "NormalFloat" — quantization levels chosen to match a normal distribution's quantiles. Good for weights (which are roughly Gaussian).

Microscaling (MX)

Block-wise scales — better outlier handling. MXFP8 / MXFP6 / MXFP4. Hardware support in Blackwell.

GGUF

llama.cpp's K-quants (Q4_K_M, Q5_K_S, etc.) — block-wise non-uniform schemes good for CPU/edge inference. Designed for the GGUF container format.

EXAMPLE — 2026 sweet spot

Weights INT4 (GPTQ/AWQ) or FP4 (Blackwell), activations FP8 or BF16. Tensor cores actually accelerate INT8 / FP8 matmul, so W8A8 has both storage and compute wins on H100/Blackwell. W4A16 wins on memory bandwidth (smaller weight read) but uses BF16 tensor cores.

REMEMBER

W4A16 = most common production. FP8 W8A8 = max throughput on H100+.
SmoothQuant is the trick that makes W8A8 actually work in production.
GPTQ uses a Hessian; AWQ protects salient channels.

SYSTEMS · KERNELS

FlashAttention — IO-aware attention, v1 → v2 → v3

TL;DR

Standard softmax-attention materializes an L×L matrix in HBM — IO-bound at long sequence length. FlashAttention tiles the computation and uses online softmax to keep partial state in SRAM. v1 (Dao 2022) showed ~3× speedup. v2 improved work partitioning. v3 (Shah, Bikshandi, Ye, Thakkar, Ramani, Dao 2024) exploits H100 features (TMA, warp specialization, FP8) to hit ~75% of H100 peak.

FA1 — the IO-aware idea

(Dao 2022, arxiv 2205.14135.) Computes attention in tiles. Standard softmax requires the full row to normalize. Online softmax: maintain running max and sum, rescale as new tiles arrive. Avoids materializing the L×L attention matrix in HBM. IO-aware; ~3× faster for typical seq lengths, much more for long.

FA2 — better partitioning

(Dao 2023, arxiv 2307.08691.) Improved work partitioning, reduced non-matmul FLOPs, parallelizes across sequence dim. Better GPU occupancy.

FA3 — H100-specific

FA3 (Shah, Bikshandi, Ye, Thakkar, Ramani, Dao 2024, arxiv 2407.08608). Note: FA3 lead author is Jay Shah, not Tri Dao (who is on the paper). Frontier-lab interviewers actually probe the attribution.

EXAMPLE — what FA3 actually exploits on H100

TMA (Tensor Memory Accelerator) — async memory copy from HBM → SMEM, freeing the warp scheduler.
Warp specialization — producer/consumer warps overlap data movement and compute.
FP8 with two-stage scaling — preserves precision through the matmul.

Result: up to ~75% of H100 peak (~740 TFLOPS BF16).

REMEMBER

FA = tile + online softmax → never materialize the L×L matrix.
FA3 lead author = Jay Shah; Tri Dao is on the paper. Get the attribution right.
FA3 features: TMA, warp specialization, FP8 two-stage scaling.

SYSTEMS · PARALLELISM

Parallelism in inference — TP, PP, EP for serving

TL;DR

Tensor Parallel shards weight matrices across GPUs (within NVLink, two all-reduces per layer, TP ≤ 8). Pipeline Parallel shards layers (lower BW, pipeline bubbles). Expert Parallel places different MoE experts on different GPUs (all-to-all dispatch). Serving combines them — DeepSeek V3 ran EP across 64-256 GPUs with custom DualPipe overlap kernels.

Tensor Parallel (TP)

Shard each weight matrix across GPUs. For attention: shard heads. For FFN: shard hidden dim. Two all-reduces per layer (after attention output proj, after FFN W₂). Bandwidth-hungry → keep within NVLink domain (TP ≤ 8 typically).

Pipeline Parallel (PP)

Shard layers across GPUs. Lower bandwidth; introduces pipeline bubbles. Mostly used for very large models when TP within node isn't enough.

Expert Parallel (EP)

For MoE, place different experts on different GPUs. Token dispatch via all-to-all. Combines with TP and DP. DeepSeek V3 used EP across 64-256 GPUs with custom comm/compute overlap kernels (DualPipe).

PITFALL — TP across PCIe

TP requires all-reduce twice per layer. Across PCIe (no NVLink), you'll spend more time in collectives than compute. Cap TP at the NVLink domain — usually 8 GPUs in a single node — and use PP or replicas across nodes.

REMEMBER

TP within NVLink (≤ 8). PP across nodes if needed. EP for MoE.
Two all-reduces per layer in TP — bandwidth is the bottleneck.
DualPipe (DeepSeek) overlaps comm and compute across pipeline stages.

DECODING · SAMPLING

Sampling — greedy through nucleus and beyond

TL;DR

Greedy is deterministic and prone to repetition. Top-k clips to a fixed count; top-p (nucleus) clips to a cumulative probability mass; min-p adapts to distribution flatness. Beam search is great for translation, terrible for creative gen. Repetition / frequency / presence penalties are hacky but ubiquitous in production.

The full menu

Greedy — argmax. Deterministic. Boring; prone to repetition.
Temperature — divide logits by T. T<1 sharper, T>1 flatter. T=0 = greedy.
Top-k — keep top k logits, renormalize, sample.
Top-p (nucleus) (Holtzman 2019) — keep smallest set whose cumulative prob ≥ p, renormalize, sample. p=0.9 or 0.95 typical.
Min-p — keep tokens with prob ≥ p · max_prob. Adaptive — wider when distribution flat.
Beam search — maintain k beams; expand all; keep top k by total log-prob. Good for translation, bad for creative gen (mode-collapsed).
Contrastive search (Su 2022) — max α·logp − (1−α)·max_sim_to_history. Reduces repetition.
Repetition penalty / frequency / presence penalties (OpenAI-style) — hacky but effective.

REMEMBER

Top-p with p ≈ 0.9-0.95 + temperature ≈ 0.7-1.0 = the safe production default.
Min-p is an under-rated upgrade: adapts to distribution flatness.
Beam search for translation; never for chat.

SYSTEMS · METRICS

The latency vocabulary — TTFT, ITL, TPS, throughput

TL;DR

TTFT = how long until the first token streams (prefill-bound, FLOP-heavy). ITL = per-token decode time (memory-bandwidth-bound). TPS per stream = 1/ITL. Aggregate throughput = sum across streams (higher batch = higher aggregate but worse per-stream ITL). SLA-driven scheduling balances these.

The four numbers everyone confuses

Metric	What it is	What it's bound by
TTFT (Time to First Token)	Latency from request → first token	Prefill (compute the prompt's KV in one pass). FLOP-bound for long prompts.
ITL (Inter-Token Latency)	Per-token decode time	Memory-bandwidth-bound (must read weights + KV every step).
TPS (Tokens Per Second)	Per-stream throughput = 1/ITL	Same as ITL.
Aggregate throughput	Tokens/sec across all concurrent streams	Higher batch → higher aggregate but worse per-stream ITL.

Tradeoff: low TTFT + low ITL = expensive (small batches, lots of GPUs idle). High aggregate throughput = large batches, slower per-user. SLA-driven scheduling balances these.

PITFALL — autoscaling lag

LLM serving has long startup time (load 70-405 GB weights). HPA-style autoscaling on QPS is too slow — by the time new pods come up, the SLA is blown. Provision for the p95, not the median, and use queue-depth-based scaling, not QPS.

REMEMBER

TTFT = prefill (compute). ITL = decode (bandwidth). Don't conflate.
Higher batch → better aggregate throughput, worse per-stream ITL.
Autoscale on queue depth, provision for p95.

SYSTEMS · DISAGGREGATION

Disaggregated serving — Splitwise, DistServe, Mooncake

TL;DR

Prefill is compute-bound, decode is memory-bandwidth-bound. Co-locating them causes interference (decode latency spikes when prefill runs). Disaggregate them onto separate GPU pools and transfer KV via NVLink/RDMA. Splitwise (Microsoft 2023), DistServe (Berkeley 2024), Mooncake (Kimi 2024) are the canon. Sarathi-Serve's chunked prefill is the alternative for single-pool deployments.

Why disaggregate

Prefill is compute-bound, decode is memory-bandwidth-bound. Running them on the same GPU causes interference — decode latency spikes during prefill (head-of-line blocking).

Disaggregation: separate prefill and decode pools. Prefill GPUs do bulk compute, then transfer KV cache to decode GPUs over fast interconnect (NVLink, RDMA). Independent scaling.

The canon

Splitwise (Patel 2024, arxiv 2311.18677) — Microsoft's original disaggregation paper.
DistServe (Zhong 2024, arxiv 2401.09670) — Berkeley, similar idea, careful cluster sizing analysis.
Mooncake (Qin 2024, arxiv 2407.00079) — Kimi's open-source disagg + KVCache-store design (separates KV cache as a tier between RAM and HBM).

DeepSeek's open-source serving stack uses this pattern.

Chunked prefill (Sarathi-Serve) — the single-pool alternative

Sarathi-Serve (Agrawal 2024, arxiv 2403.02310): instead of running long prefills as one big forward pass (which freezes decode for everyone), chunk the prefill into smaller pieces and interleave them with decode iterations. Removes TTFT spikes during sustained traffic. Now standard in vLLM and TensorRT-LLM.

Prefix caching

When many requests share a prompt prefix (system prompts, few-shot examples, RAG context), cache the KV for that prefix and reuse. Hash the prefix tokens, look up in a KV-cache pool. vLLM's PagedAttention enables sharing at block granularity.

Massive throughput wins for chatbots with shared system prompts and for agentic workloads with growing transcripts.

REMEMBER

Disaggregate prefill from decode → independent scaling, no interference.
Splitwise / DistServe / Mooncake are the canon. Mooncake adds a KV-cache tier.
Sarathi-Serve chunks prefill — alternative for single-pool deployments.
Prefix caching is free throughput when system prompts are shared.

SYSTEMS · KV COMPRESSION

Prefix & KV-cache compression — beyond MLA

TL;DR

Once you've adopted GQA or MLA architecturally, the next lever is on-the-fly KV compression. KIVI quantizes K per-channel and V per-token to INT2 with little quality loss. H2O / Heavy Hitters evict everything but the top-attention tokens. ChunkAttention dedups shared prefix chunks across requests via prefix tree. CLA shares KV across consecutive layers.

The KV-compression toolkit

KIVI (Liu 2024) — per-channel quantization of K, per-token quantization of V → INT2 KV cache with minimal quality loss.
H2O / Heavy Hitters (Zhang 2023, arxiv 2306.14048) — identify the small set of "heavy hitter" tokens that dominate attention; evict the rest. Drastically smaller cache, ~5% quality drop.
SnapKV, PyramidKV — variants on eviction with per-layer policies.
ChunkAttention — identify and dedup shared prefix chunks across distinct requests via prefix tree.
CLA / MLKV — cross-layer KV sharing (use the same KV across multiple consecutive transformer layers — DeepSeek-V2 also explores this).

EXAMPLE — RadixAttention (SGLang)

SGLang's prefix-cache structure: a radix trie over KV cache. When a new request shares prefix with cached requests, the tree walk gives O(log n) lookup of the longest matching prefix. Enables aggressive prefix-cache reuse with low overhead.

REMEMBER

KIVI: INT2 KV with per-channel K, per-token V — almost free quality.
H2O: keep heavy hitters, evict the rest. ~5% quality cost.
ChunkAttention + CLA / MLKV cut cache further. Stack with GQA / MLA.
SGLang's RadixAttention is the prefix-cache reference impl.

ECOSYSTEM · ENGINES

Inference engines compared — vLLM, SGLang, TRT-LLM, TGI

TL;DR

vLLM is the open-source throughput champion (PagedAttention, continuous batching). SGLang is catching up fast with RadixAttention and great structured-output performance. TensorRT-LLM is the NVIDIA-only speed king but model-conversion friction is real. TGI is the easy-deploy option. llama.cpp / MLC-LLM / ExecuTorch own the edge.

The matrix

Engine	Strengths	Weak/Notes
vLLM	PagedAttention, continuous batching, broad model support, open	Throughput champion in 2024; SGLang catching up
SGLang	RadixAttention (prefix tree caching), fast structured generation, very fast tool-call decoding	Newer; growing rapidly in 2025
TensorRT-LLM	Best-in-class on Nvidia HW (kernel fusion, paged kv, in-flight batching)	Nvidia-only, model conversion friction
TGI (Hugging Face)	Easy deployment, broad model coverage	Throughput trails vLLM/SGLang
llama.cpp	CPU + Metal + CUDA, GGUF format, edge-friendly	Throughput-limited at scale
MLC-LLM, ExecuTorch	On-device (mobile, browser via WebGPU)	Edge / consumer product

REMEMBER

vLLM = open default. SGLang = catching up, best for structured output.
TRT-LLM = peak Nvidia perf, friction tax to convert.
llama.cpp / MLC-LLM / ExecuTorch = edge.

0 → hero reading path for LLM inference

foundation vLLM blog — start with PagedAttention post, then continuous batching, then prefix caching
foundation TGI docs
foundation Sebastian Raschka — Coding the KV Cache from scratch
build Walk through vLLM source — start with vllm/model_executor/layers/attention/
build Implement speculative decoding in numpy — drill until you can do it in 30 min
depth PagedAttention / vLLM paper (Kwon 2023)
depth FlashAttention v1 (Dao 2022)
depth FlashAttention v2
depth FlashAttention v3 (Shah et al.)
depth Speculative Decoding (Leviathan 2023)
depth EAGLE
depth Splitwise — disaggregated serving
depth Mooncake (Kimi)
depth Sarathi-Serve — chunked prefill
depth GPTQ + AWQ + SmoothQuant for quantization
depth LMSYS blog (SGLang authors) — RadixAttention etc.

LLM inference quiz — readiness check

Walk through one decode step inside vLLM.
Show answer
Scheduler picks ready requests up to memory limit. For each: compute Q, K, V from previous token's hidden state; write K, V to next free block in the request's block table. PagedAttention kernel: attention over all blocks for that request. FFN. Sample next token. Update block table. New requests joining first do prefill (one big forward over their prompt).
Why is decode memory-bandwidth-bound?
Show answer
Each decoded token requires reading all weights (~140 GB for 70B fp16) plus the full KV cache. Compute is small (one new Q vector). Batching helps because you reuse one weight read across many requests in the batch.
How would you reduce TTFT for 100k-token prompts?
Show answer
Chunked prefill (Sarathi-Serve), prefix caching, sequence/context parallel for prefill, more GPUs allocated to that request. The bottleneck is FLOPs.
Speculative decoding tradeoff?
Show answer
Wins when draft acceptance rate is high (similar distribution to target) and draft is fast (≤ ~5% target cost). Loses if draft too divergent (low acceptance) or too slow (overhead dominates).
Compare GQA vs MQA vs MLA on KV cache size.
Show answer
MHA: 2·n_heads·d_head per token per layer. MQA: 2·d_head. GQA(g): 2·g·d_head. MLA: 2·d_latent (much smaller, e.g., 512 vs 8192 for d_model=8192). Quality: MHA ≥ GQA ≥ MQA, MLA matches MHA at 7% cache size.
What's the bottleneck for serving 10k QPS chat?
Show answer
Memory bandwidth (decode), KV cache size (concurrent context), GPU count (cost). Mitigations: GQA/MLA, FP8 weights, prefix caching, continuous batching, speculative decoding, model routing (cheap → reasoning).
Design a serving stack for a reasoning model with 8k hidden CoT tokens per request.
Show answer
Heavy decode load (8k tokens × users); huge KV cache. Disaggregated prefill (cheap) and decode (expensive); aggressive prefix caching across reasoning segments; queue with priority for premium tier; possibly spec-decoding with a weaker model for early CoT phase.
How does FA3 use H100 features?
Show answer
TMA (Tensor Memory Accelerator) for async memory copy from HBM → SMEM. Warp-specialized producer/consumer pattern overlapping data move with compute. FP8 matmuls with two-stage scaling. Up to 75% of H100 peak (~740 TFLOPS bf16).
Prove speculative decoding is exact.
Show answer
For drafted token t: accept with prob min(1, p_t/p_d). On reject: sample from residual (p_t − p_d)_+ / Z. Marginal distribution of emitted token = p_d · accept + (1 − p_d · accept) · residual = p_t (algebra works out). Each emitted token is exactly distributed as p_t.
Worked example: KV cache for Llama 3 70B at 128k context?
Show answer
GQA: 8 KV heads, d_head=128, 80 layers, bf16. Per token per layer: 2 · 8 · 128 · 2 = 4096 B. Per token: × 80 = ~320 KB. At 128k tokens: ~40 GB per request. MHA equivalent (64 KV heads): 8× = ~320 GB.
Why disaggregate prefill and decode?
Show answer
Prefill is compute-bound; decode is memory-bandwidth-bound. Same GPU running both → decode latency spikes during prefill. Disaggregation: separate pools, transfer KV cache via NVLink/RDMA. Independent scaling. Splitwise / DistServe / Mooncake.
Difference between PagedAttention block size 16 vs 128?
Show answer
Smaller (16): less internal fragmentation, more granular sharing across requests, more block-table overhead. Larger (128): less metadata, fewer GPU memory operations per attention, more wasted space at sequence ends. 16 is the vLLM default; experimentation needed for specific workloads.
What is RadixAttention (SGLang)?
Show answer
Prefix tree (radix trie) over KV cache. When a new request shares prefix with cached requests, the tree walk gives O(log n) lookup of the longest matching prefix. Enables aggressive prefix-cache reuse with low overhead. SGLang's contribution.
Top-p vs top-k vs min-p?
Show answer
Top-k: keep top k probabilities. Top-p (nucleus): keep smallest set with cumulative prob ≥ p. Min-p: keep tokens with p ≥ p_threshold · max(p) — adaptive (wider in flat distributions).
EAGLE vs Medusa for speculative decoding?
Show answer
EAGLE: small auto-regressive head conditioned on target's hidden states (richer context → high acceptance). Medusa: multiple parallel heads predicting +1, +2, +3 with tree verification. EAGLE-2/3 push further. Medusa simpler deployment; EAGLE higher acceptance.
What is SmoothQuant?
Show answer
Migrate quantization difficulty from activations to weights via diagonal scaling: (X · diag(s)⁻¹) · (diag(s) · W). Now activations have smaller outliers, weights absorb the scale. Standard recipe for production W8A8.
What's the difference between W4A16 and W8A8?
Show answer
W4A16: INT4 weights, bf16 activations. Smaller storage; matmul still in bf16. Most common production. W8A8: INT8 weights AND activations. Faster matmul (uses int8 tensor cores) but harder due to activation outliers. Need SmoothQuant.
What is chunked prefill (Sarathi-Serve)?
Show answer
Instead of running long prefills as one big forward (freezing decode for all users), chunk the prefill and interleave with decode iterations. Removes TTFT spikes during sustained traffic. Now standard in vLLM and TensorRT-LLM.
What is KIVI / H2O?
Show answer
KIVI: per-channel quant of K, per-token quant of V → INT2 KV cache with minimal quality loss. H2O (Heavy Hitters): identify the small set of tokens that dominate attention; evict the rest. Drastically smaller cache; ~5% quality drop.
Continuous batching vs static batching — explain.
Show answer
Static: collect N requests, run all to completion. Short requests wait for long ones. Continuous: per-iteration scheduling. After each forward pass, finished requests leave; new requests join. Maximizes GPU utilization. vLLM, TGI, TensorRT-LLM all do this.

LLM inference

What you'll learn

The decode bottleneck — why inference is memory-bandwidth-bound

Two phases, two bottlenecks

Why decode is bandwidth-bound

KV cache — size formulas, real-world numbers

The formula — get it right under interview pressure

The number you'll be asked to derive on the whiteboard

PagedAttention & vLLM — the OS-VM analogy that fixed inference

The OS-VM analogy

Continuous batching — the per-iteration scheduler

Static vs continuous

Static batching

Continuous batching (Orca)

Speculative decoding — exact, fast, mathematically clean

The algorithm

Why the output distribution is exactly p_target

EAGLE / EAGLE-2/3

Medusa

Lookahead decoding

Quantization for inference — INT4 / FP8 / SmoothQuant / GGUF

INT8 / FP8 — the W8A8 family

W4A16 — most common production deployment

NF4 (QLoRA)

Microscaling (MX)

GGUF

FlashAttention — IO-aware attention, v1 → v2 → v3

FA1 — the IO-aware idea

FA2 — better partitioning

FA3 — H100-specific

Parallelism in inference — TP, PP, EP for serving

Tensor Parallel (TP)

Pipeline Parallel (PP)

Expert Parallel (EP)

Sampling — greedy through nucleus and beyond

The full menu

The latency vocabulary — TTFT, ITL, TPS, throughput

The four numbers everyone confuses

Disaggregated serving — Splitwise, DistServe, Mooncake

Why disaggregate

The canon

Chunked prefill (Sarathi-Serve) — the single-pool alternative

Prefix caching

Prefix & KV-cache compression — beyond MLA

The KV-compression toolkit

Inference engines compared — vLLM, SGLang, TRT-LLM, TGI

The matrix

0 → hero reading path for LLM inference

LLM inference quiz — readiness check

Why the output distribution is exactly `p_target`