PILLAR · LLM SYSTEMS

LLM inference

vLLM internals, FlashAttention, PagedAttention, speculative decoding, quantization, disaggregated serving. The OpenAI / vLLM / Together / Anthropic loops probe deeply here — know throughput vs TTFT tradeoffs cold and be ready to design a serving stack on the whiteboard.

Read ~40 min Asked at OpenAI, Anthropic, vLLM, Together, Baseten, NVIDIA Difficulty Sr Staff bar
01
FOUNDATIONS · MENTAL MODEL

The decode bottleneck — why inference is memory-bandwidth-bound

TL;DR

Prefill is compute-bound (you process the whole prompt in one big matmul). Decode is memory-bandwidth-bound (every step reads all weights + the KV cache to emit one token). Batching helps decode by amortizing the weight read across requests. Almost every inference optimization follows from this one fact.

Two phases, two bottlenecks

An LLM serves a request in two distinct phases:

Why decode is bandwidth-bound

Each decoded token requires reading all model weights (e.g., ~140 GB for a 70B model in BF16) plus the full KV cache for that request — every single step. Compute is one Q vector against the cache; the dominant cost is memory traffic. Batching helps because you read the weights once and reuse them across many requests' Q vectors.

arithmetic intensity — FLOPs performed per byte of memory traffic. Decode has very low intensity (~1 FLOP/byte) → bandwidth-bound. Prefill has high intensity (hundreds) → compute-bound. This single metric explains every inference optimization.
REMEMBER
  • Prefill = compute-bound. Decode = memory-bandwidth-bound.
  • Batching amortizes the weight read; it's the #1 throughput lever.
  • Everything in this chapter follows from arithmetic intensity.
02
FOUNDATIONS · MEMORY

KV cache — size formulas, real-world numbers

TL;DR

Per-token-per-layer KV memory is 2 · n_kv_heads · d_head · n_bytes. For Llama 3 70B at 128k context with GQA: ~40 GB per request. With MHA it would be ~320 GB per request — totally unservable, which is exactly why GQA / MLA exist. KV cache is the dominant memory cost during long-context inference and the dominant bandwidth cost during decode.

The formula — get it right under interview pressure

During autoregressive decode, K and V for previous tokens don't change — cache them. At step t, only compute Q, K, V for the new token; append K,V to the cache; attention is Q (1×d) attending to cache (t×d) → O(t·d) per step.

Per token, per layer: 2 · n_kv_heads · d_head · n_bytes (the leading 2 is K + V).

Total: n_layers · seq_len · 2 · n_kv_heads · d_head · n_bytes.

EXAMPLE — Llama 3 70B at 128k context (GQA, must know cold)

The number you'll be asked to derive on the whiteboard

Llama 3 70B uses GQA with 8 KV heads, d_head = 128, 80 layers, BF16 (2 bytes).

  • per token per layer: 2 · 8 · 128 · 2 = 4,096 bytes
  • per token (all 80 layers): 4,096 · 80 = 327,680 bytes ≈ 320 KB
  • at 128k tokens: 320 KB · 131,072 ≈ ~40 GB per request
  • If built with MHA (64 KV heads): 8× larger → ~320 GB per request — totally unservable
  • MLA (DeepSeek-style, latent dim ~512): ~3 GB at the same context — >10× smaller than GQA

This is exactly why GQA / MQA / MLA exist. The KV cache, not the weights, is what limits concurrent context capacity in long-context serving.

EXAMPLE — GQA vs MQA vs MLA cheatsheet
  • MHA: 2 · n_heads · d_head per token per layer.
  • MQA: 2 · d_head (single shared KV head).
  • GQA(g): 2 · g · d_head (g KV heads shared by g groups of Q heads).
  • MLA: 2 · d_latent (e.g., 512 vs 8192 for d_model=8192). Quality matches MHA at ~7% cache size.

Quality ranking: MHA ≥ GQA ≥ MQA, but GQA matches MHA at much lower memory.

REMEMBER
  • Llama 3 70B GQA at 128k = ~40 GB per request. MHA equivalent = ~320 GB.
  • Memorize the formula: 2 · n_kv_heads · d_head · n_bytes per token per layer.
  • KV cache is the dominant memory cost at long context and the dominant bandwidth cost during decode.
03
SYSTEMS · MEMORY MGMT

PagedAttention & vLLM — the OS-VM analogy that fixed inference

TL;DR

Naive KV cache allocates one contiguous block per request, sized for max context. Result: massive internal fragmentation, low concurrency. PagedAttention (Kwon 2023) splits the KV cache into fixed-size blocks and gives each request a "block table" — exactly like OS virtual memory. Eliminates fragmentation, enables prefix sharing, supports beam search cheaply.

The OS-VM analogy

(Kwon 2023, arxiv 2309.06180.) Inspired by OS virtual memory:

Eliminates fragmentation, enables prefix sharing, supports complex sampling (beam search, parallel sampling) cheaply.

EXAMPLE — block size 16 vs 128
  • Smaller (16, vLLM default) — less internal fragmentation, more granular sharing across requests, more block-table overhead.
  • Larger (128) — less metadata, fewer GPU memory operations per attention, more wasted space at sequence ends.

16 is the vLLM default; experiment for your specific workload.

REMEMBER
  • PagedAttention = OS virtual memory for KV cache.
  • Block table indirection = prefix sharing + low fragmentation + beam search for free.
  • This is the single most impactful inference paper of 2023.
04
SYSTEMS · SCHEDULING

Continuous batching — the per-iteration scheduler

TL;DR

Static batching collects N requests, runs all to completion, ships the batch. Short requests sit and wait for long ones — terrible utilization. Continuous batching (Orca, Yu 2022) reschedules per iteration: finished requests leave, new ones join. vLLM, TGI, TensorRT-LLM all do this. It's the second-most-important scheduling idea after PagedAttention.

Static vs continuous

Static batching

  • Collect N requests, run all to completion.
  • Short requests wait for long ones.
  • Throughput limited by tail of batch.
  • Good for offline / batch jobs only.

Continuous batching (Orca)

  • Scheduler operates per-iteration.
  • After each forward pass, finished requests leave; new ones join.
  • Maximizes GPU utilization.
  • Standard in vLLM, TGI, TRT-LLM.
PITFALL — head-of-line blocking from prefill
Even with continuous batching, a long prefill (say, a 100k-token prompt) freezes the decode stream for everyone in the batch — one user's TTFT becomes everyone else's TTFT. Fix: chunked prefill (Sarathi-Serve, Ch. 11) or full prefill/decode disaggregation.
REMEMBER
  • Continuous = per-iteration scheduling; finished out, new in.
  • Static batching is fine for offline; never for serving.
  • Long prefills still cause head-of-line blocking — chunk them.
05
DECODING · ACCELERATION

Speculative decoding — exact, fast, mathematically clean

TL;DR

A small, fast draft model proposes K tokens. The big target model verifies all K in parallel with one forward pass. Accept tokens up to the first rejection; sample one fresh token from a residual distribution at the rejection point. The math is constructed so the marginal output distribution exactly equals p_target. ~2-3× speedup when draft acceptance is high and draft cost is low.

The algorithm

(Leviathan 2023, arxiv 2211.17192.) A small draft model generates K tokens; the target model verifies all K in parallel with one forward pass; accept tokens up to first rejection, sample one more from corrected distribution. Always exact (samples from target distribution).

THE INSIGHT — speculative decoding accept/reject math

Why the output distribution is exactly p_target

For each drafted token t:

  • Accept with probability min(1, p_target(t) / p_draft(t)).
  • If rejected, sample one fresh token from the residual distribution:
    p_residual(t') = max(0, p_target(t') − p_draft(t')) / Z
    (truncated to non-negative, normalized). Stop processing further drafted tokens.
  • If all K drafts accepted, sample one bonus token from p_target.

Why exact? Condition on accepted vs rejected and integrate. The marginal distribution of each emitted token is:

p_emit(t) = p_draft(t) · min(1, p_t/p_d)              [accepted path]
          + (1 − Σ p_draft · accept) · p_residual(t)   [rejected path]
        = p_target(t)                                  [the algebra works out]

Frontier-lab probe: "Prove speculative decoding samples from p_target." Be ready to write this on a whiteboard.

Speedup ~2–3× depending on draft acceptance rate. Best when draft is fast (~5% of target cost) and aligned (high acceptance).

EAGLE / EAGLE-2/3

(Li 2024, arxiv 2401.15077.) Uses target model's hidden states (not just tokens) as input to a small auto-regressive head that drafts. Higher acceptance because the draft sees richer context.

Medusa

(Cai 2024.) Multiple parallel decoding heads on the target model; each head predicts token at position +1, +2, +3, etc. Tree-based verification. No separate draft model. Simpler deployment.

Lookahead decoding

(Fu 2024.) Generates n-grams via Jacobi iteration on the target model itself; verifies through a unified attention pattern. Draft-free.

PITFALL — speculative decoding tradeoff
Wins only when (a) draft acceptance rate is high (similar distribution to target) AND (b) draft is fast (≤ ~5% of target cost). Loses if draft is too divergent (low acceptance, wasted compute) or too slow (overhead dominates). Tune both.
REMEMBER
  • Speculative decoding is exact — output distribution = p_target by construction.
  • Accept prob = min(1, p_t/p_d); on reject, sample from (p_t − p_d)_+ / Z.
  • EAGLE leads on acceptance rate; Medusa is simpler; Lookahead is draft-free.
06
SYSTEMS · LOW-PRECISION

Quantization for inference — INT4 / FP8 / SmoothQuant / GGUF

TL;DR

The 2026 production sweet spot: W4A16 (INT4 weights via GPTQ/AWQ, BF16 activations) for cost-sensitive serving; FP8 W8A8 on H100/Blackwell for max throughput. SmoothQuant migrates outliers from activations to weights so W8A8 actually works. NF4 is for QLoRA. GGUF is llama.cpp's format for CPU/edge.

INT8 / FP8 — the W8A8 family

Per-channel weight quantization; activations either quantized too (W8A8) or kept in FP16 (W8A16). LLM.int8() (Dettmers) handles outliers via mixed precision. FP8 is native in H100 — generally W8A8 with per-tensor or per-channel scales.

SmoothQuant (Xiao 2023, arxiv 2211.10438): activations have outlier channels that ruin quantization; migrate the difficulty from activations to weights via a diagonal scale s:

(X · diag(s)⁻¹) · (diag(s) · W)

Now activations have smaller outliers and weights absorb the scale. This is the standard recipe for production W8A8 to actually work.

W4A16 — most common production deployment

INT4 weights (GPTQ/AWQ), bf16 activations. Smaller than W8A8 in storage but equal compute (matmul still runs in bf16). Pure W4A4 is much harder due to activation outliers.

NF4 (QLoRA)

(Dettmers, arxiv 2305.14314.) "NormalFloat" — quantization levels chosen to match a normal distribution's quantiles. Good for weights (which are roughly Gaussian).

Microscaling (MX)

Block-wise scales — better outlier handling. MXFP8 / MXFP6 / MXFP4. Hardware support in Blackwell.

GGUF

llama.cpp's K-quants (Q4_K_M, Q5_K_S, etc.) — block-wise non-uniform schemes good for CPU/edge inference. Designed for the GGUF container format.

EXAMPLE — 2026 sweet spot

Weights INT4 (GPTQ/AWQ) or FP4 (Blackwell), activations FP8 or BF16. Tensor cores actually accelerate INT8 / FP8 matmul, so W8A8 has both storage and compute wins on H100/Blackwell. W4A16 wins on memory bandwidth (smaller weight read) but uses BF16 tensor cores.

REMEMBER
  • W4A16 = most common production. FP8 W8A8 = max throughput on H100+.
  • SmoothQuant is the trick that makes W8A8 actually work in production.
  • GPTQ uses a Hessian; AWQ protects salient channels.
07
SYSTEMS · KERNELS

FlashAttention — IO-aware attention, v1 → v2 → v3

TL;DR

Standard softmax-attention materializes an L×L matrix in HBM — IO-bound at long sequence length. FlashAttention tiles the computation and uses online softmax to keep partial state in SRAM. v1 (Dao 2022) showed ~3× speedup. v2 improved work partitioning. v3 (Shah, Bikshandi, Ye, Thakkar, Ramani, Dao 2024) exploits H100 features (TMA, warp specialization, FP8) to hit ~75% of H100 peak.

FA1 — the IO-aware idea

(Dao 2022, arxiv 2205.14135.) Computes attention in tiles. Standard softmax requires the full row to normalize. Online softmax: maintain running max and sum, rescale as new tiles arrive. Avoids materializing the L×L attention matrix in HBM. IO-aware; ~3× faster for typical seq lengths, much more for long.

FA2 — better partitioning

(Dao 2023, arxiv 2307.08691.) Improved work partitioning, reduced non-matmul FLOPs, parallelizes across sequence dim. Better GPU occupancy.

FA3 — H100-specific

FA3 (Shah, Bikshandi, Ye, Thakkar, Ramani, Dao 2024, arxiv 2407.08608). Note: FA3 lead author is Jay Shah, not Tri Dao (who is on the paper). Frontier-lab interviewers actually probe the attribution.

EXAMPLE — what FA3 actually exploits on H100
  • TMA (Tensor Memory Accelerator) — async memory copy from HBM → SMEM, freeing the warp scheduler.
  • Warp specialization — producer/consumer warps overlap data movement and compute.
  • FP8 with two-stage scaling — preserves precision through the matmul.

Result: up to ~75% of H100 peak (~740 TFLOPS BF16).

REMEMBER
  • FA = tile + online softmax → never materialize the L×L matrix.
  • FA3 lead author = Jay Shah; Tri Dao is on the paper. Get the attribution right.
  • FA3 features: TMA, warp specialization, FP8 two-stage scaling.
08
SYSTEMS · PARALLELISM

Parallelism in inference — TP, PP, EP for serving

TL;DR

Tensor Parallel shards weight matrices across GPUs (within NVLink, two all-reduces per layer, TP ≤ 8). Pipeline Parallel shards layers (lower BW, pipeline bubbles). Expert Parallel places different MoE experts on different GPUs (all-to-all dispatch). Serving combines them — DeepSeek V3 ran EP across 64-256 GPUs with custom DualPipe overlap kernels.

Tensor Parallel (TP)

Shard each weight matrix across GPUs. For attention: shard heads. For FFN: shard hidden dim. Two all-reduces per layer (after attention output proj, after FFN W₂). Bandwidth-hungry → keep within NVLink domain (TP ≤ 8 typically).

Pipeline Parallel (PP)

Shard layers across GPUs. Lower bandwidth; introduces pipeline bubbles. Mostly used for very large models when TP within node isn't enough.

Expert Parallel (EP)

For MoE, place different experts on different GPUs. Token dispatch via all-to-all. Combines with TP and DP. DeepSeek V3 used EP across 64-256 GPUs with custom comm/compute overlap kernels (DualPipe).

PITFALL — TP across PCIe
TP requires all-reduce twice per layer. Across PCIe (no NVLink), you'll spend more time in collectives than compute. Cap TP at the NVLink domain — usually 8 GPUs in a single node — and use PP or replicas across nodes.
REMEMBER
  • TP within NVLink (≤ 8). PP across nodes if needed. EP for MoE.
  • Two all-reduces per layer in TP — bandwidth is the bottleneck.
  • DualPipe (DeepSeek) overlaps comm and compute across pipeline stages.
09
DECODING · SAMPLING

Sampling — greedy through nucleus and beyond

TL;DR

Greedy is deterministic and prone to repetition. Top-k clips to a fixed count; top-p (nucleus) clips to a cumulative probability mass; min-p adapts to distribution flatness. Beam search is great for translation, terrible for creative gen. Repetition / frequency / presence penalties are hacky but ubiquitous in production.

The full menu

REMEMBER
  • Top-p with p ≈ 0.9-0.95 + temperature ≈ 0.7-1.0 = the safe production default.
  • Min-p is an under-rated upgrade: adapts to distribution flatness.
  • Beam search for translation; never for chat.
10
SYSTEMS · METRICS

The latency vocabulary — TTFT, ITL, TPS, throughput

TL;DR

TTFT = how long until the first token streams (prefill-bound, FLOP-heavy). ITL = per-token decode time (memory-bandwidth-bound). TPS per stream = 1/ITL. Aggregate throughput = sum across streams (higher batch = higher aggregate but worse per-stream ITL). SLA-driven scheduling balances these.

The four numbers everyone confuses

MetricWhat it isWhat it's bound by
TTFT (Time to First Token)Latency from request → first tokenPrefill (compute the prompt's KV in one pass). FLOP-bound for long prompts.
ITL (Inter-Token Latency)Per-token decode timeMemory-bandwidth-bound (must read weights + KV every step).
TPS (Tokens Per Second)Per-stream throughput = 1/ITLSame as ITL.
Aggregate throughputTokens/sec across all concurrent streamsHigher batch → higher aggregate but worse per-stream ITL.

Tradeoff: low TTFT + low ITL = expensive (small batches, lots of GPUs idle). High aggregate throughput = large batches, slower per-user. SLA-driven scheduling balances these.

PITFALL — autoscaling lag
LLM serving has long startup time (load 70-405 GB weights). HPA-style autoscaling on QPS is too slow — by the time new pods come up, the SLA is blown. Provision for the p95, not the median, and use queue-depth-based scaling, not QPS.
REMEMBER
  • TTFT = prefill (compute). ITL = decode (bandwidth). Don't conflate.
  • Higher batch → better aggregate throughput, worse per-stream ITL.
  • Autoscale on queue depth, provision for p95.
11
SYSTEMS · DISAGGREGATION

Disaggregated serving — Splitwise, DistServe, Mooncake

TL;DR

Prefill is compute-bound, decode is memory-bandwidth-bound. Co-locating them causes interference (decode latency spikes when prefill runs). Disaggregate them onto separate GPU pools and transfer KV via NVLink/RDMA. Splitwise (Microsoft 2023), DistServe (Berkeley 2024), Mooncake (Kimi 2024) are the canon. Sarathi-Serve's chunked prefill is the alternative for single-pool deployments.

Why disaggregate

Prefill is compute-bound, decode is memory-bandwidth-bound. Running them on the same GPU causes interference — decode latency spikes during prefill (head-of-line blocking).

Disaggregation: separate prefill and decode pools. Prefill GPUs do bulk compute, then transfer KV cache to decode GPUs over fast interconnect (NVLink, RDMA). Independent scaling.

The canon

DeepSeek's open-source serving stack uses this pattern.

Chunked prefill (Sarathi-Serve) — the single-pool alternative

Sarathi-Serve (Agrawal 2024, arxiv 2403.02310): instead of running long prefills as one big forward pass (which freezes decode for everyone), chunk the prefill into smaller pieces and interleave them with decode iterations. Removes TTFT spikes during sustained traffic. Now standard in vLLM and TensorRT-LLM.

Prefix caching

When many requests share a prompt prefix (system prompts, few-shot examples, RAG context), cache the KV for that prefix and reuse. Hash the prefix tokens, look up in a KV-cache pool. vLLM's PagedAttention enables sharing at block granularity.

Massive throughput wins for chatbots with shared system prompts and for agentic workloads with growing transcripts.

REMEMBER
  • Disaggregate prefill from decode → independent scaling, no interference.
  • Splitwise / DistServe / Mooncake are the canon. Mooncake adds a KV-cache tier.
  • Sarathi-Serve chunks prefill — alternative for single-pool deployments.
  • Prefix caching is free throughput when system prompts are shared.
12
SYSTEMS · KV COMPRESSION

Prefix & KV-cache compression — beyond MLA

TL;DR

Once you've adopted GQA or MLA architecturally, the next lever is on-the-fly KV compression. KIVI quantizes K per-channel and V per-token to INT2 with little quality loss. H2O / Heavy Hitters evict everything but the top-attention tokens. ChunkAttention dedups shared prefix chunks across requests via prefix tree. CLA shares KV across consecutive layers.

The KV-compression toolkit

EXAMPLE — RadixAttention (SGLang)

SGLang's prefix-cache structure: a radix trie over KV cache. When a new request shares prefix with cached requests, the tree walk gives O(log n) lookup of the longest matching prefix. Enables aggressive prefix-cache reuse with low overhead.

REMEMBER
  • KIVI: INT2 KV with per-channel K, per-token V — almost free quality.
  • H2O: keep heavy hitters, evict the rest. ~5% quality cost.
  • ChunkAttention + CLA / MLKV cut cache further. Stack with GQA / MLA.
  • SGLang's RadixAttention is the prefix-cache reference impl.
13
ECOSYSTEM · ENGINES

Inference engines compared — vLLM, SGLang, TRT-LLM, TGI

TL;DR

vLLM is the open-source throughput champion (PagedAttention, continuous batching). SGLang is catching up fast with RadixAttention and great structured-output performance. TensorRT-LLM is the NVIDIA-only speed king but model-conversion friction is real. TGI is the easy-deploy option. llama.cpp / MLC-LLM / ExecuTorch own the edge.

The matrix

EngineStrengthsWeak/Notes
vLLMPagedAttention, continuous batching, broad model support, openThroughput champion in 2024; SGLang catching up
SGLangRadixAttention (prefix tree caching), fast structured generation, very fast tool-call decodingNewer; growing rapidly in 2025
TensorRT-LLMBest-in-class on Nvidia HW (kernel fusion, paged kv, in-flight batching)Nvidia-only, model conversion friction
TGI (Hugging Face)Easy deployment, broad model coverageThroughput trails vLLM/SGLang
llama.cppCPU + Metal + CUDA, GGUF format, edge-friendlyThroughput-limited at scale
MLC-LLM, ExecuTorchOn-device (mobile, browser via WebGPU)Edge / consumer product
REMEMBER
  • vLLM = open default. SGLang = catching up, best for structured output.
  • TRT-LLM = peak Nvidia perf, friction tax to convert.
  • llama.cpp / MLC-LLM / ExecuTorch = edge.

0 → hero reading path for LLM inference

  1. foundation vLLM blog — start with PagedAttention post, then continuous batching, then prefix caching
  2. foundation TGI docs
  3. foundation Sebastian Raschka — Coding the KV Cache from scratch
  4. build Walk through vLLM source — start with vllm/model_executor/layers/attention/
  5. build Implement speculative decoding in numpy — drill until you can do it in 30 min
  6. depth PagedAttention / vLLM paper (Kwon 2023)
  7. depth FlashAttention v1 (Dao 2022)
  8. depth FlashAttention v2
  9. depth FlashAttention v3 (Shah et al.)
  10. depth Speculative Decoding (Leviathan 2023)
  11. depth EAGLE
  12. depth Splitwise — disaggregated serving
  13. depth Mooncake (Kimi)
  14. depth Sarathi-Serve — chunked prefill
  15. depth GPTQ + AWQ + SmoothQuant for quantization
  16. depth LMSYS blog (SGLang authors) — RadixAttention etc.

LLM inference quiz — readiness check

  1. Walk through one decode step inside vLLM.
    Show answer

    Scheduler picks ready requests up to memory limit. For each: compute Q, K, V from previous token's hidden state; write K, V to next free block in the request's block table. PagedAttention kernel: attention over all blocks for that request. FFN. Sample next token. Update block table. New requests joining first do prefill (one big forward over their prompt).

  2. Why is decode memory-bandwidth-bound?
    Show answer

    Each decoded token requires reading all weights (~140 GB for 70B fp16) plus the full KV cache. Compute is small (one new Q vector). Batching helps because you reuse one weight read across many requests in the batch.

  3. How would you reduce TTFT for 100k-token prompts?
    Show answer

    Chunked prefill (Sarathi-Serve), prefix caching, sequence/context parallel for prefill, more GPUs allocated to that request. The bottleneck is FLOPs.

  4. Speculative decoding tradeoff?
    Show answer

    Wins when draft acceptance rate is high (similar distribution to target) and draft is fast (≤ ~5% target cost). Loses if draft too divergent (low acceptance) or too slow (overhead dominates).

  5. Compare GQA vs MQA vs MLA on KV cache size.
    Show answer

    MHA: 2·n_heads·d_head per token per layer. MQA: 2·d_head. GQA(g): 2·g·d_head. MLA: 2·d_latent (much smaller, e.g., 512 vs 8192 for d_model=8192). Quality: MHA ≥ GQA ≥ MQA, MLA matches MHA at 7% cache size.

  6. What's the bottleneck for serving 10k QPS chat?
    Show answer

    Memory bandwidth (decode), KV cache size (concurrent context), GPU count (cost). Mitigations: GQA/MLA, FP8 weights, prefix caching, continuous batching, speculative decoding, model routing (cheap → reasoning).

  7. Design a serving stack for a reasoning model with 8k hidden CoT tokens per request.
    Show answer

    Heavy decode load (8k tokens × users); huge KV cache. Disaggregated prefill (cheap) and decode (expensive); aggressive prefix caching across reasoning segments; queue with priority for premium tier; possibly spec-decoding with a weaker model for early CoT phase.

  8. How does FA3 use H100 features?
    Show answer

    TMA (Tensor Memory Accelerator) for async memory copy from HBM → SMEM. Warp-specialized producer/consumer pattern overlapping data move with compute. FP8 matmuls with two-stage scaling. Up to 75% of H100 peak (~740 TFLOPS bf16).

  9. Prove speculative decoding is exact.
    Show answer

    For drafted token t: accept with prob min(1, p_t/p_d). On reject: sample from residual (p_t − p_d)_+ / Z. Marginal distribution of emitted token = p_d · accept + (1 − p_d · accept) · residual = p_t (algebra works out). Each emitted token is exactly distributed as p_t.

  10. Worked example: KV cache for Llama 3 70B at 128k context?
    Show answer

    GQA: 8 KV heads, d_head=128, 80 layers, bf16. Per token per layer: 2 · 8 · 128 · 2 = 4096 B. Per token: × 80 = ~320 KB. At 128k tokens: ~40 GB per request. MHA equivalent (64 KV heads): 8× = ~320 GB.

  11. Why disaggregate prefill and decode?
    Show answer

    Prefill is compute-bound; decode is memory-bandwidth-bound. Same GPU running both → decode latency spikes during prefill. Disaggregation: separate pools, transfer KV cache via NVLink/RDMA. Independent scaling. Splitwise / DistServe / Mooncake.

  12. Difference between PagedAttention block size 16 vs 128?
    Show answer

    Smaller (16): less internal fragmentation, more granular sharing across requests, more block-table overhead. Larger (128): less metadata, fewer GPU memory operations per attention, more wasted space at sequence ends. 16 is the vLLM default; experimentation needed for specific workloads.

  13. What is RadixAttention (SGLang)?
    Show answer

    Prefix tree (radix trie) over KV cache. When a new request shares prefix with cached requests, the tree walk gives O(log n) lookup of the longest matching prefix. Enables aggressive prefix-cache reuse with low overhead. SGLang's contribution.

  14. Top-p vs top-k vs min-p?
    Show answer

    Top-k: keep top k probabilities. Top-p (nucleus): keep smallest set with cumulative prob ≥ p. Min-p: keep tokens with p ≥ p_threshold · max(p) — adaptive (wider in flat distributions).

  15. EAGLE vs Medusa for speculative decoding?
    Show answer

    EAGLE: small auto-regressive head conditioned on target's hidden states (richer context → high acceptance). Medusa: multiple parallel heads predicting +1, +2, +3 with tree verification. EAGLE-2/3 push further. Medusa simpler deployment; EAGLE higher acceptance.

  16. What is SmoothQuant?
    Show answer

    Migrate quantization difficulty from activations to weights via diagonal scaling: (X · diag(s)−1) · (diag(s) · W). Now activations have smaller outliers, weights absorb the scale. Standard recipe for production W8A8.

  17. What's the difference between W4A16 and W8A8?
    Show answer

    W4A16: INT4 weights, bf16 activations. Smaller storage; matmul still in bf16. Most common production. W8A8: INT8 weights AND activations. Faster matmul (uses int8 tensor cores) but harder due to activation outliers. Need SmoothQuant.

  18. What is chunked prefill (Sarathi-Serve)?
    Show answer

    Instead of running long prefills as one big forward (freezing decode for all users), chunk the prefill and interleave with decode iterations. Removes TTFT spikes during sustained traffic. Now standard in vLLM and TensorRT-LLM.

  19. What is KIVI / H2O?
    Show answer

    KIVI: per-channel quant of K, per-token quant of V → INT2 KV cache with minimal quality loss. H2O (Heavy Hitters): identify the small set of tokens that dominate attention; evict the rest. Drastically smaller cache; ~5% quality drop.

  20. Continuous batching vs static batching — explain.
    Show answer

    Static: collect N requests, run all to completion. Short requests wait for long ones. Continuous: per-iteration scheduling. After each forward pass, finished requests leave; new requests join. Maximizes GPU utilization. vLLM, TGI, TensorRT-LLM all do this.