ML SYSTEMS · CHALLENGES · ATTEMPT BEFORE YOU PEEK

ML systems challenges — design it, then debug it

The practice arena for the ML systems course. Four real design challenges with hard constraints, four production debugging war rooms with actual telemetry to reason over, eight capacity-estimation drills, and six on-call triage simulations. Every challenge follows the same protocol: timebox 25 minutes, write your answer down, then open the hints and the model solution, then grade yourself against the junior/senior/staff rubric.

4 design challenges 4 debug war rooms 8 capacity drills 6 on-call sims

What's in here

How to train on challenges + the two meta-frameworks
Design: feed ranking for 10M DAU under 100 ms
Design: LLM chat serving for 50k concurrent users
Design: feature store with point-in-time correctness
Design: enterprise RAG with hard permissioning
Debug: p99 latency tripled after yesterday's deploy
Debug: training loss exploded at step 41,200
Debug: offline AUC up, online CTR down
Debug: GPU serving throughput collapsed at long context
Capacity-estimation drills
On-call triage simulations (rapid scenarios)
Grading rubrics + how to practice

PART I · ORIENTATION

How to train on challenges + the two meta-frameworks

🎯You learn nothing from a solution you read before you struggled — attempt first, peek second, always.

This chapter explains how to use this practice arena so that the repetitions actually transfer to an interview room. It then gives you two reusable mental frameworks — one for design questions, one for debug questions — that you can apply to every subsequent challenge regardless of domain. Internalise these two frameworks and you will never stare blankly at a whiteboard again.

Why passive reading fails

Reading a model solution feels like learning. It is not. Cognitive science calls this the fluency illusion: when material is coherent and familiar-looking, the brain rates its own comprehension as high even when the underlying retrieval circuits are untrained. You feel prepared; you are not.

What actually transfers to an interview is retrieval practice under production conditions — sitting with a blank page, a timer, and a real problem, and forcing yourself to produce an answer before you know whether it is right. The struggle is the training signal, not the solution.

⚠ The trap

Skimming the scenario, immediately opening the model solution "just to see the structure", then convincing yourself you understood it. This produces a false sense of readiness. On interview day the blank page appears and nothing loads.

The attempt-first protocol

Every challenge in this book must be done in this order:

Read the scenario only. Close every hint and solution block.
Set a 25-minute timer. Write your answer — on paper, in a doc, aloud to a recording. Do not stop before the timer fires.
Open Hint 1 only. Compare it to what you wrote. Update your answer, then set another 5 minutes.
Open the model solution. Score yourself against the rubric. Write one sentence: what did you miss and why.
Redo the same challenge cold in 48 hours. If you score higher, it is in long-term memory; if not, repeat.

📐 The attempt-first rule

Trigger: any challenge page opens.

Timer on. Blank document open. Write.
Hints only after you have a written skeleton answer.
Full solution only after you have a written complete answer.

Never: open the solution to "orient yourself" before attempting. Once you see the answer, the practice rep is ruined.

Meta-framework 1 — Design questions

Every ML system design question, regardless of domain, has the same skeleton. Interviewers are checking whether you can structure chaos into a sequence of deliberate decisions. Use this six-step framework every time.

The 8-box canonical ML system diagram: requirements → scale → data pipeline → feature store → training → serving → monitoring → feedback loop

Step 1: Requirements

Split into functional (what the system does) and non-functional (latency, throughput, freshness, consistency, cost). Write numbers for every NFR. "Fast" is not a requirement. "p99 < 100 ms at 10k QPS" is.

Step 2: Scale numbers

Derive QPS, storage, bandwidth, and GPU/CPU from the given user count. Do envelope arithmetic live on the whiteboard — this is what separates senior from staff candidates.

Step 3: Skeleton (the 8-box diagram)

Draw: data sources → ingestion → feature store → training pipeline → model registry → serving stack → monitoring → feedback loop. Label arrows with data formats and latency budgets. Every ML system fits this skeleton; the challenge is in the details of each box.

Step 4: Deep-dive 2 components

The interviewer will ask you to zoom into 1–2 boxes. Pick the two that are most constrained by your NFRs and proactively offer to go deep. Never wait to be dragged there.

Step 5: Failure modes

Name at least three ways the system can fail in production: data drift, feature staleness, serving latency tail, cold-start, training-serving skew. For each, describe the detection signal and the mitigation.

Step 6: Evolution

How does the design change if DAU 10×? If the team doubles? If cost must halve? Staff-level answers always address this. It shows you are designing for growth, not just the day-one snapshot.

📐 If you get a design question — the rule

Trigger: "Design a system that …"

Ask two clarifying questions: scale (DAU / QPS) and the hardest constraint (latency vs freshness vs cost). Then stop asking and start designing.
State requirements aloud with numbers before drawing anything.
Sketch the full 8-box skeleton first — resist the urge to zoom in early.
Pick the two most-constrained components, go deep, do arithmetic.
Close with failure modes and one "at 10× scale" observation.

Never: jump straight to model architecture without establishing scale. Interviewers at senior/staff level treat a missing requirements section as a red flag for someone who will build the wrong thing.

Meta-framework 2 — Debug questions

A debug challenge always begins with a metric moving in the wrong direction. Without a framework, candidates thrash: they guess random causes, suggest fixes before diagnosing, and talk in circles. The layered-funnel approach stops that.

The core insight: every production ML system is a pipeline of transformations, and a metric degrades at exactly one layer. Your job is to localise which layer as quickly as possible using the cheapest possible probes. Fix before diagnosis is guessing; diagnosis without cheapest-probe ordering is slow.

Debug funnel: product metric → model scores → feature distributions → data pipeline → infrastructure — each layer is a hypothesis class

Observe — what changed and when?

Exact timestamp of metric movement. What deployed in the preceding 30 minutes? What changed in upstream data? A precise "when" cuts the hypothesis space in half immediately.

Localise — layer by layer

Work top-down: (1) product metric (CTR, revenue, retention) → (2) model output scores (calibration, score distribution shift) → (3) input features (distributions, null rates, cardinality shifts) → (4) data pipeline (lag, drop, schema change) → (5) infrastructure (latency, OOM, version skew). Stop at the first layer that shows an anomaly.

Hypothesise — rank by prior probability

The most common causes of production regressions in rough order: (a) bad data / upstream schema change, (b) deploy introducing training-serving skew, (c) cold caches or infra issue, (d) gradual drift. Rank your hypotheses before starting any probe.

Test — cheapest probe first

Each probe has a cost (time, user impact, engineering effort). Always run the cheapest one first. Classic ordering: dashboard inspection (free) → log sampling (minutes) → canary rollback (reversible, fast) → A/B replay (hours) → retraining (days).

Fix + prevent

Fix addresses the immediate symptom. Prevention addresses the root cause so it cannot recur: schema validation gates, canary metric thresholds, automated rollback triggers, data-quality monitors.

📐 If you get a debug question — the rule

Trigger: "Metric X dropped / spiked after Y." or "Something is wrong with your model in production."

Ask: what changed in the last 30 minutes (deploy, data, config)? What does the p50 look like versus p99? (Tail-only = infra or cache; all-percentiles = model or data.)
State your layer funnel aloud: "I will localise top-down starting at the product metric."
Rank 3 hypotheses by prior probability, then name the cheapest probe for each.
Describe the fix and one prevention mechanism.

Never: jump to "retrain the model" as your first action — retraining takes hours/days and will not fix infra bugs, cache cold-starts, or schema changes. Always exhaust cheap reversible probes first.

⚠ Clears up: "debug" vs "diagnose"

Debugging is the full loop (observe → localise → hypothesise → test → fix → prevent). Diagnosing is just one step: finding the root cause layer. In an interview you must do the full loop — candidates who stop at "the feature pipeline is probably broken" miss the test-cheapest-first and prevention steps that distinguish senior from junior.

The self-grading rubric explained

Each challenge closes with a rubric table. Here is how to read it and how to use it honestly.

Level	What it looks like	What is missing to reach the next level
Junior	Names the right components (two-tower retrieval, ranker, feature store). Gets the general shape right. May have no numbers.	Tradeoff reasoning: "why this component, not that one?" Numbers for every scale claim.
Senior	Quantified tradeoffs. Latency budgets that sum correctly. Knows the failure modes of each choice. Can defend any decision challenged by interviewer.	Failure modes at scale, org implications, cost conversation, evolution path.
Staff	Proactively raises failure modes. Estimates cost and discusses the build/buy/oss tradeoff. Addresses 10× scale without being asked. Simplifies by questioning whether some component is even needed.	Nothing — this is the target.

Score yourself per dimension, not overall: you might be senior on architecture but junior on failure modes. Targeted practice closes specific gaps faster than generic review.

◆ Interview probe

Interviewers often intentionally ask a question that is slightly underspecified — e.g., "Design a feed ranker" with no scale given. The junior candidate picks a scale and starts designing. The senior candidate asks two clarifying questions and writes the numbers down explicitly. The staff candidate also says "let me state my assumptions, tell me if any are wrong" — treating the ambiguity itself as a signal about the problem.

✓ Remember

Attempt-first always: 25 min on a blank page before any hint. The struggle is the training signal.
Design framework: requirements (with numbers) → scale envelope → 8-box skeleton → deep-dive 2 components → failure modes → evolution.
Debug framework: observe (what changed, when?) → localise layer by layer → hypothesise ranked by prior → test cheapest-first → fix + prevent.
Rubric: junior names components; senior adds numbers and tradeoffs; staff adds failure modes, cost, evolution, and sometimes simplifies by cutting scope.

TL;DR

Read the scenario, set a 25-minute timer, and write before you peek. Every design answer flows through six steps from requirements to evolution. Every debug answer flows through five steps from observation to prevention. Grade yourself per dimension against the junior/senior/staff rubric and close specific gaps.

Tricky interview questions — chapter 01

Q1. Why is a 25-minute timebox specifically recommended, rather than working until you feel done?

Interviews are time-bounded and interviewers grade on what you produce under pressure, not unlimited time. A timebox forces prioritisation — the same cognitive demand as the real setting. Working until you "feel done" allows endless perfectionism and never trains the muscle of structured rapid output. Twenty-five minutes roughly mirrors the design phase of a 45-minute interview slot.

Q2. In the debug framework, why localise layer-by-layer top-down rather than checking the most likely root cause first?

Top-down localisation is systematic and complete; jumping to the most likely root cause is heuristic and misses cases. In practice, the "most likely" cause is wrong often enough that a systematic sweep is faster in expectation. More importantly, top-down gives you a verifiable story to tell the interviewer — "I eliminated the product layer, the scoring layer showed the anomaly, so the issue is below the model" — which demonstrates structured thinking even if the root cause takes time to find.

Q3. A candidate draws the 8-box architecture diagram immediately when asked a design question. What is wrong with this?

Drawing the diagram before stating requirements and scale numbers means the diagram is built on unstated assumptions. The interviewer cannot tell if the candidate actually understands the constraints or is pattern-matching to a memorised template. Requirements and scale must come first — they determine which boxes exist, which arrows are the critical path, and what latency budget each component gets. Skipping this step is a senior-level red flag.

Q4. What distinguishes a staff-level failure modes answer from a senior-level one?

A senior answer names the failure (e.g., "feature staleness can cause stale recommendations") and describes the detection signal and mitigation. A staff answer does this and additionally: quantifies the blast radius ("staleness >15 minutes on the recency signal drops CTR ~3% based on our ablations"), discusses the org implications ("this requires an SLA agreement with the upstream data team"), and may simplify by questioning whether the component creating the failure mode is even necessary at current scale.

Q5. The interviewer challenges your choice of two-tower retrieval: "Why not just use BM25?" How do you respond?

Acknowledge BM25 is a valid baseline and describe where it excels — exact keyword match, no training data needed, interpretable. Then explain why it falls short for the given scenario: feed ranking depends on semantic similarity and behavioural signals (what users engaged with, not what they typed), which BM25 cannot capture. Two-tower embeds both query and item in a learned space that reflects engagement patterns. Concede that a hybrid (BM25 + dense) often outperforms either alone. The interviewer is testing whether you understand the tradeoff, not whether you picked the "right" answer.

Q6. You're debugging a latency regression and your first probe (checking the dashboard) shows no anomaly. What do you do next?

Move one layer deeper in the funnel. If the product-metric dashboard is clean but a latency regression was reported, check model scoring latency directly (are scores themselves taking longer?). If that is clean, check the feature fetch layer (are upstream feature services slower?). Each dashboard check is essentially free, so exhaust all dashboard-accessible signals before escalating to more expensive probes like log sampling or canary rollbacks.

Q7. Why does the debug framework say "fix before diagnosis is guessing"?

Without localising the root cause layer, you do not know which fix will work. Retraining the model cannot fix a cold cache. Rolling back a deploy cannot fix a data-pipeline schema change that started two days ago. A wrong fix wastes hours, may introduce a new regression, and leaves the original problem alive. Diagnosis — even a rough "the problem is in the feature layer, not the model layer" — makes every subsequent action targeted and reversible.

Q8. How would you adapt the 8-box design framework for a non-neural system, like a rules-based fraud detector?

The skeleton still applies but some boxes change character: the "training pipeline" becomes a "rule-authoring and validation pipeline"; the "model registry" becomes a "rule version store"; the "feedback loop" becomes an "analyst review queue" where rule precision/recall is tracked and rules are updated. The requirements, scale, failure modes, and evolution steps are identical. The framework is model-agnostic; it is really a checklist of every component a production decision system needs, regardless of whether the decision logic is learned or hand-crafted.

Q9. An interviewer says: "Tell me about a system you would not use the 8-box framework for." How do you answer?

A good answer names a genuine exception and explains why. Example: a pure batch scoring job with no serving component — the feedback loop and serving stack boxes are absent or trivial. Or an A/B test infrastructure tool — it has no ML model, so the training pipeline and feature store are absent. The point is that the framework is a checklist, not a straitjacket. Dropping a box is fine; forgetting a box because you did not think about it is not. Demonstrating that you know when to simplify is itself a staff-level signal.

Q10. Why is "retrain the model" almost never the correct first action in a production debug situation?

Retraining takes hours to days, requires fresh labelled data, and does not fix non-model problems. Most production regressions are caused by data pipeline issues, infra problems, or deploy-induced skew — none of which retraining addresses. Retraining is the most expensive and least reversible action in the debug toolkit. It belongs at the end of the list, after cheaper reversible probes (dashboards, log sampling, rollback, config changes) have been exhausted and a model-quality root cause has been confirmed.

PART II · DESIGN CHALLENGES

Design: feed ranking for 10M DAU under 100 ms

🎯A feed ranker is really three models glued together by a latency budget — if the budget math does not add up on the whiteboard, it will not add up in production.

This challenge asks you to design a production feed-ranking system from scratch under realistic constraints: 10M daily active users, a 2,000-item candidate pool per request, p99 end-to-end latency of 100 ms, a 40-person engineering organisation, a modest GPU budget, and 20% cold-start users. It is the archetypal ML systems design question — virtually every senior/staff ML interview includes a variant of it.

The scenario

Product

Social app with a home feed of posts from followed accounts and subscribed topics.

Scale

10M DAU. Peak QPS roughly 10M × 8 feed loads/day ÷ 86,400 s × 3× peak factor ≈ 2,800 QPS at peak.

Candidates

After follow-graph and topic expansion, each request has a pool of 2,000 candidate posts to rank.

Latency budget

p99 end-to-end (client sees first item) ≤ 100 ms. This is a hard product SLA.

Org

40-person engineering org. No dedicated ML infra team — you share infra with 2 other product areas.

GPU budget

Modest: assume you can justify up to ~30 A100-equivalent GPUs for ranking before finance pushes back.

Cold-start

20% of DAU have fewer than 10 engagement events — insufficient signal for personalised ranking.

Existing infra

Kafka for event streaming, an existing Redis cluster, a Spark cluster for batch jobs, Postgres for user/post metadata.

Your task

Produce a complete feed-ranking system design that includes:

A latency budget table — each stage, its candidate count in and out, and its ms budget — that sums to ≤ 100 ms.
The full candidate funnel: retrieval → pre-rank filter → light ranker → heavy ranker → business-logic re-ranking.
Model choices per stage with justification (architecture, feature types, serving format).
A feature freshness plan: which features are batch (daily/hourly), which are near-real-time, which are request-time, and where each is computed and stored.
A cold-start strategy for the 20% of users with sparse signal.
A fallback story: what the system does if any stage breaches its budget or crashes.
An experiment plan for a safe launch: how you A/B test this without exposing all 10M users to a broken ranker.

Hint 1 — Latency decomposition

100 ms sounds generous until you account for network round trips. A user in the median location is ~20 ms from your nearest data centre. That leaves 80 ms for your backend. Your backend is not one service — it is a chain: candidate fetch → retrieval model → feature lookup → light ranker → heavy ranker → response serialisation. Each hop adds latency. The key constraint is that heavy neural ranking on 2,000 items is impossible in 80 ms on a GPU (the GPU call alone would take 200+ ms at that scale). You must reduce the candidate set aggressively before the expensive stage.

Think about this: if your heavy ranker can score 50 items in ~15 ms on GPU, how many items must the light ranker have already eliminated before it? Work backwards from that number.

Hint 2 — The two-stage model split

The industry-standard solution for this constraint is a two-tower retrieval model followed by a point-wise or list-wise ranker. The retrieval model does approximate nearest-neighbour (ANN) search in an embedding space to get from 2,000 (or more) candidates down to ~200. The ranker then uses richer features (cross-features between user and item) that are too expensive to compute for 2,000 items but fine for 200.

A further split often helps: a light ranker (logistic regression or a small gradient-boosted tree on precomputed features) reduces 200 → 50, and a heavy ranker (small neural net with cross-features) produces the final 50 to display. Think about what features each stage can afford given its latency budget, and where those features come from.

Hint 3 — Cold-start and position bias

For cold-start users (fewer than 10 engagement events), personalised two-tower embeddings are meaningless — the user embedding is just noise. Two standard approaches: (a) popularity-based fallback — rank by global or demographic-cohort popularity for the first N sessions; (b) content-based bootstrap — if the user indicated interests at signup, use item embeddings directly without a user tower. The transition from cold to warm is gradual: after 10–30 interactions, blend cold and warm signals.

For position bias: items shown at the top of the feed are clicked more regardless of quality. If your training labels are clicks without position correction, your model learns "items shown first are good", which is circular. The standard fix is inverse propensity scoring (IPS) or a position feature during training that is zeroed out at inference. Make sure your solution mentions this — it is a common interview probe.

✅ Model solution — full worked answer

Step 1: Latency budget table

Total end-to-end budget: 100 ms. Allocate as follows (all times are p99 targets):

Stage	Candidates in	Candidates out	Budget (ms)	Notes
Network (client → edge)	—	—	15	CDN edge, median user distance
Candidate generation (follow-graph + topics)	all posts <48h	2,000	10	Precomputed in Redis; pure lookup
Two-tower ANN retrieval	2,000	200	12	FAISS / ScaNN index on CPU; user vec from cache
Feature fetch (batch + near-RT features)	200 items	200	8	Redis multi-get; 200 × ~5 features
Light ranker (GBDT)	200	50	5	CPU inference, pre-compiled model
Heavy ranker (neural, GPU)	50	50 (scored)	18	GPU batch; cross-features computed here
Business-logic re-rank + dedup	50	20 (displayed)	4	CPU; diversity, ads injection
Serialisation + network (edge → client)	—	—	8	Protobuf compression
Total			80	20 ms headroom for p99 jitter

The 20 ms headroom is not waste — p99 latency has a fat tail. Without headroom, any single slow downstream service (cold Redis key, GC pause, noisy neighbour) will breach the SLA.

Step 2: Full candidate funnel — architecture

Feed-ranking funnel: candidate pool → ANN retrieval → feature fetch → light GBDT ranker → heavy neural ranker → business logic → displayed feed

Candidate generation runs offline: a Spark job hourly expands each user's follow graph and topic subscriptions into a candidate set of recent posts (≤48 hours old). The candidate list is written to Redis keyed by user ID. At request time this is a single Redis lookup — no model inference, no graph traversal on the hot path.

Two-tower ANN retrieval: offline, train a two-tower model. User tower ingests user ID embedding + aggregated interaction history (mean-pooled post embeddings of last 50 interactions). Item tower ingests post ID embedding + content features (topic, media type, author popularity). Both towers output a 128-dimensional L2-normalised embedding. At request time: look up the user's cached embedding vector; run FAISS (HNSW index) against the 2,000 candidate item vectors to retrieve top 200. HNSW search on 2,000 items at 128 dims takes ~2 ms on CPU — well inside the 12 ms budget including the Redis lookup for the user vec.

Light ranker: a gradient-boosted decision tree (LightGBM, ~200 trees, depth 6) trained on engagement labels. Features: precomputed item statistics (like rate, share rate, 7-day CTR), user-item cross-features (topic affinity score, recency signal, author follow depth). All features come from Redis — no model inference needed. GBDT inference on 200 items takes <2 ms on a single CPU core. Output: top-50 by predicted engagement score.

Heavy ranker: a 2-layer MLP with cross-attention between user embedding and item embeddings (roughly 5M parameters). Inputs: user tower output, item tower output, position embedding (for training; zeroed at inference), explicit cross-features (user–author affinity, freshness decay). Batch of 50 items through the MLP on A100 GPU takes ~10–14 ms including the CUDA kernel launch overhead. Output: calibrated engagement probability per item.

Business-logic re-rank: deterministic CPU step. Applies a maximum-marginal-relevance diversity filter (at most 3 consecutive posts from the same author), injects sponsored posts at fixed positions (slots 3, 8), and removes posts the user has already seen in the last 24 hours (Bloom filter lookup). Final 20 items returned.

Step 3: FLOP arithmetic justifying the light/heavy split

Why not run the heavy ranker on all 200 items instead of 50? Do the arithmetic:

$$\text{FLOPs per item (heavy ranker)} \approx 2 \times 5{,}000{,}000 = 10^7$$

Each of the 5M parameters participates in ~2 multiply-adds per forward pass (weights × activations). This is a rough order-of-magnitude estimate.

At 200 items: 200 × 10⁷ = 2 × 10⁹ FLOPs. An A100 delivers ~312 TFLOP/s (fp16). With 20% utilisation on small batches: effective throughput ≈ 62 TFLOP/s = 6.2 × 10¹³ FLOP/s.

$$t = \frac{2 \times 10^9}{6.2 \times 10^{13}} \approx 32\,\mu\text{s}$$

t = FLOPs needed ÷ effective FLOP/s = time for matrix math alone.

32 µs for the matrix math — sounds fine. But the real cost at batch size 200 is memory bandwidth, not compute. Each item's feature vector (~1 KB fp16) must be loaded from GPU HBM. At 200 items: 200 KB, plus model weights (~10 MB). At A100 HBM bandwidth 2 TB/s: load time ~5 µs. The bottleneck is actually the kernel launch overhead and GPU scheduling latency at these small batch sizes, which adds 8–15 ms of wall-clock time.

Running 200 items instead of 50 means 4× more feature-fetch time (8 ms becomes ~32 ms) plus the GPU kernel time stays roughly the same (still small-batch-dominated). The total saving from light-ranking 200→50 is ~24 ms on feature fetch alone — almost exactly the difference between meeting and missing the latency SLA.

Step 4: Feature freshness plan

Feature	Freshness tier	Computed by	Served from
User embedding (two-tower)	Hourly	Spark + offline model inference	Redis (user_vec:{uid})
Item embedding (two-tower)	On post creation + hourly refresh	Online inference at post time; batch refresh	FAISS index + Redis (item_vec:{pid})
Item 7-day CTR, like rate, share rate	Hourly batch	Spark aggregation over Kafka events	Redis hash (item_stats:{pid})
User–topic affinity score	Hourly batch	Spark	Redis (user_topic:{uid})
Post freshness decay	Request-time computed	Ranking service (formula: e^(−λΔt))	Computed inline from post creation time
User session context (recent clicks in session)	Real-time (<1 s)	Kafka consumer → Redis write	Redis (session:{uid})
Seen-posts Bloom filter	Real-time	Updated on every impression event	Redis Bloom filter (seen:{uid})

The key principle: freshness tier must match the rate of change of the signal. A user's long-term topic affinity changes slowly — hourly batch is fine and costs nothing at request time. A user's in-session behaviour (they just liked three cooking posts) changes in seconds — this must be near-real-time or the ranker will show them more travel content they are ignoring right now.

Step 5: Cold-start strategy

Definition: users with fewer than 10 engagement events in their history (20% of DAU = 2M users). Their user tower embedding has high variance — it is essentially a random vector trained on near-zero signal.

Phase 1 (0–5 interactions): bypass the personalised two-tower entirely. Use a popularity-based retrieval: rank candidates by a demographic-cohort weighted engagement score. Cohort = (age_bucket, geography, signup-declared interests). This is a precomputed table lookup, trivially fast.

Phase 2 (5–30 interactions): content-based bootstrap. Embed the posts the user has interacted with (using item towers, which are high quality). Represent the user as the mean of item embeddings of their interactions — no user tower required. This representation degrades gracefully as interactions increase.

Phase 3 (30+ interactions): full two-tower personalisation. Blend with content-based: final user vector = α × user-tower output + (1−α) × mean-item-embedding. α is a learned function of interaction count, trained with a small neural head.

The transition is invisible to the user: the ranker always reads one vector from Redis, updated according to whichever phase the user is in.

Step 6: Position-bias handling

Without position-bias correction, training clicks create a feedback loop: high-ranked items get more clicks because they are shown first, not because they are better. The model then learns this spurious signal and keeps ranking them first — a self-reinforcing cycle.

$$\hat{y}_{ij} = \sigma\!\left(f_{\theta}(u_i, v_j) + \beta_k\right)$$

Predicted CTR for user i, item j at position k equals sigmoid of the model score plus a learned position bias β_k. β_k is trained jointly but zeroed out at inference time.

At training time: add position as a feature (slot 1 through 20) and let the model learn its effect via β_k. At inference time: set position = 0 (or equivalently, subtract the learned β_k). The model then returns "quality score without position effect." This is equivalent to IPS when position is treated as the propensity and is standard practice in industrial ranking systems.

Alternative: randomisation experiments. Randomly shuffle feed positions for 1% of users. Labels from this slice are unbiased. Train the main model jointly on biased data + position feature, with a small unbiased slice for calibration.

Step 7: Fallback story

Heavy ranker GPU OOM or timeout

Fall back to light ranker output for that request. The user sees a GBDT-ranked feed — slightly lower quality, but no outage. Alert fires; on-call investigates.

Feature service latency spike (>20 ms)

Serve stale features from a local in-process cache (TTL 60 s) rather than blocking. Quality degrades slightly; latency SLA holds.

Redis cache miss on user embedding (cold or evicted)

Fall back to content-based phase-2 embedding computed on the fly from the last 10 interactions stored in Postgres. This takes ~25 ms but is within budget since other stages run normally.

Entire ranking service unhealthy

Return a pre-computed "safe default" feed: globally popular posts from the last 6 hours, refreshed every 10 minutes by a background job. No personalisation, but no blank feed.

The rule: every stage must have a defined graceful degradation path, not just a circuit breaker that returns an error. Users who see a slightly lower-quality feed churn less than users who see an error page.

Step 8: Experiment plan for safe launch

Shadow mode (week 1–2): run the new ranker in parallel with the existing system. Log both sets of scores. Compare offline: does the new ranker's score distribution look sane? Are there any items with degenerate scores (NaN, constant value)? Zero user exposure.
1% canary A/B (week 2–3): route 1% of traffic to the new ranker. Monitor: CTR, time-spent, app crash rate, p99 latency. Set automated rollback trigger: if p99 latency >110 ms or CTR drops >5% relative in any 15-minute window, auto-rollback and page on-call.
10% → 50% ramp (week 3–4): ramp in doubling steps, 24 hours at each step. Guardrail metrics: session length, D7 retention, ad revenue. No ramp if any guardrail breaches.
Full rollout (week 4–5): 100% traffic. Keep old ranker on standby for 72 hours — do not decommission until the new system has survived at least one weekday peak traffic cycle.

Cold-start users should be over-sampled in the early canary — they are the highest-risk segment because the new ranker's cold-start logic is the most novel part of the design.

Step 9: GPU capacity estimate

Heavy ranker: 50 items × ~10⁷ FLOPs per item = 5 × 10⁸ FLOPs per request. At 2,800 QPS peak: 5 × 10⁸ × 2,800 = 1.4 × 10¹² FLOP/s required.

One A100 (fp16, ~20% utilisation on small batches) ≈ 60 TFLOP/s = 6 × 10¹³ FLOP/s effective.

$$\text{GPUs needed} = \frac{1.4 \times 10^{12}}{6 \times 10^{13}} \approx 0.023 \text{ A100s}$$

Required throughput divided by per-GPU effective throughput.

This looks like the heavy ranker needs effectively zero GPUs — because the heavy ranker only processes 50 items per request, not thousands. The real constraint is latency (need a GPU call to complete in <18 ms), not throughput. At 2,800 concurrent requests each needing a GPU in <18 ms, assuming 1 ms kernel launch amortised over a batch of 10 co-scheduled requests:

With continuous micro-batching: one A100 can handle ~55 requests/s at 18 ms latency budget (1/0.018 ≈ 55). To handle 2,800 QPS: 2,800 ÷ 55 ≈ 51 A100s. Given the 30-GPU budget constraint, plan A: reduce heavy ranker to top-30 items (saves 40% compute, frees 20 GPUs). Plan B: use A10G (cheaper, 60% of A100 throughput) for the heavy ranker — brings cost down ~40%. Present both options to finance.

What a Staff+ answer adds

Org implications: "The feature freshness plan requires an SLA with the data platform team. Without a written SLA on Redis uptime and p99 get latency, our fallback story is unreliable. I would negotiate that before launch."
Cost conversation: "At 51 A100s and current cloud pricing (~\$3.50/GPU-hr), the heavy ranker costs \$51 × 24 × 365 × \$3.50 ≈ \$1.56M/year. Given 10M DAU and modest monetisation, this is likely borderline. I would prioritise the A10G option or model distillation to get under \$1M/year."
Evolution: "At 100M DAU, candidate generation from Redis becomes a bottleneck — we would move to an approximate graph-walk retrieval (like Pinterest's candidate generation). The two-tower ANN index would need to shard across 10+ machines."
Challenge the prompt: "Do we actually need 2,000 candidates per request? If the follow graph is dense, better candidate pruning offline (time-decay, quality gate) might let us start with 500 and skip the light ranker entirely — simpler system, lower latency."

Common wrong answers and why they fail

❌ "Run BERT on every candidate"

BERT on 2,000 items at 100 ms is physically impossible. BERT base forward pass ≈ 60 ms on a V100 for a single 512-token input. This answer shows the candidate has not internalised the latency constraint.

❌ "Use a single XGBoost model to rank all 2,000"

XGBoost on 2,000 items with cross-features is feasible in time (~5 ms) but misses the point: cross-features between user and item for 2,000 items require 2,000 feature-fetch calls, which at even 0.5 ms each = 1 second of feature I/O. Without an embedding-based retrieval step, feature cost dominates.

❌ No mention of position bias

Training on click data without position correction guarantees a feedback loop. Interviewers at senior+ level explicitly probe for this — its absence signals inexperience with production ranking systems.

❌ No cold-start strategy

20% cold-start is stated in the scenario. Ignoring it means 2M users get a broken experience. This is a junior mistake — failing to address explicitly stated constraints.

❌ "We'll add monitoring later"

Monitoring and fallbacks must be designed before launch, not bolted on after. This answer suggests the candidate has not shipped a production system.

Self-grading rubric

Dimension	Junior	Senior	Staff
Latency budget	Mentions stages exist but no numbers	Table with per-stage ms that sums to ≤100	Table + explains headroom rationale + 10× scale impact
Retrieval model	Says "use embeddings" or "ANN"	Two-tower architecture, HNSW/FAISS, explains why ANN beats brute-force	Discusses index freshness, sharding at 10× scale, ANN recall vs latency tradeoff
Light/heavy split	Mentions two-stage ranking	GBDT then MLP, justifies FLOP savings	Full arithmetic, names A100 vs A10G tradeoff, discusses distillation path
Feature freshness	Says "real-time features"	Tiered table: batch/near-RT/request-time with storage layer named	SLA negotiation with data platform, late-arrival policy, offline/online parity testing
Cold-start	Not addressed or vague ("show popular posts")	Phased strategy with interaction threshold and content-based fallback	Phased strategy + blend coefficient + how transition is invisible to user + experiment design for cold cohort
Position bias	Not mentioned	Mentions IPS or position feature	Explains the feedback loop, the formula, training vs inference treatment, randomisation alternative
Fallbacks	Not mentioned	Circuit breaker or rollback mentioned	Per-stage graceful degradation + "no blank feed" principle
Experiment plan	"Run an A/B test"	Shadow → canary → ramp with specific percentages and guardrail metrics	Above + automated rollback trigger thresholds + cold-start over-sampling rationale

◆ Interview probes

"Your p99 just breached 110 ms in canary. Walk me through your immediate actions." — tests fallback and debug framework integration.
"Why not just increase the heavy ranker's candidate count to 200 and cut the light ranker?" — tests FLOP arithmetic and feature-fetch cost understanding.
"How does the system behave if the Redis cluster goes down?" — tests fallback depth: do you have a local cache? A Postgres fallback? A static default?
"A post goes viral unexpectedly — 10× normal engagement in 5 minutes. Which part of your system breaks first?" — tests freshness tier design: hourly batch features are now severely stale for this item.
"We want to optimise for 7-day retention, not CTR. What changes?" — tests understanding of label choice: CTR labels create a short-term engagement loop; retention requires long-horizon labels, delayed feedback handling, and potentially a different objective.

📐 If you get the feed-ranker design question — the rule

Trigger: "Design a feed ranking system" or any variant with a DAU + latency constraint.

State the scale math first: DAU → peak QPS. Do this live on the whiteboard. It takes 30 seconds and immediately shows quantitative fluency.
Draw the latency budget table before drawing anything else. Every stage needs a ms budget that sums to the SLA. This is the skeleton everything else hangs on.
Work the funnel left-to-right: candidate set → retrieval → light ranker → heavy ranker → business logic. At each stage: what model, what features, what latency budget, what fallback.
Volunteer cold-start and position bias — they are almost always in scope and interviewers reward candidates who raise them unprompted.
Close with experiment plan and one cost estimate. Mention the GPU count and annual cost even if the interviewer did not ask.

Never: design the model architecture before establishing the latency budget. A beautiful neural architecture that cannot meet the p99 SLA is worthless. Budget first, model second.

✓ Remember

Feed ranking is a multi-stage funnel: 2,000 → 200 (ANN retrieval) → 50 (light GBDT) → 20 (heavy neural) → displayed. Each stage has a ms budget; they must sum to the SLA with headroom.
Heavy ranker on 200 items fails the latency budget not from compute, but from feature-fetch I/O. The light ranker exists to eliminate that cost.
Position bias must be handled at training time (position feature + zero at inference) or you train a circular model that learns "top items are clicked because they are on top."
Cold-start strategy: popularity fallback (0–5 events) → content-based mean-item-embedding (5–30) → full personalised two-tower (30+). Transition is smooth and invisible.
Every stage needs a graceful degradation path. "Circuit breaker returns error" is not a fallback — users must see a feed, even a non-personalised one.

TL;DR

A 100 ms feed-ranking SLA forces a multi-stage funnel: cheap ANN retrieval collapses 2,000 candidates to 200, a GBDT light ranker collapses to 50, and a GPU neural heavy ranker scores the final 50. Feature freshness must be tiered (batch/near-RT/request-time) matching each signal's rate of change. Cold-start users get a phased content-based fallback. Position bias is killed with a training-time position feature zeroed at inference. Every stage has a graceful degradation path so users never see a blank feed.

Tricky interview questions — chapter 02

Q1. Why is the latency budget table the first thing to draw, before any architecture diagram?

The latency budget is the central design constraint — every component choice, model size, and feature strategy is only valid if it fits within its allocated ms budget. Drawing architecture without the budget produces a design that looks correct but may be physically impossible to serve within the SLA. The budget table forces you to allocate concretely and exposes conflicts early (e.g., "if feature fetch takes 30 ms, the heavy ranker has no room"). It is the skeleton; the architecture is the flesh.

Q2. A colleague proposes replacing the two-tower retrieval with a simple BM25 search over post text. Under what conditions would this be a reasonable choice?

BM25 makes sense when the feed is primarily text-search-driven (users looking for specific information rather than browsing), when there is insufficient engagement data to train a meaningful two-tower (very early stage product), or when content quality and text relevance dominate over behavioural personalisation. BM25 has no training cost, is fully interpretable, and handles long-tail queries well. The tradeoff is that it cannot capture semantic similarity ("I like cooking but searched for 'recipe ideas'") or behavioural signals. A hybrid BM25 + dense retrieval often outperforms either alone and is worth proposing as an intermediate option.

Q3. Your FAISS index needs to be updated as new posts arrive. How do you handle this without taking the index offline?

Use a two-index strategy: a primary HNSW index for posts older than N minutes (batch-rebuilt hourly) and a small secondary flat index for posts newer than N minutes (updated incrementally, scanned brute-force since it is small). At query time, merge results from both indexes and deduplicate. Alternatively, use a FAISS IVF index that supports incremental add() calls. The key constraint is that HNSW does not support efficient incremental updates — elements can be added but not deleted, and large-scale additions degrade index quality. The batch-rebuild-with-secondary-index approach avoids this by keeping the hot index small.

Q4. Explain the feedback loop that position bias creates and why it is self-reinforcing.

Without correction, training labels (clicks) are confounded with position: items shown at position 1 receive ~5-10× more clicks than equivalent items at position 5, purely due to visibility. The model trains on this signal and learns that items it scores highly (which are shown at position 1) receive more clicks. At serving time, high-scored items are shown at position 1, which generates more clicks, which reinforce the high score. The loop tightens over time: the model becomes increasingly confident in a small set of items that were lucky enough to be shown early. This manifests as feed homogenisation — users see a narrowing set of content types — and eventually retention decline as novelty drops.

Q5. Why is the heavy ranker budget 18 ms rather than, say, 40 ms? What would you sacrifice to give it more time?

18 ms comes from working backwards: 100 ms total − 15 ms network − 10 ms candidate gen − 12 ms retrieval − 8 ms feature fetch − 5 ms light ranker − 4 ms business logic − 8 ms serialisation = 38 ms remaining, with 20 ms held as headroom = 18 ms for the heavy ranker. To give the heavy ranker more time you would need to cut from other stages. Options: pre-cache feature fetch results at a longer TTL (saves 2–3 ms), simplify business logic (saves 2 ms), accept a tighter p99 headroom (risky). Alternatively, move the heavy ranker to the edge (closer to the user) to save network time — but this requires running a GPU at the edge, which raises cost substantially.

Q6. A post goes viral — 10× normal engagement in 2 minutes. Which part of your system responds first and which lags?

The session-context feature (real-time, <1 s) responds immediately — users who click the viral post within their session will see related content boosted. The Bloom filter and seen-posts deduplication respond immediately. The item's CTR and engagement stats (hourly batch) lag by up to 59 minutes — during this window, the item looks like a normal post to the light and heavy rankers. The two-tower item embedding (hourly refresh) also lags. The viral post will be under-ranked by the personalised system for up to an hour. Mitigation: a trending signal computed every 5 minutes from Kafka event counts, injected as a freshness override into the candidate generation step.

Q7. The product team wants to optimise the ranker for 7-day retention instead of CTR. What changes?

Three major changes: (1) Labels shift from binary click (immediate) to 7-day retention (delayed). This requires a delayed labelling pipeline that joins impressions with D7 retention events — a 7-day pipeline lag before a training example is complete. (2) The training objective changes: instead of binary cross-entropy on clicks, use a survival model or a regression on retention probability. (3) The feature set should include longer-term user signals (30-day engagement patterns, account age, historical churn risk) since CTR-predictive short-term signals may not correlate with retention. The GBDT and neural ranker architectures can stay the same; only the labels and features change. The experiment must also change: a 5% canary needs at least 7 days to observe the retention metric, compared to hours for CTR.

Q8. How would you test whether your position-bias correction is actually working?

Three tests: (1) Randomisation experiment: for 1% of requests, shuffle item positions randomly. Collect clicks. Compare click-rate-by-position on the shuffled traffic against the model's predicted scores — if correction works, model scores should be uncorrelated with position on shuffled traffic. (2) Score distribution by position: in the live system, check that the model's predicted engagement scores are not monotonically decreasing with position (which would indicate residual position effect in the scores). (3) Counterfactual evaluation: use the logged propensity (position assignment probability) to compute inverse-propensity-weighted metrics on held-out data. If IPS-corrected offline AUC matches online CTR better than naive AUC, the correction is working.

Q9. The GPU budget is 30 A100s but the arithmetic shows you need 51. How do you resolve this?

Three options, in order of preference: (1) Reduce heavy-ranker candidates from 50 to 30 — this cuts GPU demand proportionally and may have small quality impact (measure in A/B). (2) Switch to A10G GPUs at 60% of A100 throughput but ~40% of cost — 51 A10Gs ≈ 31 A100-equivalents of compute, fitting within budget. (3) Distil the heavy ranker into a smaller model (3M parameters instead of 5M) — reduces FLOPs per item ~40%, tested first in shadow mode. Present all three to the interviewer as a menu: the right choice depends on whether cost or quality is the binding constraint right now, which is a product/business decision, not an engineering one.

Q10. An adversarial creator is flooding the system with 1,000 low-quality posts per hour trying to game the ranking. How does your system defend against this?

Several layers: (1) Candidate generation quality gate: posts below a minimum quality score (spam classifier score, author trust score) are excluded before entering the 2,000-candidate pool — they never reach the ranker. (2) Author-level engagement rate: the feature store tracks per-author historical CTR and dwell-time. A new author with no history gets a cold-start conservative prior, not a generous one. (3) Diversity rule in business logic: at most 3 consecutive posts from the same author in the displayed feed — flooding the system cannot dominate the feed even if individual posts score well. (4) Abuse signal pipeline: a separate real-time abuse detection system (running independently of ranking) flags accounts and tombstones them from the candidate pool within minutes. Ranking should not be the first line of defence against adversarial content — that is the abuse team's job, and ranking should consume their signals, not replicate them.

Q11. Your shadow-mode logs show the new ranker scores items very differently from the old ranker — some items 10× higher, some 10× lower. Is this a bug?

Not necessarily — but it requires investigation before canary launch. A large score shift could indicate: (a) a calibration difference (old ranker outputs 0–1 probabilities; new ranker outputs raw logits — check output normalisation); (b) a genuine quality improvement (the old ranker was poorly calibrated and the new one is correctly discriminating); (c) a training data bug (the new ranker trained on mislabelled or leaked data — check offline AUC against a held-out test set sampled from production logs); (d) a feature distribution shift (a feature that exists in training data does not exist in serving — check feature null rates in shadow logs vs training). Score distribution shift alone is not a bug; the question is whether it predicts user outcomes correctly. Run the full shadow evaluation: compare new vs old ranker scores on a randomised traffic slice where position is controlled.

Q12. If you could only monitor three metrics in production for this system, which would you choose and why?

The three highest-signal metrics: (1) p99 end-to-end latency — this is the SLA and any breach is immediately user-visible; it detects infra issues, cache cold-starts, and model regressions simultaneously. (2) Score distribution mean and variance per stage — a sudden shift in the light or heavy ranker's output distribution indicates a model issue (feature skew, serving bug) before it shows up in user metrics. (3) D1 retention of the cold-start cohort — this is the leading indicator for whether the cold-start strategy is working; CTR is a lagging metric that can be gamed by over-optimising for short-term clicks. D1 retention for new users tells you whether the ranker is delivering value to the most vulnerable segment.

PART II · DESIGN CHALLENGES

Design: LLM chat serving for 50k concurrent users

🎯The only reason LLM serving is hard: every token you generate must read — and write — the entire conversation history from GPU memory, and 50k people want a turn simultaneously.

This challenge forces you to do the arithmetic that separates a hand-wavy "just use vLLM" answer from a Staff-grade design. You will estimate GPU counts from first principles, size the KV-cache per device, pick a batching strategy, and compute cost per million tokens — the metric every product team actually cares about.

The Scenario

You are the ML-infrastructure tech lead at a company launching a customer-facing chat assistant. The stack must handle:

Model: 13B-parameter dense transformer, fp16 weights, 40 layers, 40 heads, head dim 128, GQA with 8 KV heads per layer (so KV head dim = 128).
Traffic: 50,000 concurrent sessions; mean 4 requests / session / hour → ~56 req/s sustained, with 3× daily peak (≈ 167 req/s).
Token profile: mean prompt 1,500 tokens; mean response 400 tokens; so ~1,900 tokens round-trip per request.
Latency SLO: time-to-first-token (TTFT) < 800 ms at p95; time-per-output-token (TPOT) < 60 ms at p95.
Cost target: \$2.00 / 1M tokens (input + output blended).
Infra: H100 80GB SXM GPUs available on-demand; also an autoscale budget.

Your Task

Estimate how many H100s you need — show every arithmetic step.
Design the batching strategy (continuous batching, chunked prefill, etc.) and explain why naive static batching fails here.
Compute the KV-cache memory budget per GPU and decide how many concurrent sequences fit per device.
Describe routing / session-affinity decisions and why they matter.
Design the autoscale + degradation ladder for the 3× peak.
Address multi-region: when does it help, when does it not?

Hint 1 — Start with throughput, not latency

The most common first mistake is to think about latency targets first and GPU count second. Flip it: figure out how many tokens per second the fleet must generate, then divide by what one H100 can do. Throughput determines fleet size; latency determines per-device batch size and scheduling policy.

Useful number: a single H100 (80GB, fp16, no quantization) running a 13B model with a realistically-sized batch can sustain roughly 3,000–5,000 output tokens/second on the decode phase. Where does that number come from? Memory-bandwidth bound: at 3.35 TB/s HBM3 bandwidth and ~26 GB of model weights (13B × 2 bytes), each decode step reads the entire model in ~8 ms, implying ≈ 125 decode steps/sec with batch=1. With batch=32 the same memory read is amortized over 32 tokens → 4,000 tokens/sec. Confirm you can reconstruct this from first principles before peeking at the solution.

Hint 2 — KV-cache memory is the binding constraint, not weights

On an H100 with 80 GB VRAM: the 13B fp16 model occupies ~26 GB. That leaves ~50 GB for KV cache + activations. Before you decide on batch size, work out how many bytes one token of KV state costs for this model. With GQA (8 KV heads, head dim 128) the calculation is per-layer-per-token: 2 (K and V) × 8 heads × 128 dim × 2 bytes (fp16) = 4,096 bytes per layer. Multiply by 40 layers = 160 KB per token. Now ask: how many 1,900-token sessions does 50 GB support? That tells you the max concurrent sequences a single GPU can hold in-cache.

Hint 3 — Continuous batching changes the scheduling math

With static batching you commit a batch at request arrival and wait for all sequences to finish before accepting new ones. With continuous batching (aka iteration-level scheduling), each forward pass picks the "ready" sequences from a waiting pool — sequences can enter and leave mid-batch. This keeps GPU utilization high even when response lengths vary wildly. Think about how this interacts with prefill vs. decode: prefill of a long prompt (1,500 tokens) for ONE new request takes many more FLOPs than a decode step. If you mix prefill and decode in the same forward pass, long prefills starve decodes → TTFT of other users spikes. What is the fix?

✅ Model solution — GPU count, KV-cache arithmetic, cost per 1M tokens

Step 1 — Fleet throughput demand

Peak traffic: 50,000 sessions × 4 req/hr / 3,600 s × 3× burst = 167 req/s.

Tokens per request: 1,500 prompt + 400 response = 1,900 total, but the serving cost splits differently:

Prefill processes 1,500 tokens in one parallel forward pass (FLOP-heavy, fast).
Decode generates 400 tokens one at a time (memory-BW-heavy, slow).

Output token demand at peak: 167 req/s × 400 tokens = 66,800 output tokens/sec fleet-wide.

Step 2 — KV-cache memory per token (the arithmetic)

Model config: 40 layers, GQA with 8 KV heads, head dim 128, fp16.

$$\text{KV bytes per token} = 2 \times L \times H_{kv} \times D_{head} \times 2$$

2 = K and V tensors; L = 40 layers; H_kv = 8 KV heads; D_head = 128; final ×2 = fp16 bytes

= 2 × 40 × 8 × 128 × 2 = 163,840 bytes ≈ 160 KB per token.

VRAM budget per H100 (80 GB):

Model weights (fp16): 13B × 2 = 26 GB
Activations + scratch: ~4 GB (conservative)
Available for KV cache: 80 − 26 − 4 = 50 GB

Max tokens in KV cache per GPU: 50 GB / 160 KB = 312,500 tokens.

At mean context length 1,900 tokens/session: 312,500 / 1,900 ≈ 164 concurrent sequences per GPU.

This is the primary capacity constraint — not raw FLOP throughput.

Step 3 — Decode throughput per GPU

H100 HBM3 bandwidth: 3.35 TB/s. Each decode step must read all 26 GB of weights once per token generated.

$$\text{decode steps/sec} = \frac{3{,}350 \text{ GB/s}}{26 \text{ GB}} \approx 129 \text{ steps/sec at batch=1}$$

bandwidth ÷ model size = how many times we can scan weights per second

With batch B, we generate B tokens per step → throughput = 129 × B tokens/sec. At batch=164 (our KV-cache ceiling): ≈ 21,000 output tokens/sec per GPU. Apply a 70% utilization haircut (routing overhead, prefill interruptions, head-of-line blocking): ~14,700 usable output tok/sec per GPU.

Step 4 — GPU count

Fleet output token demand at peak: 66,800 tok/s.

GPUs needed: 66,800 / 14,700 ≈ 5 H100s for decode at peak.

Add headroom for prefill burst (prefill is FLOP-bound; a 1,500-token prefill takes ~0.5s on one GPU at batch=1, much less in parallel): plan 2× overhead → 10–12 H100s for the compute fleet. Add 2 for redundancy → 12–14 H100s total at peak.

Sustained (non-peak) load is ~3× lower → 4–5 GPUs; autoscale between 5 and 14.

Step 5 — Batching strategy: continuous batching + chunked prefill

Why static batching fails: variable response lengths mean some sequences finish at token 20, others at token 800. Static batching holds the GPU until the slowest finishes → 90%+ idle time on average.

Continuous batching (iteration-level scheduling): every forward pass, the scheduler swaps completed sequences out and new ones in. GPU is never idle waiting for stragglers. This is the default in vLLM, TGI, and SGLang.

The prefill/decode interference problem: a single prefill of 1,500 tokens in a mixed batch adds ~12 ms to that forward pass, spiking TTFT for all other sequences currently decoding. Chunked prefill splits the prefill into fixed-size chunks (e.g., 512 tokens/chunk) and interleaves them with decode steps. This bounds prefill latency injection at the cost of slightly higher TTFT for the prefilling sequence itself.

Disaggregated prefill/decode (advanced): route all prefill work to a dedicated "prefill pool" of GPUs optimized for high FLOP throughput; route decode work to a "decode pool" optimized for memory bandwidth and high batch size. The KV cache is transferred between pools after prefill completes. Eliminates interference entirely; adds network transfer cost (~1,900 tokens × 160 KB = ~300 MB per request — tolerable on InfiniBand, painful on Ethernet).

Step 6 — Paged KV cache

Without paging: you must pre-allocate the maximum possible context length for every new sequence at arrival — most of it wasted if the sequence is short. With paged KV cache (vLLM's key contribution): KV cache is divided into fixed-size "pages" (e.g., 16 tokens each); pages are allocated on demand as sequences grow. Fragmentation falls from O(max_len) to O(page_size). The practical effect: you can run 2–3× more concurrent sequences for the same VRAM.

Step 7 — Prefix caching for system prompts

If 80% of your requests share a 500-token system prompt, and you cache the KV state for that prefix, each new request skips 500 tokens of prefill → ~33% reduction in prefill work and a jump in effective capacity. Implementation: hash the prefix, store its KV pages, reuse on cache hit. This is "prefix KV caching" or "prompt caching." Works best when prefix is long and hit rate is high. A wrong answer: "just cache the full response" — that only helps exact-repeat queries, not the common case of shared system prompts with novel user turns.

Step 8 — Routing and session affinity

When to use affinity: if your service supports multi-turn chat where the KV state from turn N is reused in turn N+1, routing turn N+1 to the same GPU as turn N avoids re-computing or re-transferring the KV state. This is a significant saving: re-prefilling 1,500 tokens from scratch costs ~500ms; reusing cached KV costs ~0ms. Implementation: a consistent-hash router keyed on session ID.

When affinity hurts: if one GPU gets a few very long-context sessions, its KV memory fills up faster than peers → hot-spot imbalance. Solution: track per-GPU KV utilization and re-route new sessions away from GPUs above a threshold (e.g., 85% KV full).

Step 9 — Autoscale + degradation ladder

Load level	Action	User impact
< 80% capacity	Normal serving, full context window	None
80–100%	Scale out (add GPUs from autoscale group)	~2–3 min ramp-up lag
100–120%	Reduce max response length (400→200 tokens), shed lowest-priority sessions	Shorter answers
> 120%	Queue overflow requests with retry-after header; serve waitlisted users in FIFO order	Visible wait indicator
Catastrophic	Redirect to smaller (7B) fallback model on separate pool	Lower quality, available

Key autoscale metric: KV-cache utilization per GPU, NOT CPU/GPU compute utilization. A GPU can be at 30% compute but 95% KV-full — it cannot accept new sessions. Use KV utilization as the primary scaling signal.

Step 10 — Cost per 1M tokens

H100 on-demand: ~\$3.00/hr (cloud spot: ~\$1.80/hr). Sustained fleet: 5 GPUs × \$3.00 = \$15/hr.

Output tokens per hour at sustained load: 22,267 req/hr × 400 tokens = 8.9M tokens/hr.

Input tokens per hour: 22,267 req/hr × 1,500 tokens = 33.4M tokens/hr.

Total tokens/hr: 42.3M.

$$\text{cost per 1M tokens} = \frac{\$15/\text{hr}}{42.3\text{M tok/hr}} \times 10^6 \approx \$0.35 / \text{1M tokens}$$

fleet cost per hour ÷ total tokens per hour, scaled to 1M

That's well under the \$2.00 target — meaning 5 GPUs on-demand could serve sustained load profitably. The 3× peak fleet (14 GPUs) at \$42/hr over 42.3M × 3 = 127M tok/hr gives \$0.33/1M — still fine. The \$2.00 budget has enormous margin, suggesting the real business risk is not cost but availability during peaks.

Speculative decoding consideration: a small draft model (e.g., 1B) proposes 4–8 tokens per step; the target (13B) verifies in one forward pass. Speedup: 2–3× on output tokens when the draft model's acceptance rate is high (common for constrained domains like customer service). Tradeoff: must serve TWO models; draft model adds ~2 GB VRAM; acceptance rate degrades on novel/creative queries. Implement behind a feature flag, measure acceptance rate in prod before committing to the 1B model in your fleet budget.

Multi-region

When it helps: latency to users (TTFT is sensitive to network RTT — a user 150ms away gets 150ms knocked off their 800ms budget before the GPU even starts); disaster recovery; data-residency compliance.

When it doesn't help throughput: splitting 50k sessions across 2 regions halves per-region capacity but doesn't reduce total GPU count. The model weights must be loaded in every region independently. Shared KV state across regions is impractical (network bandwidth is 10-100× too slow for real-time KV transfer). So multi-region = multi-copy of the fleet, not a cost saving.

Common wrong answer: "multi-region reduces latency AND saves money by sharing load." It reduces latency; it does NOT save money — it costs more because you cannot fractionally pool KV cache across regions.

Common wrong answers and why they fail

"Just use a load balancer round-robin"

Ignores session affinity for KV reuse. Every turn re-prefills 1,500 tokens = 3× TTFT for multi-turn users.

"Scale on GPU utilization"

GPU compute can be low while KV memory is full. The GPU refuses new sequences even though its utilization metric says 30%. Wrong signal → under-scaling.

"Static batch size of 32"

Leaving 132 KV slots idle. 4× throughput left on the table.

"Quantize to int4 to reduce cost"

Halves weight memory (13 GB) — great for fitting more model variants per GPU, but decode throughput is still memory-BW bound. Int4 on H100 gives ~2× throughput, not 4×. A valid optimization, but don't lead with it before showing you understand the primary constraints.

✓ Remember

KV memory is the binding constraint, not FLOPs. For a 13B GQA model on H100, one token costs 160 KB of KV cache; 50 GB free → 164 concurrent sequences per GPU.
Continuous batching + chunked prefill solves the variable-length + prefill/decode interference problems. Disaggregated prefill/decode is the Staff+ answer.
Scale on KV-cache utilization, not GPU compute utilization — they diverge badly.
Cost math: 5 H100s at \$3/hr serve ~42M tokens/hr = \$0.35/1M tokens. The \$2.00 target has 5× headroom — availability risk matters more than cost risk here.
Session affinity saves one full prefill per multi-turn exchange; route to a different GPU only when KV utilization > threshold.

Self-grading rubric
Level	What a passing answer includes
Junior	Names vLLM / TGI, mentions "we need multiple GPUs," knows KV cache exists. No numbers.
Senior	Derives GPU count from token throughput, sizes KV cache, picks continuous batching, mentions paged KV. Ballpark cost estimate. Knows autoscale is needed.
Staff	Full arithmetic in the solution (KV bytes per token formula, utilization haircut, cost per 1M tokens computation). Identifies KV utilization as the correct scaling signal. Discusses chunked prefill vs. disaggregated prefill with tradeoffs. Prefix caching for system prompts. Speculative decoding evaluation (not just "use it"). Multi-region cost reality check. Degradation ladder with concrete thresholds.

📐 If you get this question — the rule

Trigger: "Design an LLM serving system for N concurrent users."

Derive peak output token demand (sessions × req/hr × 3× burst × output tokens/req).
Compute KV bytes per token for the given model (2 × layers × KV-heads × head-dim × 2 bytes).
Divide free VRAM by KV bytes per token to get max concurrent sequences per GPU.
Compute decode throughput: bandwidth / model-size × batch-size × utilization-haircut.
Divide fleet demand by per-GPU throughput → GPU count.
Choose continuous batching + chunked prefill; justify with latency SLO.
Scale on KV utilization, not compute. Build degradation ladder.
Compute cost per 1M tokens at the end to validate against target.

Never: say "just add more GPUs" without showing the arithmetic, or recommend scaling on CPU/GPU compute utilization.

Tricky interview questions — chapter 03

Q1. A 13B model in fp16 has 13 billion parameters. How many GB of VRAM does that require, and what's left for KV cache on an H100?

13B × 2 bytes (fp16) = 26 GB for weights. H100 has 80 GB. Subtract 26 GB weights + ~4 GB activations/scratch = ~50 GB available for KV cache. This is not just a memory formula — it's the starting point for every capacity decision in LLM serving.

Q2. Why is "GPU utilization" a bad autoscale signal for LLM serving?

GPU compute utilization measures how busy the CUDA cores are. But LLM decode is memory-bandwidth-bound, not compute-bound — utilization can read 30–40% while the GPU is fully saturated with memory traffic. More importantly, a GPU can refuse new sessions because its KV-cache pages are full even if compute utilization is low. KV-cache utilization (fraction of pages occupied) is the correct primary signal; compute utilization is misleading.

Q3. What is continuous batching and how does it differ from static batching?

Static batching fixes a batch at request arrival and waits for all sequences in the batch to finish before accepting new ones — GPU is idle whenever short sequences finish. Continuous batching (iteration-level scheduling) makes a fresh batching decision at every forward pass: completed sequences are swapped out, new requests swapped in. GPU stays near full utilization even with highly variable response lengths. vLLM, TGI, and SGLang all implement continuous batching.

Q4. What is the prefill/decode interference problem, and what are the two main solutions?

Prefill (processing the prompt) is FLOP-bound and takes many milliseconds for long prompts; decode (generating tokens) is memory-BW-bound and fast per step. When a large prefill is scheduled in the same forward pass as many decode sequences, the prefill dominates that pass's duration, spiking TTFT for all concurrently decoding users. Solution 1 (chunked prefill): split the prefill into fixed chunks (e.g., 512 tokens) interleaved with decode steps — bounds the latency spike per step. Solution 2 (disaggregated prefill/decode): dedicate separate GPU pools to prefill and decode; transfer KV cache between them after prefill completes — eliminates interference entirely at the cost of KV transfer bandwidth.

Q5. Walk me through the KV-cache bytes-per-token formula for a 13B model with GQA.

KV bytes per token = 2 (K and V) × number_of_layers × KV_heads × head_dim × bytes_per_element. For our model: 2 × 40 × 8 × 128 × 2 (fp16) = 163,840 bytes ≈ 160 KB per token. With 50 GB free: 50 GB / 160 KB ≈ 312,500 tokens total, or about 164 concurrent sequences of 1,900 tokens each. Without GQA (full 40-head MHA): 2 × 40 × 40 × 128 × 2 = 819,200 bytes per token — 5× more. GQA dramatically reduces KV memory.

Q6. When does speculative decoding help, and when does it not?

Speculative decoding uses a small draft model to propose K tokens per step, verified in one forward pass of the large model. Speedup is real (2–3×) when the draft model's acceptance rate is high — which happens with constrained, repetitive outputs (code completion, templated responses, customer-service scripts). Acceptance degrades for creative, open-ended, or multilingual generation. Operationally it requires serving TWO model sizes simultaneously, complicating deployment. Evaluate acceptance rate on your actual traffic distribution before committing to the complexity.

Q7. How does paged KV cache (vLLM) improve memory utilization over pre-allocated KV cache?

Without paging, you must allocate max_context_length × KV_bytes at sequence start even if the sequence ends after 50 tokens — most memory is wasted (internal fragmentation). With paged KV, the cache is divided into fixed-size pages (e.g., 16 tokens each, ~2.5 KB per page for our model). Pages are allocated on demand as the sequence grows; freed immediately when the sequence ends. Fragmentation is bounded by one page per sequence instead of max_context_length per sequence. In practice this allows 2–3× more concurrent sequences for the same VRAM.

Q8. The product team wants to reduce cost by 50%. What levers do you pull, in order of risk?

In ascending risk order: (1) Switch to spot/preemptible instances (H100 spot at ~\$1.80/hr vs \$3.00 on-demand) — ~40% savings with good checkpointing and graceful drain on preemption signals. (2) Enable prefix caching for system prompts — reduces prefill work and allows higher concurrency. (3) Deploy speculative decoding if acceptance rate is high on your traffic — up to 2× throughput improvement. (4) Quantize KV cache to int8 or fp8 — halves KV memory, doubles concurrent sequences per GPU, roughly 2× throughput. (5) Quantize weights to int4 — ~2× throughput on memory-BW-bound decode, small quality hit. Each step has a cost-quality tradeoff; measure regression before shipping.

Q9. Why doesn't multi-region deployment reduce your GPU cost?

Multi-region splits traffic geographically, reducing latency for distant users and improving availability. But the KV cache is local to each GPU — you cannot pool KV memory across data centers (the bandwidth is 10–100× too slow to transfer KV state at inference speed). Each region must be independently sized for its peak load. The total GPU count is the same or higher than single-region because you lose economy of scale in KV pool sizing. Multi-region is a latency and reliability investment, not a cost saving.

Q10. What is session affinity in LLM serving, and what is the risk of getting it wrong in both directions?

Session affinity routes all turns of a multi-turn conversation to the same GPU, so the KV state from earlier turns is already in cache — no re-prefill cost. Getting it wrong by ignoring affinity: every turn re-prefills the full history, multiplying TTFT by 3–4× for long conversations. Getting it wrong by strict affinity: a few users with very long contexts fill a GPU's KV pages, creating hot spots. The right implementation tracks per-GPU KV utilization and re-routes new sessions away from GPUs above a threshold, evicting KV for idle sessions under memory pressure.

Q11. What is the "memory wall" in LLM serving and how does GQA address it?

The memory wall: decode throughput is bounded by HBM bandwidth, not FLOP throughput. For every output token, the GPU must read all model weights (26 GB for 13B fp16) once. GQA (grouped-query attention) reduces the number of KV heads from Q-heads (e.g., 40) to a smaller number (e.g., 8), shrinking KV-cache memory by 5× and the per-step KV-read volume by 5×. This directly loosens the memory-bandwidth constraint, allowing larger batch sizes and higher throughput. MQA (multi-query attention, 1 KV head) goes further but with more quality degradation. GQA is the dominant choice in production models (Llama 3, Mistral, Gemma).

PART II · DESIGN CHALLENGES

Design: feature store with point-in-time correctness

🎯Offline AUC 0.92, online AUC 0.71 — the gap is not your model, it is your join.

A feature store is the connective tissue between raw data and your models. When it is built wrong, training silently sees the future and your offline metrics become fiction. This challenge walks you through diagnosing a point-in-time leak from first principles, then designing the offline/online infrastructure that prevents it permanently. Mastering this pattern is a prerequisite for any Staff-level ML systems role.

The scenario

Your company runs two high-stakes ML consumers: a fraud detection model (inference within 200 ms of each transaction) and a feed ranking model (batch-scored nightly, re-ranked live). Both draw from a shared feature platform serving 200 features across 8 data-engineering teams.

The platform has grown organically. Features come from two source types:

Streaming — Kafka events (transaction counts, session activity, engagement signals) written to Redis. Sub-second freshness.
Batch — daily Spark jobs writing to Hive/Parquet (30-day aggregates, user segments, graph-derived scores). Freshness: 18–36 hours.

Training data is assembled monthly by a script that joins the label table (fraud outcomes, clicks) against the feature tables on user_id. The script has no special time handling — it just does a plain SQL join.

The numbers that triggered this conversation:

Metric	Offline (held-out test set)	Online (A/B shadow mode)
AUC-ROC	0.92	0.71
Precision @ top decile	0.84	0.51
Feature coverage	99.8 %	88.3 %

The gap is 0.21 AUC — catastrophic. A naive attribution says the model is bad, but the model never changed between offline and online evaluation. Something upstream is poisoned.

Your task

Diagnose — identify the leak class (training-serving skew, label leakage, target leakage, point-in-time leak). Show with a concrete 4-row table what the wrong join returns vs. what the correct as-of join should return.
Design the offline store — schema for a timestamped feature log, the point-in-time join algorithm, backfill strategy.
Design the online store — Redis/KV layout for streaming features and a low-latency serving API. Justify why it stays separate from the offline store.
Define freshness tiers — map the 200 features to at least 3 tiers; state the SLA and mechanism for each.
Describe parity testing — how you continuously verify that offline training distributions match online serving distributions.
State the late-data policy — what happens when a batch job is delayed 6 hours? Who owns what feature?

Hint 1 — what does the 0.92 vs 0.71 gap pattern tell you?

Offline metrics that are too GOOD are as diagnostic as online metrics that are too bad. What class of bug systematically inflates offline evaluation? What information could the training set contain that serving never will?

Hint 2 — look at the monthly training-set build

Training sets are built monthly from the warehouse's CURRENT feature tables. A label from June 3rd is joined against… which version of the user's 7-day aggregate? What did serving see on June 3rd?

✅ Model solution

Diagnosis. The monthly build joins labels against end-of-month feature snapshots — every training row sees features computed AFTER its label event, including the label's own consequences. That's point-in-time leakage; the 0.92 is fiction. The wrong-vs-right join:

Label event	WRONG join (month-end snapshot)	RIGHT join (as-of)
fraud_flag @ Jun 3, 12:04	txn_count_7d as of Jun 30 (includes Jun 3-10 panic activity!)	txn_count_7d as of Jun 3, 12:04 (data through Jun 3)
click @ Jun 12, 09:15	user_ctr_30d as of Jun 30 (includes this click)	user_ctr_30d as of Jun 12, 09:15

Architecture. (1) Timestamped feature log: every feature value is appended with its computation timestamp — the offline store is feature HISTORY, not current state. (2) As-of join engine for training-set builds: for each label, fetch the latest value with ts ≤ label_ts (and optionally ts ≥ label_ts − staleness_bound to mimic serving staleness). (3) Online store holds latest values, written by the same pipelines that append history — one definition, two materializations. (4) Registry: per-feature owner, freshness tier, lineage. (5) API: get_online(keys, features) for serving; build_training_set(labels_df with timestamps, features) for offline — teams never hand-write joins.

Freshness tiers. Classify the 200 features: batch daily (demographics, long aggregates), near-real-time micro-batch (hourly counters), streaming (fraud velocity — the fraud team's true requirement). Each tier ~10× the ops cost of the previous; default down.

Parity & late data. Continuous online/offline parity check: sample serving reads, replay as-of offline, alert on divergence rate. Late-arriving events: features recompute with event-time watermarks; the as-of join must use the value as it WAS at serving time — which logged-at-scoring features give you for free (log features at inference; train on the logs; the leak class disappears by construction — say this as the strongest variant).

Common wrong answers. "Retrain more often" (doesn't touch the join), "add regularization" (the 0.92 isn't overfitting in the classical sense), "feature selection to remove the leaky ones" (every aggregate leaks when joined wrong — the JOIN is broken, not particular features).

✓ Remember

Offline ≫ online with a monthly snapshot build = point-in-time leakage until proven otherwise.
The fix is architectural: timestamped history + as-of joins, or logged-at-scoring features.
Freshness is tiered and priced; parity checks are the skew alarm; lineage answers "who breaks if this table is wrong."

Level	Answer signature
Junior	Names "data leakage" vaguely; proposes retraining or model changes.
Senior	Diagnoses the as-of join precisely with the table above; designs offline/online stores + the training-set API; tiers freshness by need.
Staff	Adds logged-at-scoring as the structural cure, parity monitoring as the regression guard, late-data/backfill policy, lineage for blast radius, and a migration plan for 8 teams' existing pipelines with the fraud team's 5ms reads handled separately.

PART II · DESIGN CHALLENGES

Design: enterprise RAG with hard permissioning

🎯ACL filtering belongs at query time against a permission index — never delegated to the LLM after retrieval.

This challenge puts you in front of the hardest constraint in enterprise search: retrieval quality and authorization must both be correct, and the failure modes are asymmetric — a missed answer frustrates a user, but a leaked document can end careers. You will design a RAG system for 50,000 employees over 5 million documents, reason through exactly where permission filtering belongs in the pipeline, and build an evaluation plan that stress-tests both relevance and information leakage.

The scenario

Your company runs three internal knowledge sources that must be unified into a single Q&A interface:

Confluence wiki: 3.2M pages. Access rules: space-level (public, team-private, exec-only) plus page-level overrides.
Google Drive: 1.5M docs/sheets/slides. Access rules: per-file ACL with ~6 principals on average; changes propagate from folder inheritance.
Jira: 300k tickets. Access rules: project-level, with some tickets marked "security" and restricted to 12 people company-wide.

Scale numbers:

Corpus

5M documents, avg 1,800 tokens → ~9B tokens total

Users

50k employees; median employee can see ~35% of corpus

ACL mutation rate

~2,000 permission changes/hour (offboarding, project rotations, folder moves)

Freshness SLA

A permission change or document edit must be reflected in retrieval within 1 hour

Query load

~1,200 queries/minute peak (0.4 queries/employee/hour)

Latency target

p95 answer latency ≤ 4 seconds end-to-end

Zero-tolerance

NO unauthorized content must appear in any answer or citation

Your task — exact deliverables

Ingestion architecture: how do 5M documents get chunked, embedded, and stored? What metadata must travel with every chunk?
Incremental indexing pipeline: when a doc is edited or deleted, what happens within the 1-hour SLA? Define the tombstone strategy.
ACL filtering decision: where exactly does permission filtering happen — before retrieval, during retrieval, or after retrieval? Argue for your choice. Explain why the alternative that "feels" natural is wrong.
Retrieval stack: justify your choice of sparse, dense, or hybrid retrieval. If hybrid, specify the fusion mechanism.
Evaluation plan: define metrics and test sets for both answer quality and leakage. What does a leakage red-team look like?
Permission-aware cache: a senior interviewer will ask whether you cache query results. Walk through the subtlety.

Time-box yourself to 25 minutes before opening any hint. Write your answer on paper or in a doc first.

Hint 1 — Where does permission filtering live?

Draw the retrieval pipeline as a sequence of stages: query → candidate retrieval → filtering → re-ranking → generation. Now ask: at which stage do you know which documents the user is allowed to see? At which stage is it too late? The key insight is about what the LLM does with context it receives — does it reliably ignore unauthorized passages?

Also think about the difference between filtering 1M candidates down to 50 authorized ones, versus retrieving the top-50 and then checking authorization. The numbers are the same only if your retrieval is perfect — and it never is.

Hint 2 — Incremental indexing and the tombstone problem

When a document is deleted or its ACL is restricted (e.g., a page is moved to exec-only), what happens to chunks that were already indexed? Vector stores are append-friendly but deletion is expensive. A tombstone is a marker that tells the retrieval layer "this chunk exists in the index but is logically deleted — skip it." Think about how tombstones interact with your filtering logic.

Freshness within 1 hour means your ingestion pipeline cannot batch daily. What does an event-driven pipeline look like? What triggers it?

Hint 3 — The permission-aware cache trap

Caching is obvious for performance: if 100 employees ask "what is our parental leave policy?" you should not run 100 RAG pipelines. But the cache key cannot be just the query string. Think about what happens if Alice can see HR-confidential docs and Bob cannot — they ask the same question and you cache the result. Whose answer gets served to the other?

The fix requires the cache key to encode the user's permission set, but permission sets are large (thousands of doc IDs) and change hourly. What is a practical approximation?

✅ Model solution

The crux: filter at query time, against an index, never post-hoc. Post-hoc filtering (retrieve 50, ask an LLM or rule layer to drop unauthorized docs) fails three ways: the unauthorized content already left the vector store toward your application layer (leak surface), recall collapses when most of the top-k gets filtered (the user with narrow permissions gets 2 results), and an LLM judging permissions will eventually be wrong once — which is once too many. Correct: the retrieval query carries the user's permission predicate, evaluated INSIDE the search engine against indexed ACL metadata, so unauthorized vectors are never candidates.

Permission model. Per-doc ACLs (users, groups, classification labels) flattened into indexable terms (group IDs, tenant IDs, sensitivity tier) stored alongside each chunk's vector. Query side: resolve the user → their groups/entitlements (cached minutes, not hours — hourly ACL churn is the requirement) → filter expression ANDed into both the BM25 and the dense ANN search (filtered HNSW / pre-filtered IVF; modern engines support metadata-filtered ANN natively).

Ingestion & freshness. Source connectors emit (doc, ACL, content) change events → Kafka → incremental pipeline: re-chunk/re-embed only changed docs; ACL-only changes update metadata WITHOUT re-embedding (cheap, hits the <1h SLA); deletes write tombstones that suppress results immediately and compact later. Full re-index stays as a weekly repair job, not the freshness path.

Retrieval stack. Hybrid BM25 + dense with RRF fusion (enterprise queries are full of exact tokens — error codes, project names — where keywords beat embeddings), then a permission-safe cross-encoder re-rank over the top ~50. Generation cites only retrieved chunks; the answer layer never sees unauthorized text by construction.

The cache subtlety. A shared answer/retrieval cache keyed only on the query leaks across users (user A's cached answer contains doc X; user B without access asks the same question). Options: key the cache by (query, permission-set hash) — works when permission sets cluster into groups; or cache only retrieval candidates pre-filter and re-apply the user's filter on every hit (cache the expensive part, never the authorization). Never cache post-authorization content across principals.

Eval. Two suites: quality (recall@k/MRR against a labeled query→doc set, per department) and LEAKAGE red-team: synthetic users with known entitlements issue queries engineered to surface forbidden docs (exact-title queries, quote fragments); the gate is zero unauthorized chunks retrieved across the suite, run on every index/code change. Audit log of (user, query, docs retrieved) for compliance.

Common wrong answers. Post-retrieval LLM filtering (leak surface + recall collapse); per-user indexes (5M docs × 50k users doesn't exist); ignoring ACL-only updates (forcing re-embeds blows the freshness SLA and the GPU bill).

✓ Remember

Authorization is a RETRIEVAL predicate, not a generation instruction.
ACL changes ≠ content changes: metadata updates must be cheap and fast; embeddings only recompute on content change.
Caches keyed on query alone leak across users — key on permissions or cache pre-authorization artifacts only.
Ship with a leakage red-team suite as a blocking gate, not just quality metrics.

Level	Answer signature
Junior	RAG pipeline recited; permissions handled by "filtering the results" after retrieval.
Senior	ACL-filtered ANN at query time, incremental indexing with tombstones, hybrid retrieval with RRF, separates ACL updates from re-embeds.
Staff	Adds the cache-leak subtlety, the leakage red-team gate, group-flattening of ACLs with churn-rate math, citation-constrained generation, audit logging, and the cost story (re-embed budget vs metadata updates).

PART III · DEBUG CHALLENGES

Debug: p99 latency tripled after yesterday's deploy

🎯A flat p50 next to a spiked p99 is the fingerprint of a cold cache, not a slow algorithm.

Your ranking service p99 just tripled overnight. p50 is perfectly calm. GPU utilization hasn't budged. Something subtle changed at deploy time — and the telemetry, if you read it right, tells you exactly what. This challenge trains you to read the latency percentile signature, trace the fan-out path, and prescribe the cheapest fix before you ever touch a rollback button.

The Scenario

You run a ranking service for a content platform. At 13:45 your team deployed a model update that added 12 new features — all fetched from the same feature store, all served through the existing feature service. No infra changes, no schema migrations. Deployment completed cleanly with zero errors in the deploy log.

At 14:00 the on-call engineer gets paged. Here is what the dashboards show:

Metric	13:30 (pre-deploy)	14:15 (post-deploy)	Delta
Ranking service p99 (ms)	80	260	+225%
Ranking service p50 (ms)	42	44	+5% (flat)
Ranking service p95 (ms)	61	181	+197%
Feature service p99 (ms)	18	74	+311%
Feature service p50 (ms)	9	10	+11% (flat)
GPU utilization (%)	61	62	flat
Feature cache hit rate (%)	92	61	−34 pp
Feature fetch fan-out (calls/req)	14	26	+86%
Ranking service error rate (%)	0.02	0.03	flat

Additional context: the feature cache is a shared Redis cluster with an LRU eviction policy and a 4-hour TTL. The 12 new features have never been requested before this deploy. Total cache capacity is 80 GB. Cache occupancy before the deploy was 71 GB.

Your Task

You have 15 minutes before the incident commander asks for a status update. Do the following:

Rank your hypotheses. List at least 4 plausible causes, in descending order of likelihood given the telemetry above. For each, say which data points support or contradict it.
Name the 3 cheapest probes you would run right now to confirm or eliminate the top hypothesis — ordered cheapest-first (read-only before write).
Prescribe the immediate fix assuming your top hypothesis is confirmed.
Describe the prevention — what process change ensures this never pages you at 14:00 again?

Hint 1 — Read the percentile signature

p50 is flat; p99 is 3× worse. What does that tell you about the distribution of affected requests? If the scoring model were slow, every request would be slower and both percentiles would shift. If a downstream call were slow, every request would also be slower. So what class of event affects only the slowest requests?

Think about caches. When a cache returns a hit, the request is fast. When it misses, the request falls through to the origin and is slow. If your cache hit rate drops from 92% to 61%, what fraction of requests now experience the slow path? Does that fraction match the percentile that spiked?

Hint 2 — Trace the fan-out path

The feature fetch fan-out went from 14 calls/request to 26 calls/request. Those 12 extra calls correspond exactly to the 12 new features. Now: each of those calls is a cache miss (the keys have never existed). A cache miss on a shared Redis cluster triggers a synchronous read from the feature store backend. The ranking service's p99 latency is bounded by the slowest of its parallel fan-out calls. What happens to the maximum of 26 independent random variables compared to the maximum of 14?

Extreme-value statistics: the expected maximum of n i.i.d. random variables grows with n. More calls → higher chance one of them hits a slow path (GC pause, hot key, network jitter). That's why p99 blows up while p50 stays flat.

Hint 3 — Cache eviction and the secondary effect

Before the deploy, cache occupancy was 71 GB out of 80 GB. The 12 new features are being written into the cache on first fetch. Under LRU eviction, the new keys displace old ones. Which old keys get evicted first? The least-recently-used ones — which are often low-frequency features that were already borderline. This creates a secondary wave of cache misses on previously-warm features, further degrading hit rate beyond just the 12 new features. This explains why hit rate fell by 34 pp rather than a smaller amount proportional to just the new features.

✅ Model solution — full causal chain, probes, fix, prevention

Causal chain (root cause)

Primary cause: The deploy introduced 12 new feature keys that had zero cache entries. Every request immediately triggered 12 cache misses, each fanning out to the feature store backend synchronously. The cache hit rate dropped from 92% to 61% because: (a) 12/26 = 46% of calls per request are guaranteed misses, and (b) the cache is near capacity (71/80 GB), so new key writes evict warm old keys under LRU, causing secondary misses on previously-cached features.

Why p50 is flat but p99 exploded: Cache hits are fast (~5ms). Misses fall through to the feature store backend (~60-80ms). A request with all hits completes in ~42ms (the p50). A request with even one slow miss completes in max(all_calls). With 26 calls/request and 39% miss rate, the expected number of misses per request is ~10 — almost certain at least one miss occurs per request. The expected maximum latency of 26 calls grows faster than the mean: extreme-value theory predicts the max of n exponentials scales as O(log n) above the mean. This is why p99 (the tail of the tail) exploded while p50 (median) barely moved.

$$P(\text{at least one miss per request}) = 1 - 0.61^{26} \approx 1 - 0.000002 \approx 100\%$$

hit-rate 0.61 raised to 26 calls: essentially every single request hits the slow path at least once

The 3 cheapest probes (ordered cheapest-first)

Probe 1 — Cache hit rate by feature key (read-only, 2 min)

Query Redis INFO keyspace and filter hit/miss counters by the 12 new feature key prefixes. If hit rate on new keys is ~0% while old keys remain at ~92%, the cold-cache hypothesis is confirmed. Cost: one read command. No risk.

Probe 2 — Per-feature-service dependency latency heatmap (read-only, 5 min)

Pull distributed trace samples from the last 15 minutes, group by which feature keys were fetched, and plot the latency distribution. Requests whose slow span corresponds to new feature key fetches confirm the root cause. Requests slow for other reasons (model inference, downstream calls) would point elsewhere.

Probe 3 — Rollback canary test (low-risk write, 10 min)

Route 1% of traffic to the previous model version (which does not request the 12 new features). If p99 immediately returns to ~80ms on that 1% slice while the other 99% stays at 260ms, the deploy is confirmed as the cause. This is slightly more invasive but gives definitive evidence before a full rollback decision.

Immediate fix

Step 1 (now, 0 min): Pre-warm the cache for the 12 new feature keys before routing production traffic. Run a background job that fetches all 12 features for the top-N user IDs (e.g., top 1M by DAU) and writes them into Redis. At 100k users/minute this takes ~10 minutes and fills the cache without user-visible latency impact.

Step 2 (now, parallel): If p99 is breaching SLA, immediately enable hedge requests — send a second parallel feature fetch for any call that hasn't returned within 50ms, take whichever returns first. This adds overhead but caps tail latency.

Step 3 (short-term): Batch the 26 feature calls into a single multi-get request to the feature service instead of 26 serial or parallel individual calls. This reduces connection overhead and allows the feature service to pipeline responses, cutting p99 on cache-miss paths significantly.

Prevention — deploy-time cache warming gate

Pre-deploy warming job (required for feature additions): Any PR that adds new feature keys must include a companion warming script. The deploy pipeline runs this script against production cache before traffic cut-over. Warming must reach >80% hit rate on new keys before the gate passes.
p99 canary gate: Deployments are gated on a 5-minute canary window where 5% of traffic runs on the new version. If p99 exceeds 1.5× baseline during the canary window, the deploy is automatically held and the team paged — before full rollout.
Cache capacity headroom policy: Cache occupancy must not exceed 60% at deploy time (current: 89%). Enforce this as a pre-deploy check. Headroom absorbs new keys without evicting warm entries.
Fan-out alerting: Alert if per-request feature call count increases by >20% across a deploy. Automatically flags new feature additions for warming review.

What a Staff+ answer adds

Challenge the fan-out design: 26 serial/parallel feature calls per request is a design smell. A Staff candidate asks: "Why aren't these batched into a single request to the feature service?" and proposes a feature vector API that returns all features for a user in one round-trip.
Cache sizing math: 12 new features × 1M active users × ~200 bytes/feature = 2.4 GB new cache data. This fits in the 9 GB headroom if we had maintained the 60% policy. A simple back-of-envelope before the deploy would have flagged the eviction risk.
Org process: "This incident reveals that we have no staging environment that exercises the feature cache at production scale. A shadow-traffic replay with production cache state would have caught this in CI."

Common wrong answers and why they fail

"GPU is the bottleneck — scale up compute"

GPU utilization is flat at 62%. Scaling GPU adds cost but fixes nothing. The bottleneck is I/O latency (cache misses), not compute. Red flag answer.

"Roll back immediately without diagnosing"

Rollback discards the new model features — a correctness regression — and doesn't fix the underlying cache policy. The right move is to confirm the hypothesis first, then warm the cache (10 min) rather than rolling back (and re-deploying later with the same problem).

"p99 latency is just noise / outliers"

p99 at 260ms is 10% worse for every 100th user. At 10M DAU this is 100k users/day having a bad experience. Not noise.

Level	Answer signature
Junior	Identifies cache hit rate as relevant; suggests rollback; doesn't explain the p50-flat / p99-spike signature
Senior	Explains cold-cache + fan-out amplification; orders probes cheapest-first; prescribes warming before rollback; mentions hedge requests
Staff+	All of the above + challenges fan-out design, sizes the cache impact with arithmetic, proposes the canary gate + staging environment gap, discusses the org process change needed

📐 If you get this question — the rule

Trigger: "p99 spiked but p50 is flat" or "latency got worse right after a deploy."

Say the phrase: "flat p50 + spiked p99 is a cold-cache or fan-out signature, not a throughput problem."
Ask: "Did cache hit rate change? Did the number of downstream calls per request change?" Pull those two metrics first.
Trace the causal chain: new keys → misses → slow origin fetches → tail amplified by fan-out → p99 blows up.
Prescribe probes in order: read-only confirmation (cache stats) → trace sample → canary slice.
Fix is warming, not rollback (unless SLA is critically breached and warming takes >15 min).

Never: Jump to "scale up compute" when GPU/CPU utilization is flat. Never roll back without first confirming the hypothesis — you may be trading a cache problem for a model quality regression.

✓ Remember

p50 flat + p99 high = cold cache or fan-out amplification. Every request hits at least one slow path; median is unaffected because most paths are still fast.
Fan-out magnifies tail latency. The expected maximum of N i.i.d. calls grows as O(log N). Adding 12 calls to 14 nearly doubled the worst-case latency.
LRU eviction cascades. A near-full cache plus new keys evicts warm old keys, creating secondary misses beyond the new features alone. Always maintain headroom.
Cheap probes first. Read-only → low-risk write → rollback. Don't rollback before you've confirmed the cause; you may regress quality for no gain.

Tricky interview questions — chapter 06

Q1. What does a flat p50 with a tripled p99 tell you about the nature of the problem?

It tells you the problem affects only a fraction of requests — specifically the slowest fraction. If the issue were systemic (slow model, slow CPU), every request would slow down and p50 would move. A flat p50 means the median request is unchanged: it hits the fast path (cache hit, no contention). Only the unlucky tail hits the slow path (cache miss, origin fetch). This is the textbook signature of a cold cache, a thundering-herd event, or a fan-out with a long tail of slow calls.

Q2. Cache hit rate dropped from 92% to 61%. Why did eviction of old keys happen, and how does it compound the problem?

The cache was at 71/80 GB (89% full). The 12 new features are written into cache on first fetch. Under LRU policy, each write evicts the least-recently-used existing key. The evicted keys are typically low-frequency features — already marginal. Those features were previously served from cache; now they miss too. This secondary eviction wave explains why hit rate fell by 34 pp rather than just the ~12/26 = 46% one might expect from the new keys alone. The lesson: always maintain significant cache headroom before deploying new feature additions.

Q3. Why is pre-warming the cache better than an immediate rollback in this scenario?

Rollback removes the new model and the 12 new features — a quality regression. Pre-warming the cache for the new keys takes ~10 minutes and restores p99 to baseline without discarding the model improvement. Rollback is correct when: (a) warming would take longer than the SLA breach is tolerable, (b) the root cause is not confirmed as cache-related (e.g., it could be a model bug), or (c) the new model is causing correctness errors. In this case, warming is strictly better: it fixes latency while preserving model quality.

Q4. Explain "hedge requests" and when they help vs. hurt.

A hedge request sends a duplicate of a slow sub-call after a threshold timeout (e.g., 50ms), taking whichever response arrives first. They help when: tail latency is caused by occasional slow nodes or GC pauses — the duplicate is likely served by a different replica that isn't paused. They hurt when: the slow path is systemic (all replicas are slow, e.g., due to cache misses at origin), in which case you double the load on an already-stressed backend. In this scenario, hedge requests provide partial relief but don't fix the root cause — they add origin load. Use as a temporary mitigation while warming runs, not as a permanent solution.

Q5. A colleague suggests adding a Redis replica for higher availability. Does this help with p99 latency?

No. A read replica improves read throughput and availability (failover), but does not fix cold-cache latency. The problem is that the keys don't exist in cache, not that cache is overloaded or unavailable. Adding replicas copies the empty cache. The fix is warming the keys. A replica would help if the problem were cache server CPU saturation or a single point of failure — neither of which is present here.

Q6. How would you design a deploy-time cache warming system for a feature store?

The warming system has three parts: (1) Detection — the deploy pipeline diffs the old and new feature key sets and identifies new keys. (2) Warming job — a background worker iterates over the top-N active entity IDs (users, items) and pre-fetches all new feature values, writing results to cache. Parallelized across workers with rate limiting to avoid overwhelming the feature store backend. (3) Gate — the deploy pipeline polls cache hit rate for new keys and only cuts over production traffic when hit rate exceeds a threshold (e.g., 85%). If warming takes too long, the release is held and the team is notified. This pattern is sometimes called "blue-green cache warming."

Q7. The feature fan-out went from 14 to 26 calls per request. Why is batching the 26 calls into a single multi-get better than parallelizing them?

Parallelizing 26 calls: the total latency is the maximum of 26 independent latency samples. As N grows, the expected maximum grows (extreme-value statistics). Also, 26 simultaneous connections add connection overhead, head-of-line blocking risk, and TCP setup cost. A single multi-get: one round-trip to the feature service, which internally pipelines the 26 key lookups in a single Redis pipeline command. Total latency ≈ one network round-trip + one Redis pipeline (which is O(N) but pipelined, so practically the same as a single get for N ≤ 100). Batching reduces the expected tail from O(max of 26 samples) to O(one pipelined batch), cutting p99 significantly.

Q8. The incident commander asks: "Should we add a p99 latency SLO for the feature service?" What do you say?

Yes, and it should be set below the ranking service's p99 latency budget minus the model inference budget. Currently ranking p99 is 80ms pre-incident; model inference likely takes ~30-40ms; so feature service p99 should be ≤ 20-25ms. Today's incident showed feature service p99 can reach 74ms — 4× above that budget. An SLO with an alerting threshold at 30ms would have paged on-call during the canary phase rather than after full rollout. Critically: the feature service SLO should be tested during deploy canaries, not just in steady state, because cold-start effects only appear on new traffic.

Q9. Same scenario but GPU utilization had jumped from 61% to 95% instead of staying flat. How does your diagnosis change?

High GPU utilization changes the story completely. Now the bottleneck is compute, not I/O. The 12 new features likely include expensive computed features (e.g., embeddings, heavy aggregations) being computed on-GPU at serving time rather than pre-computed offline. Or the new model is significantly larger/more expensive than the old one. The p50-flat signature would likely not hold — p50 would also degrade. The probes change: profile the model forward pass (which layer is slowest?), check if new features require GPU computation, check model parameter count. The fix is either model optimization (quantization, pruning, distillation) or offloading heavy feature computation to offline batch jobs.

Q10. You've confirmed the cold-cache hypothesis. The engineering manager asks whether to roll back, fix forward (warm cache), or just cap the new feature set to 6 features instead of 12. How do you advise?

Fix forward (warm cache) is the right call if warming completes within the SLA tolerance window (~10-20 min) and the new model demonstrably improves quality. Rolling back is appropriate if: warming is too slow, or the new model has a correctness bug, or the business impact of the 3× latency is severe enough that 10 minutes of degradation is unacceptable. Capping to 6 features is an engineering compromise that avoids the immediate crisis but wastes half the model improvement and doesn't address the systemic problem (no warming gate). Recommend: warm cache now, restore service in 10 min, then implement the warming gate and canary policy so future feature additions never repeat this. The cap-to-6 option should be rejected — it's a one-time patch, not a systemic fix.

PART III · DEBUG CHALLENGES

Debug: training loss exploded at step 41,200

🎯A grad-norm spike that precedes loss is a crime scene with a clear fingerprint — data contamination, not numerical instability.

A 7B-parameter pretraining run hits step 41,200 and within 300 steps loss goes from 2.1 to 9.8 to NaN. This challenge walks you through reading the telemetry, forming ordered hypotheses, running the cheapest-first probes, recovering from checkpoint, and installing gates so it never silences another run. The skills here transfer directly to any large-scale training fire.

The scenario

You are on-call for a 7B dense transformer pretraining run. Configuration:

Model

7B parameters, 32 layers, bf16 weights + activations, no fp32 master weights

Hardware

64 × H100 SXM5, tensor-parallel 4, pipeline-parallel 4, data-parallel 4 (64 total)

Optimizer

AdamW, β₁ 0.9, β₂ 0.95, ε 1×10⁻⁸, weight decay 0.1, grad clip 1.0

LR schedule

Warmup 0→6×10⁻⁴ over steps 0–2k, then cosine decay to 6×10⁻⁵ by step 300k — currently mid-decay (~step 41k)

Batch

4M tokens per step (global batch), sequence length 4096

Data pipeline

40 shards of ≈25B tokens each; shards rotate sequentially; shard 0–40 completed by step ~41,000; shard 41 loaded at step ≈41,050

Checkpoints

Every 1,000 steps; last clean checkpoint is step 40,000

Telemetry snapshot (extracted from the run's W&B dashboard):

Step	Train loss	Grad norm	Loss-scale (GradScaler)	Notes
40,800	2.09	0.82	65,536	All normal
41,000	2.11	0.79	65,536	Shard 40 still active
41,050	2.13	0.81	65,536	Shard 41 first batch
41,100	2.18	1.47	65,536	Grad norm first spike
41,150	2.61	3.82	65,536	Loss rising
41,200	4.40	11.9	32,768	Loss-scale halved (overflow detected)
41,300	9.84	38.4	4,096	Loss-scale cascading down
41,380	NaN	NaN	—	Run crashed

Additional facts your colleague surfaced before paging you:

LR at step 41,100 is ~3.8×10⁻⁴ — expected for this schedule, not anomalously high.
No code changes, no config changes, no hardware swaps between step 40,000 and 41,380.
GPU memory utilization is flat at 91% throughout — no OOM pressure before the crash.
The loss-scale counter starts halving at step 41,200, indicating bf16 overflow in the backward pass.
Shard 41 was produced by a different preprocessing worker pool than shards 0–40.

Your task

Write an ordered hypothesis list (most likely first) BEFORE opening hints.
Name the three cheapest discriminating probes, in order.
Give the recovery plan (checkpoints exist every 1,000 steps) and the prevention list.

Hint 1 — what does "grad-norm spikes BEFORE loss" order tell you?

Causality reads off the telemetry order: gradients went pathological first, loss followed. What produces a sudden gradient pathology mid-run when LR hasn't changed — and what coincided at ~41k?

Hint 2 — the shard boundary

Shard 41 started at ~41,200 and came from a different preprocessing pool. What does a corrupted/undeduplicated/mis-tokenized shard do to gradients, and how would you check WITHOUT reading 40GB by eye?

✅ Model solution

Ordered hypotheses. (1) Bad data shard — boundary coincidence + grad-first signature + different preprocessing pool: prior ~70%. (2) bf16 numeric instability amplified by the shard's distribution (the overflow counters halving loss-scale support this as the MECHANISM riding on cause 1). (3) LR-schedule kink at 41k (check, it's cheap — but schedules rarely have step-function changes mid-decay). (4) Hardware (a flaky GPU producing NaNs — would usually show as rank-localized grad anomalies, not synchronized).

Cheapest probes, in order. (1) Resume from the 40k checkpoint with shard 41 SKIPPED — one config change; if training sails past 41,200-equivalent tokens, the shard is convicted (deterministic data order makes this experiment possible at all). (2) While that runs: stats on shard 41 vs shard 40 — token-length histogram, repeated-sequence rate, tokenizer round-trip failures, % non-language bytes; a dedup failure or binary contamination shows up in minutes. (3) Per-rank grad-norm breakdown at the spike: synchronized across ranks = data; localized = hardware.

Recovery. Resume at 41k with shard skipped (lose 200 steps, ~minutes), tighter grad-clip for the next few thousand steps as a belt, and requeue shard 41 for re-preprocessing through the GOOD worker pool, re-inserting later in the schedule. Don't fast-forward the optimizer past corrupted moments: the 41k checkpoint predates contamination (verify by grad-norm history).

Prevention. Data validation gates per shard before admission (token stats, dedup rate, tokenizer round-trip, contamination heuristics — the same checks probe 2 ran, run ALWAYS); grad-norm alerting (page on z-score, don't wait for NaN); spike-skip logic (auto-skip batch + log when grad-norm > k×rolling-median); preprocessing-pool version pinning so "different worker pool" can't silently mean "different code"; loss-scale/overflow-counter dashboards for bf16/fp8 runs.

Common wrong answers. "Lower the LR and restart from scratch" (burns the run, no diagnosis); "switch to fp32" (10× cost to mask a data bug); "it recovered after NaN so continue" (Adam moments are corrupted; the run is silently damaged).

✓ Remember

Telemetry ORDER is causal evidence: grad-norm before loss = data or numerics, not the objective.
Deterministic data order is what makes the skip-shard experiment possible — determinism is an ops feature.
Checkpoint-resume experiments are the cheapest, highest-information probes in training debugging.

Level	Answer signature
Junior	"Lower the learning rate / add gradient clipping" without diagnosis.
Senior	Reads the grad-first signature, suspects the shard, designs the skip-shard resume probe and shard stats, recovers from 41k.
Staff	Adds the per-rank hardware discrimination, the loss-scale-counter read for the bf16 mechanism, the prevention pipeline (gates, alerts, version pinning), and articulates why post-NaN continuation is unsafe (poisoned optimizer state).

PART III · DEBUG CHALLENGES

Debug: offline AUC up, online CTR down

🎯A model that wins offline and loses online is not a better model — it is a better liar; learn the five-cause taxonomy and you will never be fooled again.

You ship a new ranker that beats the champion by +1.5 percentage points of AUC in every offline slice. Five percent of traffic goes to it. A day later the war room opens: CTR is down 2 %, time-spent is flat. This chapter makes that scenario a solvable puzzle, not a fire drill. It introduces the canonical five-cause taxonomy for offline/online divergence, drills the evidence-mapping habit, and designs the experiment that tells the causes apart.

The scenario — war room briefing

The following facts are on the table when you enter the room. Read them carefully before you form any hypothesis.

Signal	Control (old ranker)	Treatment (new ranker)	Delta
Offline AUC (held-out set)	0.782	0.797	+1.5 pp
Online CTR (A/B, 24 h)	4.10 %	4.02 %	−2.0 %
Time-spent per session	8.2 min	8.2 min	0 %
Calibration at score > 0.8	predicted ≈ actual	predicted 1.4× actual	overconfident
p99 serving latency	38 ms	39 ms	+1 ms (noise)
Feature logging version	v12	v13 (same release)	changed

What changed in this release (single deploy, 14:00 yesterday):

New model weights (retrained on 6 weeks of logged data).
Feature logging schema bumped from v12 → v13: three categorical features renamed, one numeric feature rescaled ÷10.
No A/B infrastructure change; split is deterministic by user-id hash.

Calibration detail. The calibration plot below summarises the gap. Predicted scores cluster in the 0.80–0.95 bucket at 1.4× the rate of actual clicks. Below 0.5, calibration is nearly perfect.

Calibration curves: old ranker (solid) vs new ranker (dashed) across predicted score deciles. Gap widens sharply above 0.8.

Your task

Enumerate the five classic causes of offline/online divergence from memory.
Fit each against THIS evidence (calibration plot + simultaneous logging change) — which two survive?
Design the single cheapest experiment that discriminates between the survivors.

Hint 1 — why does the same-release logging change matter so much?

The new ranker was trained on features logged by the OLD pipeline, but serves on features computed by the NEW one. What's that called, and what would it do to score distributions?

Hint 2 — what does overconfidence at HIGH scores break downstream?

Scores don't just rank — they feed a value formula and thresholds. If the top decile's probabilities run hot, what happens to the blend with other objectives and to any score-gated behavior?

✅ Model solution

The taxonomy (recite it, always). (1) Training-serving skew — features differ between train and serve paths. (2) Offline-eval leakage/exposure bias — labels exist only for what the OLD policy showed, so offline replay rewards imitating it. (3) Calibration shift breaking downstream consumers — combination formulas and thresholds need probabilities, AUC doesn't. (4) Position-bias mismatch between replay and live. (5) Metric mismatch — AUC is ranking-only; CTR is absolute behavior under an ecosystem.

Fit to evidence. The logging change in the same release makes (1) the prime suspect — the model trains on old-pipeline features and serves on new-pipeline ones; even small definition drift (null handling, window boundaries) shifts the score distribution… which is also exactly what an overconfident-at-the-top calibration plot shows, implicating (3) as the damage mechanism: hot top-decile probabilities over-weight this model's head in the value formula and push wrong items past thresholds. (2)/(4) can't be excluded by this evidence but don't explain the calibration signature; (5) is always true but doesn't explain a REGRESSION.

The discriminating experiment. Dual-scoring on live traffic: score the SAME requests with the new model fed by (a) old-pipeline features (reconstructed/logged) and (b) new-pipeline features. If score distributions diverge between (a) and (b), skew is convicted, and the diff localizes WHICH features moved (compare per-feature distributions where the two pipelines disagree). One day of shadow compute, no user impact. In parallel: recalibrate (isotonic on recent live data) and re-read the value-formula output distribution — if recalibration alone restores sane blending in offline simulation, (3) was carrying the regression.

Fix & prevention. Fix: align the feature definitions (or retrain on new-pipeline logged features), recalibrate per head, re-canary. Prevention: logged-at-scoring features as the contract (the change couldn't have diverged train from serve), feature parity tests in CI between pipeline versions, calibration as a launch gate (predicted/observed ratio per decile must sit in bounds), and never shipping a model and a feature-pipeline change in the same release — separate the variables.

Common wrong answers. "The model overfit" (doesn't explain the release coupling), "roll back the model only" (if the pipeline changed, the OLD model now skews too — roll back as a PAIR or fix forward consciously), "AUC went up so the model is better, the metric is wrong" (the value formula is the product; calibration is part of the model's job).

✓ Remember

The five-cause taxonomy is the expected recitation — lead with it.
Same-release confounds are convicted by dual-scoring the same traffic both ways.
Calibration is a launch gate because scores are added and thresholded, not just sorted.
Never bundle model and feature-pipeline changes in one release.

Level	Answer signature
Junior	"Offline metrics don't always transfer" + retrain suggestion.
Senior	Recites the taxonomy, fits evidence to skew+calibration, designs the dual-scoring discrimination, names recalibration.
Staff	Adds the rollback-as-a-pair insight, logged-at-scoring as the structural prevention, calibration gates with per-decile bounds, and the release-hygiene rule (one variable per launch).

PART III · DEBUG CHALLENGES

Debug: GPU serving throughput collapsed at long context

🎯Context length is a silent multiplier on the KV cache — quadruple the context, quarter the batch, crater the throughput.

Your 13B chat service (GQA, 40 layers, 8 KV heads, head_dim 128, fp16 KV) ran beautifully at a 2k-token context cap. Product launched 32k-context document upload yesterday. Since then: throughput per GPU down ~70%, OOM kills roughly hourly, TTFT fine for short prompts but p99 TPOT is awful for everyone. Nothing else changed.

The scenario telemetry

Metric	Before (2k cap)	After (32k cap)
Output tokens/sec/GPU	2,400	710
Mean concurrent sequences/GPU	38	9
KV-cache occupancy	62%	97% (sawtooth, OOM kills hourly)
p99 TPOT (short-prompt users)	48ms	210ms
Long-context requests share	—	7% of traffic

Your task

Explain WHY 7% long-context traffic destroyed 70% of throughput — with the memory arithmetic.
Explain the p99 TPOT damage to SHORT-prompt users specifically.
Give same-day mitigations, then the two-quarter redesign, then the conversation to have with product.

Hint 1 — compute per-token KV bytes for this model

KV per token = 2 (K and V) × layers × kv_heads × head_dim × bytes. Then multiply by 2k vs 32k and compare against the batch the GPU used to hold.

Hint 2 — what does a 32k prefill do to everyone else's decode?

A 32k-token prefill is a multi-second compute burst. If it shares an engine iteration with 30 decoding sequences, what happens to their inter-token interval?

✅ Model solution

The memory math. Per token: 2 × 40 × 8 × 128 × 2B = 164KB. At 2k context: ≤0.33GB per sequence — 38 concurrent sequences ≈ 12.5GB of KV, comfortable. At 32k: up to 5.2GB per sequence — 16×. A handful of long sessions (7% of traffic, but they LINGER — long documents mean long conversations) eat the KV pool: occupancy pins at 97%, admission stalls, concurrency collapses 38→9, and since decode throughput scales with batch, tokens/sec falls proportionally (2,400→710 ≈ the concurrency ratio). The hourly OOMs are reservation overflow on max-length sequences colliding at peak.

The p99 story. Two mechanisms: (1) prefill interference — a 32k prefill is seconds of compute-bound work; every decode sequence co-scheduled with it stalls, so short-prompt users' TPOT spikes whenever any long request arrives (tail-shaped harm, mean barely moves); (2) KV pressure evictions/admission queuing add jitter. The signature "TTFT fine, TPOT awful" says decode is the victim, not prompt processing capacity.

Same-day mitigations. Cap effective context server-side (truncate middle, keep head+tail) below the product cap while you fix; admission-control long requests into a concurrency-limited lane (max 1-2 concurrent long sequences per GPU); enable chunked prefill (512-token chunks interleaved with decode — kills the TPOT spikes); quantize KV to fp8 (halves the 164KB/token, doubling effective pool); turn on/verify paged KV (no max-length reservations → OOMs stop, fragmentation waste <5%).

The redesign. Separate pools: a long-context pool (few, KV-heavy, possibly bigger-HBM GPUs) and the existing short-context fleet — route by prompt length; or full prefill/decode disaggregation if long-context traffic grows. Prefix caching for re-queried documents (the same uploaded doc across turns re-uses its KV — huge for doc-chat). Session TTLs and idle-KV eviction with recompute-on-return. Capacity model updated: price a request by context-tokens-held×time, not request count.

The product conversation. 32k context costs ~16× the KV residency of 2k: either it's priced/tiered (premium feature, stricter rate limits), or capacity grows accordingly — show the \$/request delta. Also surface that 90% of uploaded docs fit retrieval: RAG-style retrieve-then-answer over the doc costs a fraction of full-context stuffing for most queries — context length is a product decision wearing an infra costume.

Common wrong answers. "Add GPUs" (linear cost for a problem with 16× structure — fix utilization first); "lower max_tokens" (output length isn't the driver, context residency is); "the model is too slow, distill it" (weights didn't change; the cache did).

✓ Remember

KV per token = 2·layers·kv_heads·head_dim·bytes — memorize the formula, derive the incident.
Long-context harm is twofold: KV residency (batch collapse) and prefill interference (tail latency).
Mitigation ladder: caps → chunked prefill → fp8 KV → paged attention → pool separation → disaggregation.
Context length is a pricing/product decision; bring the per-request cost delta to that meeting.

Level	Answer signature
Junior	"Long prompts are slower; add GPUs or shrink the model."
Senior	Does the 164KB/token → 16× math, separates the two mechanisms (residency vs interference), proposes chunked prefill + paged KV + caps.
Staff	Adds the lane/pool architecture with routing by length, prefix caching for doc-chat, the pricing conversation with per-request economics, and the RAG-instead-of-stuffing product reframe.

PART IV · DRILLS

Capacity-estimation drills

🎯Every back-of-envelope answer lives or dies on one formula: demand ÷ (unit capacity × utilization).

Capacity estimation is the first thing an interviewer asks when sizing any new system — and the first thing on-call engineers reach for at 3 am. This chapter gives you eight drills covering GPUs, memory, network, storage, and cost. Each prompt stands alone; attempt it before opening the solution. The universal recipe and numbers-to-memorize table at the end make every future estimate faster.

📐 How to use these drills

Protocol: read the prompt, open a blank doc, write your answer (even rough numbers), then open the solution.

Set a 10-minute timer per drill.
Write every intermediate number down — interviewers watch the reasoning, not just the answer.
After reading the solution, compare where your estimate diverged, not just the final number.

Never: open the solution first and convince yourself you would have got there. You would not.

Drill 1 — GPUs for a ranking service at 5,000 QPS

Prompt: Your ranking model costs 2 GFLOP per request (measured). Your serving hardware is A100 80GB GPUs. Peak load is 5,000 requests/second. Target GPU utilization is 60% (leave headroom for spikes). How many GPUs do you need?

Attempt it now. Then open the solution.

Hint — what do you need to look up?

An A100 delivers roughly 312 TFLOPS of FP16 dense throughput (or ~77 TFLOPS FP32). Ranking models are typically FP16. Also note: "GFLOP per request" is a one-shot cost — multiply by QPS to get FLOP/s demand.

✅ Model solution

$$\text{GPUs} = \frac{\text{demand (FLOP/s)}}{\text{GPU throughput (FLOP/s)} \times \text{utilization}}$$

demand = FLOPs per request × requests per second; divide by effective GPU throughput after haircut

Step 1 — compute demand.
5,000 req/s × 2 × 10⁹ FLOP/req = 10 × 10¹² FLOP/s = 10 TFLOP/s

Step 2 — effective GPU capacity.
312 TFLOP/s × 0.60 utilization = 187 TFLOP/s usable per GPU

Step 3 — divide.
10 TFLOP/s ÷ 187 TFLOP/s = 0.053 GPUs

Step 4 — sanity check. Less than one GPU? Yes — 2 GFLOP/req is a tiny model (think a two-tower dot-product ranker or a shallow MLP). A single A100 could serve ~93,000 req/s at 60% util. Round up to 1 GPU minimum; in practice deploy at least 2 for redundancy.

Staff+ add: This ignores batching overhead and memory bandwidth limits. Real serving is often memory-bandwidth-bound, not compute-bound, for small-batch inference. Measure actual throughput with nsight or vLLM benchmarks before committing to a fleet size.

✓ Remember

Demand (FLOP/s) = FLOP/req × QPS. Then divide by effective GPU FLOP/s.
A100 FP16 peak ≈ 312 TFLOP/s; MFU for real workloads is 40–60%.
A "2 GFLOP" model at 5k QPS is trivially small — sense-check every answer.

Drill 2 — KV-cache memory for 64 concurrent sessions on a 13B model

Prompt: You are serving a 13B-parameter model with 40 transformer layers, 40 attention heads, head dimension 128, stored in FP16. You want to hold 64 concurrent sessions, each with up to 8,192 tokens of context. How much GPU memory is consumed by the KV cache alone?

Hint — the KV-cache formula

Each token stores a K vector and a V vector per layer. Size per token = 2 (K+V) × num_layers × num_heads × head_dim × bytes_per_element.

✅ Model solution

$$\text{KV bytes/token} = 2 \times L \times H \times d_h \times \text{bytes}$$

2 for K and V; L = layers; H = heads; d_h = head dimension; bytes = 2 for FP16

Step 1 — bytes per token per session.
2 × 40 layers × 40 heads × 128 dim × 2 bytes = 2 × 40 × 40 × 128 × 2
= 2 × 409,600 = 819,200 bytes ≈ 0.8 MB per token

Step 2 — per session (8,192 tokens).
0.8 MB × 8,192 = 6,553 MB ≈ 6.4 GB per session

Step 3 — total for 64 sessions.
6.4 GB × 64 = 409 GB

Step 4 — sanity check. Model weights alone ≈ 13B × 2 bytes = 26 GB. The KV cache for 64 sessions at 8k tokens dwarfs the weights by 15×. On a single A100 (80 GB), this is impossible — you can hold weights + maybe 6–7 sessions. For 64 sessions you need ≥6 A100s dedicated to KV, or you use paged KV eviction (vLLM-style) to multiplex.

Staff+ add: With Grouped Query Attention (GQA, 8 KV heads instead of 40), the per-token KV drops 5× to ~0.16 MB — 64 sessions costs ~82 GB, feasible on 2 A100s. GQA is why Llama-2 70B is deployable.

✓ Remember

KV per token = 2 × L × H × d_h × 2 bytes (FP16). For 13B this is ~0.8 MB.
Long context + large batch = KV memory dominates, not weights.
GQA/MQA reduces KV by 4–8× — know this lever.

Drill 3 — Embedding table for 500M users, 128-d FP16, with sharding

Prompt: You have 500 million user IDs, each represented by a 128-dimensional FP16 embedding. (a) How much total RAM does the table require? (b) You plan to shard it across servers with 64 GB RAM each (leaving 16 GB for the OS/other). How many shards? (c) What happens to lookup latency as you add shards?

Hint

FP16 = 2 bytes. Total size = num_rows × embedding_dim × bytes_per_element. Available RAM per shard = 64 − 16 = 48 GB.

✅ Model solution

(a) Table size.
500 × 10⁶ rows × 128 dims × 2 bytes = 128 × 10⁹ bytes = 128 GB

(b) Number of shards.
Available RAM/shard = 64 − 16 = 48 GB
Shards needed = ⌈128 ÷ 48⌉ = ⌈2.67⌉ = 3 shards

(c) Latency implication. Every request that needs user embeddings must fan out to all shards that hold its user IDs. With hash-based sharding: a single-user request hits exactly 1 shard (O(1) fan-out). A batch of B users hits up to min(B, S) shards — at B=100, all 3. The critical cost is the tail latency across shards: you wait for the slowest one. With 3 shards the overhead is modest; at 50+ shards it dominates.

Staff+ add: In practice embedding tables use FP32 at training time but FP16 or even INT8 at serving. INT8 halves the table to 64 GB → 2 shards. Also consider replication (read replicas per shard) vs. sharding for read-heavy workloads.

✓ Remember

Embedding table = rows × dim × bytes. 500M × 128 × 2 = 128 GB.
Sharding introduces fan-out; latency = max(shard latencies), not sum.
INT8 quantization halves serving memory with <1% quality loss on ID embeddings.

Drill 4 — Kafka partitions for 2M events/second

Prompt: Your event bus must ingest 2,000,000 events per second, each event 1 KB. A single Kafka partition on your hardware can sustain 50 MB/s write throughput reliably. You want a 20% safety margin. How many partitions do you need?

Hint

Convert events/s to MB/s first, then account for the safety margin before dividing.

✅ Model solution

Step 1 — throughput in MB/s.
2,000,000 events/s × 1 KB/event = 2,000,000 KB/s = 1,953 MB/s ≈ 2,000 MB/s

Step 2 — apply safety margin.
Effective partition capacity = 50 MB/s × (1 − 0.20) = 40 MB/s

Step 3 — partition count.
⌈2,000 ÷ 40⌉ = ⌈50⌉ = 50 partitions

Step 4 — sanity checks.
• Consumer side: if you have 50 partitions and each consumer thread reads one partition, you need ≥50 consumer threads total.
• Replication: with replication factor 3, each broker handles 50 × 3 / num_brokers writes. Size your broker cluster accordingly.
• Partition count is essentially permanent in Kafka — over-provision slightly; 64 or 128 (powers of 2) are conventional round numbers.

Staff+ add: Partition count also determines maximum parallelism for consumers. If downstream processing is the bottleneck (not ingest), you may want more partitions even if 50 suffices for throughput. Kafka can reassign partitions but it's disruptive — set this right at cluster creation.

✓ Remember

Partitions = ⌈total_MB/s ÷ (partition_MB/s × (1 − margin))⌉.
Partition count caps consumer parallelism — size for the bottleneck.
Pick powers of 2 for easy key-hashing and future expansion.

Drill 5 — Training time for 7B model on 1T tokens with 256 H100s

Prompt: You want to train a 7-billion-parameter transformer on 1 trillion tokens using 256 × H100 GPUs. H100 SXM delivers roughly 1,000 TFLOP/s FP8 (or ~500 TFLOP/s BF16). Assume BF16 training and a Model FLOP Utilization (MFU) of 45%. Use the 6ND approximation (6 × num_params × num_tokens FLOPs for a forward+backward pass). How long does training take?

Hint — the 6ND rule

Total FLOPs ≈ 6 × N × D where N = parameters and D = tokens. The factor 6 accounts for forward (2ND) plus backward (4ND — two passes for gradients).

✅ Model solution

$$\text{Training time} = \frac{6 \times N \times D}{\text{num\_GPUs} \times \text{GPU\_FLOP/s} \times \text{MFU}}$$

N = parameters; D = tokens; MFU = model FLOP utilization (fraction of peak actually used)

Step 1 — total FLOPs.
6 × 7 × 10⁹ × 10¹² = 6 × 7 × 10²¹ = 4.2 × 10²² FLOPs

Step 2 — effective cluster throughput.
256 GPUs × 500 × 10¹² FLOP/s × 0.45 MFU = 256 × 225 × 10¹² = 57,600 TFLOP/s = 5.76 × 10¹⁶ FLOP/s

Step 3 — training time in seconds.
4.2 × 10²² ÷ 5.76 × 10¹⁶ = 729,167 seconds ≈ 8.4 days

Step 4 — sanity check. Llama-2 7B was trained on 2T tokens with more GPUs; 8 days for 1T tokens on 256 H100s is plausible. Real runs add ~10–15% for checkpoint overhead, data loading stalls, and hardware failures — budget ~10 days.

Staff+ add: MFU of 45% is optimistic for 256 GPUs across nodes; inter-node all-reduce on 400 Gb/s InfiniBand costs real time. At 1024 GPUs MFU often drops to 35–38%. Always measure MFU on a short pilot run before committing the full budget.

✓ Remember

Total FLOPs = 6ND. For 7B on 1T tokens: 4.2 × 10²² FLOPs.
H100 BF16 ≈ 500 TFLOP/s peak; realistic MFU 35–50%.
Time = FLOPs ÷ (GPUs × peak × MFU). Always state your MFU assumption.

Drill 6 — ANN index RAM for 100M × 768-d vectors with PQ compression

Prompt: You have 100 million vectors, each 768 dimensions, originally in FP32. You want to build an ANN (Approximate Nearest Neighbor) index using Product Quantization (PQ) with 96 sub-quantizers and 256 centroids per sub-quantizer (PQ96×8 — "8" means 8 bits per code). (a) Raw FP32 size of the corpus? (b) PQ-compressed size? (c) Compression ratio?

Hint — how PQ works

PQ splits each 768-d vector into 96 sub-vectors of 8 dimensions each, then replaces each sub-vector with the index of the nearest centroid (0–255). Each centroid index fits in 1 byte. So the compressed code for one vector = 96 bytes.

✅ Model solution

(a) Raw FP32 size.
100 × 10⁶ vectors × 768 dims × 4 bytes = 307 GB

(b) PQ-compressed size.
Each vector → 96 sub-quantizer codes × 1 byte = 96 bytes/vector
100 × 10⁶ × 96 = 9,600 MB = 9.4 GB
Plus codebook storage: 96 sub-quantizers × 256 centroids × 8 dims × 4 bytes = 96 × 256 × 32 bytes = 786,432 bytes ≈ 0.8 MB (negligible)

(c) Compression ratio.
307 GB ÷ 9.4 GB ≈ 32.7×

Sanity check and tradeoffs. 9.4 GB fits comfortably in one server's RAM (even a commodity box). The cost is recall loss — PQ introduces quantization error, so ANN recall@10 drops from ~99% (flat exact search) to ~90–95% depending on data distribution. You can recover recall by re-ranking the top-K PQ candidates with exact FP32 distances for the top few results.

Staff+ add: FAISS IVF-PQ adds an inverted file index (centroid-based coarse partitioning) to prune search to ~1% of the corpus. Combined with PQ, index RAM stays 10 GB while search time scales sub-linearly. HNSW is an alternative: higher RAM (~50 GB for this corpus) but better recall and no training step.

✓ Remember

Raw FP32 corpus = N × D × 4. For 100M × 768: 307 GB.
PQ code = num_sub_quantizers bytes/vector. PQ96: 96 bytes → 32× compression.
Compression trades recall for RAM. Always benchmark recall@K before shipping.

Drill 7 — Daily cost of a 100-replica A100 serving fleet

Prompt: You run a serving fleet of 100 replicas, each with 8 × A100 80GB GPUs. Cloud on-demand price for an 8-GPU A100 instance is \$32/hour (typical AWS p4d.24xlarge). You run 24 hours/day. (a) Daily cost? (b) Monthly cost? (c) What is the cost per 1M tokens if the fleet serves 50M tokens/hour?

Hint

Daily cost = replicas × hourly_rate × 24. Cost per token = daily_cost ÷ daily_tokens.

✅ Model solution

(a) Daily cost.
100 replicas × \$32/hr × 24 hr = \$76,800/day

(b) Monthly cost.
\$76,800 × 30 = \$2,304,000/month ≈ \$2.3M/month

(c) Cost per 1M tokens.
Total tokens/day = 50M tokens/hr × 24 hr = 1,200M = 1.2B tokens/day
Cost per token = \$76,800 ÷ 1,200,000,000 = \$0.000064/token
Cost per 1M tokens = \$0.000064 × 10⁶ = \$64 per 1M tokens

Sanity check. GPT-4 API was priced at ~\$30/1M tokens (output) in 2024; \$64 is high but plausible for a self-hosted large model without optimization. With reserved instances (1-year commitment) the hourly rate drops ~40% → ~\$38/1M tokens. Speculative decoding or distillation to a smaller model can halve serving cost.

Staff+ add: On-demand pricing is worst-case. Production fleets typically use: (1) reserved instances for baseline load (40% discount), (2) spot instances for burst (70% discount, but must handle interruptions), (3) batching improvements to raise token throughput per GPU. A mature serving team targets \$5–15/1M tokens for a 70B-class model.

✓ Remember

8×A100 instance ≈ \$32/hr on-demand; 100 replicas → \$76.8k/day.
Cost/token = total_spend ÷ total_tokens. Always compute this for LLM products.
Reserved + spot + batching is the standard cost-reduction trifecta.

Drill 8 — Allreduce time for 14 GB gradients on 400 Gb/s InfiniBand

Prompt: During distributed training, you need to allreduce gradient tensors totalling 14 GB across 256 GPUs connected via 400 Gb/s InfiniBand (per-link). A ring-allreduce transfers 2 × (N−1)/N × data bytes per GPU (≈ 2× data for large N). How long does one allreduce take, ignoring latency? How does this compare to a 10-ms compute step?

Hint — ring-allreduce math

In ring-allreduce, each GPU sends and receives (N−1)/N × data in reduce-scatter, then (N−1)/N × data in allgather. Total traffic per link ≈ 2 × data (for large N). Time = (2 × data) ÷ bandwidth.

✅ Model solution

Step 1 — convert bandwidth.
400 Gb/s = 400 ÷ 8 = 50 GB/s per link

Step 2 — traffic per GPU (ring-allreduce, large N).
≈ 2 × 14 GB = 28 GB per GPU

Step 3 — time.
28 GB ÷ 50 GB/s = 0.56 seconds

Step 4 — compare to compute step.
0.56 s ÷ 0.010 s = 56× longer than one compute step

Why this matters. This is why naive data parallelism at large scale is cripplingly slow. With gradient compression (1-bit SGD, TopK sparsification) or mixed precision with smaller gradients (16-bit vs 32-bit), you can cut this 2–4×. More importantly, pipeline parallelism and gradient accumulation are used to overlap allreduce with the next micro-batch's forward pass, hiding most of the communication cost.

Staff+ add: 400 Gb/s InfiniBand per link doesn't mean per GPU — check whether it's the switch uplink or the NIC bandwidth. NVLink within a node (600 GB/s bidirectional) is far faster than inter-node IB. Topology matters: fat-tree vs. dragonfly affects allreduce behavior at 256+ nodes. Real MFU degradation from communication is the main reason large-scale training MFU is 35–45%, not 80%.

✓ Remember

Ring-allreduce traffic per GPU ≈ 2 × gradient_size. Time = traffic ÷ link_BW.
400 Gb/s IB = 50 GB/s. 14 GB grads → 0.56 s — huge vs. a compute step.
Overlap communication with compute (pipeline parallelism) is essential at scale.

PART IV · DRILLS

On-call triage simulations (rapid scenarios)

🎯In the first 15 minutes of an ML incident you are not a scientist — you are a firefighter: pull the cheap reversible lever first, diagnose second.

Six pager-style scenarios, each with a 15-minute clock. Read the page, set a timer, say your actions out loud in order, then open the model answer. The grading axis is not whether you find root cause — it is whether your first moves are cheap, reversible, and blast-radius-shrinking. Interviewers run these to separate people who have actually carried a pager from people who have only read about it.

📐 If you get an on-call scenario — the rule

Trigger: "You're on call, X just fired, what do you do?"

Stop the bleeding — name the cheapest reversible action (rollback, disable flag, block the promote, freeze the deploy) and state whether you'd pull it now or after one confirming check.
Confirm it's real and scope the blast radius — one query, two minutes: is a user-facing metric moving, or only an internal alert? Which traffic slice?
Check "what changed?" — deploys, model promotes, config flips, upstream releases in the last few hours. 80% of incidents are a change, not a drift.
Communicate — one line in the incident channel: impact, action taken, next check.
Only then diagnose — root cause is a daytime activity once users are safe.

Never: open a notebook and start exploratory analysis while the system is still degrading. "I'd look at the data" with no mitigation is the canonical junior failure.

Simulation 1 — Drift alert fired, business metrics flat

03:40, Tuesday. PagerDuty: feature_drift_psi_high on the fraud model — PSI for device_age_days jumped from 0.04 to 0.31 over the last 6h window. You pull up the dashboards: fraud approval rate, chargeback proxy, and score distribution percentiles are all within normal daily bands. The alert has fired 3 times this month; the last two were marked "no action" by other on-calls.

Your task

What do you do in the next 15 minutes? Be explicit about what you would and would not page anyone else for, and what you leave for the morning.

✅ Model answer — Simulation 1

Cheap reversible action first: none needed yet — and saying so is the correct call. Nothing user-facing is moving, so the reversible action is "don't touch production at 3am." But you do NOT silence and go back to sleep either; a flat top-line with a drifted input can mean the model has simply stopped using that feature's signal, and fraud losses show up with a multi-week lag (chargebacks). Flat-right-now is not flat.

Minutes 0–5: confirm the alert is computed correctly — is the PSI spike in the live feature values or in the reference window? Check whether device_age_days moved for all traffic or one slice (one country, one app version). A single-slice jump after an app release is usually a logging change, not fraud.

Minutes 5–10: check "what changed": app releases, upstream schema deploys, and whether the feature's null/default rate jumped (a drift alert is often a missing-data alert in disguise — values collapsing to a default of 0 shifts the distribution violently). If null rate spiked, this becomes Simulation 2 and you escalate to the owning team's on-call.

Minutes 10–15: if it's genuinely a population shift with healthy logging: annotate the alert with findings, file a morning ticket to (a) check model performance on the shifted slice once labels arrive, and (b) re-tune the PSI threshold or reference window, because an alert with a 100% "no action" history is training the team to ignore the one that matters.

Why this is correct: the staff-level move is treating alert quality as part of the incident. You neither dismissed it (lagged-label risk) nor took a 3am production action with zero user impact (unjustified risk). Common wrong answers: "retrain immediately" (retraining on a possibly-corrupt feature bakes the problem in) and "mute the alert" (destroys the safety net).

Simulation 2 — Upstream team broke a feature contract, silently

14:10, Thursday. No alert fired. A PM messages you: "Recommendations look generic since lunch?" CTR on the home feed is down 8% over 3 hours. Your service deployed nothing today. Digging in, you notice the user_engagement_7d feature is now null for ~60% of requests; the ranker imputes nulls to 0. The upstream events team shipped a pipeline migration at 11:30 that renamed a field from engagement_score to engagementScore. Their pipeline is "green."

Your task

Next 15 minutes. The upstream team says a revert of their migration needs ~2 hours of backfill to be safe. What do you do meanwhile, and in what order?

✅ Model answer — Simulation 2

Cheap reversible action first: you cannot roll back upstream quickly, so mitigate on YOUR side: change the null-handling for this feature from "impute 0" to "impute the population median / last known per-user value," or — often better — flip the ranker to a fallback model or config that excludes the broken feature. Both are config-level, reversible in minutes, and turn a corrupted signal into a merely missing one. A model fed plausible defaults degrades gracefully; a model fed 60% zeros on a heavy-weight feature is actively wrong, because "0 engagement" is a strong negative signal the model learned from real users.

Minutes 0–5: pull the mitigation lever (fallback config), verify CTR on a canary slice recovers direction-wise. Open an incident, severity matching 8% CTR: this is revenue-visible.

Minutes 5–10: coordinate with upstream: get a committed ETA for revert-plus-backfill, and confirm whether the rename hit other consumers (fraud? ads?) — you may be the first to notice a multi-team incident. Blast-radius discovery is on-call work, not politeness.

Minutes 10–15: guard the future: if any training/retraining job consumes this feature, pause scheduled retrains now — three hours of 60%-null data flowing into a training set is how today's incident becomes next week's "model mysteriously worse." Write the timeline in the channel.

Why this is correct: "their pipeline is green" and "your model is fine" can both be true while the contract between them is broken — schema is part of the interface, and nulls imputed to 0 convert a data bug into a model-behavior bug. Staff+ answers add the prevention: schema contracts with CI checks on the producer side, and a null-rate alert per critical feature on the consumer side (you found this via a PM, 3 hours late — that is itself a finding). Common wrong answer: waiting 2 hours for the upstream revert with no consumer-side mitigation.

Simulation 3 — Tonight's retrained model is 40% smaller than usual

06:55, Monday. The nightly retrain of the ranking model finished green. A weak sanity alert (warning, not page) notes the model artifact is 410 MB; for the last 30 days it has been 680 MB ± 3%. Offline AUC on the held-out set: 0.847 vs last week's 0.851 — within normal noise. The auto-promote job pushes the new model to production at 07:30. It is 06:55.

Your task

You have 35 minutes of wall-clock but answer for the next 15. What is your very first action, and why is the "AUC looks fine" argument a trap?

✅ Model answer — Simulation 3

Cheap reversible action first — and it is the whole answer: BLOCK THE PROMOTE. Pause the 07:30 auto-promotion before doing anything else. This costs nothing: production keeps serving yesterday's model, which was fine yesterday and is fine today. A one-day-stale ranker is a non-event; a corrupted ranker at peak traffic is an incident. The asymmetry is total, so you do not need to be sure anything is wrong — the size anomaly alone justifies the block.

Why "AUC is fine" is a trap: a 40% smaller artifact usually means 40% of something is missing — typically a chunk of the embedding vocabulary or feature hash space, because an upstream extract silently dropped a partition (one day of data, one region, or a join that brings in long-tail IDs returned empty). If the eval set is built from the same truncated data, it cannot see the hole: the missing users and items are absent from eval too. Aggregate AUC on a self-consistent-but-incomplete dataset is exactly the metric that stays green while the model quietly loses its tail. Silent data loss is the one failure class where "metrics look fine" is expected, not reassuring.

Minutes 2–15, after blocking: diff the artifact against last week's — embedding table row counts, feature vocabulary sizes, per-feature coverage. Then diff the training data: row counts per day/partition vs the 30-day norm; you will usually find the dropped partition in minutes. Determine whether the loss is upstream (a re-run fixes it) or in your extract code. Re-run once the input is whole; only promote a model whose size and slice-level evals are back in band.

Why this is correct: this is the purest test of the chapter's theme. There is a free, instantly reversible action with unbounded downside protection, available before any diagnosis. Anyone who starts with "first I'd investigate the data pipeline" has already failed — at 07:30 the bad model ships while they investigate. Staff+ adds prevention: make artifact size, embedding cardinality, and training-row counts blocking promotion gates rather than warnings.

Simulation 4 — A/B guardrail breached at 3am

03:12, Saturday. Page: the new ranker variant (at 10% traffic since Friday noon) has breached the latency guardrail — p99 went from 180 ms to 290 ms over the last hour — and the crash-rate guardrail is yellow. The experiment owner is asleep. Your org's experiment platform supports one-click ramp-down. Topline engagement in the variant still looks slightly positive.

Your task

Next 15 minutes. Do you wait for the experiment owner? Does "engagement is still positive" change your decision?

✅ Model answer — Simulation 4

Cheap reversible action first: ramp the variant to 0% now. Do not wait for the owner. A guardrail is a pre-commitment — the team agreed in advance, while calm that this threshold means stop; 3am with partial information is precisely the situation pre-commitments exist for. Re-litigating the threshold mid-breach defeats its purpose. And the action is the definition of reversible: the experiment can re-ramp Monday at the click of a button, with zero lost work beyond a weekend of data.

Why "engagement is positive" does not change the call: the comparison is asymmetric. Upside of leaving it on: a weekend of slightly better experiment data. Downside: latency degradation compounding under Saturday peak, crash-rate yellow going red, and users churning — harms that are real and partially irreversible, versus an experiment delay that is fully recoverable. Also, short-horizon engagement under elevated latency is untrustworthy: latency damage shows up in return-rate days later, not in same-hour clicks.

Minutes 0–5: ramp to 0%, confirm p99 and crash rate recover in the control population (if they do NOT recover, the variant was not the cause — un-pause your assumption and treat it as an infra incident: check deploys, hosts, traffic anomalies).

Minutes 5–15: snapshot dashboards and a few traces from the breach window so the owner can do root cause Monday without the evidence aging out of retention. Leave a clear handoff note: what breached, when you ramped down, recovery confirmation, links. No 3am debugging of the variant's latency — that is the owner's daytime job.

Why this is correct: the recovery check is the subtle senior move — ramping down is both mitigation and a diagnostic test, and if metrics do not recover you have falsified your hypothesis in five minutes. Common wrong answers: paging the experiment owner and waiting (the guardrail already encodes their decision), and killing the experiment permanently/deleting it (over-rotation: 0% ramp preserves the experiment setup and the learning).

Simulation 5 — Embedding version skew between retrieval and index

11:20, Wednesday. Gradual alarm: recall@100 of the retrieval stage (measured online against a click-based proxy) has slid 30% since ~09:00. Search "feels off" per support tickets. At 09:00, the embedding-service team deployed query-encoder v12 ("minor fine-tune, backwards compatible"). The ANN index was built last night with item embeddings from encoder v11. Nothing crashed; latencies are normal; the ANN service reports healthy.

Your task

Next 15 minutes. Why did nothing crash, and what does that tell you about where the fix must go?

✅ Model answer — Simulation 5

Cheap reversible action first: roll the query encoder back to v11. That is the change that correlates with the slide, the rollback takes minutes, and it restores the invariant that matters: queries and index must be embedded in the same vector space. Do not try to "fix forward" by rebuilding the index against v12 — a full index rebuild takes hours, and rollback gets users healthy now. Rebuild-then-coordinated-cutover is the daytime plan.

Why nothing crashed — and why that is the lesson: v11 and v12 produce vectors of the same dimensionality, so every API contract is satisfied: the encoder returns 768 floats, the ANN index happily computes nearest neighbors, everything is type-correct and "healthy." But two encoders trained separately — even a fine-tune of the same base — produce geometrically incompatible spaces: distances between a v12 query vector and v11 item vectors are meaningless. The system fails semantically while passing every syntactic check. This is the classic ML-systems failure: the contract that broke (same embedding space) was never written down as a contract, so no monitor owned it. Only a quality metric (recall proxy) could catch it — which is exactly why you have one.

Minutes 0–5: confirm the timeline (slide starts at the v12 deploy), then roll back the query encoder. Verify recall@100 recovers over the next minutes of traffic.

Minutes 5–15: message the embedding team — "backwards compatible" needs a precise definition for encoders, and other consumers (recs? ads retrieval?) may be skewed too. Sketch the proper rollout for v12: rebuild the index with v12 item embeddings, then cut query traffic and index over atomically (or run dual-encoding during transition). File for prevention: version-pin the encoder ID into the index metadata and have the retrieval service refuse — or alarm — on mismatch at request time.

Why this is correct: rollback dominates because it is minutes-cheap and restores a known-good invariant, while every fix-forward path is hours-long. Staff+ answers name the general principle: any system with a trained artifact on both sides of an interface (two-tower retrieval, reranker + candidate gen, encoder + index) needs compatibility versioning, not just API versioning. Common wrong answer: restarting/rebuilding the ANN service — it is behaving perfectly; the bug is in the space, not the search.

Simulation 6 — the cost dashboard tripled overnight

03:40 page: cloud spend rate 3× baseline since 02:10. Traffic is normal. What do you do in the next 15 minutes?

Model answer

Cheap and reversible first: open the autoscaler dashboard — is replica count 3× normal? If yes: cap max replicas at a sane ceiling NOW (reversible, stops the bleed) and look for what's driving scale-out with flat traffic: the classic is a retry storm — a dependency started failing/slow at 02:10, clients retried, synthetic load tripled, autoscaler obliged. Check dependency error rates and retry counts at 02:10; if found, enable/lower the circuit breaker and fix or fail-fast the dependency. Second classic: a deploy at ~02:00 with a slow code path (3× CPU per request) — same traffic, 3× compute; check deploy timeline, roll back. Third: a batch/training job scheduled onto the serving account (check per-namespace breakdown). What you do NOT do at 3am: optimize instance types, renegotiate pricing, or kill replicas below safe capacity — the ceiling cap plus cause isolation is the whole 15-minute job, and the cost of one wasted night of compute is far below the cost of a capacity outage at morning peak.

📐 The on-call theme — the rule

Reversible, bounded actions first: rollback, cap, disable, block-the-promote. Diagnose AFTER the bleeding stops.
Never promote/deploy anything new during triage — including "quick fixes."
The timeline question ("what changed at T?") solves most incidents; keep deploys, config pushes, and data-pipeline events on ONE timeline.
If the action is irreversible (deleting data, failing over a region), it needs a second person — at any hour.

✓ Remember

Alert → stop the bleed → localize by timeline → fix → prevent: in that order, every sim above.
Blocking a bad model promote is free; un-shipping a bad model from prod logs is not.
Retry storms and autoscale runaways turn small failures into big bills — caps and budgets are pre-decided, not improvised.

PART IV · DRILLS

Grading rubrics + how to practice

🎯You cannot grade yourself unless you know what grade looks like — this chapter makes the scoring transparent so every solo rep gets honest feedback.

Design and debug challenges only sharpen you if you evaluate your answers honestly and at the right altitude. This chapter exposes the full grading rubric — what separates a junior answer from a principal answer on the same question — then gives you a concrete solo-practice protocol to use for every chapter in this course, and finally maps each challenge type to the course chapters you should revisit when you miss.

The four altitude levels — what each one means

Every ML-systems interview question has a "ceiling" that rises with seniority. The ceiling is not about knowing more jargon; it is about how many layers of consequence you reason through. The rubric below applies to any design or debug question. Read the anchor phrases, then internalise the pattern.

Junior (L3–L4)

Names the right components. Knows what a two-tower model is, knows what a feature store does. Stops there. Answer is a component list or a system diagram with no numbers and no tradeoffs.

Senior (L5)

Quantifies. Picks between two real options and explains why for this specific scale. Uses actual numbers: latency budgets, memory arithmetic, FLOP counts. Surfaces the dominant tradeoff.

Staff (L6)

Reasons about failure modes, evolution, and org cost. "What breaks under 10x load?" "How does this design change when the team doubles?" "What is the on-call burden of this approach vs the alternative?"

Principal (L7+)

Challenges the prompt. Notices when the stated requirement is under-specified or wrong. Proposes a simpler solution that achieves 90 % of the value with 20 % of the complexity. Reframes the problem so the hard parts dissolve.

Full rubric table — same question, four altitudes

The example question used throughout is: "Design a feed-ranking system for a social app with 10 M DAU and a 100 ms end-to-end budget." This is ch2. Read the table column by column first, then row by row.

Dimension	Junior answer	Senior answer	Staff answer	Principal answer
Component naming	"We need a retrieval stage, a ranking model, and a serving layer."	"Two-tower retrieval with ANN at 200 ms budget, a LightGBM ranker at 20 ms for top-500, and a deep ranker at 40 ms for top-100."	"The two-tower produces 500 candidates in 15 ms; if we ever switch to a dense re-ranker we'll need to revisit the candidate count because inference cost grows quadratically."	"Do we actually need a two-tower? At 10 M DAU and a 2 k candidate pool, a pure BM25 + heuristic pre-filter covers cold-start better and the model can be a single small ranker — less infra, easier debugging."
Quantitative justification	(absent — no numbers given)	"10 M DAU × 4 refreshes/day = 1 667 QPS sustained; p99 is 5× = 8 k QPS; one H100 handles ~2 k QPS at this FLOP budget, so 4 GPUs with 2× redundancy = 8 GPUs minimum."	Same arithmetic plus: "At 8 k QPS peak, the feature store sees 8 k × 500 candidates × 40 features = 160 M feature lookups/s. That is the real bottleneck — not the model."	"The 100 ms SLA is p99 end-to-end. My experience is that 60 % of p99 is network + feature fetch, not model inference. Before sizing GPUs I want to see the latency breakdown from a traffic replay."
Tradeoff articulation	"We can use a neural ranker or a tree model."	"LightGBM has 5–10 ms inference on CPU, is interpretable, and is safe to deploy; a neural ranker is 30–50 ms on GPU but captures feature interactions. For launch, GBDT; for phase 2, neural."	"GBDT vs neural is also a team-capability tradeoff: ML platform team already has GBDT serving; neural adds an inference serving dependency that takes 3 sprints to productionise and needs on-call rotation."	"The right question is not GBDT vs neural but 'what is the cheapest model that beats the baseline metric by ≥ 1 % CTR?' — run a 1-week experiment with GBDT first; if it's not enough, justify the neural infra."
Failure modes	(not mentioned)	"If the ranker is slow, the fallback is to serve the last cached ranking."	"Three failure modes to plan for: (1) feature store timeout → serve with stale features and alert; (2) ANN index refresh lag → candidates are stale up to 5 min, acceptable; (3) cold-start user (20 % of DAU) → pure-popularity fallback, re-enters ranked pool after 10 positive interactions."	"The failure mode I'd add: correlated failures. Feature store latency spike and model serving spike tend to co-occur under traffic bursts because both share the same GPU cluster. Separate the serving path so a model spike doesn't cascade to feature fetch."
Evolution / org cost	(not mentioned)	"We can add more GPUs as DAU grows."	"At 3× DAU the ANN index needs re-sharding; design the retrieval layer to shard-aware route now so the migration is a config change, not a rewrite. Org cost: the feature store design requires feature ownership contracts — budget 2 sprints of platform work."	"The 40-person org is the constraint. A two-stage retrieval + ranking pipeline needs 3–4 teams to own it reliably. If the org is under-resourced for that, a simpler monolithic ranker with heuristic retrieval is the right call until the team grows."
Prompt challenge / simplification	(not attempted)	(not attempted)	(rarely attempted)	"100 ms end-to-end — is that measured at the load balancer or at the client? Client-measured includes mobile network RTT which is 50–100 ms alone. If the real latency target is 'feels fast on mobile', we should spec server-side latency at 40 ms and invest in client-side prefetch, which is cheaper than shaving GPU time."

⚠ Clears up

"Staff means knowing more things." No. Staff means holding more layers of consequence in your head simultaneously. A junior might know what a KV-cache is. A staff engineer reasons about how KV-cache memory per session caps batch size, which caps throughput, which determines whether you can serve the product's long-context feature at all — and says so unprompted.

📐 How to grade your own answer — the rule

Trigger: you have just written or spoken an answer to a design or debug challenge.

Find your answer in the rubric table: which column does it most closely match?
Ask: did I give at least one number derived from the scenario's constraints? If not, you are at junior regardless of vocabulary.
Ask: did I name a failure mode and a mitigation? If not, you are at most senior.
Ask: did I mention the org or team cost of my design decision? If not, you are at most senior.
Ask: did I push back on any part of the prompt, or notice an ambiguity? If yes (and correctly), add one level.

Never: grade yourself on vocabulary. Saying "we use paged attention with disaggregated prefill/decode" while unable to do the memory arithmetic is junior cosplay. The rubric rewards consequence-reasoning, not jargon.

Solo-practice protocol

Passive reading gives you recognition memory — you feel like you know the answer because you've seen it. Active retrieval under time pressure is what builds the actual skill. Follow this protocol for every challenge in chapters ch2–ch11.

Step 1 — Timebox

Set a 25-minute timer. Do not open hints or solutions. The timer creates the pressure that approximates an interview. If 25 min feels too short for design challenges, use 35 min; keep debug challenges at 20 min.

Step 2 — Write before you speak

Write a structured response: constraints → sketch → numbers → tradeoffs → failure modes. Writing forces precision. "I'd probably use a feature store" becomes "I need a feature store with p50 < 2 ms reads; I'll use Redis for online serving and S3 Parquet for offline training." The gap between your spoken intuition and your written precision is where growth lives.

Step 3 — Talk aloud

After writing, narrate your answer aloud as if speaking to an interviewer. Notice where you stumble, hedge, or go silent. Those pauses are the topics to drill. Record yourself if you can; playback is uncomfortable and therefore useful.

Step 4 — Open hints in order

Only open Hint 1 after your written answer is complete. If Hint 1 surfaces something you missed, annotate your answer with what you should have said — do not rewrite it clean. The gap annotation is the learning artifact.

Step 5 — Grade against the rubric

Open the model solution. Apply the five-question grading checklist above. Write your grade (junior / senior / staff / principal) and one sentence explaining why. Be honest: inflated self-grades waste your finite practice reps.

Step 6 — Spaced re-attempt

Re-attempt every challenge you graded below your target level after a 3-day gap. On the re-attempt, re-read only the scenario — not your previous answer. Track the delta. If you do not improve in two re-attempts, that is a signal to re-read the referenced course chapters, not to attempt the challenge a third time without new input.

The numbers-first rule

For design challenges: always write down the four fundamental numbers before drawing a single box — (1) QPS or request rate, (2) per-request compute or memory, (3) total resource requirement, (4) cost or latency implication. Until those four are on paper, your architecture sketch is decoration.

For debug challenges: always write down the three timeline questions before hypothesising — (1) when did the symptom start?, (2) what changed at or just before that time?, (3) which metrics moved together vs independently?

✓ Remember

Write first, speak second, check third — never reverse the order.
A grade is only valid if it is based on the rubric's five questions, not your subjective confidence.
Spaced re-attempts after 3 days are more valuable than back-to-back reruns.
Annotate gaps instead of rewriting clean answers — the gap is the learning.

Pairing with the mini-project notebooks

Each challenge in this course pairs with a runnable notebook in /notebooks/. The notebook gives you real telemetry to reason over, not hypotheticals. Use the notebook in two ways:

Before the challenge: run the notebook, observe the numbers, then close it and attempt the challenge from memory of those numbers. This trains the skill of reasoning from real data rather than round numbers.
After the challenge: use the notebook to verify your arithmetic. If your estimate for GPU count was 8 and the notebook simulation suggests 12, trace the discrepancy — it almost always reveals a utilisation assumption you got wrong.