ML systems challenges — design it, then debug it
The practice arena for the ML systems course. Four real design challenges with hard constraints, four production debugging war rooms with actual telemetry to reason over, eight capacity-estimation drills, and six on-call triage simulations. Every challenge follows the same protocol: timebox 25 minutes, write your answer down, then open the hints and the model solution, then grade yourself against the junior/senior/staff rubric.
What's in here
- How to train on challenges + the two meta-frameworks
- Design: feed ranking for 10M DAU under 100 ms
- Design: LLM chat serving for 50k concurrent users
- Design: feature store with point-in-time correctness
- Design: enterprise RAG with hard permissioning
- Debug: p99 latency tripled after yesterday's deploy
- Debug: training loss exploded at step 41,200
- Debug: offline AUC up, online CTR down
- Debug: GPU serving throughput collapsed at long context
- Capacity-estimation drills
- On-call triage simulations (rapid scenarios)
- Grading rubrics + how to practice
This chapter explains how to use this practice arena so that the repetitions actually transfer to an interview room. It then gives you two reusable mental frameworks — one for design questions, one for debug questions — that you can apply to every subsequent challenge regardless of domain. Internalise these two frameworks and you will never stare blankly at a whiteboard again.
Reading a model solution feels like learning. It is not. Cognitive science calls this the fluency illusion: when material is coherent and familiar-looking, the brain rates its own comprehension as high even when the underlying retrieval circuits are untrained. You feel prepared; you are not.
What actually transfers to an interview is retrieval practice under production conditions — sitting with a blank page, a timer, and a real problem, and forcing yourself to produce an answer before you know whether it is right. The struggle is the training signal, not the solution.
Skimming the scenario, immediately opening the model solution "just to see the structure", then convincing yourself you understood it. This produces a false sense of readiness. On interview day the blank page appears and nothing loads.
Every challenge in this book must be done in this order:
- Read the scenario only. Close every hint and solution block.
- Set a 25-minute timer. Write your answer — on paper, in a doc, aloud to a recording. Do not stop before the timer fires.
- Open Hint 1 only. Compare it to what you wrote. Update your answer, then set another 5 minutes.
- Open the model solution. Score yourself against the rubric. Write one sentence: what did you miss and why.
- Redo the same challenge cold in 48 hours. If you score higher, it is in long-term memory; if not, repeat.
Trigger: any challenge page opens.
- Timer on. Blank document open. Write.
- Hints only after you have a written skeleton answer.
- Full solution only after you have a written complete answer.
Never: open the solution to "orient yourself" before attempting. Once you see the answer, the practice rep is ruined.
Every ML system design question, regardless of domain, has the same skeleton. Interviewers are checking whether you can structure chaos into a sequence of deliberate decisions. Use this six-step framework every time.
Trigger: "Design a system that …"
- Ask two clarifying questions: scale (DAU / QPS) and the hardest constraint (latency vs freshness vs cost). Then stop asking and start designing.
- State requirements aloud with numbers before drawing anything.
- Sketch the full 8-box skeleton first — resist the urge to zoom in early.
- Pick the two most-constrained components, go deep, do arithmetic.
- Close with failure modes and one "at 10× scale" observation.
Never: jump straight to model architecture without establishing scale. Interviewers at senior/staff level treat a missing requirements section as a red flag for someone who will build the wrong thing.
A debug challenge always begins with a metric moving in the wrong direction. Without a framework, candidates thrash: they guess random causes, suggest fixes before diagnosing, and talk in circles. The layered-funnel approach stops that.
The core insight: every production ML system is a pipeline of transformations, and a metric degrades at exactly one layer. Your job is to localise which layer as quickly as possible using the cheapest possible probes. Fix before diagnosis is guessing; diagnosis without cheapest-probe ordering is slow.
Trigger: "Metric X dropped / spiked after Y." or "Something is wrong with your model in production."
- Ask: what changed in the last 30 minutes (deploy, data, config)? What does the p50 look like versus p99? (Tail-only = infra or cache; all-percentiles = model or data.)
- State your layer funnel aloud: "I will localise top-down starting at the product metric."
- Rank 3 hypotheses by prior probability, then name the cheapest probe for each.
- Describe the fix and one prevention mechanism.
Never: jump to "retrain the model" as your first action — retraining takes hours/days and will not fix infra bugs, cache cold-starts, or schema changes. Always exhaust cheap reversible probes first.
Debugging is the full loop (observe → localise → hypothesise → test → fix → prevent). Diagnosing is just one step: finding the root cause layer. In an interview you must do the full loop — candidates who stop at "the feature pipeline is probably broken" miss the test-cheapest-first and prevention steps that distinguish senior from junior.
Each challenge closes with a rubric table. Here is how to read it and how to use it honestly.
| Level | What it looks like | What is missing to reach the next level |
|---|---|---|
| Junior | Names the right components (two-tower retrieval, ranker, feature store). Gets the general shape right. May have no numbers. | Tradeoff reasoning: "why this component, not that one?" Numbers for every scale claim. |
| Senior | Quantified tradeoffs. Latency budgets that sum correctly. Knows the failure modes of each choice. Can defend any decision challenged by interviewer. | Failure modes at scale, org implications, cost conversation, evolution path. |
| Staff | Proactively raises failure modes. Estimates cost and discusses the build/buy/oss tradeoff. Addresses 10× scale without being asked. Simplifies by questioning whether some component is even needed. | Nothing — this is the target. |
Score yourself per dimension, not overall: you might be senior on architecture but junior on failure modes. Targeted practice closes specific gaps faster than generic review.
Interviewers often intentionally ask a question that is slightly underspecified — e.g., "Design a feed ranker" with no scale given. The junior candidate picks a scale and starts designing. The senior candidate asks two clarifying questions and writes the numbers down explicitly. The staff candidate also says "let me state my assumptions, tell me if any are wrong" — treating the ambiguity itself as a signal about the problem.
- Attempt-first always: 25 min on a blank page before any hint. The struggle is the training signal.
- Design framework: requirements (with numbers) → scale envelope → 8-box skeleton → deep-dive 2 components → failure modes → evolution.
- Debug framework: observe (what changed, when?) → localise layer by layer → hypothesise ranked by prior → test cheapest-first → fix + prevent.
- Rubric: junior names components; senior adds numbers and tradeoffs; staff adds failure modes, cost, evolution, and sometimes simplifies by cutting scope.
Read the scenario, set a 25-minute timer, and write before you peek. Every design answer flows through six steps from requirements to evolution. Every debug answer flows through five steps from observation to prevention. Grade yourself per dimension against the junior/senior/staff rubric and close specific gaps.
Q1. Why is a 25-minute timebox specifically recommended, rather than working until you feel done?
Q2. In the debug framework, why localise layer-by-layer top-down rather than checking the most likely root cause first?
Q3. A candidate draws the 8-box architecture diagram immediately when asked a design question. What is wrong with this?
Q4. What distinguishes a staff-level failure modes answer from a senior-level one?
Q5. The interviewer challenges your choice of two-tower retrieval: "Why not just use BM25?" How do you respond?
Q6. You're debugging a latency regression and your first probe (checking the dashboard) shows no anomaly. What do you do next?
Q7. Why does the debug framework say "fix before diagnosis is guessing"?
Q8. How would you adapt the 8-box design framework for a non-neural system, like a rules-based fraud detector?
Q9. An interviewer says: "Tell me about a system you would not use the 8-box framework for." How do you answer?
Q10. Why is "retrain the model" almost never the correct first action in a production debug situation?
This challenge asks you to design a production feed-ranking system from scratch under realistic constraints: 10M daily active users, a 2,000-item candidate pool per request, p99 end-to-end latency of 100 ms, a 40-person engineering organisation, a modest GPU budget, and 20% cold-start users. It is the archetypal ML systems design question — virtually every senior/staff ML interview includes a variant of it.
Produce a complete feed-ranking system design that includes:
- A latency budget table — each stage, its candidate count in and out, and its ms budget — that sums to ≤ 100 ms.
- The full candidate funnel: retrieval → pre-rank filter → light ranker → heavy ranker → business-logic re-ranking.
- Model choices per stage with justification (architecture, feature types, serving format).
- A feature freshness plan: which features are batch (daily/hourly), which are near-real-time, which are request-time, and where each is computed and stored.
- A cold-start strategy for the 20% of users with sparse signal.
- A fallback story: what the system does if any stage breaches its budget or crashes.
- An experiment plan for a safe launch: how you A/B test this without exposing all 10M users to a broken ranker.
Hint 1 — Latency decomposition
100 ms sounds generous until you account for network round trips. A user in the median location is ~20 ms from your nearest data centre. That leaves 80 ms for your backend. Your backend is not one service — it is a chain: candidate fetch → retrieval model → feature lookup → light ranker → heavy ranker → response serialisation. Each hop adds latency. The key constraint is that heavy neural ranking on 2,000 items is impossible in 80 ms on a GPU (the GPU call alone would take 200+ ms at that scale). You must reduce the candidate set aggressively before the expensive stage.
Think about this: if your heavy ranker can score 50 items in ~15 ms on GPU, how many items must the light ranker have already eliminated before it? Work backwards from that number.
Hint 2 — The two-stage model split
The industry-standard solution for this constraint is a two-tower retrieval model followed by a point-wise or list-wise ranker. The retrieval model does approximate nearest-neighbour (ANN) search in an embedding space to get from 2,000 (or more) candidates down to ~200. The ranker then uses richer features (cross-features between user and item) that are too expensive to compute for 2,000 items but fine for 200.
A further split often helps: a light ranker (logistic regression or a small gradient-boosted tree on precomputed features) reduces 200 → 50, and a heavy ranker (small neural net with cross-features) produces the final 50 to display. Think about what features each stage can afford given its latency budget, and where those features come from.
Hint 3 — Cold-start and position bias
For cold-start users (fewer than 10 engagement events), personalised two-tower embeddings are meaningless — the user embedding is just noise. Two standard approaches: (a) popularity-based fallback — rank by global or demographic-cohort popularity for the first N sessions; (b) content-based bootstrap — if the user indicated interests at signup, use item embeddings directly without a user tower. The transition from cold to warm is gradual: after 10–30 interactions, blend cold and warm signals.
For position bias: items shown at the top of the feed are clicked more regardless of quality. If your training labels are clicks without position correction, your model learns "items shown first are good", which is circular. The standard fix is inverse propensity scoring (IPS) or a position feature during training that is zeroed out at inference. Make sure your solution mentions this — it is a common interview probe.
✅ Model solution — full worked answer
Total end-to-end budget: 100 ms. Allocate as follows (all times are p99 targets):
| Stage | Candidates in | Candidates out | Budget (ms) | Notes |
|---|---|---|---|---|
| Network (client → edge) | — | — | 15 | CDN edge, median user distance |
| Candidate generation (follow-graph + topics) | all posts <48h | 2,000 | 10 | Precomputed in Redis; pure lookup |
| Two-tower ANN retrieval | 2,000 | 200 | 12 | FAISS / ScaNN index on CPU; user vec from cache |
| Feature fetch (batch + near-RT features) | 200 items | 200 | 8 | Redis multi-get; 200 × ~5 features |
| Light ranker (GBDT) | 200 | 50 | 5 | CPU inference, pre-compiled model |
| Heavy ranker (neural, GPU) | 50 | 50 (scored) | 18 | GPU batch; cross-features computed here |
| Business-logic re-rank + dedup | 50 | 20 (displayed) | 4 | CPU; diversity, ads injection |
| Serialisation + network (edge → client) | — | — | 8 | Protobuf compression |
| Total | 80 | 20 ms headroom for p99 jitter |
The 20 ms headroom is not waste — p99 latency has a fat tail. Without headroom, any single slow downstream service (cold Redis key, GC pause, noisy neighbour) will breach the SLA.
Candidate generation runs offline: a Spark job hourly expands each user's follow graph and topic subscriptions into a candidate set of recent posts (≤48 hours old). The candidate list is written to Redis keyed by user ID. At request time this is a single Redis lookup — no model inference, no graph traversal on the hot path.
Two-tower ANN retrieval: offline, train a two-tower model. User tower ingests user ID embedding + aggregated interaction history (mean-pooled post embeddings of last 50 interactions). Item tower ingests post ID embedding + content features (topic, media type, author popularity). Both towers output a 128-dimensional L2-normalised embedding. At request time: look up the user's cached embedding vector; run FAISS (HNSW index) against the 2,000 candidate item vectors to retrieve top 200. HNSW search on 2,000 items at 128 dims takes ~2 ms on CPU — well inside the 12 ms budget including the Redis lookup for the user vec.
Light ranker: a gradient-boosted decision tree (LightGBM, ~200 trees, depth 6) trained on engagement labels. Features: precomputed item statistics (like rate, share rate, 7-day CTR), user-item cross-features (topic affinity score, recency signal, author follow depth). All features come from Redis — no model inference needed. GBDT inference on 200 items takes <2 ms on a single CPU core. Output: top-50 by predicted engagement score.
Heavy ranker: a 2-layer MLP with cross-attention between user embedding and item embeddings (roughly 5M parameters). Inputs: user tower output, item tower output, position embedding (for training; zeroed at inference), explicit cross-features (user–author affinity, freshness decay). Batch of 50 items through the MLP on A100 GPU takes ~10–14 ms including the CUDA kernel launch overhead. Output: calibrated engagement probability per item.
Business-logic re-rank: deterministic CPU step. Applies a maximum-marginal-relevance diversity filter (at most 3 consecutive posts from the same author), injects sponsored posts at fixed positions (slots 3, 8), and removes posts the user has already seen in the last 24 hours (Bloom filter lookup). Final 20 items returned.
Why not run the heavy ranker on all 200 items instead of 50? Do the arithmetic:
At 200 items: 200 × 10⁷ = 2 × 10⁹ FLOPs. An A100 delivers ~312 TFLOP/s (fp16). With 20% utilisation on small batches: effective throughput ≈ 62 TFLOP/s = 6.2 × 10¹³ FLOP/s.
32 µs for the matrix math — sounds fine. But the real cost at batch size 200 is memory bandwidth, not compute. Each item's feature vector (~1 KB fp16) must be loaded from GPU HBM. At 200 items: 200 KB, plus model weights (~10 MB). At A100 HBM bandwidth 2 TB/s: load time ~5 µs. The bottleneck is actually the kernel launch overhead and GPU scheduling latency at these small batch sizes, which adds 8–15 ms of wall-clock time.
Running 200 items instead of 50 means 4× more feature-fetch time (8 ms becomes ~32 ms) plus the GPU kernel time stays roughly the same (still small-batch-dominated). The total saving from light-ranking 200→50 is ~24 ms on feature fetch alone — almost exactly the difference between meeting and missing the latency SLA.
| Feature | Freshness tier | Computed by | Served from |
|---|---|---|---|
| User embedding (two-tower) | Hourly | Spark + offline model inference | Redis (user_vec:{uid}) |
| Item embedding (two-tower) | On post creation + hourly refresh | Online inference at post time; batch refresh | FAISS index + Redis (item_vec:{pid}) |
| Item 7-day CTR, like rate, share rate | Hourly batch | Spark aggregation over Kafka events | Redis hash (item_stats:{pid}) |
| User–topic affinity score | Hourly batch | Spark | Redis (user_topic:{uid}) |
| Post freshness decay | Request-time computed | Ranking service (formula: e^(−λΔt)) | Computed inline from post creation time |
| User session context (recent clicks in session) | Real-time (<1 s) | Kafka consumer → Redis write | Redis (session:{uid}) |
| Seen-posts Bloom filter | Real-time | Updated on every impression event | Redis Bloom filter (seen:{uid}) |
The key principle: freshness tier must match the rate of change of the signal. A user's long-term topic affinity changes slowly — hourly batch is fine and costs nothing at request time. A user's in-session behaviour (they just liked three cooking posts) changes in seconds — this must be near-real-time or the ranker will show them more travel content they are ignoring right now.
Definition: users with fewer than 10 engagement events in their history (20% of DAU = 2M users). Their user tower embedding has high variance — it is essentially a random vector trained on near-zero signal.
Phase 1 (0–5 interactions): bypass the personalised two-tower entirely. Use a popularity-based retrieval: rank candidates by a demographic-cohort weighted engagement score. Cohort = (age_bucket, geography, signup-declared interests). This is a precomputed table lookup, trivially fast.
Phase 2 (5–30 interactions): content-based bootstrap. Embed the posts the user has interacted with (using item towers, which are high quality). Represent the user as the mean of item embeddings of their interactions — no user tower required. This representation degrades gracefully as interactions increase.
Phase 3 (30+ interactions): full two-tower personalisation. Blend with content-based: final user vector = α × user-tower output + (1−α) × mean-item-embedding. α is a learned function of interaction count, trained with a small neural head.
The transition is invisible to the user: the ranker always reads one vector from Redis, updated according to whichever phase the user is in.
Without position-bias correction, training clicks create a feedback loop: high-ranked items get more clicks because they are shown first, not because they are better. The model then learns this spurious signal and keeps ranking them first — a self-reinforcing cycle.
At training time: add position as a feature (slot 1 through 20) and let the model learn its effect via β_k. At inference time: set position = 0 (or equivalently, subtract the learned β_k). The model then returns "quality score without position effect." This is equivalent to IPS when position is treated as the propensity and is standard practice in industrial ranking systems.
Alternative: randomisation experiments. Randomly shuffle feed positions for 1% of users. Labels from this slice are unbiased. Train the main model jointly on biased data + position feature, with a small unbiased slice for calibration.
The rule: every stage must have a defined graceful degradation path, not just a circuit breaker that returns an error. Users who see a slightly lower-quality feed churn less than users who see an error page.
- Shadow mode (week 1–2): run the new ranker in parallel with the existing system. Log both sets of scores. Compare offline: does the new ranker's score distribution look sane? Are there any items with degenerate scores (NaN, constant value)? Zero user exposure.
- 1% canary A/B (week 2–3): route 1% of traffic to the new ranker. Monitor: CTR, time-spent, app crash rate, p99 latency. Set automated rollback trigger: if p99 latency >110 ms or CTR drops >5% relative in any 15-minute window, auto-rollback and page on-call.
- 10% → 50% ramp (week 3–4): ramp in doubling steps, 24 hours at each step. Guardrail metrics: session length, D7 retention, ad revenue. No ramp if any guardrail breaches.
- Full rollout (week 4–5): 100% traffic. Keep old ranker on standby for 72 hours — do not decommission until the new system has survived at least one weekday peak traffic cycle.
Cold-start users should be over-sampled in the early canary — they are the highest-risk segment because the new ranker's cold-start logic is the most novel part of the design.
Heavy ranker: 50 items × ~10⁷ FLOPs per item = 5 × 10⁸ FLOPs per request. At 2,800 QPS peak: 5 × 10⁸ × 2,800 = 1.4 × 10¹² FLOP/s required.
One A100 (fp16, ~20% utilisation on small batches) ≈ 60 TFLOP/s = 6 × 10¹³ FLOP/s effective.
This looks like the heavy ranker needs effectively zero GPUs — because the heavy ranker only processes 50 items per request, not thousands. The real constraint is latency (need a GPU call to complete in <18 ms), not throughput. At 2,800 concurrent requests each needing a GPU in <18 ms, assuming 1 ms kernel launch amortised over a batch of 10 co-scheduled requests:
With continuous micro-batching: one A100 can handle ~55 requests/s at 18 ms latency budget (1/0.018 ≈ 55). To handle 2,800 QPS: 2,800 ÷ 55 ≈ 51 A100s. Given the 30-GPU budget constraint, plan A: reduce heavy ranker to top-30 items (saves 40% compute, frees 20 GPUs). Plan B: use A10G (cheaper, 60% of A100 throughput) for the heavy ranker — brings cost down ~40%. Present both options to finance.
- Org implications: "The feature freshness plan requires an SLA with the data platform team. Without a written SLA on Redis uptime and p99 get latency, our fallback story is unreliable. I would negotiate that before launch."
- Cost conversation: "At 51 A100s and current cloud pricing (~\$3.50/GPU-hr), the heavy ranker costs \$51 × 24 × 365 × \$3.50 ≈ \$1.56M/year. Given 10M DAU and modest monetisation, this is likely borderline. I would prioritise the A10G option or model distillation to get under \$1M/year."
- Evolution: "At 100M DAU, candidate generation from Redis becomes a bottleneck — we would move to an approximate graph-walk retrieval (like Pinterest's candidate generation). The two-tower ANN index would need to shard across 10+ machines."
- Challenge the prompt: "Do we actually need 2,000 candidates per request? If the follow graph is dense, better candidate pruning offline (time-decay, quality gate) might let us start with 500 and skip the light ranker entirely — simpler system, lower latency."
| Dimension | Junior | Senior | Staff |
|---|---|---|---|
| Latency budget | Mentions stages exist but no numbers | Table with per-stage ms that sums to ≤100 | Table + explains headroom rationale + 10× scale impact |
| Retrieval model | Says "use embeddings" or "ANN" | Two-tower architecture, HNSW/FAISS, explains why ANN beats brute-force | Discusses index freshness, sharding at 10× scale, ANN recall vs latency tradeoff |
| Light/heavy split | Mentions two-stage ranking | GBDT then MLP, justifies FLOP savings | Full arithmetic, names A100 vs A10G tradeoff, discusses distillation path |
| Feature freshness | Says "real-time features" | Tiered table: batch/near-RT/request-time with storage layer named | SLA negotiation with data platform, late-arrival policy, offline/online parity testing |
| Cold-start | Not addressed or vague ("show popular posts") | Phased strategy with interaction threshold and content-based fallback | Phased strategy + blend coefficient + how transition is invisible to user + experiment design for cold cohort |
| Position bias | Not mentioned | Mentions IPS or position feature | Explains the feedback loop, the formula, training vs inference treatment, randomisation alternative |
| Fallbacks | Not mentioned | Circuit breaker or rollback mentioned | Per-stage graceful degradation + "no blank feed" principle |
| Experiment plan | "Run an A/B test" | Shadow → canary → ramp with specific percentages and guardrail metrics | Above + automated rollback trigger thresholds + cold-start over-sampling rationale |
- "Your p99 just breached 110 ms in canary. Walk me through your immediate actions." — tests fallback and debug framework integration.
- "Why not just increase the heavy ranker's candidate count to 200 and cut the light ranker?" — tests FLOP arithmetic and feature-fetch cost understanding.
- "How does the system behave if the Redis cluster goes down?" — tests fallback depth: do you have a local cache? A Postgres fallback? A static default?
- "A post goes viral unexpectedly — 10× normal engagement in 5 minutes. Which part of your system breaks first?" — tests freshness tier design: hourly batch features are now severely stale for this item.
- "We want to optimise for 7-day retention, not CTR. What changes?" — tests understanding of label choice: CTR labels create a short-term engagement loop; retention requires long-horizon labels, delayed feedback handling, and potentially a different objective.
Trigger: "Design a feed ranking system" or any variant with a DAU + latency constraint.
- State the scale math first: DAU → peak QPS. Do this live on the whiteboard. It takes 30 seconds and immediately shows quantitative fluency.
- Draw the latency budget table before drawing anything else. Every stage needs a ms budget that sums to the SLA. This is the skeleton everything else hangs on.
- Work the funnel left-to-right: candidate set → retrieval → light ranker → heavy ranker → business logic. At each stage: what model, what features, what latency budget, what fallback.
- Volunteer cold-start and position bias — they are almost always in scope and interviewers reward candidates who raise them unprompted.
- Close with experiment plan and one cost estimate. Mention the GPU count and annual cost even if the interviewer did not ask.
Never: design the model architecture before establishing the latency budget. A beautiful neural architecture that cannot meet the p99 SLA is worthless. Budget first, model second.
- Feed ranking is a multi-stage funnel: 2,000 → 200 (ANN retrieval) → 50 (light GBDT) → 20 (heavy neural) → displayed. Each stage has a ms budget; they must sum to the SLA with headroom.
- Heavy ranker on 200 items fails the latency budget not from compute, but from feature-fetch I/O. The light ranker exists to eliminate that cost.
- Position bias must be handled at training time (position feature + zero at inference) or you train a circular model that learns "top items are clicked because they are on top."
- Cold-start strategy: popularity fallback (0–5 events) → content-based mean-item-embedding (5–30) → full personalised two-tower (30+). Transition is smooth and invisible.
- Every stage needs a graceful degradation path. "Circuit breaker returns error" is not a fallback — users must see a feed, even a non-personalised one.
A 100 ms feed-ranking SLA forces a multi-stage funnel: cheap ANN retrieval collapses 2,000 candidates to 200, a GBDT light ranker collapses to 50, and a GPU neural heavy ranker scores the final 50. Feature freshness must be tiered (batch/near-RT/request-time) matching each signal's rate of change. Cold-start users get a phased content-based fallback. Position bias is killed with a training-time position feature zeroed at inference. Every stage has a graceful degradation path so users never see a blank feed.
Q1. Why is the latency budget table the first thing to draw, before any architecture diagram?
Q2. A colleague proposes replacing the two-tower retrieval with a simple BM25 search over post text. Under what conditions would this be a reasonable choice?
Q3. Your FAISS index needs to be updated as new posts arrive. How do you handle this without taking the index offline?
Q4. Explain the feedback loop that position bias creates and why it is self-reinforcing.
Q5. Why is the heavy ranker budget 18 ms rather than, say, 40 ms? What would you sacrifice to give it more time?
Q6. A post goes viral — 10× normal engagement in 2 minutes. Which part of your system responds first and which lags?
Q7. The product team wants to optimise the ranker for 7-day retention instead of CTR. What changes?
Q8. How would you test whether your position-bias correction is actually working?
Q9. The GPU budget is 30 A100s but the arithmetic shows you need 51. How do you resolve this?
Q10. An adversarial creator is flooding the system with 1,000 low-quality posts per hour trying to game the ranking. How does your system defend against this?
Q11. Your shadow-mode logs show the new ranker scores items very differently from the old ranker — some items 10× higher, some 10× lower. Is this a bug?
Q12. If you could only monitor three metrics in production for this system, which would you choose and why?
This challenge forces you to do the arithmetic that separates a hand-wavy "just use vLLM" answer from a Staff-grade design. You will estimate GPU counts from first principles, size the KV-cache per device, pick a batching strategy, and compute cost per million tokens — the metric every product team actually cares about.
You are the ML-infrastructure tech lead at a company launching a customer-facing chat assistant. The stack must handle:
- Model: 13B-parameter dense transformer, fp16 weights, 40 layers, 40 heads, head dim 128, GQA with 8 KV heads per layer (so KV head dim = 128).
- Traffic: 50,000 concurrent sessions; mean 4 requests / session / hour → ~56 req/s sustained, with 3× daily peak (≈ 167 req/s).
- Token profile: mean prompt 1,500 tokens; mean response 400 tokens; so ~1,900 tokens round-trip per request.
- Latency SLO: time-to-first-token (TTFT) < 800 ms at p95; time-per-output-token (TPOT) < 60 ms at p95.
- Cost target: \$2.00 / 1M tokens (input + output blended).
- Infra: H100 80GB SXM GPUs available on-demand; also an autoscale budget.
- Estimate how many H100s you need — show every arithmetic step.
- Design the batching strategy (continuous batching, chunked prefill, etc.) and explain why naive static batching fails here.
- Compute the KV-cache memory budget per GPU and decide how many concurrent sequences fit per device.
- Describe routing / session-affinity decisions and why they matter.
- Design the autoscale + degradation ladder for the 3× peak.
- Address multi-region: when does it help, when does it not?
Hint 1 — Start with throughput, not latency
The most common first mistake is to think about latency targets first and GPU count second. Flip it: figure out how many tokens per second the fleet must generate, then divide by what one H100 can do. Throughput determines fleet size; latency determines per-device batch size and scheduling policy.
Useful number: a single H100 (80GB, fp16, no quantization) running a 13B model with a realistically-sized batch can sustain roughly 3,000–5,000 output tokens/second on the decode phase. Where does that number come from? Memory-bandwidth bound: at 3.35 TB/s HBM3 bandwidth and ~26 GB of model weights (13B × 2 bytes), each decode step reads the entire model in ~8 ms, implying ≈ 125 decode steps/sec with batch=1. With batch=32 the same memory read is amortized over 32 tokens → 4,000 tokens/sec. Confirm you can reconstruct this from first principles before peeking at the solution.
Hint 2 — KV-cache memory is the binding constraint, not weights
On an H100 with 80 GB VRAM: the 13B fp16 model occupies ~26 GB. That leaves ~50 GB for KV cache + activations. Before you decide on batch size, work out how many bytes one token of KV state costs for this model. With GQA (8 KV heads, head dim 128) the calculation is per-layer-per-token: 2 (K and V) × 8 heads × 128 dim × 2 bytes (fp16) = 4,096 bytes per layer. Multiply by 40 layers = 160 KB per token. Now ask: how many 1,900-token sessions does 50 GB support? That tells you the max concurrent sequences a single GPU can hold in-cache.
Hint 3 — Continuous batching changes the scheduling math
With static batching you commit a batch at request arrival and wait for all sequences to finish before accepting new ones. With continuous batching (aka iteration-level scheduling), each forward pass picks the "ready" sequences from a waiting pool — sequences can enter and leave mid-batch. This keeps GPU utilization high even when response lengths vary wildly. Think about how this interacts with prefill vs. decode: prefill of a long prompt (1,500 tokens) for ONE new request takes many more FLOPs than a decode step. If you mix prefill and decode in the same forward pass, long prefills starve decodes → TTFT of other users spikes. What is the fix?
✅ Model solution — GPU count, KV-cache arithmetic, cost per 1M tokens
Peak traffic: 50,000 sessions × 4 req/hr / 3,600 s × 3× burst = 167 req/s.
Tokens per request: 1,500 prompt + 400 response = 1,900 total, but the serving cost splits differently:
- Prefill processes 1,500 tokens in one parallel forward pass (FLOP-heavy, fast).
- Decode generates 400 tokens one at a time (memory-BW-heavy, slow).
Output token demand at peak: 167 req/s × 400 tokens = 66,800 output tokens/sec fleet-wide.
Model config: 40 layers, GQA with 8 KV heads, head dim 128, fp16.
= 2 × 40 × 8 × 128 × 2 = 163,840 bytes ≈ 160 KB per token.
VRAM budget per H100 (80 GB):
- Model weights (fp16): 13B × 2 = 26 GB
- Activations + scratch: ~4 GB (conservative)
- Available for KV cache: 80 − 26 − 4 = 50 GB
Max tokens in KV cache per GPU: 50 GB / 160 KB = 312,500 tokens.
At mean context length 1,900 tokens/session: 312,500 / 1,900 ≈ 164 concurrent sequences per GPU.
This is the primary capacity constraint — not raw FLOP throughput.
H100 HBM3 bandwidth: 3.35 TB/s. Each decode step must read all 26 GB of weights once per token generated.
With batch B, we generate B tokens per step → throughput = 129 × B tokens/sec. At batch=164 (our KV-cache ceiling): ≈ 21,000 output tokens/sec per GPU. Apply a 70% utilization haircut (routing overhead, prefill interruptions, head-of-line blocking): ~14,700 usable output tok/sec per GPU.
Fleet output token demand at peak: 66,800 tok/s.
GPUs needed: 66,800 / 14,700 ≈ 5 H100s for decode at peak.
Add headroom for prefill burst (prefill is FLOP-bound; a 1,500-token prefill takes ~0.5s on one GPU at batch=1, much less in parallel): plan 2× overhead → 10–12 H100s for the compute fleet. Add 2 for redundancy → 12–14 H100s total at peak.
Sustained (non-peak) load is ~3× lower → 4–5 GPUs; autoscale between 5 and 14.
Why static batching fails: variable response lengths mean some sequences finish at token 20, others at token 800. Static batching holds the GPU until the slowest finishes → 90%+ idle time on average.
Continuous batching (iteration-level scheduling): every forward pass, the scheduler swaps completed sequences out and new ones in. GPU is never idle waiting for stragglers. This is the default in vLLM, TGI, and SGLang.
The prefill/decode interference problem: a single prefill of 1,500 tokens in a mixed batch adds ~12 ms to that forward pass, spiking TTFT for all other sequences currently decoding. Chunked prefill splits the prefill into fixed-size chunks (e.g., 512 tokens/chunk) and interleaves them with decode steps. This bounds prefill latency injection at the cost of slightly higher TTFT for the prefilling sequence itself.
Disaggregated prefill/decode (advanced): route all prefill work to a dedicated "prefill pool" of GPUs optimized for high FLOP throughput; route decode work to a "decode pool" optimized for memory bandwidth and high batch size. The KV cache is transferred between pools after prefill completes. Eliminates interference entirely; adds network transfer cost (~1,900 tokens × 160 KB = ~300 MB per request — tolerable on InfiniBand, painful on Ethernet).
Without paging: you must pre-allocate the maximum possible context length for every new sequence at arrival — most of it wasted if the sequence is short. With paged KV cache (vLLM's key contribution): KV cache is divided into fixed-size "pages" (e.g., 16 tokens each); pages are allocated on demand as sequences grow. Fragmentation falls from O(max_len) to O(page_size). The practical effect: you can run 2–3× more concurrent sequences for the same VRAM.
If 80% of your requests share a 500-token system prompt, and you cache the KV state for that prefix, each new request skips 500 tokens of prefill → ~33% reduction in prefill work and a jump in effective capacity. Implementation: hash the prefix, store its KV pages, reuse on cache hit. This is "prefix KV caching" or "prompt caching." Works best when prefix is long and hit rate is high. A wrong answer: "just cache the full response" — that only helps exact-repeat queries, not the common case of shared system prompts with novel user turns.
When to use affinity: if your service supports multi-turn chat where the KV state from turn N is reused in turn N+1, routing turn N+1 to the same GPU as turn N avoids re-computing or re-transferring the KV state. This is a significant saving: re-prefilling 1,500 tokens from scratch costs ~500ms; reusing cached KV costs ~0ms. Implementation: a consistent-hash router keyed on session ID.
When affinity hurts: if one GPU gets a few very long-context sessions, its KV memory fills up faster than peers → hot-spot imbalance. Solution: track per-GPU KV utilization and re-route new sessions away from GPUs above a threshold (e.g., 85% KV full).
| Load level | Action | User impact |
|---|---|---|
| < 80% capacity | Normal serving, full context window | None |
| 80–100% | Scale out (add GPUs from autoscale group) | ~2–3 min ramp-up lag |
| 100–120% | Reduce max response length (400→200 tokens), shed lowest-priority sessions | Shorter answers |
| > 120% | Queue overflow requests with retry-after header; serve waitlisted users in FIFO order | Visible wait indicator |
| Catastrophic | Redirect to smaller (7B) fallback model on separate pool | Lower quality, available |
Key autoscale metric: KV-cache utilization per GPU, NOT CPU/GPU compute utilization. A GPU can be at 30% compute but 95% KV-full — it cannot accept new sessions. Use KV utilization as the primary scaling signal.
H100 on-demand: ~\$3.00/hr (cloud spot: ~\$1.80/hr). Sustained fleet: 5 GPUs × \$3.00 = \$15/hr.
Output tokens per hour at sustained load: 22,267 req/hr × 400 tokens = 8.9M tokens/hr.
Input tokens per hour: 22,267 req/hr × 1,500 tokens = 33.4M tokens/hr.
Total tokens/hr: 42.3M.
That's well under the \$2.00 target — meaning 5 GPUs on-demand could serve sustained load profitably. The 3× peak fleet (14 GPUs) at \$42/hr over 42.3M × 3 = 127M tok/hr gives \$0.33/1M — still fine. The \$2.00 budget has enormous margin, suggesting the real business risk is not cost but availability during peaks.
Speculative decoding consideration: a small draft model (e.g., 1B) proposes 4–8 tokens per step; the target (13B) verifies in one forward pass. Speedup: 2–3× on output tokens when the draft model's acceptance rate is high (common for constrained domains like customer service). Tradeoff: must serve TWO models; draft model adds ~2 GB VRAM; acceptance rate degrades on novel/creative queries. Implement behind a feature flag, measure acceptance rate in prod before committing to the 1B model in your fleet budget.
When it helps: latency to users (TTFT is sensitive to network RTT — a user 150ms away gets 150ms knocked off their 800ms budget before the GPU even starts); disaster recovery; data-residency compliance.
When it doesn't help throughput: splitting 50k sessions across 2 regions halves per-region capacity but doesn't reduce total GPU count. The model weights must be loaded in every region independently. Shared KV state across regions is impractical (network bandwidth is 10-100× too slow for real-time KV transfer). So multi-region = multi-copy of the fleet, not a cost saving.
Common wrong answer: "multi-region reduces latency AND saves money by sharing load." It reduces latency; it does NOT save money — it costs more because you cannot fractionally pool KV cache across regions.
- KV memory is the binding constraint, not FLOPs. For a 13B GQA model on H100, one token costs 160 KB of KV cache; 50 GB free → 164 concurrent sequences per GPU.
- Continuous batching + chunked prefill solves the variable-length + prefill/decode interference problems. Disaggregated prefill/decode is the Staff+ answer.
- Scale on KV-cache utilization, not GPU compute utilization — they diverge badly.
- Cost math: 5 H100s at \$3/hr serve ~42M tokens/hr = \$0.35/1M tokens. The \$2.00 target has 5× headroom — availability risk matters more than cost risk here.
- Session affinity saves one full prefill per multi-turn exchange; route to a different GPU only when KV utilization > threshold.
| Level | What a passing answer includes |
|---|---|
| Junior | Names vLLM / TGI, mentions "we need multiple GPUs," knows KV cache exists. No numbers. |
| Senior | Derives GPU count from token throughput, sizes KV cache, picks continuous batching, mentions paged KV. Ballpark cost estimate. Knows autoscale is needed. |
| Staff | Full arithmetic in the solution (KV bytes per token formula, utilization haircut, cost per 1M tokens computation). Identifies KV utilization as the correct scaling signal. Discusses chunked prefill vs. disaggregated prefill with tradeoffs. Prefix caching for system prompts. Speculative decoding evaluation (not just "use it"). Multi-region cost reality check. Degradation ladder with concrete thresholds. |
Trigger: "Design an LLM serving system for N concurrent users."
- Derive peak output token demand (sessions × req/hr × 3× burst × output tokens/req).
- Compute KV bytes per token for the given model (2 × layers × KV-heads × head-dim × 2 bytes).
- Divide free VRAM by KV bytes per token to get max concurrent sequences per GPU.
- Compute decode throughput: bandwidth / model-size × batch-size × utilization-haircut.
- Divide fleet demand by per-GPU throughput → GPU count.
- Choose continuous batching + chunked prefill; justify with latency SLO.
- Scale on KV utilization, not compute. Build degradation ladder.
- Compute cost per 1M tokens at the end to validate against target.
Never: say "just add more GPUs" without showing the arithmetic, or recommend scaling on CPU/GPU compute utilization.
Q1. A 13B model in fp16 has 13 billion parameters. How many GB of VRAM does that require, and what's left for KV cache on an H100?
Q2. Why is "GPU utilization" a bad autoscale signal for LLM serving?
Q3. What is continuous batching and how does it differ from static batching?
Q4. What is the prefill/decode interference problem, and what are the two main solutions?
Q5. Walk me through the KV-cache bytes-per-token formula for a 13B model with GQA.
Q6. When does speculative decoding help, and when does it not?
Q7. How does paged KV cache (vLLM) improve memory utilization over pre-allocated KV cache?
Q8. The product team wants to reduce cost by 50%. What levers do you pull, in order of risk?
Q9. Why doesn't multi-region deployment reduce your GPU cost?
Q10. What is session affinity in LLM serving, and what is the risk of getting it wrong in both directions?
Q11. What is the "memory wall" in LLM serving and how does GQA address it?
A feature store is the connective tissue between raw data and your models. When it is built wrong, training silently sees the future and your offline metrics become fiction. This challenge walks you through diagnosing a point-in-time leak from first principles, then designing the offline/online infrastructure that prevents it permanently. Mastering this pattern is a prerequisite for any Staff-level ML systems role.
Your company runs two high-stakes ML consumers: a fraud detection model (inference within 200 ms of each transaction) and a feed ranking model (batch-scored nightly, re-ranked live). Both draw from a shared feature platform serving 200 features across 8 data-engineering teams.
The platform has grown organically. Features come from two source types:
- Streaming — Kafka events (transaction counts, session activity, engagement signals) written to Redis. Sub-second freshness.
- Batch — daily Spark jobs writing to Hive/Parquet (30-day aggregates, user segments, graph-derived scores). Freshness: 18–36 hours.
Training data is assembled monthly by a script that joins the label table (fraud outcomes, clicks) against the feature tables on user_id. The script has no special time handling — it just does a plain SQL join.
The numbers that triggered this conversation:
| Metric | Offline (held-out test set) | Online (A/B shadow mode) |
|---|---|---|
| AUC-ROC | 0.92 | 0.71 |
| Precision @ top decile | 0.84 | 0.51 |
| Feature coverage | 99.8 % | 88.3 % |
The gap is 0.21 AUC — catastrophic. A naive attribution says the model is bad, but the model never changed between offline and online evaluation. Something upstream is poisoned.
- Diagnose — identify the leak class (training-serving skew, label leakage, target leakage, point-in-time leak). Show with a concrete 4-row table what the wrong join returns vs. what the correct as-of join should return.
- Design the offline store — schema for a timestamped feature log, the point-in-time join algorithm, backfill strategy.
- Design the online store — Redis/KV layout for streaming features and a low-latency serving API. Justify why it stays separate from the offline store.
- Define freshness tiers — map the 200 features to at least 3 tiers; state the SLA and mechanism for each.
- Describe parity testing — how you continuously verify that offline training distributions match online serving distributions.
- State the late-data policy — what happens when a batch job is delayed 6 hours? Who owns what feature?
Hint 1 — what does the 0.92 vs 0.71 gap pattern tell you?
Offline metrics that are too GOOD are as diagnostic as online metrics that are too bad. What class of bug systematically inflates offline evaluation? What information could the training set contain that serving never will?
Hint 2 — look at the monthly training-set build
Training sets are built monthly from the warehouse's CURRENT feature tables. A label from June 3rd is joined against… which version of the user's 7-day aggregate? What did serving see on June 3rd?
✅ Model solution
Diagnosis. The monthly build joins labels against end-of-month feature snapshots — every training row sees features computed AFTER its label event, including the label's own consequences. That's point-in-time leakage; the 0.92 is fiction. The wrong-vs-right join:
| Label event | WRONG join (month-end snapshot) | RIGHT join (as-of) |
|---|---|---|
| fraud_flag @ Jun 3, 12:04 | txn_count_7d as of Jun 30 (includes Jun 3-10 panic activity!) | txn_count_7d as of Jun 3, 12:04 (data through Jun 3) |
| click @ Jun 12, 09:15 | user_ctr_30d as of Jun 30 (includes this click) | user_ctr_30d as of Jun 12, 09:15 |
Architecture. (1) Timestamped feature log: every feature value is appended with its computation timestamp — the offline store is feature HISTORY, not current state. (2) As-of join engine for training-set builds: for each label, fetch the latest value with ts ≤ label_ts (and optionally ts ≥ label_ts − staleness_bound to mimic serving staleness). (3) Online store holds latest values, written by the same pipelines that append history — one definition, two materializations. (4) Registry: per-feature owner, freshness tier, lineage. (5) API: get_online(keys, features) for serving; build_training_set(labels_df with timestamps, features) for offline — teams never hand-write joins.
Freshness tiers. Classify the 200 features: batch daily (demographics, long aggregates), near-real-time micro-batch (hourly counters), streaming (fraud velocity — the fraud team's true requirement). Each tier ~10× the ops cost of the previous; default down.
Parity & late data. Continuous online/offline parity check: sample serving reads, replay as-of offline, alert on divergence rate. Late-arriving events: features recompute with event-time watermarks; the as-of join must use the value as it WAS at serving time — which logged-at-scoring features give you for free (log features at inference; train on the logs; the leak class disappears by construction — say this as the strongest variant).
Common wrong answers. "Retrain more often" (doesn't touch the join), "add regularization" (the 0.92 isn't overfitting in the classical sense), "feature selection to remove the leaky ones" (every aggregate leaks when joined wrong — the JOIN is broken, not particular features).
- Offline ≫ online with a monthly snapshot build = point-in-time leakage until proven otherwise.
- The fix is architectural: timestamped history + as-of joins, or logged-at-scoring features.
- Freshness is tiered and priced; parity checks are the skew alarm; lineage answers "who breaks if this table is wrong."
| Level | Answer signature |
|---|---|
| Junior | Names "data leakage" vaguely; proposes retraining or model changes. |
| Senior | Diagnoses the as-of join precisely with the table above; designs offline/online stores + the training-set API; tiers freshness by need. |
| Staff | Adds logged-at-scoring as the structural cure, parity monitoring as the regression guard, late-data/backfill policy, lineage for blast radius, and a migration plan for 8 teams' existing pipelines with the fraud team's 5ms reads handled separately. |
This challenge puts you in front of the hardest constraint in enterprise search: retrieval quality and authorization must both be correct, and the failure modes are asymmetric — a missed answer frustrates a user, but a leaked document can end careers. You will design a RAG system for 50,000 employees over 5 million documents, reason through exactly where permission filtering belongs in the pipeline, and build an evaluation plan that stress-tests both relevance and information leakage.
Your company runs three internal knowledge sources that must be unified into a single Q&A interface:
- Confluence wiki: 3.2M pages. Access rules: space-level (public, team-private, exec-only) plus page-level overrides.
- Google Drive: 1.5M docs/sheets/slides. Access rules: per-file ACL with ~6 principals on average; changes propagate from folder inheritance.
- Jira: 300k tickets. Access rules: project-level, with some tickets marked "security" and restricted to 12 people company-wide.
Scale numbers:
- Ingestion architecture: how do 5M documents get chunked, embedded, and stored? What metadata must travel with every chunk?
- Incremental indexing pipeline: when a doc is edited or deleted, what happens within the 1-hour SLA? Define the tombstone strategy.
- ACL filtering decision: where exactly does permission filtering happen — before retrieval, during retrieval, or after retrieval? Argue for your choice. Explain why the alternative that "feels" natural is wrong.
- Retrieval stack: justify your choice of sparse, dense, or hybrid retrieval. If hybrid, specify the fusion mechanism.
- Evaluation plan: define metrics and test sets for both answer quality and leakage. What does a leakage red-team look like?
- Permission-aware cache: a senior interviewer will ask whether you cache query results. Walk through the subtlety.
Time-box yourself to 25 minutes before opening any hint. Write your answer on paper or in a doc first.
Hint 1 — Where does permission filtering live?
Draw the retrieval pipeline as a sequence of stages: query → candidate retrieval → filtering → re-ranking → generation. Now ask: at which stage do you know which documents the user is allowed to see? At which stage is it too late? The key insight is about what the LLM does with context it receives — does it reliably ignore unauthorized passages?
Also think about the difference between filtering 1M candidates down to 50 authorized ones, versus retrieving the top-50 and then checking authorization. The numbers are the same only if your retrieval is perfect — and it never is.
Hint 2 — Incremental indexing and the tombstone problem
When a document is deleted or its ACL is restricted (e.g., a page is moved to exec-only), what happens to chunks that were already indexed? Vector stores are append-friendly but deletion is expensive. A tombstone is a marker that tells the retrieval layer "this chunk exists in the index but is logically deleted — skip it." Think about how tombstones interact with your filtering logic.
Freshness within 1 hour means your ingestion pipeline cannot batch daily. What does an event-driven pipeline look like? What triggers it?
Hint 3 — The permission-aware cache trap
Caching is obvious for performance: if 100 employees ask "what is our parental leave policy?" you should not run 100 RAG pipelines. But the cache key cannot be just the query string. Think about what happens if Alice can see HR-confidential docs and Bob cannot — they ask the same question and you cache the result. Whose answer gets served to the other?
The fix requires the cache key to encode the user's permission set, but permission sets are large (thousands of doc IDs) and change hourly. What is a practical approximation?
✅ Model solution
The crux: filter at query time, against an index, never post-hoc. Post-hoc filtering (retrieve 50, ask an LLM or rule layer to drop unauthorized docs) fails three ways: the unauthorized content already left the vector store toward your application layer (leak surface), recall collapses when most of the top-k gets filtered (the user with narrow permissions gets 2 results), and an LLM judging permissions will eventually be wrong once — which is once too many. Correct: the retrieval query carries the user's permission predicate, evaluated INSIDE the search engine against indexed ACL metadata, so unauthorized vectors are never candidates.
Permission model. Per-doc ACLs (users, groups, classification labels) flattened into indexable terms (group IDs, tenant IDs, sensitivity tier) stored alongside each chunk's vector. Query side: resolve the user → their groups/entitlements (cached minutes, not hours — hourly ACL churn is the requirement) → filter expression ANDed into both the BM25 and the dense ANN search (filtered HNSW / pre-filtered IVF; modern engines support metadata-filtered ANN natively).
Ingestion & freshness. Source connectors emit (doc, ACL, content) change events → Kafka → incremental pipeline: re-chunk/re-embed only changed docs; ACL-only changes update metadata WITHOUT re-embedding (cheap, hits the <1h SLA); deletes write tombstones that suppress results immediately and compact later. Full re-index stays as a weekly repair job, not the freshness path.
Retrieval stack. Hybrid BM25 + dense with RRF fusion (enterprise queries are full of exact tokens — error codes, project names — where keywords beat embeddings), then a permission-safe cross-encoder re-rank over the top ~50. Generation cites only retrieved chunks; the answer layer never sees unauthorized text by construction.
The cache subtlety. A shared answer/retrieval cache keyed only on the query leaks across users (user A's cached answer contains doc X; user B without access asks the same question). Options: key the cache by (query, permission-set hash) — works when permission sets cluster into groups; or cache only retrieval candidates pre-filter and re-apply the user's filter on every hit (cache the expensive part, never the authorization). Never cache post-authorization content across principals.
Eval. Two suites: quality (recall@k/MRR against a labeled query→doc set, per department) and LEAKAGE red-team: synthetic users with known entitlements issue queries engineered to surface forbidden docs (exact-title queries, quote fragments); the gate is zero unauthorized chunks retrieved across the suite, run on every index/code change. Audit log of (user, query, docs retrieved) for compliance.
Common wrong answers. Post-retrieval LLM filtering (leak surface + recall collapse); per-user indexes (5M docs × 50k users doesn't exist); ignoring ACL-only updates (forcing re-embeds blows the freshness SLA and the GPU bill).
- Authorization is a RETRIEVAL predicate, not a generation instruction.
- ACL changes ≠ content changes: metadata updates must be cheap and fast; embeddings only recompute on content change.
- Caches keyed on query alone leak across users — key on permissions or cache pre-authorization artifacts only.
- Ship with a leakage red-team suite as a blocking gate, not just quality metrics.
| Level | Answer signature |
|---|---|
| Junior | RAG pipeline recited; permissions handled by "filtering the results" after retrieval. |
| Senior | ACL-filtered ANN at query time, incremental indexing with tombstones, hybrid retrieval with RRF, separates ACL updates from re-embeds. |
| Staff | Adds the cache-leak subtlety, the leakage red-team gate, group-flattening of ACLs with churn-rate math, citation-constrained generation, audit logging, and the cost story (re-embed budget vs metadata updates). |
Your ranking service p99 just tripled overnight. p50 is perfectly calm. GPU utilization hasn't budged. Something subtle changed at deploy time — and the telemetry, if you read it right, tells you exactly what. This challenge trains you to read the latency percentile signature, trace the fan-out path, and prescribe the cheapest fix before you ever touch a rollback button.
You run a ranking service for a content platform. At 13:45 your team deployed a model update that added 12 new features — all fetched from the same feature store, all served through the existing feature service. No infra changes, no schema migrations. Deployment completed cleanly with zero errors in the deploy log.
At 14:00 the on-call engineer gets paged. Here is what the dashboards show:
| Metric | 13:30 (pre-deploy) | 14:15 (post-deploy) | Delta |
|---|---|---|---|
| Ranking service p99 (ms) | 80 | 260 | +225% |
| Ranking service p50 (ms) | 42 | 44 | +5% (flat) |
| Ranking service p95 (ms) | 61 | 181 | +197% |
| Feature service p99 (ms) | 18 | 74 | +311% |
| Feature service p50 (ms) | 9 | 10 | +11% (flat) |
| GPU utilization (%) | 61 | 62 | flat |
| Feature cache hit rate (%) | 92 | 61 | −34 pp |
| Feature fetch fan-out (calls/req) | 14 | 26 | +86% |
| Ranking service error rate (%) | 0.02 | 0.03 | flat |
Additional context: the feature cache is a shared Redis cluster with an LRU eviction policy and a 4-hour TTL. The 12 new features have never been requested before this deploy. Total cache capacity is 80 GB. Cache occupancy before the deploy was 71 GB.
You have 15 minutes before the incident commander asks for a status update. Do the following:
- Rank your hypotheses. List at least 4 plausible causes, in descending order of likelihood given the telemetry above. For each, say which data points support or contradict it.
- Name the 3 cheapest probes you would run right now to confirm or eliminate the top hypothesis — ordered cheapest-first (read-only before write).
- Prescribe the immediate fix assuming your top hypothesis is confirmed.
- Describe the prevention — what process change ensures this never pages you at 14:00 again?
Hint 1 — Read the percentile signature
p50 is flat; p99 is 3× worse. What does that tell you about the distribution of affected requests? If the scoring model were slow, every request would be slower and both percentiles would shift. If a downstream call were slow, every request would also be slower. So what class of event affects only the slowest requests?
Think about caches. When a cache returns a hit, the request is fast. When it misses, the request falls through to the origin and is slow. If your cache hit rate drops from 92% to 61%, what fraction of requests now experience the slow path? Does that fraction match the percentile that spiked?
Hint 2 — Trace the fan-out path
The feature fetch fan-out went from 14 calls/request to 26 calls/request. Those 12 extra calls correspond exactly to the 12 new features. Now: each of those calls is a cache miss (the keys have never existed). A cache miss on a shared Redis cluster triggers a synchronous read from the feature store backend. The ranking service's p99 latency is bounded by the slowest of its parallel fan-out calls. What happens to the maximum of 26 independent random variables compared to the maximum of 14?
Extreme-value statistics: the expected maximum of n i.i.d. random variables grows with n. More calls → higher chance one of them hits a slow path (GC pause, hot key, network jitter). That's why p99 blows up while p50 stays flat.
Hint 3 — Cache eviction and the secondary effect
Before the deploy, cache occupancy was 71 GB out of 80 GB. The 12 new features are being written into the cache on first fetch. Under LRU eviction, the new keys displace old ones. Which old keys get evicted first? The least-recently-used ones — which are often low-frequency features that were already borderline. This creates a secondary wave of cache misses on previously-warm features, further degrading hit rate beyond just the 12 new features. This explains why hit rate fell by 34 pp rather than a smaller amount proportional to just the new features.
✅ Model solution — full causal chain, probes, fix, prevention
Primary cause: The deploy introduced 12 new feature keys that had zero cache entries. Every request immediately triggered 12 cache misses, each fanning out to the feature store backend synchronously. The cache hit rate dropped from 92% to 61% because: (a) 12/26 = 46% of calls per request are guaranteed misses, and (b) the cache is near capacity (71/80 GB), so new key writes evict warm old keys under LRU, causing secondary misses on previously-cached features.
Why p50 is flat but p99 exploded: Cache hits are fast (~5ms). Misses fall through to the feature store backend (~60-80ms). A request with all hits completes in ~42ms (the p50). A request with even one slow miss completes in max(all_calls). With 26 calls/request and 39% miss rate, the expected number of misses per request is ~10 — almost certain at least one miss occurs per request. The expected maximum latency of 26 calls grows faster than the mean: extreme-value theory predicts the max of n exponentials scales as O(log n) above the mean. This is why p99 (the tail of the tail) exploded while p50 (median) barely moved.
Step 1 (now, 0 min): Pre-warm the cache for the 12 new feature keys before routing production traffic. Run a background job that fetches all 12 features for the top-N user IDs (e.g., top 1M by DAU) and writes them into Redis. At 100k users/minute this takes ~10 minutes and fills the cache without user-visible latency impact.
Step 2 (now, parallel): If p99 is breaching SLA, immediately enable hedge requests — send a second parallel feature fetch for any call that hasn't returned within 50ms, take whichever returns first. This adds overhead but caps tail latency.
Step 3 (short-term): Batch the 26 feature calls into a single multi-get request to the feature service instead of 26 serial or parallel individual calls. This reduces connection overhead and allows the feature service to pipeline responses, cutting p99 on cache-miss paths significantly.
- Pre-deploy warming job (required for feature additions): Any PR that adds new feature keys must include a companion warming script. The deploy pipeline runs this script against production cache before traffic cut-over. Warming must reach >80% hit rate on new keys before the gate passes.
- p99 canary gate: Deployments are gated on a 5-minute canary window where 5% of traffic runs on the new version. If p99 exceeds 1.5× baseline during the canary window, the deploy is automatically held and the team paged — before full rollout.
- Cache capacity headroom policy: Cache occupancy must not exceed 60% at deploy time (current: 89%). Enforce this as a pre-deploy check. Headroom absorbs new keys without evicting warm entries.
- Fan-out alerting: Alert if per-request feature call count increases by >20% across a deploy. Automatically flags new feature additions for warming review.
- Challenge the fan-out design: 26 serial/parallel feature calls per request is a design smell. A Staff candidate asks: "Why aren't these batched into a single request to the feature service?" and proposes a feature vector API that returns all features for a user in one round-trip.
- Cache sizing math: 12 new features × 1M active users × ~200 bytes/feature = 2.4 GB new cache data. This fits in the 9 GB headroom if we had maintained the 60% policy. A simple back-of-envelope before the deploy would have flagged the eviction risk.
- Org process: "This incident reveals that we have no staging environment that exercises the feature cache at production scale. A shadow-traffic replay with production cache state would have caught this in CI."
| Level | Answer signature |
|---|---|
| Junior | Identifies cache hit rate as relevant; suggests rollback; doesn't explain the p50-flat / p99-spike signature |
| Senior | Explains cold-cache + fan-out amplification; orders probes cheapest-first; prescribes warming before rollback; mentions hedge requests |
| Staff+ | All of the above + challenges fan-out design, sizes the cache impact with arithmetic, proposes the canary gate + staging environment gap, discusses the org process change needed |
Trigger: "p99 spiked but p50 is flat" or "latency got worse right after a deploy."
- Say the phrase: "flat p50 + spiked p99 is a cold-cache or fan-out signature, not a throughput problem."
- Ask: "Did cache hit rate change? Did the number of downstream calls per request change?" Pull those two metrics first.
- Trace the causal chain: new keys → misses → slow origin fetches → tail amplified by fan-out → p99 blows up.
- Prescribe probes in order: read-only confirmation (cache stats) → trace sample → canary slice.
- Fix is warming, not rollback (unless SLA is critically breached and warming takes >15 min).
Never: Jump to "scale up compute" when GPU/CPU utilization is flat. Never roll back without first confirming the hypothesis — you may be trading a cache problem for a model quality regression.
- p50 flat + p99 high = cold cache or fan-out amplification. Every request hits at least one slow path; median is unaffected because most paths are still fast.
- Fan-out magnifies tail latency. The expected maximum of N i.i.d. calls grows as O(log N). Adding 12 calls to 14 nearly doubled the worst-case latency.
- LRU eviction cascades. A near-full cache plus new keys evicts warm old keys, creating secondary misses beyond the new features alone. Always maintain headroom.
- Cheap probes first. Read-only → low-risk write → rollback. Don't rollback before you've confirmed the cause; you may regress quality for no gain.
Q1. What does a flat p50 with a tripled p99 tell you about the nature of the problem?
Q2. Cache hit rate dropped from 92% to 61%. Why did eviction of old keys happen, and how does it compound the problem?
Q3. Why is pre-warming the cache better than an immediate rollback in this scenario?
Q4. Explain "hedge requests" and when they help vs. hurt.
Q5. A colleague suggests adding a Redis replica for higher availability. Does this help with p99 latency?
Q6. How would you design a deploy-time cache warming system for a feature store?
Q7. The feature fan-out went from 14 to 26 calls per request. Why is batching the 26 calls into a single multi-get better than parallelizing them?
Q8. The incident commander asks: "Should we add a p99 latency SLO for the feature service?" What do you say?
Q9. Same scenario but GPU utilization had jumped from 61% to 95% instead of staying flat. How does your diagnosis change?
Q10. You've confirmed the cold-cache hypothesis. The engineering manager asks whether to roll back, fix forward (warm cache), or just cap the new feature set to 6 features instead of 12. How do you advise?
A 7B-parameter pretraining run hits step 41,200 and within 300 steps loss goes from 2.1 to 9.8 to NaN. This challenge walks you through reading the telemetry, forming ordered hypotheses, running the cheapest-first probes, recovering from checkpoint, and installing gates so it never silences another run. The skills here transfer directly to any large-scale training fire.
You are on-call for a 7B dense transformer pretraining run. Configuration:
Telemetry snapshot (extracted from the run's W&B dashboard):
| Step | Train loss | Grad norm | Loss-scale (GradScaler) | Notes |
|---|---|---|---|---|
| 40,800 | 2.09 | 0.82 | 65,536 | All normal |
| 41,000 | 2.11 | 0.79 | 65,536 | Shard 40 still active |
| 41,050 | 2.13 | 0.81 | 65,536 | Shard 41 first batch |
| 41,100 | 2.18 | 1.47 | 65,536 | Grad norm first spike |
| 41,150 | 2.61 | 3.82 | 65,536 | Loss rising |
| 41,200 | 4.40 | 11.9 | 32,768 | Loss-scale halved (overflow detected) |
| 41,300 | 9.84 | 38.4 | 4,096 | Loss-scale cascading down |
| 41,380 | NaN | NaN | — | Run crashed |
Additional facts your colleague surfaced before paging you:
- LR at step 41,100 is ~3.8×10⁻⁴ — expected for this schedule, not anomalously high.
- No code changes, no config changes, no hardware swaps between step 40,000 and 41,380.
- GPU memory utilization is flat at 91% throughout — no OOM pressure before the crash.
- The loss-scale counter starts halving at step 41,200, indicating bf16 overflow in the backward pass.
- Shard 41 was produced by a different preprocessing worker pool than shards 0–40.
- Write an ordered hypothesis list (most likely first) BEFORE opening hints.
- Name the three cheapest discriminating probes, in order.
- Give the recovery plan (checkpoints exist every 1,000 steps) and the prevention list.
Hint 1 — what does "grad-norm spikes BEFORE loss" order tell you?
Causality reads off the telemetry order: gradients went pathological first, loss followed. What produces a sudden gradient pathology mid-run when LR hasn't changed — and what coincided at ~41k?
Hint 2 — the shard boundary
Shard 41 started at ~41,200 and came from a different preprocessing pool. What does a corrupted/undeduplicated/mis-tokenized shard do to gradients, and how would you check WITHOUT reading 40GB by eye?
✅ Model solution
Ordered hypotheses. (1) Bad data shard — boundary coincidence + grad-first signature + different preprocessing pool: prior ~70%. (2) bf16 numeric instability amplified by the shard's distribution (the overflow counters halving loss-scale support this as the MECHANISM riding on cause 1). (3) LR-schedule kink at 41k (check, it's cheap — but schedules rarely have step-function changes mid-decay). (4) Hardware (a flaky GPU producing NaNs — would usually show as rank-localized grad anomalies, not synchronized).
Cheapest probes, in order. (1) Resume from the 40k checkpoint with shard 41 SKIPPED — one config change; if training sails past 41,200-equivalent tokens, the shard is convicted (deterministic data order makes this experiment possible at all). (2) While that runs: stats on shard 41 vs shard 40 — token-length histogram, repeated-sequence rate, tokenizer round-trip failures, % non-language bytes; a dedup failure or binary contamination shows up in minutes. (3) Per-rank grad-norm breakdown at the spike: synchronized across ranks = data; localized = hardware.
Recovery. Resume at 41k with shard skipped (lose 200 steps, ~minutes), tighter grad-clip for the next few thousand steps as a belt, and requeue shard 41 for re-preprocessing through the GOOD worker pool, re-inserting later in the schedule. Don't fast-forward the optimizer past corrupted moments: the 41k checkpoint predates contamination (verify by grad-norm history).
Prevention. Data validation gates per shard before admission (token stats, dedup rate, tokenizer round-trip, contamination heuristics — the same checks probe 2 ran, run ALWAYS); grad-norm alerting (page on z-score, don't wait for NaN); spike-skip logic (auto-skip batch + log when grad-norm > k×rolling-median); preprocessing-pool version pinning so "different worker pool" can't silently mean "different code"; loss-scale/overflow-counter dashboards for bf16/fp8 runs.
Common wrong answers. "Lower the LR and restart from scratch" (burns the run, no diagnosis); "switch to fp32" (10× cost to mask a data bug); "it recovered after NaN so continue" (Adam moments are corrupted; the run is silently damaged).
- Telemetry ORDER is causal evidence: grad-norm before loss = data or numerics, not the objective.
- Deterministic data order is what makes the skip-shard experiment possible — determinism is an ops feature.
- Checkpoint-resume experiments are the cheapest, highest-information probes in training debugging.
| Level | Answer signature |
|---|---|
| Junior | "Lower the learning rate / add gradient clipping" without diagnosis. |
| Senior | Reads the grad-first signature, suspects the shard, designs the skip-shard resume probe and shard stats, recovers from 41k. |
| Staff | Adds the per-rank hardware discrimination, the loss-scale-counter read for the bf16 mechanism, the prevention pipeline (gates, alerts, version pinning), and articulates why post-NaN continuation is unsafe (poisoned optimizer state). |
You ship a new ranker that beats the champion by +1.5 percentage points of AUC in every offline slice. Five percent of traffic goes to it. A day later the war room opens: CTR is down 2 %, time-spent is flat. This chapter makes that scenario a solvable puzzle, not a fire drill. It introduces the canonical five-cause taxonomy for offline/online divergence, drills the evidence-mapping habit, and designs the experiment that tells the causes apart.
The following facts are on the table when you enter the room. Read them carefully before you form any hypothesis.
| Signal | Control (old ranker) | Treatment (new ranker) | Delta |
|---|---|---|---|
| Offline AUC (held-out set) | 0.782 | 0.797 | +1.5 pp |
| Online CTR (A/B, 24 h) | 4.10 % | 4.02 % | −2.0 % |
| Time-spent per session | 8.2 min | 8.2 min | 0 % |
| Calibration at score > 0.8 | predicted ≈ actual | predicted 1.4× actual | overconfident |
| p99 serving latency | 38 ms | 39 ms | +1 ms (noise) |
| Feature logging version | v12 | v13 (same release) | changed |
What changed in this release (single deploy, 14:00 yesterday):
- New model weights (retrained on 6 weeks of logged data).
- Feature logging schema bumped from v12 → v13: three categorical features renamed, one numeric feature rescaled ÷10.
- No A/B infrastructure change; split is deterministic by user-id hash.
Calibration detail. The calibration plot below summarises the gap. Predicted scores cluster in the 0.80–0.95 bucket at 1.4× the rate of actual clicks. Below 0.5, calibration is nearly perfect.
- Enumerate the five classic causes of offline/online divergence from memory.
- Fit each against THIS evidence (calibration plot + simultaneous logging change) — which two survive?
- Design the single cheapest experiment that discriminates between the survivors.
Hint 1 — why does the same-release logging change matter so much?
The new ranker was trained on features logged by the OLD pipeline, but serves on features computed by the NEW one. What's that called, and what would it do to score distributions?
Hint 2 — what does overconfidence at HIGH scores break downstream?
Scores don't just rank — they feed a value formula and thresholds. If the top decile's probabilities run hot, what happens to the blend with other objectives and to any score-gated behavior?
✅ Model solution
The taxonomy (recite it, always). (1) Training-serving skew — features differ between train and serve paths. (2) Offline-eval leakage/exposure bias — labels exist only for what the OLD policy showed, so offline replay rewards imitating it. (3) Calibration shift breaking downstream consumers — combination formulas and thresholds need probabilities, AUC doesn't. (4) Position-bias mismatch between replay and live. (5) Metric mismatch — AUC is ranking-only; CTR is absolute behavior under an ecosystem.
Fit to evidence. The logging change in the same release makes (1) the prime suspect — the model trains on old-pipeline features and serves on new-pipeline ones; even small definition drift (null handling, window boundaries) shifts the score distribution… which is also exactly what an overconfident-at-the-top calibration plot shows, implicating (3) as the damage mechanism: hot top-decile probabilities over-weight this model's head in the value formula and push wrong items past thresholds. (2)/(4) can't be excluded by this evidence but don't explain the calibration signature; (5) is always true but doesn't explain a REGRESSION.
The discriminating experiment. Dual-scoring on live traffic: score the SAME requests with the new model fed by (a) old-pipeline features (reconstructed/logged) and (b) new-pipeline features. If score distributions diverge between (a) and (b), skew is convicted, and the diff localizes WHICH features moved (compare per-feature distributions where the two pipelines disagree). One day of shadow compute, no user impact. In parallel: recalibrate (isotonic on recent live data) and re-read the value-formula output distribution — if recalibration alone restores sane blending in offline simulation, (3) was carrying the regression.
Fix & prevention. Fix: align the feature definitions (or retrain on new-pipeline logged features), recalibrate per head, re-canary. Prevention: logged-at-scoring features as the contract (the change couldn't have diverged train from serve), feature parity tests in CI between pipeline versions, calibration as a launch gate (predicted/observed ratio per decile must sit in bounds), and never shipping a model and a feature-pipeline change in the same release — separate the variables.
Common wrong answers. "The model overfit" (doesn't explain the release coupling), "roll back the model only" (if the pipeline changed, the OLD model now skews too — roll back as a PAIR or fix forward consciously), "AUC went up so the model is better, the metric is wrong" (the value formula is the product; calibration is part of the model's job).
- The five-cause taxonomy is the expected recitation — lead with it.
- Same-release confounds are convicted by dual-scoring the same traffic both ways.
- Calibration is a launch gate because scores are added and thresholded, not just sorted.
- Never bundle model and feature-pipeline changes in one release.
| Level | Answer signature |
|---|---|
| Junior | "Offline metrics don't always transfer" + retrain suggestion. |
| Senior | Recites the taxonomy, fits evidence to skew+calibration, designs the dual-scoring discrimination, names recalibration. |
| Staff | Adds the rollback-as-a-pair insight, logged-at-scoring as the structural prevention, calibration gates with per-decile bounds, and the release-hygiene rule (one variable per launch). |
Your 13B chat service (GQA, 40 layers, 8 KV heads, head_dim 128, fp16 KV) ran beautifully at a 2k-token context cap. Product launched 32k-context document upload yesterday. Since then: throughput per GPU down ~70%, OOM kills roughly hourly, TTFT fine for short prompts but p99 TPOT is awful for everyone. Nothing else changed.
| Metric | Before (2k cap) | After (32k cap) |
|---|---|---|
| Output tokens/sec/GPU | 2,400 | 710 |
| Mean concurrent sequences/GPU | 38 | 9 |
| KV-cache occupancy | 62% | 97% (sawtooth, OOM kills hourly) |
| p99 TPOT (short-prompt users) | 48ms | 210ms |
| Long-context requests share | — | 7% of traffic |
- Explain WHY 7% long-context traffic destroyed 70% of throughput — with the memory arithmetic.
- Explain the p99 TPOT damage to SHORT-prompt users specifically.
- Give same-day mitigations, then the two-quarter redesign, then the conversation to have with product.
Hint 1 — compute per-token KV bytes for this model
KV per token = 2 (K and V) × layers × kv_heads × head_dim × bytes. Then multiply by 2k vs 32k and compare against the batch the GPU used to hold.
Hint 2 — what does a 32k prefill do to everyone else's decode?
A 32k-token prefill is a multi-second compute burst. If it shares an engine iteration with 30 decoding sequences, what happens to their inter-token interval?
✅ Model solution
The memory math. Per token: 2 × 40 × 8 × 128 × 2B = 164KB. At 2k context: ≤0.33GB per sequence — 38 concurrent sequences ≈ 12.5GB of KV, comfortable. At 32k: up to 5.2GB per sequence — 16×. A handful of long sessions (7% of traffic, but they LINGER — long documents mean long conversations) eat the KV pool: occupancy pins at 97%, admission stalls, concurrency collapses 38→9, and since decode throughput scales with batch, tokens/sec falls proportionally (2,400→710 ≈ the concurrency ratio). The hourly OOMs are reservation overflow on max-length sequences colliding at peak.
The p99 story. Two mechanisms: (1) prefill interference — a 32k prefill is seconds of compute-bound work; every decode sequence co-scheduled with it stalls, so short-prompt users' TPOT spikes whenever any long request arrives (tail-shaped harm, mean barely moves); (2) KV pressure evictions/admission queuing add jitter. The signature "TTFT fine, TPOT awful" says decode is the victim, not prompt processing capacity.
Same-day mitigations. Cap effective context server-side (truncate middle, keep head+tail) below the product cap while you fix; admission-control long requests into a concurrency-limited lane (max 1-2 concurrent long sequences per GPU); enable chunked prefill (512-token chunks interleaved with decode — kills the TPOT spikes); quantize KV to fp8 (halves the 164KB/token, doubling effective pool); turn on/verify paged KV (no max-length reservations → OOMs stop, fragmentation waste <5%).
The redesign. Separate pools: a long-context pool (few, KV-heavy, possibly bigger-HBM GPUs) and the existing short-context fleet — route by prompt length; or full prefill/decode disaggregation if long-context traffic grows. Prefix caching for re-queried documents (the same uploaded doc across turns re-uses its KV — huge for doc-chat). Session TTLs and idle-KV eviction with recompute-on-return. Capacity model updated: price a request by context-tokens-held×time, not request count.
The product conversation. 32k context costs ~16× the KV residency of 2k: either it's priced/tiered (premium feature, stricter rate limits), or capacity grows accordingly — show the \$/request delta. Also surface that 90% of uploaded docs fit retrieval: RAG-style retrieve-then-answer over the doc costs a fraction of full-context stuffing for most queries — context length is a product decision wearing an infra costume.
Common wrong answers. "Add GPUs" (linear cost for a problem with 16× structure — fix utilization first); "lower max_tokens" (output length isn't the driver, context residency is); "the model is too slow, distill it" (weights didn't change; the cache did).
- KV per token = 2·layers·kv_heads·head_dim·bytes — memorize the formula, derive the incident.
- Long-context harm is twofold: KV residency (batch collapse) and prefill interference (tail latency).
- Mitigation ladder: caps → chunked prefill → fp8 KV → paged attention → pool separation → disaggregation.
- Context length is a pricing/product decision; bring the per-request cost delta to that meeting.
| Level | Answer signature |
|---|---|
| Junior | "Long prompts are slower; add GPUs or shrink the model." |
| Senior | Does the 164KB/token → 16× math, separates the two mechanisms (residency vs interference), proposes chunked prefill + paged KV + caps. |
| Staff | Adds the lane/pool architecture with routing by length, prefix caching for doc-chat, the pricing conversation with per-request economics, and the RAG-instead-of-stuffing product reframe. |
Capacity estimation is the first thing an interviewer asks when sizing any new system — and the first thing on-call engineers reach for at 3 am. This chapter gives you eight drills covering GPUs, memory, network, storage, and cost. Each prompt stands alone; attempt it before opening the solution. The universal recipe and numbers-to-memorize table at the end make every future estimate faster.
Protocol: read the prompt, open a blank doc, write your answer (even rough numbers), then open the solution.
- Set a 10-minute timer per drill.
- Write every intermediate number down — interviewers watch the reasoning, not just the answer.
- After reading the solution, compare where your estimate diverged, not just the final number.
Never: open the solution first and convince yourself you would have got there. You would not.
Prompt: Your ranking model costs 2 GFLOP per request (measured). Your serving hardware is A100 80GB GPUs. Peak load is 5,000 requests/second. Target GPU utilization is 60% (leave headroom for spikes). How many GPUs do you need?
Attempt it now. Then open the solution.
Hint — what do you need to look up?
An A100 delivers roughly 312 TFLOPS of FP16 dense throughput (or ~77 TFLOPS FP32). Ranking models are typically FP16. Also note: "GFLOP per request" is a one-shot cost — multiply by QPS to get FLOP/s demand.
✅ Model solution
Step 1 — compute demand.
5,000 req/s × 2 × 10⁹ FLOP/req = 10 × 10¹² FLOP/s = 10 TFLOP/s
Step 2 — effective GPU capacity.
312 TFLOP/s × 0.60 utilization = 187 TFLOP/s usable per GPU
Step 3 — divide.
10 TFLOP/s ÷ 187 TFLOP/s = 0.053 GPUs
Step 4 — sanity check. Less than one GPU? Yes — 2 GFLOP/req is a tiny model (think a two-tower dot-product ranker or a shallow MLP). A single A100 could serve ~93,000 req/s at 60% util. Round up to 1 GPU minimum; in practice deploy at least 2 for redundancy.
Staff+ add: This ignores batching overhead and memory bandwidth limits. Real serving is often memory-bandwidth-bound, not compute-bound, for small-batch inference. Measure actual throughput with nsight or vLLM benchmarks before committing to a fleet size.
- Demand (FLOP/s) = FLOP/req × QPS. Then divide by effective GPU FLOP/s.
- A100 FP16 peak ≈ 312 TFLOP/s; MFU for real workloads is 40–60%.
- A "2 GFLOP" model at 5k QPS is trivially small — sense-check every answer.
Prompt: You are serving a 13B-parameter model with 40 transformer layers, 40 attention heads, head dimension 128, stored in FP16. You want to hold 64 concurrent sessions, each with up to 8,192 tokens of context. How much GPU memory is consumed by the KV cache alone?
Hint — the KV-cache formula
Each token stores a K vector and a V vector per layer. Size per token = 2 (K+V) × num_layers × num_heads × head_dim × bytes_per_element.
✅ Model solution
Step 1 — bytes per token per session.
2 × 40 layers × 40 heads × 128 dim × 2 bytes = 2 × 40 × 40 × 128 × 2
= 2 × 409,600 = 819,200 bytes ≈ 0.8 MB per token
Step 2 — per session (8,192 tokens).
0.8 MB × 8,192 = 6,553 MB ≈ 6.4 GB per session
Step 3 — total for 64 sessions.
6.4 GB × 64 = 409 GB
Step 4 — sanity check. Model weights alone ≈ 13B × 2 bytes = 26 GB. The KV cache for 64 sessions at 8k tokens dwarfs the weights by 15×. On a single A100 (80 GB), this is impossible — you can hold weights + maybe 6–7 sessions. For 64 sessions you need ≥6 A100s dedicated to KV, or you use paged KV eviction (vLLM-style) to multiplex.
Staff+ add: With Grouped Query Attention (GQA, 8 KV heads instead of 40), the per-token KV drops 5× to ~0.16 MB — 64 sessions costs ~82 GB, feasible on 2 A100s. GQA is why Llama-2 70B is deployable.
- KV per token = 2 × L × H × d_h × 2 bytes (FP16). For 13B this is ~0.8 MB.
- Long context + large batch = KV memory dominates, not weights.
- GQA/MQA reduces KV by 4–8× — know this lever.
Prompt: You have 500 million user IDs, each represented by a 128-dimensional FP16 embedding. (a) How much total RAM does the table require? (b) You plan to shard it across servers with 64 GB RAM each (leaving 16 GB for the OS/other). How many shards? (c) What happens to lookup latency as you add shards?
Hint
FP16 = 2 bytes. Total size = num_rows × embedding_dim × bytes_per_element. Available RAM per shard = 64 − 16 = 48 GB.
✅ Model solution
(a) Table size.
500 × 10⁶ rows × 128 dims × 2 bytes = 128 × 10⁹ bytes = 128 GB
(b) Number of shards.
Available RAM/shard = 64 − 16 = 48 GB
Shards needed = ⌈128 ÷ 48⌉ = ⌈2.67⌉ = 3 shards
(c) Latency implication. Every request that needs user embeddings must fan out to all shards that hold its user IDs. With hash-based sharding: a single-user request hits exactly 1 shard (O(1) fan-out). A batch of B users hits up to min(B, S) shards — at B=100, all 3. The critical cost is the tail latency across shards: you wait for the slowest one. With 3 shards the overhead is modest; at 50+ shards it dominates.
Staff+ add: In practice embedding tables use FP32 at training time but FP16 or even INT8 at serving. INT8 halves the table to 64 GB → 2 shards. Also consider replication (read replicas per shard) vs. sharding for read-heavy workloads.
- Embedding table = rows × dim × bytes. 500M × 128 × 2 = 128 GB.
- Sharding introduces fan-out; latency = max(shard latencies), not sum.
- INT8 quantization halves serving memory with <1% quality loss on ID embeddings.
Prompt: Your event bus must ingest 2,000,000 events per second, each event 1 KB. A single Kafka partition on your hardware can sustain 50 MB/s write throughput reliably. You want a 20% safety margin. How many partitions do you need?
Hint
Convert events/s to MB/s first, then account for the safety margin before dividing.
✅ Model solution
Step 1 — throughput in MB/s.
2,000,000 events/s × 1 KB/event = 2,000,000 KB/s = 1,953 MB/s ≈ 2,000 MB/s
Step 2 — apply safety margin.
Effective partition capacity = 50 MB/s × (1 − 0.20) = 40 MB/s
Step 3 — partition count.
⌈2,000 ÷ 40⌉ = ⌈50⌉ = 50 partitions
Step 4 — sanity checks.
• Consumer side: if you have 50 partitions and each consumer thread reads one partition, you need ≥50 consumer threads total.
• Replication: with replication factor 3, each broker handles 50 × 3 / num_brokers writes. Size your broker cluster accordingly.
• Partition count is essentially permanent in Kafka — over-provision slightly; 64 or 128 (powers of 2) are conventional round numbers.
Staff+ add: Partition count also determines maximum parallelism for consumers. If downstream processing is the bottleneck (not ingest), you may want more partitions even if 50 suffices for throughput. Kafka can reassign partitions but it's disruptive — set this right at cluster creation.
- Partitions = ⌈total_MB/s ÷ (partition_MB/s × (1 − margin))⌉.
- Partition count caps consumer parallelism — size for the bottleneck.
- Pick powers of 2 for easy key-hashing and future expansion.
Prompt: You want to train a 7-billion-parameter transformer on 1 trillion tokens using 256 × H100 GPUs. H100 SXM delivers roughly 1,000 TFLOP/s FP8 (or ~500 TFLOP/s BF16). Assume BF16 training and a Model FLOP Utilization (MFU) of 45%. Use the 6ND approximation (6 × num_params × num_tokens FLOPs for a forward+backward pass). How long does training take?
Hint — the 6ND rule
Total FLOPs ≈ 6 × N × D where N = parameters and D = tokens. The factor 6 accounts for forward (2ND) plus backward (4ND — two passes for gradients).
✅ Model solution
Step 1 — total FLOPs.
6 × 7 × 10⁹ × 10¹² = 6 × 7 × 10²¹ = 4.2 × 10²² FLOPs
Step 2 — effective cluster throughput.
256 GPUs × 500 × 10¹² FLOP/s × 0.45 MFU = 256 × 225 × 10¹² = 57,600 TFLOP/s = 5.76 × 10¹⁶ FLOP/s
Step 3 — training time in seconds.
4.2 × 10²² ÷ 5.76 × 10¹⁶ = 729,167 seconds ≈ 8.4 days
Step 4 — sanity check. Llama-2 7B was trained on 2T tokens with more GPUs; 8 days for 1T tokens on 256 H100s is plausible. Real runs add ~10–15% for checkpoint overhead, data loading stalls, and hardware failures — budget ~10 days.
Staff+ add: MFU of 45% is optimistic for 256 GPUs across nodes; inter-node all-reduce on 400 Gb/s InfiniBand costs real time. At 1024 GPUs MFU often drops to 35–38%. Always measure MFU on a short pilot run before committing the full budget.
- Total FLOPs = 6ND. For 7B on 1T tokens: 4.2 × 10²² FLOPs.
- H100 BF16 ≈ 500 TFLOP/s peak; realistic MFU 35–50%.
- Time = FLOPs ÷ (GPUs × peak × MFU). Always state your MFU assumption.
Prompt: You have 100 million vectors, each 768 dimensions, originally in FP32. You want to build an ANN (Approximate Nearest Neighbor) index using Product Quantization (PQ) with 96 sub-quantizers and 256 centroids per sub-quantizer (PQ96×8 — "8" means 8 bits per code). (a) Raw FP32 size of the corpus? (b) PQ-compressed size? (c) Compression ratio?
Hint — how PQ works
PQ splits each 768-d vector into 96 sub-vectors of 8 dimensions each, then replaces each sub-vector with the index of the nearest centroid (0–255). Each centroid index fits in 1 byte. So the compressed code for one vector = 96 bytes.
✅ Model solution
(a) Raw FP32 size.
100 × 10⁶ vectors × 768 dims × 4 bytes = 307 GB
(b) PQ-compressed size.
Each vector → 96 sub-quantizer codes × 1 byte = 96 bytes/vector
100 × 10⁶ × 96 = 9,600 MB = 9.4 GB
Plus codebook storage: 96 sub-quantizers × 256 centroids × 8 dims × 4 bytes = 96 × 256 × 32 bytes = 786,432 bytes ≈ 0.8 MB (negligible)
(c) Compression ratio.
307 GB ÷ 9.4 GB ≈ 32.7×
Sanity check and tradeoffs. 9.4 GB fits comfortably in one server's RAM (even a commodity box). The cost is recall loss — PQ introduces quantization error, so ANN recall@10 drops from ~99% (flat exact search) to ~90–95% depending on data distribution. You can recover recall by re-ranking the top-K PQ candidates with exact FP32 distances for the top few results.
Staff+ add: FAISS IVF-PQ adds an inverted file index (centroid-based coarse partitioning) to prune search to ~1% of the corpus. Combined with PQ, index RAM stays 10 GB while search time scales sub-linearly. HNSW is an alternative: higher RAM (~50 GB for this corpus) but better recall and no training step.
- Raw FP32 corpus = N × D × 4. For 100M × 768: 307 GB.
- PQ code = num_sub_quantizers bytes/vector. PQ96: 96 bytes → 32× compression.
- Compression trades recall for RAM. Always benchmark recall@K before shipping.
Prompt: You run a serving fleet of 100 replicas, each with 8 × A100 80GB GPUs. Cloud on-demand price for an 8-GPU A100 instance is \$32/hour (typical AWS p4d.24xlarge). You run 24 hours/day. (a) Daily cost? (b) Monthly cost? (c) What is the cost per 1M tokens if the fleet serves 50M tokens/hour?
Hint
Daily cost = replicas × hourly_rate × 24. Cost per token = daily_cost ÷ daily_tokens.
✅ Model solution
(a) Daily cost.
100 replicas × \$32/hr × 24 hr = \$76,800/day
(b) Monthly cost.
\$76,800 × 30 = \$2,304,000/month ≈ \$2.3M/month
(c) Cost per 1M tokens.
Total tokens/day = 50M tokens/hr × 24 hr = 1,200M = 1.2B tokens/day
Cost per token = \$76,800 ÷ 1,200,000,000 = \$0.000064/token
Cost per 1M tokens = \$0.000064 × 10⁶ = \$64 per 1M tokens
Sanity check. GPT-4 API was priced at ~\$30/1M tokens (output) in 2024; \$64 is high but plausible for a self-hosted large model without optimization. With reserved instances (1-year commitment) the hourly rate drops ~40% → ~\$38/1M tokens. Speculative decoding or distillation to a smaller model can halve serving cost.
Staff+ add: On-demand pricing is worst-case. Production fleets typically use: (1) reserved instances for baseline load (40% discount), (2) spot instances for burst (70% discount, but must handle interruptions), (3) batching improvements to raise token throughput per GPU. A mature serving team targets \$5–15/1M tokens for a 70B-class model.
- 8×A100 instance ≈ \$32/hr on-demand; 100 replicas → \$76.8k/day.
- Cost/token = total_spend ÷ total_tokens. Always compute this for LLM products.
- Reserved + spot + batching is the standard cost-reduction trifecta.
Prompt: During distributed training, you need to allreduce gradient tensors totalling 14 GB across 256 GPUs connected via 400 Gb/s InfiniBand (per-link). A ring-allreduce transfers 2 × (N−1)/N × data bytes per GPU (≈ 2× data for large N). How long does one allreduce take, ignoring latency? How does this compare to a 10-ms compute step?
Hint — ring-allreduce math
In ring-allreduce, each GPU sends and receives (N−1)/N × data in reduce-scatter, then (N−1)/N × data in allgather. Total traffic per link ≈ 2 × data (for large N). Time = (2 × data) ÷ bandwidth.
✅ Model solution
Step 1 — convert bandwidth.
400 Gb/s = 400 ÷ 8 = 50 GB/s per link
Step 2 — traffic per GPU (ring-allreduce, large N).
≈ 2 × 14 GB = 28 GB per GPU
Step 3 — time.
28 GB ÷ 50 GB/s = 0.56 seconds
Step 4 — compare to compute step.
0.56 s ÷ 0.010 s = 56× longer than one compute step
Why this matters. This is why naive data parallelism at large scale is cripplingly slow. With gradient compression (1-bit SGD, TopK sparsification) or mixed precision with smaller gradients (16-bit vs 32-bit), you can cut this 2–4×. More importantly, pipeline parallelism and gradient accumulation are used to overlap allreduce with the next micro-batch's forward pass, hiding most of the communication cost.
Staff+ add: 400 Gb/s InfiniBand per link doesn't mean per GPU — check whether it's the switch uplink or the NIC bandwidth. NVLink within a node (600 GB/s bidirectional) is far faster than inter-node IB. Topology matters: fat-tree vs. dragonfly affects allreduce behavior at 256+ nodes. Real MFU degradation from communication is the main reason large-scale training MFU is 35–45%, not 80%.
- Ring-allreduce traffic per GPU ≈ 2 × gradient_size. Time = traffic ÷ link_BW.
- 400 Gb/s IB = 50 GB/s. 14 GB grads → 0.56 s — huge vs. a compute step.
- Overlap communication with compute (pipeline parallelism) is essential at scale.
Six pager-style scenarios, each with a 15-minute clock. Read the page, set a timer, say your actions out loud in order, then open the model answer. The grading axis is not whether you find root cause — it is whether your first moves are cheap, reversible, and blast-radius-shrinking. Interviewers run these to separate people who have actually carried a pager from people who have only read about it.
Trigger: "You're on call, X just fired, what do you do?"
- Stop the bleeding — name the cheapest reversible action (rollback, disable flag, block the promote, freeze the deploy) and state whether you'd pull it now or after one confirming check.
- Confirm it's real and scope the blast radius — one query, two minutes: is a user-facing metric moving, or only an internal alert? Which traffic slice?
- Check "what changed?" — deploys, model promotes, config flips, upstream releases in the last few hours. 80% of incidents are a change, not a drift.
- Communicate — one line in the incident channel: impact, action taken, next check.
- Only then diagnose — root cause is a daytime activity once users are safe.
Never: open a notebook and start exploratory analysis while the system is still degrading. "I'd look at the data" with no mitigation is the canonical junior failure.
03:40, Tuesday. PagerDuty: feature_drift_psi_high on the fraud model — PSI for device_age_days jumped from 0.04 to 0.31 over the last 6h window. You pull up the dashboards: fraud approval rate, chargeback proxy, and score distribution percentiles are all within normal daily bands. The alert has fired 3 times this month; the last two were marked "no action" by other on-calls.
What do you do in the next 15 minutes? Be explicit about what you would and would not page anyone else for, and what you leave for the morning.
✅ Model answer — Simulation 1
Cheap reversible action first: none needed yet — and saying so is the correct call. Nothing user-facing is moving, so the reversible action is "don't touch production at 3am." But you do NOT silence and go back to sleep either; a flat top-line with a drifted input can mean the model has simply stopped using that feature's signal, and fraud losses show up with a multi-week lag (chargebacks). Flat-right-now is not flat.
Minutes 0–5: confirm the alert is computed correctly — is the PSI spike in the live feature values or in the reference window? Check whether device_age_days moved for all traffic or one slice (one country, one app version). A single-slice jump after an app release is usually a logging change, not fraud.
Minutes 5–10: check "what changed": app releases, upstream schema deploys, and whether the feature's null/default rate jumped (a drift alert is often a missing-data alert in disguise — values collapsing to a default of 0 shifts the distribution violently). If null rate spiked, this becomes Simulation 2 and you escalate to the owning team's on-call.
Minutes 10–15: if it's genuinely a population shift with healthy logging: annotate the alert with findings, file a morning ticket to (a) check model performance on the shifted slice once labels arrive, and (b) re-tune the PSI threshold or reference window, because an alert with a 100% "no action" history is training the team to ignore the one that matters.
Why this is correct: the staff-level move is treating alert quality as part of the incident. You neither dismissed it (lagged-label risk) nor took a 3am production action with zero user impact (unjustified risk). Common wrong answers: "retrain immediately" (retraining on a possibly-corrupt feature bakes the problem in) and "mute the alert" (destroys the safety net).
14:10, Thursday. No alert fired. A PM messages you: "Recommendations look generic since lunch?" CTR on the home feed is down 8% over 3 hours. Your service deployed nothing today. Digging in, you notice the user_engagement_7d feature is now null for ~60% of requests; the ranker imputes nulls to 0. The upstream events team shipped a pipeline migration at 11:30 that renamed a field from engagement_score to engagementScore. Their pipeline is "green."
Next 15 minutes. The upstream team says a revert of their migration needs ~2 hours of backfill to be safe. What do you do meanwhile, and in what order?
✅ Model answer — Simulation 2
Cheap reversible action first: you cannot roll back upstream quickly, so mitigate on YOUR side: change the null-handling for this feature from "impute 0" to "impute the population median / last known per-user value," or — often better — flip the ranker to a fallback model or config that excludes the broken feature. Both are config-level, reversible in minutes, and turn a corrupted signal into a merely missing one. A model fed plausible defaults degrades gracefully; a model fed 60% zeros on a heavy-weight feature is actively wrong, because "0 engagement" is a strong negative signal the model learned from real users.
Minutes 0–5: pull the mitigation lever (fallback config), verify CTR on a canary slice recovers direction-wise. Open an incident, severity matching 8% CTR: this is revenue-visible.
Minutes 5–10: coordinate with upstream: get a committed ETA for revert-plus-backfill, and confirm whether the rename hit other consumers (fraud? ads?) — you may be the first to notice a multi-team incident. Blast-radius discovery is on-call work, not politeness.
Minutes 10–15: guard the future: if any training/retraining job consumes this feature, pause scheduled retrains now — three hours of 60%-null data flowing into a training set is how today's incident becomes next week's "model mysteriously worse." Write the timeline in the channel.
Why this is correct: "their pipeline is green" and "your model is fine" can both be true while the contract between them is broken — schema is part of the interface, and nulls imputed to 0 convert a data bug into a model-behavior bug. Staff+ answers add the prevention: schema contracts with CI checks on the producer side, and a null-rate alert per critical feature on the consumer side (you found this via a PM, 3 hours late — that is itself a finding). Common wrong answer: waiting 2 hours for the upstream revert with no consumer-side mitigation.
06:55, Monday. The nightly retrain of the ranking model finished green. A weak sanity alert (warning, not page) notes the model artifact is 410 MB; for the last 30 days it has been 680 MB ± 3%. Offline AUC on the held-out set: 0.847 vs last week's 0.851 — within normal noise. The auto-promote job pushes the new model to production at 07:30. It is 06:55.
You have 35 minutes of wall-clock but answer for the next 15. What is your very first action, and why is the "AUC looks fine" argument a trap?
✅ Model answer — Simulation 3
Cheap reversible action first — and it is the whole answer: BLOCK THE PROMOTE. Pause the 07:30 auto-promotion before doing anything else. This costs nothing: production keeps serving yesterday's model, which was fine yesterday and is fine today. A one-day-stale ranker is a non-event; a corrupted ranker at peak traffic is an incident. The asymmetry is total, so you do not need to be sure anything is wrong — the size anomaly alone justifies the block.
Why "AUC is fine" is a trap: a 40% smaller artifact usually means 40% of something is missing — typically a chunk of the embedding vocabulary or feature hash space, because an upstream extract silently dropped a partition (one day of data, one region, or a join that brings in long-tail IDs returned empty). If the eval set is built from the same truncated data, it cannot see the hole: the missing users and items are absent from eval too. Aggregate AUC on a self-consistent-but-incomplete dataset is exactly the metric that stays green while the model quietly loses its tail. Silent data loss is the one failure class where "metrics look fine" is expected, not reassuring.
Minutes 2–15, after blocking: diff the artifact against last week's — embedding table row counts, feature vocabulary sizes, per-feature coverage. Then diff the training data: row counts per day/partition vs the 30-day norm; you will usually find the dropped partition in minutes. Determine whether the loss is upstream (a re-run fixes it) or in your extract code. Re-run once the input is whole; only promote a model whose size and slice-level evals are back in band.
Why this is correct: this is the purest test of the chapter's theme. There is a free, instantly reversible action with unbounded downside protection, available before any diagnosis. Anyone who starts with "first I'd investigate the data pipeline" has already failed — at 07:30 the bad model ships while they investigate. Staff+ adds prevention: make artifact size, embedding cardinality, and training-row counts blocking promotion gates rather than warnings.
03:12, Saturday. Page: the new ranker variant (at 10% traffic since Friday noon) has breached the latency guardrail — p99 went from 180 ms to 290 ms over the last hour — and the crash-rate guardrail is yellow. The experiment owner is asleep. Your org's experiment platform supports one-click ramp-down. Topline engagement in the variant still looks slightly positive.
Next 15 minutes. Do you wait for the experiment owner? Does "engagement is still positive" change your decision?
✅ Model answer — Simulation 4
Cheap reversible action first: ramp the variant to 0% now. Do not wait for the owner. A guardrail is a pre-commitment — the team agreed in advance, while calm that this threshold means stop; 3am with partial information is precisely the situation pre-commitments exist for. Re-litigating the threshold mid-breach defeats its purpose. And the action is the definition of reversible: the experiment can re-ramp Monday at the click of a button, with zero lost work beyond a weekend of data.
Why "engagement is positive" does not change the call: the comparison is asymmetric. Upside of leaving it on: a weekend of slightly better experiment data. Downside: latency degradation compounding under Saturday peak, crash-rate yellow going red, and users churning — harms that are real and partially irreversible, versus an experiment delay that is fully recoverable. Also, short-horizon engagement under elevated latency is untrustworthy: latency damage shows up in return-rate days later, not in same-hour clicks.
Minutes 0–5: ramp to 0%, confirm p99 and crash rate recover in the control population (if they do NOT recover, the variant was not the cause — un-pause your assumption and treat it as an infra incident: check deploys, hosts, traffic anomalies).
Minutes 5–15: snapshot dashboards and a few traces from the breach window so the owner can do root cause Monday without the evidence aging out of retention. Leave a clear handoff note: what breached, when you ramped down, recovery confirmation, links. No 3am debugging of the variant's latency — that is the owner's daytime job.
Why this is correct: the recovery check is the subtle senior move — ramping down is both mitigation and a diagnostic test, and if metrics do not recover you have falsified your hypothesis in five minutes. Common wrong answers: paging the experiment owner and waiting (the guardrail already encodes their decision), and killing the experiment permanently/deleting it (over-rotation: 0% ramp preserves the experiment setup and the learning).
11:20, Wednesday. Gradual alarm: recall@100 of the retrieval stage (measured online against a click-based proxy) has slid 30% since ~09:00. Search "feels off" per support tickets. At 09:00, the embedding-service team deployed query-encoder v12 ("minor fine-tune, backwards compatible"). The ANN index was built last night with item embeddings from encoder v11. Nothing crashed; latencies are normal; the ANN service reports healthy.
Next 15 minutes. Why did nothing crash, and what does that tell you about where the fix must go?
✅ Model answer — Simulation 5
Cheap reversible action first: roll the query encoder back to v11. That is the change that correlates with the slide, the rollback takes minutes, and it restores the invariant that matters: queries and index must be embedded in the same vector space. Do not try to "fix forward" by rebuilding the index against v12 — a full index rebuild takes hours, and rollback gets users healthy now. Rebuild-then-coordinated-cutover is the daytime plan.
Why nothing crashed — and why that is the lesson: v11 and v12 produce vectors of the same dimensionality, so every API contract is satisfied: the encoder returns 768 floats, the ANN index happily computes nearest neighbors, everything is type-correct and "healthy." But two encoders trained separately — even a fine-tune of the same base — produce geometrically incompatible spaces: distances between a v12 query vector and v11 item vectors are meaningless. The system fails semantically while passing every syntactic check. This is the classic ML-systems failure: the contract that broke (same embedding space) was never written down as a contract, so no monitor owned it. Only a quality metric (recall proxy) could catch it — which is exactly why you have one.
Minutes 0–5: confirm the timeline (slide starts at the v12 deploy), then roll back the query encoder. Verify recall@100 recovers over the next minutes of traffic.
Minutes 5–15: message the embedding team — "backwards compatible" needs a precise definition for encoders, and other consumers (recs? ads retrieval?) may be skewed too. Sketch the proper rollout for v12: rebuild the index with v12 item embeddings, then cut query traffic and index over atomically (or run dual-encoding during transition). File for prevention: version-pin the encoder ID into the index metadata and have the retrieval service refuse — or alarm — on mismatch at request time.
Why this is correct: rollback dominates because it is minutes-cheap and restores a known-good invariant, while every fix-forward path is hours-long. Staff+ answers name the general principle: any system with a trained artifact on both sides of an interface (two-tower retrieval, reranker + candidate gen, encoder + index) needs compatibility versioning, not just API versioning. Common wrong answer: restarting/rebuilding the ANN service — it is behaving perfectly; the bug is in the space, not the search.
03:40 page: cloud spend rate 3× baseline since 02:10. Traffic is normal. What do you do in the next 15 minutes?
Model answer
Cheap and reversible first: open the autoscaler dashboard — is replica count 3× normal? If yes: cap max replicas at a sane ceiling NOW (reversible, stops the bleed) and look for what's driving scale-out with flat traffic: the classic is a retry storm — a dependency started failing/slow at 02:10, clients retried, synthetic load tripled, autoscaler obliged. Check dependency error rates and retry counts at 02:10; if found, enable/lower the circuit breaker and fix or fail-fast the dependency. Second classic: a deploy at ~02:00 with a slow code path (3× CPU per request) — same traffic, 3× compute; check deploy timeline, roll back. Third: a batch/training job scheduled onto the serving account (check per-namespace breakdown). What you do NOT do at 3am: optimize instance types, renegotiate pricing, or kill replicas below safe capacity — the ceiling cap plus cause isolation is the whole 15-minute job, and the cost of one wasted night of compute is far below the cost of a capacity outage at morning peak.
- Reversible, bounded actions first: rollback, cap, disable, block-the-promote. Diagnose AFTER the bleeding stops.
- Never promote/deploy anything new during triage — including "quick fixes."
- The timeline question ("what changed at T?") solves most incidents; keep deploys, config pushes, and data-pipeline events on ONE timeline.
- If the action is irreversible (deleting data, failing over a region), it needs a second person — at any hour.
- Alert → stop the bleed → localize by timeline → fix → prevent: in that order, every sim above.
- Blocking a bad model promote is free; un-shipping a bad model from prod logs is not.
- Retry storms and autoscale runaways turn small failures into big bills — caps and budgets are pre-decided, not improvised.
Design and debug challenges only sharpen you if you evaluate your answers honestly and at the right altitude. This chapter exposes the full grading rubric — what separates a junior answer from a principal answer on the same question — then gives you a concrete solo-practice protocol to use for every chapter in this course, and finally maps each challenge type to the course chapters you should revisit when you miss.
Every ML-systems interview question has a "ceiling" that rises with seniority. The ceiling is not about knowing more jargon; it is about how many layers of consequence you reason through. The rubric below applies to any design or debug question. Read the anchor phrases, then internalise the pattern.
The example question used throughout is: "Design a feed-ranking system for a social app with 10 M DAU and a 100 ms end-to-end budget." This is ch2. Read the table column by column first, then row by row.
| Dimension | Junior answer | Senior answer | Staff answer | Principal answer |
|---|---|---|---|---|
| Component naming | "We need a retrieval stage, a ranking model, and a serving layer." | "Two-tower retrieval with ANN at 200 ms budget, a LightGBM ranker at 20 ms for top-500, and a deep ranker at 40 ms for top-100." | "The two-tower produces 500 candidates in 15 ms; if we ever switch to a dense re-ranker we'll need to revisit the candidate count because inference cost grows quadratically." | "Do we actually need a two-tower? At 10 M DAU and a 2 k candidate pool, a pure BM25 + heuristic pre-filter covers cold-start better and the model can be a single small ranker — less infra, easier debugging." |
| Quantitative justification | (absent — no numbers given) | "10 M DAU × 4 refreshes/day = 1 667 QPS sustained; p99 is 5× = 8 k QPS; one H100 handles ~2 k QPS at this FLOP budget, so 4 GPUs with 2× redundancy = 8 GPUs minimum." | Same arithmetic plus: "At 8 k QPS peak, the feature store sees 8 k × 500 candidates × 40 features = 160 M feature lookups/s. That is the real bottleneck — not the model." | "The 100 ms SLA is p99 end-to-end. My experience is that 60 % of p99 is network + feature fetch, not model inference. Before sizing GPUs I want to see the latency breakdown from a traffic replay." |
| Tradeoff articulation | "We can use a neural ranker or a tree model." | "LightGBM has 5–10 ms inference on CPU, is interpretable, and is safe to deploy; a neural ranker is 30–50 ms on GPU but captures feature interactions. For launch, GBDT; for phase 2, neural." | "GBDT vs neural is also a team-capability tradeoff: ML platform team already has GBDT serving; neural adds an inference serving dependency that takes 3 sprints to productionise and needs on-call rotation." | "The right question is not GBDT vs neural but 'what is the cheapest model that beats the baseline metric by ≥ 1 % CTR?' — run a 1-week experiment with GBDT first; if it's not enough, justify the neural infra." |
| Failure modes | (not mentioned) | "If the ranker is slow, the fallback is to serve the last cached ranking." | "Three failure modes to plan for: (1) feature store timeout → serve with stale features and alert; (2) ANN index refresh lag → candidates are stale up to 5 min, acceptable; (3) cold-start user (20 % of DAU) → pure-popularity fallback, re-enters ranked pool after 10 positive interactions." | "The failure mode I'd add: correlated failures. Feature store latency spike and model serving spike tend to co-occur under traffic bursts because both share the same GPU cluster. Separate the serving path so a model spike doesn't cascade to feature fetch." |
| Evolution / org cost | (not mentioned) | "We can add more GPUs as DAU grows." | "At 3× DAU the ANN index needs re-sharding; design the retrieval layer to shard-aware route now so the migration is a config change, not a rewrite. Org cost: the feature store design requires feature ownership contracts — budget 2 sprints of platform work." | "The 40-person org is the constraint. A two-stage retrieval + ranking pipeline needs 3–4 teams to own it reliably. If the org is under-resourced for that, a simpler monolithic ranker with heuristic retrieval is the right call until the team grows." |
| Prompt challenge / simplification | (not attempted) | (not attempted) | (rarely attempted) | "100 ms end-to-end — is that measured at the load balancer or at the client? Client-measured includes mobile network RTT which is 50–100 ms alone. If the real latency target is 'feels fast on mobile', we should spec server-side latency at 40 ms and invest in client-side prefetch, which is cheaper than shaving GPU time." |
"Staff means knowing more things." No. Staff means holding more layers of consequence in your head simultaneously. A junior might know what a KV-cache is. A staff engineer reasons about how KV-cache memory per session caps batch size, which caps throughput, which determines whether you can serve the product's long-context feature at all — and says so unprompted.
Trigger: you have just written or spoken an answer to a design or debug challenge.
- Find your answer in the rubric table: which column does it most closely match?
- Ask: did I give at least one number derived from the scenario's constraints? If not, you are at junior regardless of vocabulary.
- Ask: did I name a failure mode and a mitigation? If not, you are at most senior.
- Ask: did I mention the org or team cost of my design decision? If not, you are at most senior.
- Ask: did I push back on any part of the prompt, or notice an ambiguity? If yes (and correctly), add one level.
Never: grade yourself on vocabulary. Saying "we use paged attention with disaggregated prefill/decode" while unable to do the memory arithmetic is junior cosplay. The rubric rewards consequence-reasoning, not jargon.
Passive reading gives you recognition memory — you feel like you know the answer because you've seen it. Active retrieval under time pressure is what builds the actual skill. Follow this protocol for every challenge in chapters ch2–ch11.
For design challenges: always write down the four fundamental numbers before drawing a single box — (1) QPS or request rate, (2) per-request compute or memory, (3) total resource requirement, (4) cost or latency implication. Until those four are on paper, your architecture sketch is decoration.
For debug challenges: always write down the three timeline questions before hypothesising — (1) when did the symptom start?, (2) what changed at or just before that time?, (3) which metrics moved together vs independently?
- Write first, speak second, check third — never reverse the order.
- A grade is only valid if it is based on the rubric's five questions, not your subjective confidence.
- Spaced re-attempts after 3 days are more valuable than back-to-back reruns.
- Annotate gaps instead of rewriting clean answers — the gap is the learning.
Each challenge in this course pairs with a runnable notebook in /notebooks/. The notebook gives you real telemetry to reason over, not hypotheticals. Use the notebook in two ways:
- Before the challenge: run the notebook, observe the numbers, then close it and attempt the challenge from memory of those numbers. This trains the skill of reasoning from real data rather than round numbers.
- After the challenge: use the notebook to verify your arithmetic. If your estimate for GPU count was 8 and the notebook simulation suggests 12, trace the discrepancy — it almost always reveals a utilisation assumption you got wrong.