DESIGN · 45-MIN INTERVIEW LOOP

ML system design — 12 worked problems

Sr Staff designs are not "draw a DLRM". They are: clarify scope, defend trade-offs, name second-order effects, and survive the deep-dive. This page is twelve problems worked end-to-end in the same disciplined structure — plus the meta-framework that ties them together.

Read ~60 min Asked at Pinterest, Meta, Google, OpenAI, Anthropic Difficulty Sr Staff bar
00
META · THE FRAMEWORK

The 45-min interview framework — clarify, capacity, API, architecture, deep-dive

TL;DR

The Sr Staff bar is not "drawing the right diagram"; it's making sensible cost/quality/latency trade-offs and naming second-order effects. Every loop has the same shape: clarify the problem, estimate capacity, sketch the architecture, deep-dive a component, name the gotchas. The candidates who skip clarification and jump to "two-tower + MMoE + DLRM" lose the loop in the first 5 minutes.

THE FRAMEWORK

Five phases, every time

  1. Clarify (0–5 min) — scale, latency budget, what to optimize, what's allowed, who the user is, what existing systems exist. Write the constraints down. Confirm with interviewer.
  2. Capacity (5–10 min) — back-of-envelope: QPS, storage, training data volume, memory. Pick the constraint that will dominate.
  3. API + architecture (10–20 min) — boxes and arrows. State the funnel/stack. Two-stage retrieval+ranker, prefill+decode, batch+streaming feature store, etc.
  4. Deep-dive (20–35 min) — interviewer signals what they want pulled apart. Discuss alternatives. Defend choices with first principles. This is the bulk of the signal.
  5. Eval + monitoring + gotchas (35–45 min) — offline metrics, online metrics, guardrails, what would you do differently with more time.

The signals interviewers grade on

PITFALL — the four classic failure modes
(1) Jumping to architecture before clarifying. (2) Ignoring evaluation entirely (offline AUC + online A/B + guardrails). (3) Skipping monitoring (latency p99, calibration drift, feature freshness). (4) Claiming a single technology solves everything ("we'd use Spanner for everything"). Any of these alone is a downlevel signal.
REMEMBER
  • Clarify → Capacity → Architecture → Deep-dive → Eval/monitoring/gotchas. Always.
  • Name trade-offs explicitly. "I'd pick X because Y, even though Z."
  • Cover eval, monitoring, and gotchas — they're 30% of the score.
  • Second-order effects (bias, drift, feedback loops) are the Sr Staff differentiator.
01
RECSYS · LARGE-SCALE FUNNEL

Design YouTube recommendations

Recommend videos to 2B users from a corpus of billions, with sub-100ms latency, optimizing for watchtime AND satisfaction (likes, subscribes, no-dislike) — without collapsing into a clickbait filter bubble. The hard part of this problem is the multi-objective optimization: pCTR alone produces clickbait; pWatchtime alone produces engagement-bait; you need a multi-task ranker with the right weights, and you need diversity / exploration baked in.

TL;DR

Two-stage funnel: retrieval (1B → 1k) merges several sources (two-tower + collab + fresh + trending); ranking (1k → 100) is a multi-task cross-encoder (DLRM-style or transformer) with MMoE shared experts; final scoring is a tuned weighted sum across pCTR/pWatchtime/pLike/pDislike with diversity (MMR/DPP) and exploration slots. Streaming training for freshness, shared embedding tables sharded model-parallel.

Requirements

Data

Architecture — classic two-stage funnel

Training infra

Massive embedding tables (hundreds of GB) — sharded model parallel (TorchRec). Streaming training (Kafka → Flink → trainer) for freshness. Daily full retrain + hourly incremental.

Serving

Retrieval served from in-memory ANN index (replicated). Ranker on GPU/CPU farm with feature store fan-in. Online (Redis-like, sub-ms) + offline (warehouse for training).

Eval

Offline AUC, NDCG, recall@k, calibration. Online: A/B tests with primary metric watchtime, guardrails on dislikes/abandonment/diversity.

Monitoring

Per-head calibration drift, retrieval recall slipping, feature freshness, p99 latency, fraction of recommendations that come from each retrieval source (don't let one collapse), exploration slot fraction.

EXAMPLE — concrete capacity numbers

2B users × 10 sessions/day × 50 candidates ranked = ~1 trillion ranker evaluations/day. At 1ms per ranking on GPU, that's ~12k GPU-equivalents continuously. In practice you batch and run on heterogeneous CPU+GPU pools. Embedding tables: 100M videos × 256-dim BF16 = ~50 GB just for items; user side similar order. Total ~hundreds of GB → must be model-parallel sharded (TorchRec).

PITFALL — clickbait, feedback loops, position bias
Clickbait: training on pCTR alone teaches the model to surface thumbnails that get clicked, not videos that get watched. Always train pWatchtime + pSatisfaction, not pCTR alone.
Feedback loops: model only sees feedback on items it showed, so it can't unlearn its own biases. Mandatory exploration slots (5–10%) are non-negotiable.
Position bias: position 1 always gets clicked more. Estimate the position effect with a shallow position-only tower and subtract; or use position-aware loss (PAL).
Train/serve skew: same feature definitions across pipelines, point-in-time correctness during training joins.
REMEMBER — what they're probing
  • Two-stage funnel discipline: retrieval is recall-first, ranking is precision-first.
  • Multi-task heads + MMoE — never single-objective on a recommender.
  • Calibration matters when scores feed downstream (auctions, weighted blending).
  • Feedback loops are the killer: exploration is required, not optional.
02
RECSYS · REAL-TIME FEED

Design Twitter / X feed ranking

Like YouTube, but with severe recency, real-time engagement signals, and heterogeneous content (text/image/video/links). The hard part of this problem is real-time freshness: a tweet from 2 minutes ago with 500 likes/min is more valuable than a tweet from 2 days ago with 10k cumulative likes — your ranker needs to consume real-time engagement counters as features without breaking train/serve consistency.

TL;DR

Source mixer (in-network + out-of-network + promoted + lists) feeds a heavy multi-task ranker (MaskNet / DCN-v2) with real-time engagement counts and recency decay; heuristic re-rank handles diversity and sensitive-content filtering. Negative-feedback signals (mute/block) are weighted heavily; SimClusters provide interest representations.

Requirements

Architecture

Training infra

Hours-fresh data; daily retrains. Embeddings for user/author/topic. SimClusters (Twitter's clustering) provides interest representations.

Serving

Real-time engagement counters via Kafka + sliding window aggregator. Per-tweet feature lookup at request time. Hot tweet cache.

Eval

Offline: AUC per head, calibration. Online: dwell time, follow rate, negative feedback rate, retention.

PITFALL — negative feedback, bots, trending shifts
Negative feedback signals (mute, block, "see less") are strong → weight heavily, often as a separate head with negative coefficient in the final blend.
Bots / spam need an upstream filter; they pollute training data and can game engagement counts.
Trending events create positive-class shifts mid-day; a model trained yesterday underweights them. Streaming features and frequent retrains both matter.
REMEMBER — what they're probing
  • Real-time engagement features as ranker inputs — no train/serve skew.
  • Negative feedback is a signal, not noise — model it explicitly.
  • Source mixer pattern: candidates from many funnels, blended in the ranker.
  • Recency decay must be a feature, not just a heuristic post-rank.
03
RECSYS · CALIBRATED PROBABILITY

Design ad CTR prediction

Predict click probability for ad-impression pairs at millions of QPS, with predicted probability used directly in second-price auctions and pacing systems. The hard part of this problem is calibration: an AUC-optimal ranker that's miscalibrated will mis-bid on every impression. CTR is also extremely high cardinality (billions of users × millions of ads × thousands of slots) and very imbalanced (~0.1–1% positive rate).

TL;DR

DLRM-style stack with massive embedding tables for sparse IDs, DCN-v2 cross terms, DIN/DIEN sequence attention. Streaming online learning + daily retrains. Calibration via isotonic / Platt on a held-out window — recalibrate frequently, especially across regime shifts. Watch for delayed conversions and selection bias.

Requirements

Data

Impressions and clicks (positive class very rare, ~0.1–1%), context (page, slot, time), ad creative, advertiser, user features.

Model — classic stack

Training infra

Streaming online + daily retrains. Massive embedding tables (TBs) → row-wise parallel.

Serving

Embedding-table lookup service (sharded by hash), feature fetch, ranker eval, calibration, score back to auction.

Eval

AUC, log-loss, normalized cross-entropy (NCE), calibration plots. Online: revenue, eCPM, advertiser ROI.

Calibration — critical for auctions

Apply isotonic regression or Platt scaling on a held-out window; recalibrate frequently. AUC measures ranking, not whether predicted 0.05 actually means 5% click rate. In a second-price auction, miscalibration directly leaks revenue.

EXAMPLE — why calibration breaks the bid

Ad A and Ad B are both predicted at relative scores 0.10 and 0.05. AUC says "rank A higher" — correct. But the auction multiplies predicted CTR by bid: if true CTRs are 0.02 and 0.01, you've doubled both bids equally and the auction is fine. If the model is miscalibrated such that A is true-CTR 0.02 and B is true-CTR 0.005, the relative bid changes, the winner changes, and revenue degrades. AUC is invariant to monotone rescaling; auctions are not.

PITFALL — delayed conversions, selection bias, distribution shift
Delayed conversions: a click leading to a conversion takes hours to days; naive labeling treats unobserved conversions as negative. Use delayed-feedback modeling (Chapelle 2014: importance-weighted loss with conversion-delay distribution).
Selection bias: training log only contains impressions the model showed → IPS or doubly-robust correction needed.
Cold start: new ads have no history → content-based fallback (creative-text embedding).
Distribution shift at sales/holidays: model trained pre-Black-Friday over-bids on Black Friday. Frequent retrains and regime detection.
REMEMBER — what they're probing
  • Calibration ≠ AUC — name this distinction unprompted.
  • Embedding-table sharding is the systems story (TBs of state).
  • Delayed feedback + selection bias + cold start — the three classic CTR gotchas.
  • Streaming online + daily retrains: calibration drifts, you re-fit.
04
SEARCH · HYBRID RETRIEVAL

Design personalized search ranking

Search ranking with personalization, but careful — over-personalizing on ambiguous queries ("Apple") gives the wrong intent. The hard part of this problem is unbiased eval: position bias is enormous in search (top result gets 30%+ of all clicks regardless of relevance), so you need click models or randomized exploration to even measure quality.

TL;DR

Hybrid retrieval (BM25 + dense two-tower, possibly ColBERT late-interaction) → cheap L1 ranker → cross-encoder L2 ranker with listwise loss. Personalization via user embedding (long-term + session). Don't over-personalize ambiguous queries. Eval with NDCG@k offline, click models for unbiased online eval, plus head/tail query slices.

Pipeline

Eval

NDCG@k, MRR, click models for unbiased eval (Cascade, PBM, DBN). Online: search success, session abandonment.

Monitoring

Per-segment NDCG (head queries vs tail queries vs navigational vs informational), latency p99, freshness for news queries, click-through patterns by position.

PITFALL — position bias, head/tail, freshness
Position bias (huge in search) → click models (Cascade, PBM, DBN) for unbiased eval. Otherwise you're just measuring "does the model agree with previous model".
Head/tail queries — separate eval slices. A model that wins on head can lose 5% on tail and overall metrics will hide it.
Freshness for news vs stable for evergreen — query-class-conditional ranking weights.
Over-personalization on ambiguous queries — "Apple" should sometimes show fruit; over-fitting to a tech user's history breaks intent diversity.
REMEMBER — what they're probing
  • Hybrid (BM25 + dense) is universal — pure dense underperforms on lexical queries.
  • Cross-encoder L2 ranker; listwise loss (LambdaMART). Pointwise loses signal.
  • Click models for unbiased eval — name them (PBM, Cascade, DBN).
  • Per-segment eval — head/tail, navigational/informational, fresh/stable.
05
LLM SERVING · MULTI-TENANT

Design ChatGPT serving at scale

Serve 200M+ DAU at sub-2s TTFT and sub-50ms inter-token latency, across short and long prompts, with multi-tier SLAs and tight GPU economics. The hard part of this problem is head-of-line blocking: a single 100k-token prefill request can block dozens of short requests in a naive scheduler. The solution is disaggregated prefill+decode pools, prefix caching, continuous batching, and SLO-aware scheduling.

TL;DR

Disaggregated prefill + decode pools (KV transferred over RDMA). PagedAttention prefix cache keyed on prompt prefix. Continuous batching with per-iteration scheduling. TP=8 within node for 70B+; speculative decoding (EAGLE); FP8 weights+activations. Model router selects mini/full/reasoning model by query complexity. Priority queues separate long-context from short.

Requirements

Architecture

Eval

Offline (held-out conversations, MMLU/HumanEval/MATH/AgentBench), online (thumbs feedback, conversation length, retry rate, A/B on model versions).

Monitoring

TTFT/ITL distributions, KV cache hit rate, GPU utilization, queue depths, OOM rate, refusal rates, output toxicity.

EXAMPLE — why disaggregated prefill+decode

Prefill is compute-bound (one big matmul per layer over the prompt). Decode is memory-bound (one matmul per token, dominated by KV cache reads). Mixing them in the same pool means decode steps wait behind prefill steps, ITL spikes. Disaggregating: prefill pool runs hot on compute (FP8 saturates tensor cores); decode pool runs hot on memory bandwidth. KV cache moves once over RDMA. Result: stable ITL even when prefill load fluctuates.

PITFALL — head-of-line blocking, autoscaling lag, safety latency
Head-of-line blocking: long-context requests block short ones in naive batching. Mitigations: priority queues, separate pools for long context, SLO-aware scheduling.
Noisy neighbors on shared pools.
Autoscaling lag: GPU spin-up takes minutes — you can't scale reactively. Predictive scaling on traffic forecasts.
Safety filtering pipeline adds latency: pre-filter on input, post-filter on output. Both add hundreds of ms if naive; needs streaming + parallel checks.
Streaming TTFT vs full-completion latency are different SLOs: optimize them separately.
REMEMBER — what they're probing
  • Disaggregated prefill+decode is the 2025 default.
  • Prefix caching + continuous batching + speculative decoding stack is non-negotiable.
  • Model router (mini/full/reasoning) is how you make economics work.
  • Head-of-line blocking on long context is THE LLM serving gotcha.
06
LLM TRAINING · ALIGNMENT LOOP

Design fine-tuning + Constitutional AI loop

Build the production pipeline that takes raw user prompts (and red-team prompts) and produces a fine-tuned, aligned, eval-gated, canary-rolled model. The hard part of this problem is the safety-helpfulness trade-off and the regression-gate engineering: a tiny safety regression on one slice should block the rollout, even if helpfulness improves overall.

TL;DR

Pipeline: data ingest (deduped, PII-filtered) → generation pool (K candidates per prompt, self-critique with constitutional principles, revisions) → preference labeling (AI judge + human gold subset) → SFT on revisions → DPO/KTO/PPO on preferences → eval (capabilities + safety + IF + winrate) → regression gate (block if any safety/capability metric regresses > ε) → canary rollout (1% → 10% → 100%) with online metric monitoring.

Pipeline

  1. Data ingest: raw prompts (real user, red team, synthetic). Deduped, PII-filtered.
  2. Generation pool: base model generates K candidates per prompt; self-critique with constitutional principles; revisions.
  3. Preference labeling: AI judge (stronger model) rates pairs against constitution. Sample subset for human review (gold set).
  4. Training: SFT on revisions; then DPO/KTO/PPO on preferences. Multiple iterations.
  5. Evaluation: capabilities (MMLU, MATH, HumanEval), safety (HarmBench, XSTest, jailbreak suite), instruction following (IFEval), preference winrate vs prior model.
  6. Regression gate: blocked from rollout if regression on N safety / capability metrics > ε.
  7. Canary rollout: 1% → 10% → 100% with online metrics monitored.

Eval

Capabilities (MMLU, MATH, HumanEval, MMLU-Pro, GPQA), safety (HarmBench attack success, XSTest over-refusal, JailbreakBench), instruction following (IFEval), preference winrate vs prior model (Arena-Hard, AlpacaEval 2).

Monitoring

Online thumbs feedback, refusal rate, output toxicity rate, conversation length, retry rate, jailbreak attempt rate. Per-cohort breakdowns.

PITFALL — reward hacking, mode collapse, eval contamination
Reward hacking: PPO finds prompts where the RM is wrong and exploits them. Mitigations: KL penalty, RM ensemble, rejection sampling.
Mode collapse: DPO can over-decrease likelihood of both chosen and rejected — only the gap matters. Mitigations: stronger β, mix in SFT loss, IPO variant.
Distribution shift between SFT data and RL prompts → off-policy issues during PPO.
Safety-helpfulness trade-off: making the model refuse more reduces some risks but increases over-refusal on benign queries (XSTest).
Eval contamination: pretraining data leaked into eval → use contamination-resistant evals (GPQA-Diamond, LiveCodeBench, FrontierMath).
REMEMBER — what they're probing
  • Regression gate is the engineering discipline that makes alignment safe to ship.
  • AI judge + human gold subset is the cheap, high-quality preference label recipe.
  • Safety vs helpfulness is a Pareto curve — name both axes.
  • Canary rollout (1% → 10% → 100%) with online metrics — no big-bang deploys.
07
RAG · GROUNDED GENERATION

Design RAG at billion-doc scale

Build a retrieval-augmented generation system over billions of documents that produces grounded, cited answers. The hard part of this problem is chunk-boundary loss and "lost in the middle": naive chunking destroys context across boundaries; long stuffed contexts cause LLMs to underuse mid-context information.

TL;DR

Ingest with semantic-aware chunking + overlap → embed (Matryoshka-style for cheap recall + full-dim precision) → vector DB (sharded by tenant, IVF-PQ within shard) → hybrid retrieval (BM25 + dense, RRF fusion) → cross-encoder reranker (top-100 → top-10) → augmentation with citation markers → LLM generation with grounded-answer requirements. Optionally HyDE / multi-query / multi-hop.

Pipeline

Eval

Monitoring

Retrieval recall@k drift, citation accuracy, hallucination rate, latency per stage (embed, retrieve, rerank, generate), index staleness per tenant, embedding-model version pinning.

PITFALL — chunk boundaries, embedding drift, lost in the middle
Chunk boundaries lose context → overlap, hierarchical chunking, semantic chunking.
Embedding drift when re-embedding with new model — need full reindex; never mix versions.
Lost in the middle (Liu 2023) — LLMs underuse middle of long context. Mitigations: re-rank top-k aggressively, put best snippet first or last, truncate if context is too long.
Stale data — TTL + incremental reindexing.
Hallucinated citations — even with citation requirements, models invent source IDs. Validate citations against retrieved set programmatically.
REMEMBER — what they're probing
  • Hybrid (BM25 + dense, RRF fusion) — never pure dense at billion-doc scale.
  • Cross-encoder reranker is what makes top-10 actually correct.
  • Chunking strategy matters more than people think — semantic + overlap.
  • Faithfulness eval (RAGAS) + citation validation are the production safety nets.
08
INFRASTRUCTURE · ANN

Design vector search infrastructure

Build a billion-vector ANN system that handles inserts, deletes, attribute filters, and consistent QPS. The hard part of this problem is the filter problem: vectors are easy to find, but combining ANN with attribute filters (tenant, time range, category) breaks the obvious algorithms in non-obvious ways.

TL;DR

HNSW for <100M (in-memory, best recall/latency); IVF-PQ for billion+ (compressed); ScaNN/DiskANN intermediate. Sharding by tenant then by clustering; replicas for QPS. Filters need inline traversal (attribute-aware HNSW) — pre/post-filter both have failure modes. Insert is easy, delete is hard (tombstones + compaction). Re-indexing is mandatory on embedding-model changes.

Algorithms

Memory: billion 768-d float32 = 3 TB; PQ to ~32 bytes/vec → 32 GB.

System

Eval

Recall@k against brute-force ground truth on a sample; query latency p50/p99; throughput at target recall.

Monitoring

Per-shard latency, recall sampling against brute-force baseline, index size, deletion-tombstone fraction, compaction lag, filter-selectivity histograms.

EXAMPLE — billion-vector memory accounting

1B vectors × 768 dims × 4 bytes (FP32) = 3 TB. Won't fit on a single node. Three options: (a) shard across ~50 nodes of 64 GB each (in-memory HNSW per shard); (b) compress with IVF-PQ to ~32 bytes/vec → 32 GB total, fits on a few nodes; (c) DiskANN on SSD — same shard count as (a) but 10× cheaper hardware. Pick based on recall SLA: HNSW for highest recall, IVF-PQ for tightest budget.

PITFALL — filters, deletes, embedding drift
Filters: pre-filter is bad when selectivity is low (you re-index on every query); post-filter is bad when selectivity is high (you discard most results). Inline attribute-aware traversal (HNSW filtered) is the modern answer.
Deletes: graph indices don't truly delete — tombstones + periodic full compaction.
Embedding-model drift: changing the embedding model means re-indexing the entire corpus. Plan for this; never mix vectors from different model versions in the same index.
REMEMBER — what they're probing
  • HNSW vs IVF-PQ trade-off — recall vs memory vs cost.
  • Filters are the gotcha — name pre/post/inline distinction.
  • Deletes need compaction; re-indexing is mandatory on model changes.
  • Sharding by tenant first, then by clustering — multi-tenant isolation matters.
09
DATA · FEATURE INFRA

Design ML feature store with point-in-time correctness

Build the feature store every ML team in a company shares: declarative feature definitions, online and offline serving, point-in-time joins for training, governance. The hard part of this problem is train/serve skew prevention and point-in-time correctness: training joins must look up feature values as of the label timestamp, never the current value.

TL;DR

Declarative feature definitions (Feast/Tecton) → batch (Spark/Flink) writes to offline warehouse + reverse-ETL to online KV → streaming (Kafka → Flink/Beam) for real-time aggregations → on-demand transforms at request time. Training uses point-in-time joins (feature value as-of label timestamp) to prevent leakage. Same code path online and offline = no train/serve skew.

Architecture

Governance

Feature ownership, lineage, freshness SLAs, deprecation workflow, cost attribution.

Eval

Per-feature freshness SLA tracking, online/offline parity sampling (compute online, log, recompute offline, diff), feature-vector latency p99.

EXAMPLE — point-in-time join

Label: user U clicked ad A at timestamp T. Training feature "U's 7-day click count" must be the value at T, not at training time. Naive join: select most recent feature row → leaks future clicks into training feature → model overfits a value it could never have at serving time. Correct join: pick the feature row with timestamp ≤ T (and ≥ T − staleness budget). Feast/Tecton implement this primitively.

PITFALL — train/serve skew, freshness drift, duplicate features
Train/serve skew — same code path online and offline. If "user 7-day click count" is computed in Python at serving and Spark at training, you have skew. Single-definition feature stores (Feast/Tecton) exist for this.
Feature freshness drift — monitor per-feature staleness; alert when a streaming feature falls behind.
Duplicate features across teams — centralize discoverability; mandatory feature registration.
REMEMBER — what they're probing
  • Point-in-time correctness — name this unprompted.
  • Online + offline stores backed by single feature definition (no skew).
  • Streaming (Flink/Beam) for real-time aggregations is mandatory at scale.
  • Governance + lineage + freshness SLAs are the org-scale problem.
10
EXPERIMENTATION

Design A/B testing framework for ML

Build the company's experimentation platform: assignment, logging, metrics, statistical engine, holdouts, heterogeneous treatment effects. The hard part of this problem is SUTVA violations in marketplaces and social networks: two-sided platforms break the "treatment of one user doesn't affect others" assumption, requiring cluster randomization or switchback designs.

TL;DR

Deterministic hash assignment in mutually exclusive layers (Google's Overlapping Experiments Infrastructure). Hierarchical metrics (North Star + proxy + guardrails). Stat engine with Welch t-test + CUPED variance reduction + sequential testing for early-stopping safety + multi-comparison correction. HTE via causal forests / meta-learners. Switchback or cluster randomization for marketplace effects. Long-term holdouts for novelty decay.

Components

Eval / monitoring

Sample ratio mismatch (SRM) detection; bucket-balance checks; metric stability (rolling baseline drift); experiment-coverage dashboard.

PITFALL — SUTVA, peeking, Simpson's, SRM
SUTVA violations (network effects in social products, supply-side competition in marketplaces) → cluster randomization or switchback.
Simpson's paradox: aggregate effect can flip sign vs subgroup effects.
Peeking without sequential correction: looking at p-values daily and stopping when significant inflates false-positive rate massively.
Metric definition drift: definition changes mid-experiment.
Logging bugs that look like wins: the most common false-positive source. Always check exposure parity (SRM).
REMEMBER — what they're probing
  • CUPED is variance reduction with a pre-experiment covariate — name it.
  • Mutually exclusive layers (OEI) — multiple experiments on same user.
  • SUTVA violations are the marketplace gotcha — switchback / cluster randomization.
  • Sequential testing or no early-stopping — peeking inflates FPR.
11
RECSYS · MULTI-MODAL

Design multi-modal recommendation system

Recommend items where the items are inherently multi-modal: short videos with audio + text + visual + author info. The hard part of this problem is modality fusion + missingness: not every item has every modality, and a model that assumes "always have video" breaks on items without one. Plus, video encoding is expensive — precompute and cache aggressively.

TL;DR

Per-modality pretrained encoders (CLIP/SigLIP for image-text, video transformer for clips) project into shared embedding space. Gated attention fuses modalities (gates handle missingness). User encoder over historical item embeddings. Two-tower retrieval with multimodal item tower. Cross-encoder ranker with multi-task heads. Precompute video encodings — they're expensive.

Architecture

Training infra

Pretrain modality encoders separately; align via contrastive on (user_history, clicked_item); fine-tune end-to-end on engagement.

Serving

Precompute item embeddings on ingest; cache aggressively. Video encoding latency is hostile — never compute on the request path.

Eval

Per-modality ablation: how much does each modality contribute? Cold-start eval on items with subset of modalities. Standard recsys metrics (recall, NDCG, online engagement).

PITFALL — modality dropout, expense, modality bias
Modality dropout / missingness: model gates must handle missing modalities gracefully. Train with random modality masking.
Expensive video encoding → precompute, cache; never on request path.
Modality bias: one modality (e.g., title text) dominates because it's always present and high-signal — model ignores video. Counter via per-modality dropout during training.
Cold-start — content-based via modality embeddings is the strength of this design (vs ID-only models).
REMEMBER — what they're probing
  • Pretrained per-modality encoders + projection heads — don't train end-to-end from scratch.
  • Gated fusion handles missing modalities; per-modality dropout in training prevents bias.
  • Precompute item embeddings; never encode video on request path.
  • Cold-start is the strength — multi-modal beats ID-only on new items.
12
PERCEPTION · SAFETY-CRITICAL

Design self-driving perception pipeline

360° real-time perception at 10–30 Hz, sub-50ms latency, multi-sensor (camera + lidar + radar), high recall on safety-critical objects, must generalize to long tail (mattress on highway, construction worker holding a sign). The hard part of this problem is long-tail rare objects + auto-labeling at PB scale — you can't human-label everything, you need an offline-large-model auto-labeler with human verification on edge cases.

TL;DR

Sensor fusion via BEV (Bird's-Eye View) — modern preference is mid-fusion (BEVFormer / Lift-Splat-Shoot lifts camera features into 3D, splats into BEV grid, transformer over BEV). Multi-task heads: 3D detection, segmentation, lane, traffic light, pose, intent. Tracking via Kalman + re-ID. Trajectory prediction with multimodal output. Auto-labeling pipeline + simulation + shadow mode. End-to-end (Wayve / Tesla FSD V12) is the alternative.

Requirements

Architecture

Training infra

PB-scale data; auto-labeling pipeline (large offline model labels, humans verify edge cases); simulation for rare scenarios; mining hard negatives from fleet.

Eval

Offline (mAP, IoU, ADE/FDE); closed-loop sim; shadow mode (parallel run, log disagreements); on-road (disengagement rate).

Monitoring

Per-class recall (especially safety-critical: pedestrian, cyclist), calibration drift between sensors, latency p99, disengagement rate per region/condition, sim-to-real gap.

PITFALL — long tail, calibration drift, prediction collapse
Long tail of rare objects (mattress on highway, fallen ladder): auto-labeling + active mining + sim augmentation.
Calibration drift between sensors: lidar-camera extrinsics shift with vibration; periodic recalibration is mandatory.
Adversarial conditions (rain, sun glare, dust): per-condition eval slices.
Prediction multimodality — collapse to mean is dangerous (predicting average of "turn left" and "turn right" = drive straight into oncoming traffic). Use multi-modal trajectory output (MTR-style).
OTA model rollout with canary fleets — never big-bang deploy.
REMEMBER — what they're probing
  • BEV fusion is the modern default — name BEVFormer / LSS.
  • Auto-labeling pipeline + sim + shadow mode is the data flywheel.
  • Trajectory prediction must be multimodal — never collapse to mean.
  • Long tail is the hard problem — active mining is the answer.
99
CLOSING TACTICS

How to actually run a 45-min ML design loop

TL;DR

Phase the time explicitly. Talk less in clarification, more in deep-dive. Always cover eval AND monitoring AND gotchas. End with "what would I do with more time" — it shows scope-discipline. Never claim a single technology solves everything.

The phase-by-phase tactic sheet

  1. 0–5 min: clarify requirements, scale, what to optimize, what's allowed. Confirm with interviewer. Write the constraints on the board.
  2. 5–10 min: high-level architecture (boxes + arrows). State the funnel/stack.
  3. 10–25 min: deep-dive 1–2 components the interviewer signals interest in. Discuss alternatives. This is where 50% of your signal lives.
  4. 25–35 min: training infra + serving infra + eval (online + offline).
  5. 35–45 min: gotchas, monitoring, what would you do differently with more time.

The "if I had more time" close

End every loop with 2–3 things you'd explore further: alternative architectures, deeper eval, harder gotchas. This signals scope-discipline (you knew what you skipped) and intellectual honesty (you don't claim to have solved everything in 45 min).

PITFALL — the four canonical loop-killers
Don't immediately draw "DLRM + MMoE" — they don't know yet what problem you're solving. Don't ignore evaluation. Don't skip monitoring. Don't claim you'd use Spanner for everything.
REMEMBER
  • Phase the time explicitly — say "I'll spend 5 min clarifying, then 5 sketching, then 15 deep-diving the ranker."
  • Eval + monitoring + gotchas are 30% of the score; don't run out of time.
  • "If I had more time" is the scope-discipline signal.
  • Trade-offs over absolutes: "I'd pick X because Y, even though Z."

ML system design quiz — readiness check

  1. Walk through YouTube's 2-stage funnel.
    Show answer

    Retrieval (~1B → 1k): two-tower + ANN, plus other sources (collab, fresh, trending). Ranking (~1k → 100): cross-encoder MMoE with multi-task heads (pCTR, pWatchtime, pLike, pDislike). Final scoring + diversity (MMR/DPP) + business rules → top 10.

  2. How would you handle cold start for new users / new items?
    Show answer

    New user: content-based recs from session signals (initial interactions, demographics, source); explore via bandits. New item: content-based features (text/image embeddings) drive retrieval; explore via mandatory exploration slots; meta-learning if you have many items per category.

  3. Your CTR model has high offline AUC but loses in A/B test. Diagnose.
    Show answer

    Suspect (1) selection bias / exposure bias — train log only contains items the old model showed; (2) train/serve skew — different feature pipeline; (3) miscalibration — AUC measures ranking, not calibrated prob; (4) shift in business mix; (5) novelty: model fits historical patterns that no longer hold.

  4. Why does an MMoE outperform a shared-bottom multi-task model?
    Show answer

    Shared-bottom forces all tasks through one trunk → negative transfer when tasks conflict. MMoE has multiple experts shared across tasks; per-task gates softly mix experts. Each task can pick its preferred expert combination → less interference.

  5. How do you debias position effects in a learned ranker?
    Show answer

    (1) PAL / position-aware learning: feed position as a feature during training; set to "no position" at inference. (2) Two-tower: shallow tower predicts position effect; main tower predicts relevance. (3) Click models (PBM, Cascade, DBN). (4) Result randomization for unbiased data.

  6. In-batch negatives vs explicit hard negatives — tradeoff?
    Show answer

    In-batch: cheap, popularity-biased (high-frequency items appear more as negatives). Hard negatives (top non-positive from ANN): faster learning per example, risk of false negatives (true positives that look similar). Mixed Negative Sampling (MNS): combine in-batch + uniform-random; logQ correction for popularity debiasing.

  7. HNSW vs IVF-PQ — when each?
    Show answer

    HNSW: best recall/latency in memory; high memory (no compression). Use for < 100M vectors with quality SLA. IVF-PQ: compressed (32×–128×); scales to billions; recall lower. Use for billion+ vectors at modest QPS. ScaNN/DiskANN intermediate.

  8. Your retrieval layer returns the same 1000 items for everyone. What's wrong?
    Show answer

    Likely: (1) user tower collapsed (output ≈ constant — check embedding norm and variance); (2) lack of personalization features in user tower; (3) popularity-biased sampled-softmax without logQ correction → all items pushed toward popular ones; (4) ANN index returning popular items only (debiasing lost in serving).

  9. How do you calibrate a multi-task ranker where each task has different positive rates?
    Show answer

    Per-task isotonic regression on a held-out window. For each task head's logit, fit monotonic mapping logit → calibrated prob using actual positive rate per quantile. Recalibrate frequently (weekly) under distribution shift. Joint calibration if tasks are correlated.

  10. Migrate DLRM-style ranker to generative recommender (TIGER/HSTU) — risks?
    Show answer

    (1) Loss of explicit business signal control (multi-task heads gone). (2) Auto-regressive serving latency. (3) Cold-start changes (semantic IDs from RQ-VAE replace ID embeddings). (4) Training-time interactions with online behavior (counterfactual eval harder). (5) Capacity for personalization in generative model still being scaled out. Run as A/B with strict guardrails.

  11. Design a RAG system at billion-doc scale.
    Show answer

    Chunking (semantic, ~500-1000 tok with overlap) → embedding (SigLIP/E5/Cohere) → vector DB (sharded by tenant, IVF-PQ within shard) → hybrid retrieval (BM25 + dense, RRF) → cross-encoder reranker → LLM generation with citations. Eval: recall@k, RAGAS faithfulness, citation accuracy.

  12. Design a feature store with point-in-time correctness.
    Show answer

    Online (sub-ms KV) + offline (warehouse with full history). Streaming compute (Flink/Beam) → online + offline. For training, point-in-time joins: for each (entity, label timestamp), look up feature value as-of the label time. Prevents leakage of future info into training features.

  13. What's CUPED?
    Show answer

    Variance reduction for A/B tests (Deng 2013). Adjust the outcome by a pre-experiment covariate: Y' = Y − θ (X − E[X]) where X is correlated with Y. Reduces variance → smaller sample size for same power. Used at every major web company.

  14. How do you avoid SUTVA violations in marketplace experiments?
    Show answer

    SUTVA = stable unit treatment value assumption — treatment of one unit doesn't affect others. Violated by network effects (social) or supply-side competition (DoorDash, Uber). Solutions: cluster randomization (region-level), switchback (alternate treatment over time), counterfactual interleaving for ranking.

  15. Design ChatGPT serving for 200M DAU.
    Show answer

    Disaggregated prefill + decode pools. Prefix caching (system prompts shared). Continuous batching. TP=8 within node for 70B+. Speculative decoding. FP8 weights. Model routing (cheap vs reasoning). Multi-tier autoscale. Safety filtering pipeline pre/post inference. SLO-aware scheduling.