ML system design — 12 worked problems
Sr Staff designs are not "draw a DLRM". They are: clarify scope, defend trade-offs, name second-order effects, and survive the deep-dive. This page is twelve problems worked end-to-end in the same disciplined structure — plus the meta-framework that ties them together.
What you'll learn
- The 45-min interview framework
- Design YouTube recommendations
- Design Twitter / X feed ranking
- Design ad CTR prediction
- Design personalized search ranking
- Design ChatGPT serving at scale
- Design fine-tuning + Constitutional AI loop
- Design RAG at billion-doc scale
- Design vector search infrastructure
- Design ML feature store with point-in-time correctness
- Design A/B testing framework for ML
- Design multi-modal recommendation system
- Design self-driving perception pipeline
- How to actually run a 45-min ML design loop
The Sr Staff bar is not "drawing the right diagram"; it's making sensible cost/quality/latency trade-offs and naming second-order effects. Every loop has the same shape: clarify the problem, estimate capacity, sketch the architecture, deep-dive a component, name the gotchas. The candidates who skip clarification and jump to "two-tower + MMoE + DLRM" lose the loop in the first 5 minutes.
Five phases, every time
- Clarify (0–5 min) — scale, latency budget, what to optimize, what's allowed, who the user is, what existing systems exist. Write the constraints down. Confirm with interviewer.
- Capacity (5–10 min) — back-of-envelope: QPS, storage, training data volume, memory. Pick the constraint that will dominate.
- API + architecture (10–20 min) — boxes and arrows. State the funnel/stack. Two-stage retrieval+ranker, prefill+decode, batch+streaming feature store, etc.
- Deep-dive (20–35 min) — interviewer signals what they want pulled apart. Discuss alternatives. Defend choices with first principles. This is the bulk of the signal.
- Eval + monitoring + gotchas (35–45 min) — offline metrics, online metrics, guardrails, what would you do differently with more time.
The signals interviewers grade on
- Did you clarify before architecting? Skipping this is the #1 way to lose Sr Staff loops.
- Did you name trade-offs explicitly? "I'd use HNSW because we're under 100M vectors and want highest recall at the cost of memory."
- Did you cover eval AND monitoring AND gotchas? Junior candidates stop at architecture.
- Did you name second-order effects? Position bias, feedback loops, train/serve skew, calibration drift, network effects in marketplace experiments.
- Did you stay scope-disciplined? When the interviewer asks "deep-dive on the ranker," you stop talking about retrieval.
- Clarify → Capacity → Architecture → Deep-dive → Eval/monitoring/gotchas. Always.
- Name trade-offs explicitly. "I'd pick X because Y, even though Z."
- Cover eval, monitoring, and gotchas — they're 30% of the score.
- Second-order effects (bias, drift, feedback loops) are the Sr Staff differentiator.
Recommend videos to 2B users from a corpus of billions, with sub-100ms latency, optimizing for watchtime AND satisfaction (likes, subscribes, no-dislike) — without collapsing into a clickbait filter bubble. The hard part of this problem is the multi-objective optimization: pCTR alone produces clickbait; pWatchtime alone produces engagement-bait; you need a multi-task ranker with the right weights, and you need diversity / exploration baked in.
Two-stage funnel: retrieval (1B → 1k) merges several sources (two-tower + collab + fresh + trending); ranking (1k → 100) is a multi-task cross-encoder (DLRM-style or transformer) with MMoE shared experts; final scoring is a tuned weighted sum across pCTR/pWatchtime/pLike/pDislike with diversity (MMR/DPP) and exploration slots. Streaming training for freshness, shared embedding tables sharded model-parallel.
Requirements
- 2B users, 500h/min uploaded
- Latency < 100ms p99
- Optimize watchtime + satisfaction (likes, subscribes, low dislike rate)
- Avoid filter-bubble + clickbait dynamics
Data
- Implicit: watch, watch_fraction, skip
- Explicit: like, dislike, subscribe, share
- Context: time, device, locale
- Content: channel, topic, language, age
- User: history, social graph
Architecture — classic two-stage funnel
- Candidate generation (retrieval): multiple sources merged.
- Two-tower: user tower (ID embed + recent history sequence + demographics) and item tower (video features), trained with sampled-softmax / in-batch negatives. ANN (HNSW or ScaNN) over O(100M) videos → top ~1k.
- Co-watch / collab: classical retrievers.
- Subscriptions / fresh: from followed channels.
- Trending / locale popular.
- Ranking: cross-encoder (DLRM-style or transformer). Inputs: user features, candidate features, cross features, sequence of recent interactions (DIN/SASRec attention over user history vs candidate). Multi-task heads: pCTR, pWatchtime, pLike, pComment, pDislike. MMoE for shared+task-specific experts.
- Final scoring: weighted sum across heads (weights tuned via online experiments / Pareto), with diversity / recency / freshness boost. MMR or DPP for diversity.
Training infra
Massive embedding tables (hundreds of GB) — sharded model parallel (TorchRec). Streaming training (Kafka → Flink → trainer) for freshness. Daily full retrain + hourly incremental.
Serving
Retrieval served from in-memory ANN index (replicated). Ranker on GPU/CPU farm with feature store fan-in. Online (Redis-like, sub-ms) + offline (warehouse for training).
Eval
Offline AUC, NDCG, recall@k, calibration. Online: A/B tests with primary metric watchtime, guardrails on dislikes/abandonment/diversity.
Monitoring
Per-head calibration drift, retrieval recall slipping, feature freshness, p99 latency, fraction of recommendations that come from each retrieval source (don't let one collapse), exploration slot fraction.
2B users × 10 sessions/day × 50 candidates ranked = ~1 trillion ranker evaluations/day. At 1ms per ranking on GPU, that's ~12k GPU-equivalents continuously. In practice you batch and run on heterogeneous CPU+GPU pools. Embedding tables: 100M videos × 256-dim BF16 = ~50 GB just for items; user side similar order. Total ~hundreds of GB → must be model-parallel sharded (TorchRec).
Feedback loops: model only sees feedback on items it showed, so it can't unlearn its own biases. Mandatory exploration slots (5–10%) are non-negotiable.
Position bias: position 1 always gets clicked more. Estimate the position effect with a shallow position-only tower and subtract; or use position-aware loss (PAL).
Train/serve skew: same feature definitions across pipelines, point-in-time correctness during training joins.
- Two-stage funnel discipline: retrieval is recall-first, ranking is precision-first.
- Multi-task heads + MMoE — never single-objective on a recommender.
- Calibration matters when scores feed downstream (auctions, weighted blending).
- Feedback loops are the killer: exploration is required, not optional.
Like YouTube, but with severe recency, real-time engagement signals, and heterogeneous content (text/image/video/links). The hard part of this problem is real-time freshness: a tweet from 2 minutes ago with 500 likes/min is more valuable than a tweet from 2 days ago with 10k cumulative likes — your ranker needs to consume real-time engagement counters as features without breaking train/serve consistency.
Source mixer (in-network + out-of-network + promoted + lists) feeds a heavy multi-task ranker (MaskNet / DCN-v2) with real-time engagement counts and recency decay; heuristic re-rank handles diversity and sensitive-content filtering. Negative-feedback signals (mute/block) are weighted heavily; SimClusters provide interest representations.
Requirements
- Real-time, sub-200ms p99
- Billions of tweets, heterogeneous signals (text, image, video, links)
- Strong recency requirement — minutes-old content matters
Architecture
- Source mixer: in-network (people you follow), out-of-network (recommended), promoted, communities/lists. Each contributes candidates.
- Heavy ranker: MaskNet / DCN-v2 with multi-task heads (p_like, p_RT, p_reply, p_dwell, p_negative_feedback). Inputs: user-author affinity, real-time engagement counts, embedding similarity, recency decay.
- Heuristic re-rank: diversity (cap per author), recency boost, sensitive content filter.
Training infra
Hours-fresh data; daily retrains. Embeddings for user/author/topic. SimClusters (Twitter's clustering) provides interest representations.
Serving
Real-time engagement counters via Kafka + sliding window aggregator. Per-tweet feature lookup at request time. Hot tweet cache.
Eval
Offline: AUC per head, calibration. Online: dwell time, follow rate, negative feedback rate, retention.
Bots / spam need an upstream filter; they pollute training data and can game engagement counts.
Trending events create positive-class shifts mid-day; a model trained yesterday underweights them. Streaming features and frequent retrains both matter.
- Real-time engagement features as ranker inputs — no train/serve skew.
- Negative feedback is a signal, not noise — model it explicitly.
- Source mixer pattern: candidates from many funnels, blended in the ranker.
- Recency decay must be a feature, not just a heuristic post-rank.
Predict click probability for ad-impression pairs at millions of QPS, with predicted probability used directly in second-price auctions and pacing systems. The hard part of this problem is calibration: an AUC-optimal ranker that's miscalibrated will mis-bid on every impression. CTR is also extremely high cardinality (billions of users × millions of ads × thousands of slots) and very imbalanced (~0.1–1% positive rate).
DLRM-style stack with massive embedding tables for sparse IDs, DCN-v2 cross terms, DIN/DIEN sequence attention. Streaming online learning + daily retrains. Calibration via isotonic / Platt on a held-out window — recalibrate frequently, especially across regime shifts. Watch for delayed conversions and selection bias.
Requirements
- O(ms) latency at QPS scale of millions
- Very high cardinality features
- Good calibration (predicted CTR feeds auctions and pacing)
Data
Impressions and clicks (positive class very rare, ~0.1–1%), context (page, slot, time), ad creative, advertiser, user features.
Model — classic stack
- Feature embedding tables for IDs (user, ad, publisher). Hash-bucket for OOV.
- DLRM (Naumov 2019, arxiv 1906.00091): bottom MLP on dense, embedding lookup for sparse, pairwise dot product across all sparse pairs → top MLP → sigmoid.
- DCN-v2 (Wang 2020, arxiv 2008.13535): explicit cross terms via low-rank cross network.
- Sequence: DIN/DIEN attention over user's recent ad interactions.
Training infra
Streaming online + daily retrains. Massive embedding tables (TBs) → row-wise parallel.
Serving
Embedding-table lookup service (sharded by hash), feature fetch, ranker eval, calibration, score back to auction.
Eval
AUC, log-loss, normalized cross-entropy (NCE), calibration plots. Online: revenue, eCPM, advertiser ROI.
Calibration — critical for auctions
Apply isotonic regression or Platt scaling on a held-out window; recalibrate frequently. AUC measures ranking, not whether predicted 0.05 actually means 5% click rate. In a second-price auction, miscalibration directly leaks revenue.
Ad A and Ad B are both predicted at relative scores 0.10 and 0.05. AUC says "rank A higher" — correct. But the auction multiplies predicted CTR by bid: if true CTRs are 0.02 and 0.01, you've doubled both bids equally and the auction is fine. If the model is miscalibrated such that A is true-CTR 0.02 and B is true-CTR 0.005, the relative bid changes, the winner changes, and revenue degrades. AUC is invariant to monotone rescaling; auctions are not.
Selection bias: training log only contains impressions the model showed → IPS or doubly-robust correction needed.
Cold start: new ads have no history → content-based fallback (creative-text embedding).
Distribution shift at sales/holidays: model trained pre-Black-Friday over-bids on Black Friday. Frequent retrains and regime detection.
- Calibration ≠ AUC — name this distinction unprompted.
- Embedding-table sharding is the systems story (TBs of state).
- Delayed feedback + selection bias + cold start — the three classic CTR gotchas.
- Streaming online + daily retrains: calibration drifts, you re-fit.
Search ranking with personalization, but careful — over-personalizing on ambiguous queries ("Apple") gives the wrong intent. The hard part of this problem is unbiased eval: position bias is enormous in search (top result gets 30%+ of all clicks regardless of relevance), so you need click models or randomized exploration to even measure quality.
Hybrid retrieval (BM25 + dense two-tower, possibly ColBERT late-interaction) → cheap L1 ranker → cross-encoder L2 ranker with listwise loss. Personalization via user embedding (long-term + session). Don't over-personalize ambiguous queries. Eval with NDCG@k offline, click models for unbiased online eval, plus head/tail query slices.
Pipeline
- Query understanding: spelling correction, entity linking, intent classification, query embedding (dense retrieval), expansion.
- Retrieval: hybrid — BM25 (lexical, inverted index) + dense (two-tower with query/doc encoders, ANN over corpus). Late-interaction ColBERT for higher recall.
- L1 ranker (cheap, ~1k candidates): linear or shallow GBDT.
- L2 ranker (cross-encoder, ~100): BERT/transformer cross-encoder scoring (query, doc) pair. Listwise loss (LambdaMART / ListNet).
- Personalization: user embedding (long-term + session). Don't over-personalize for ambiguous queries.
Eval
NDCG@k, MRR, click models for unbiased eval (Cascade, PBM, DBN). Online: search success, session abandonment.
Monitoring
Per-segment NDCG (head queries vs tail queries vs navigational vs informational), latency p99, freshness for news queries, click-through patterns by position.
Head/tail queries — separate eval slices. A model that wins on head can lose 5% on tail and overall metrics will hide it.
Freshness for news vs stable for evergreen — query-class-conditional ranking weights.
Over-personalization on ambiguous queries — "Apple" should sometimes show fruit; over-fitting to a tech user's history breaks intent diversity.
- Hybrid (BM25 + dense) is universal — pure dense underperforms on lexical queries.
- Cross-encoder L2 ranker; listwise loss (LambdaMART). Pointwise loses signal.
- Click models for unbiased eval — name them (PBM, Cascade, DBN).
- Per-segment eval — head/tail, navigational/informational, fresh/stable.
Serve 200M+ DAU at sub-2s TTFT and sub-50ms inter-token latency, across short and long prompts, with multi-tier SLAs and tight GPU economics. The hard part of this problem is head-of-line blocking: a single 100k-token prefill request can block dozens of short requests in a naive scheduler. The solution is disaggregated prefill+decode pools, prefix caching, continuous batching, and SLO-aware scheduling.
Disaggregated prefill + decode pools (KV transferred over RDMA). PagedAttention prefix cache keyed on prompt prefix. Continuous batching with per-iteration scheduling. TP=8 within node for 70B+; speculative decoding (EAGLE); FP8 weights+activations. Model router selects mini/full/reasoning model by query complexity. Priority queues separate long-context from short.
Requirements
- 200M+ DAU
- p99 TTFT < 2s, ITL < 50ms
- Mix of short and long prompts
- Multi-tenant SLA tiers (free / plus / pro / API)
- GPU efficiency (cost is the constraint)
Architecture
- Global LB → region → cluster.
- Disaggregated serving: prefill pool + decode pool. KV transferred over RDMA.
- Prefix cache: hash-table keyed on prompt prefix tokens. Block-aligned (PagedAttention).
- Continuous batching: per-iteration scheduling.
- Tensor parallel within node (TP=8 for 70B+); pipeline parallel only for very large models.
- Speculative decoding with EAGLE-style head.
- Quantization: FP8 weights+activations; INT4 weights for cost-sensitive variants.
- Routing: model router selects mini/full/reasoning model based on query complexity.
Eval
Offline (held-out conversations, MMLU/HumanEval/MATH/AgentBench), online (thumbs feedback, conversation length, retry rate, A/B on model versions).
Monitoring
TTFT/ITL distributions, KV cache hit rate, GPU utilization, queue depths, OOM rate, refusal rates, output toxicity.
Prefill is compute-bound (one big matmul per layer over the prompt). Decode is memory-bound (one matmul per token, dominated by KV cache reads). Mixing them in the same pool means decode steps wait behind prefill steps, ITL spikes. Disaggregating: prefill pool runs hot on compute (FP8 saturates tensor cores); decode pool runs hot on memory bandwidth. KV cache moves once over RDMA. Result: stable ITL even when prefill load fluctuates.
Noisy neighbors on shared pools.
Autoscaling lag: GPU spin-up takes minutes — you can't scale reactively. Predictive scaling on traffic forecasts.
Safety filtering pipeline adds latency: pre-filter on input, post-filter on output. Both add hundreds of ms if naive; needs streaming + parallel checks.
Streaming TTFT vs full-completion latency are different SLOs: optimize them separately.
- Disaggregated prefill+decode is the 2025 default.
- Prefix caching + continuous batching + speculative decoding stack is non-negotiable.
- Model router (mini/full/reasoning) is how you make economics work.
- Head-of-line blocking on long context is THE LLM serving gotcha.
Build the production pipeline that takes raw user prompts (and red-team prompts) and produces a fine-tuned, aligned, eval-gated, canary-rolled model. The hard part of this problem is the safety-helpfulness trade-off and the regression-gate engineering: a tiny safety regression on one slice should block the rollout, even if helpfulness improves overall.
Pipeline: data ingest (deduped, PII-filtered) → generation pool (K candidates per prompt, self-critique with constitutional principles, revisions) → preference labeling (AI judge + human gold subset) → SFT on revisions → DPO/KTO/PPO on preferences → eval (capabilities + safety + IF + winrate) → regression gate (block if any safety/capability metric regresses > ε) → canary rollout (1% → 10% → 100%) with online metric monitoring.
Pipeline
- Data ingest: raw prompts (real user, red team, synthetic). Deduped, PII-filtered.
- Generation pool: base model generates K candidates per prompt; self-critique with constitutional principles; revisions.
- Preference labeling: AI judge (stronger model) rates pairs against constitution. Sample subset for human review (gold set).
- Training: SFT on revisions; then DPO/KTO/PPO on preferences. Multiple iterations.
- Evaluation: capabilities (MMLU, MATH, HumanEval), safety (HarmBench, XSTest, jailbreak suite), instruction following (IFEval), preference winrate vs prior model.
- Regression gate: blocked from rollout if regression on N safety / capability metrics > ε.
- Canary rollout: 1% → 10% → 100% with online metrics monitored.
Eval
Capabilities (MMLU, MATH, HumanEval, MMLU-Pro, GPQA), safety (HarmBench attack success, XSTest over-refusal, JailbreakBench), instruction following (IFEval), preference winrate vs prior model (Arena-Hard, AlpacaEval 2).
Monitoring
Online thumbs feedback, refusal rate, output toxicity rate, conversation length, retry rate, jailbreak attempt rate. Per-cohort breakdowns.
Mode collapse: DPO can over-decrease likelihood of both chosen and rejected — only the gap matters. Mitigations: stronger β, mix in SFT loss, IPO variant.
Distribution shift between SFT data and RL prompts → off-policy issues during PPO.
Safety-helpfulness trade-off: making the model refuse more reduces some risks but increases over-refusal on benign queries (XSTest).
Eval contamination: pretraining data leaked into eval → use contamination-resistant evals (GPQA-Diamond, LiveCodeBench, FrontierMath).
- Regression gate is the engineering discipline that makes alignment safe to ship.
- AI judge + human gold subset is the cheap, high-quality preference label recipe.
- Safety vs helpfulness is a Pareto curve — name both axes.
- Canary rollout (1% → 10% → 100%) with online metrics — no big-bang deploys.
Build a retrieval-augmented generation system over billions of documents that produces grounded, cited answers. The hard part of this problem is chunk-boundary loss and "lost in the middle": naive chunking destroys context across boundaries; long stuffed contexts cause LLMs to underuse mid-context information.
Ingest with semantic-aware chunking + overlap → embed (Matryoshka-style for cheap recall + full-dim precision) → vector DB (sharded by tenant, IVF-PQ within shard) → hybrid retrieval (BM25 + dense, RRF fusion) → cross-encoder reranker (top-100 → top-10) → augmentation with citation markers → LLM generation with grounded-answer requirements. Optionally HyDE / multi-query / multi-hop.
Pipeline
- Ingest: chunking (recursive, sentence-aware, ~500–1000 tokens with overlap; or semantic chunking via embedding similarity).
- Embedding: dense (E5, BGE, OpenAI text-embedding-3, Cohere embed v4) with possible Matryoshka representation (truncate dim for cheap recall, full dim for precision).
- Storage: vector DB (Pinecone, Weaviate, Qdrant) or built-in (FAISS sharded, ScaNN).
- Sharding: by tenant, then by embedding cluster (IVF). Replicas for QPS.
- Retrieval: hybrid (BM25 + dense with RRF or learned weighting), large k (~100). Optional ColBERT-style late interaction.
- Reranking: cross-encoder (bge-reranker, cohere-rerank) on top-100 → top-10.
- Augmentation: stitch retrieved chunks into prompt with source markers. Truncate to context budget.
- Generation: LLM with citation requirements (grounded).
- Optional: query rewriting (multiple sub-queries via LLM), chain-of-thought retrieval (multi-hop), HyDE (generate hypothetical answer, embed, retrieve).
Eval
- Retrieval: recall@k, MRR vs labeled relevance.
- End-to-end: faithfulness (RAGAS, TruLens), answer correctness, citation accuracy.
- Hallucination rate: NLI-based check.
Monitoring
Retrieval recall@k drift, citation accuracy, hallucination rate, latency per stage (embed, retrieve, rerank, generate), index staleness per tenant, embedding-model version pinning.
Embedding drift when re-embedding with new model — need full reindex; never mix versions.
Lost in the middle (Liu 2023) — LLMs underuse middle of long context. Mitigations: re-rank top-k aggressively, put best snippet first or last, truncate if context is too long.
Stale data — TTL + incremental reindexing.
Hallucinated citations — even with citation requirements, models invent source IDs. Validate citations against retrieved set programmatically.
- Hybrid (BM25 + dense, RRF fusion) — never pure dense at billion-doc scale.
- Cross-encoder reranker is what makes top-10 actually correct.
- Chunking strategy matters more than people think — semantic + overlap.
- Faithfulness eval (RAGAS) + citation validation are the production safety nets.
Build a billion-vector ANN system that handles inserts, deletes, attribute filters, and consistent QPS. The hard part of this problem is the filter problem: vectors are easy to find, but combining ANN with attribute filters (tenant, time range, category) breaks the obvious algorithms in non-obvious ways.
HNSW for <100M (in-memory, best recall/latency); IVF-PQ for billion+ (compressed); ScaNN/DiskANN intermediate. Sharding by tenant then by clustering; replicas for QPS. Filters need inline traversal (attribute-aware HNSW) — pre/post-filter both have failure modes. Insert is easy, delete is hard (tombstones + compaction). Re-indexing is mandatory on embedding-model changes.
Algorithms
- Flat (brute force): exact, O(N·d). Good for <1M.
- IVF: cluster with k-means into C centroids. At query, search nearest probe centroids, scan their lists.
- PQ: split d-dim vector into M subvectors, quantize each with k=256 (1 byte each). Compressed asymmetric distance via lookup tables. IVF-PQ combines both.
- HNSW: hierarchical small-world graph. Multi-layer; greedy descent with ef-search controlling recall. Excellent for in-memory.
- ScaNN (Google): IVF + anisotropic quantization. Strong recall/latency.
- DiskANN (Microsoft): SSD-resident graph; ~2× slower than in-memory HNSW but 10× cheaper.
Memory: billion 768-d float32 = 3 TB; PQ to ~32 bytes/vec → 32 GB.
System
- Sharding: by tenant, then by ID range or by clustering.
- Replicas: for QPS and HA.
- Updates: insertion in HNSW is O(log N); deletion is hard (tombstones + periodic compaction). Re-indexing for embedding model changes.
- Filters: pre-filter (filter then search — bad for low-selectivity), post-filter (search then filter — bad for high-selectivity), inline filter (HNSW with attribute-aware traversal).
Eval
Recall@k against brute-force ground truth on a sample; query latency p50/p99; throughput at target recall.
Monitoring
Per-shard latency, recall sampling against brute-force baseline, index size, deletion-tombstone fraction, compaction lag, filter-selectivity histograms.
1B vectors × 768 dims × 4 bytes (FP32) = 3 TB. Won't fit on a single node. Three options: (a) shard across ~50 nodes of 64 GB each (in-memory HNSW per shard); (b) compress with IVF-PQ to ~32 bytes/vec → 32 GB total, fits on a few nodes; (c) DiskANN on SSD — same shard count as (a) but 10× cheaper hardware. Pick based on recall SLA: HNSW for highest recall, IVF-PQ for tightest budget.
Deletes: graph indices don't truly delete — tombstones + periodic full compaction.
Embedding-model drift: changing the embedding model means re-indexing the entire corpus. Plan for this; never mix vectors from different model versions in the same index.
- HNSW vs IVF-PQ trade-off — recall vs memory vs cost.
- Filters are the gotcha — name pre/post/inline distinction.
- Deletes need compaction; re-indexing is mandatory on model changes.
- Sharding by tenant first, then by clustering — multi-tenant isolation matters.
Build the feature store every ML team in a company shares: declarative feature definitions, online and offline serving, point-in-time joins for training, governance. The hard part of this problem is train/serve skew prevention and point-in-time correctness: training joins must look up feature values as of the label timestamp, never the current value.
Declarative feature definitions (Feast/Tecton) → batch (Spark/Flink) writes to offline warehouse + reverse-ETL to online KV → streaming (Kafka → Flink/Beam) for real-time aggregations → on-demand transforms at request time. Training uses point-in-time joins (feature value as-of label timestamp) to prevent leakage. Same code path online and offline = no train/serve skew.
Architecture
- Feature definition: declarative (Feast, Tecton). One source of truth.
- Storage: online low-latency KV (Redis, DynamoDB, Cassandra); offline data warehouse / lake (BigQuery, Snowflake, Iceberg) with full history.
- Compute: batch (Spark/Flink) → write to offline store; reverse-ETL → online. Streaming (Kafka → Flink/Beam) → real-time aggregations. On-demand transforms (compute at request time).
- Point-in-time joins: for training, join labels with feature values as of label timestamp — prevents leakage.
- Serving: feature fetch service fans out lookups, applies on-demand transforms, returns vector to model server.
Governance
Feature ownership, lineage, freshness SLAs, deprecation workflow, cost attribution.
Eval
Per-feature freshness SLA tracking, online/offline parity sampling (compute online, log, recompute offline, diff), feature-vector latency p99.
Label: user U clicked ad A at timestamp T. Training feature "U's 7-day click count" must be the value at T, not at training time. Naive join: select most recent feature row → leaks future clicks into training feature → model overfits a value it could never have at serving time. Correct join: pick the feature row with timestamp ≤ T (and ≥ T − staleness budget). Feast/Tecton implement this primitively.
Feature freshness drift — monitor per-feature staleness; alert when a streaming feature falls behind.
Duplicate features across teams — centralize discoverability; mandatory feature registration.
- Point-in-time correctness — name this unprompted.
- Online + offline stores backed by single feature definition (no skew).
- Streaming (Flink/Beam) for real-time aggregations is mandatory at scale.
- Governance + lineage + freshness SLAs are the org-scale problem.
Build the company's experimentation platform: assignment, logging, metrics, statistical engine, holdouts, heterogeneous treatment effects. The hard part of this problem is SUTVA violations in marketplaces and social networks: two-sided platforms break the "treatment of one user doesn't affect others" assumption, requiring cluster randomization or switchback designs.
Deterministic hash assignment in mutually exclusive layers (Google's Overlapping Experiments Infrastructure). Hierarchical metrics (North Star + proxy + guardrails). Stat engine with Welch t-test + CUPED variance reduction + sequential testing for early-stopping safety + multi-comparison correction. HTE via causal forests / meta-learners. Switchback or cluster randomization for marketplace effects. Long-term holdouts for novelty decay.
Components
- Assignment: deterministic hashing on (user_id, experiment_id) → bucket. Mutually exclusive layers (Overlapping Experiments Infrastructure, Tang 2010 — Google paper). Holdouts.
- Logging: exposure events (when user actually entered the experiment surface), interaction events.
- Metrics: hierarchical — North Star (DAU, revenue), proxy (CTR, sessions), guardrails (latency, error rate).
- Stat engine: Welch's t-test for means; CUPED variance reduction (Deng 2013); sequential testing (mSPRT) for early-stopping safety; Bonferroni / BH for multiple comparisons.
- Heterogeneous treatment effects: causal forests, meta-learners (S/T/X-learners) to find subgroup wins.
- Switchback / interleaving for marketplace effects (network interference).
- Long-term effects: holdouts (a population that never sees treatment for months) to measure novelty decay.
Eval / monitoring
Sample ratio mismatch (SRM) detection; bucket-balance checks; metric stability (rolling baseline drift); experiment-coverage dashboard.
Simpson's paradox: aggregate effect can flip sign vs subgroup effects.
Peeking without sequential correction: looking at p-values daily and stopping when significant inflates false-positive rate massively.
Metric definition drift: definition changes mid-experiment.
Logging bugs that look like wins: the most common false-positive source. Always check exposure parity (SRM).
- CUPED is variance reduction with a pre-experiment covariate — name it.
- Mutually exclusive layers (OEI) — multiple experiments on same user.
- SUTVA violations are the marketplace gotcha — switchback / cluster randomization.
- Sequential testing or no early-stopping — peeking inflates FPR.
Recommend items where the items are inherently multi-modal: short videos with audio + text + visual + author info. The hard part of this problem is modality fusion + missingness: not every item has every modality, and a model that assumes "always have video" breaks on items without one. Plus, video encoding is expensive — precompute and cache aggressively.
Per-modality pretrained encoders (CLIP/SigLIP for image-text, video transformer for clips) project into shared embedding space. Gated attention fuses modalities (gates handle missingness). User encoder over historical item embeddings. Two-tower retrieval with multimodal item tower. Cross-encoder ranker with multi-task heads. Precompute video encodings — they're expensive.
Architecture
- Item encoders: pretrained encoders per modality (CLIP/SigLIP for image-text, video transformer for clips). Project into shared embedding via projection head.
- Multi-modal fusion: gated attention combining modalities; quality of each modality gated.
- User encoder: sequence model over user's historical item embeddings + side info.
- Two-tower retrieval with multimodal item tower; ANN.
- Cross-encoder ranker: full attention between user history and candidate, multi-task heads.
- LLM-augmented (optional): LLM generates natural language query/intent from history; semantic retrieval over multimodal embeddings.
Training infra
Pretrain modality encoders separately; align via contrastive on (user_history, clicked_item); fine-tune end-to-end on engagement.
Serving
Precompute item embeddings on ingest; cache aggressively. Video encoding latency is hostile — never compute on the request path.
Eval
Per-modality ablation: how much does each modality contribute? Cold-start eval on items with subset of modalities. Standard recsys metrics (recall, NDCG, online engagement).
Expensive video encoding → precompute, cache; never on request path.
Modality bias: one modality (e.g., title text) dominates because it's always present and high-signal — model ignores video. Counter via per-modality dropout during training.
Cold-start — content-based via modality embeddings is the strength of this design (vs ID-only models).
- Pretrained per-modality encoders + projection heads — don't train end-to-end from scratch.
- Gated fusion handles missing modalities; per-modality dropout in training prevents bias.
- Precompute item embeddings; never encode video on request path.
- Cold-start is the strength — multi-modal beats ID-only on new items.
360° real-time perception at 10–30 Hz, sub-50ms latency, multi-sensor (camera + lidar + radar), high recall on safety-critical objects, must generalize to long tail (mattress on highway, construction worker holding a sign). The hard part of this problem is long-tail rare objects + auto-labeling at PB scale — you can't human-label everything, you need an offline-large-model auto-labeler with human verification on edge cases.
Sensor fusion via BEV (Bird's-Eye View) — modern preference is mid-fusion (BEVFormer / Lift-Splat-Shoot lifts camera features into 3D, splats into BEV grid, transformer over BEV). Multi-task heads: 3D detection, segmentation, lane, traffic light, pose, intent. Tracking via Kalman + re-ID. Trajectory prediction with multimodal output. Auto-labeling pipeline + simulation + shadow mode. End-to-end (Wayve / Tesla FSD V12) is the alternative.
Requirements
- 360° perception at 10–30 Hz
- Latency < 50ms
- Multiple sensors (camera, lidar, radar)
- High recall on safety-critical objects
- Generalize to long tail
Architecture
- Sensor fusion:
- Early: project lidar onto camera, jointly process.
- Late: per-sensor detections, fuse in BEV.
- Mid (modern): BEVFormer / Lift-Splat-Shoot — lift camera features into 3D via depth distribution, splat into BEV grid, transformer over BEV.
- Backbone: per-camera ConvNet/ViT; temporal fusion (multiple frames).
- Heads: 3D object detection (bbox, class, velocity), semantic/instance segmentation, lane detection, drivable area, traffic light state, pose, intent prediction.
- Tracking: Kalman / data association; re-ID embeddings.
- Prediction: trajectory prediction (Trajectron, MultiPath, MTR) — multimodal trajectory distributions.
- End-to-end alternatives: Wayve / Tesla FSD V12 — camera → planning, trained on driving demonstration + RL from interventions.
Training infra
PB-scale data; auto-labeling pipeline (large offline model labels, humans verify edge cases); simulation for rare scenarios; mining hard negatives from fleet.
Eval
Offline (mAP, IoU, ADE/FDE); closed-loop sim; shadow mode (parallel run, log disagreements); on-road (disengagement rate).
Monitoring
Per-class recall (especially safety-critical: pedestrian, cyclist), calibration drift between sensors, latency p99, disengagement rate per region/condition, sim-to-real gap.
Calibration drift between sensors: lidar-camera extrinsics shift with vibration; periodic recalibration is mandatory.
Adversarial conditions (rain, sun glare, dust): per-condition eval slices.
Prediction multimodality — collapse to mean is dangerous (predicting average of "turn left" and "turn right" = drive straight into oncoming traffic). Use multi-modal trajectory output (MTR-style).
OTA model rollout with canary fleets — never big-bang deploy.
- BEV fusion is the modern default — name BEVFormer / LSS.
- Auto-labeling pipeline + sim + shadow mode is the data flywheel.
- Trajectory prediction must be multimodal — never collapse to mean.
- Long tail is the hard problem — active mining is the answer.
Phase the time explicitly. Talk less in clarification, more in deep-dive. Always cover eval AND monitoring AND gotchas. End with "what would I do with more time" — it shows scope-discipline. Never claim a single technology solves everything.
The phase-by-phase tactic sheet
- 0–5 min: clarify requirements, scale, what to optimize, what's allowed. Confirm with interviewer. Write the constraints on the board.
- 5–10 min: high-level architecture (boxes + arrows). State the funnel/stack.
- 10–25 min: deep-dive 1–2 components the interviewer signals interest in. Discuss alternatives. This is where 50% of your signal lives.
- 25–35 min: training infra + serving infra + eval (online + offline).
- 35–45 min: gotchas, monitoring, what would you do differently with more time.
The "if I had more time" close
End every loop with 2–3 things you'd explore further: alternative architectures, deeper eval, harder gotchas. This signals scope-discipline (you knew what you skipped) and intellectual honesty (you don't claim to have solved everything in 45 min).
- Phase the time explicitly — say "I'll spend 5 min clarifying, then 5 sketching, then 15 deep-diving the ranker."
- Eval + monitoring + gotchas are 30% of the score; don't run out of time.
- "If I had more time" is the scope-discipline signal.
- Trade-offs over absolutes: "I'd pick X because Y, even though Z."
ML system design quiz — readiness check
- Walk through YouTube's 2-stage funnel.
Show answer
Retrieval (~1B → 1k): two-tower + ANN, plus other sources (collab, fresh, trending). Ranking (~1k → 100): cross-encoder MMoE with multi-task heads (pCTR, pWatchtime, pLike, pDislike). Final scoring + diversity (MMR/DPP) + business rules → top 10.
- How would you handle cold start for new users / new items?
Show answer
New user: content-based recs from session signals (initial interactions, demographics, source); explore via bandits. New item: content-based features (text/image embeddings) drive retrieval; explore via mandatory exploration slots; meta-learning if you have many items per category.
- Your CTR model has high offline AUC but loses in A/B test. Diagnose.
Show answer
Suspect (1) selection bias / exposure bias — train log only contains items the old model showed; (2) train/serve skew — different feature pipeline; (3) miscalibration — AUC measures ranking, not calibrated prob; (4) shift in business mix; (5) novelty: model fits historical patterns that no longer hold.
- Why does an MMoE outperform a shared-bottom multi-task model?
Show answer
Shared-bottom forces all tasks through one trunk → negative transfer when tasks conflict. MMoE has multiple experts shared across tasks; per-task gates softly mix experts. Each task can pick its preferred expert combination → less interference.
- How do you debias position effects in a learned ranker?
Show answer
(1) PAL / position-aware learning: feed position as a feature during training; set to "no position" at inference. (2) Two-tower: shallow tower predicts position effect; main tower predicts relevance. (3) Click models (PBM, Cascade, DBN). (4) Result randomization for unbiased data.
- In-batch negatives vs explicit hard negatives — tradeoff?
Show answer
In-batch: cheap, popularity-biased (high-frequency items appear more as negatives). Hard negatives (top non-positive from ANN): faster learning per example, risk of false negatives (true positives that look similar). Mixed Negative Sampling (MNS): combine in-batch + uniform-random; logQ correction for popularity debiasing.
- HNSW vs IVF-PQ — when each?
Show answer
HNSW: best recall/latency in memory; high memory (no compression). Use for < 100M vectors with quality SLA. IVF-PQ: compressed (32×–128×); scales to billions; recall lower. Use for billion+ vectors at modest QPS. ScaNN/DiskANN intermediate.
- Your retrieval layer returns the same 1000 items for everyone. What's wrong?
Show answer
Likely: (1) user tower collapsed (output ≈ constant — check embedding norm and variance); (2) lack of personalization features in user tower; (3) popularity-biased sampled-softmax without logQ correction → all items pushed toward popular ones; (4) ANN index returning popular items only (debiasing lost in serving).
- How do you calibrate a multi-task ranker where each task has different positive rates?
Show answer
Per-task isotonic regression on a held-out window. For each task head's logit, fit monotonic mapping logit → calibrated prob using actual positive rate per quantile. Recalibrate frequently (weekly) under distribution shift. Joint calibration if tasks are correlated.
- Migrate DLRM-style ranker to generative recommender (TIGER/HSTU) — risks?
Show answer
(1) Loss of explicit business signal control (multi-task heads gone). (2) Auto-regressive serving latency. (3) Cold-start changes (semantic IDs from RQ-VAE replace ID embeddings). (4) Training-time interactions with online behavior (counterfactual eval harder). (5) Capacity for personalization in generative model still being scaled out. Run as A/B with strict guardrails.
- Design a RAG system at billion-doc scale.
Show answer
Chunking (semantic, ~500-1000 tok with overlap) → embedding (SigLIP/E5/Cohere) → vector DB (sharded by tenant, IVF-PQ within shard) → hybrid retrieval (BM25 + dense, RRF) → cross-encoder reranker → LLM generation with citations. Eval: recall@k, RAGAS faithfulness, citation accuracy.
- Design a feature store with point-in-time correctness.
Show answer
Online (sub-ms KV) + offline (warehouse with full history). Streaming compute (Flink/Beam) → online + offline. For training, point-in-time joins: for each (entity, label timestamp), look up feature value as-of the label time. Prevents leakage of future info into training features.
- What's CUPED?
Show answer
Variance reduction for A/B tests (Deng 2013). Adjust the outcome by a pre-experiment covariate:
Y' = Y − θ (X − E[X])where X is correlated with Y. Reduces variance → smaller sample size for same power. Used at every major web company. - How do you avoid SUTVA violations in marketplace experiments?
Show answer
SUTVA = stable unit treatment value assumption — treatment of one unit doesn't affect others. Violated by network effects (social) or supply-side competition (DoorDash, Uber). Solutions: cluster randomization (region-level), switchback (alternate treatment over time), counterfactual interleaving for ranking.
- Design ChatGPT serving for 200M DAU.
Show answer
Disaggregated prefill + decode pools. Prefix caching (system prompts shared). Continuous batching. TP=8 within node for 70B+. Speculative decoding. FP8 weights. Model routing (cheap vs reasoning). Multi-tier autoscale. Safety filtering pipeline pre/post inference. SLO-aware scheduling.