DESIGN · 45-MIN INTERVIEW LOOP

ML system design — 12 worked problems

Sr Staff designs are not "draw a DLRM". They are: clarify scope, defend trade-offs, name second-order effects, and survive the deep-dive. This page is twelve problems worked end-to-end in the same disciplined structure — plus the meta-framework that ties them together.

Read ~60 min Asked at Pinterest, Meta, Google, OpenAI, Anthropic Difficulty Sr Staff bar

What you'll learn

The 45-min interview framework
Design YouTube recommendations
Design Twitter / X feed ranking
Design ad CTR prediction
Design personalized search ranking
Design ChatGPT serving at scale
Design fine-tuning + Constitutional AI loop
Design RAG at billion-doc scale
Design vector search infrastructure
Design ML feature store with point-in-time correctness
Design A/B testing framework for ML
Design multi-modal recommendation system
Design self-driving perception pipeline
How to actually run a 45-min ML design loop

META · THE FRAMEWORK

The 45-min interview framework — clarify, capacity, API, architecture, deep-dive

TL;DR

The Sr Staff bar is not "drawing the right diagram"; it's making sensible cost/quality/latency trade-offs and naming second-order effects. Every loop has the same shape: clarify the problem, estimate capacity, sketch the architecture, deep-dive a component, name the gotchas. The candidates who skip clarification and jump to "two-tower + MMoE + DLRM" lose the loop in the first 5 minutes.

THE FRAMEWORK

Five phases, every time

Clarify (0–5 min) — scale, latency budget, what to optimize, what's allowed, who the user is, what existing systems exist. Write the constraints down. Confirm with interviewer.
Capacity (5–10 min) — back-of-envelope: QPS, storage, training data volume, memory. Pick the constraint that will dominate.
API + architecture (10–20 min) — boxes and arrows. State the funnel/stack. Two-stage retrieval+ranker, prefill+decode, batch+streaming feature store, etc.
Deep-dive (20–35 min) — interviewer signals what they want pulled apart. Discuss alternatives. Defend choices with first principles. This is the bulk of the signal.
Eval + monitoring + gotchas (35–45 min) — offline metrics, online metrics, guardrails, what would you do differently with more time.

The signals interviewers grade on

Did you clarify before architecting? Skipping this is the #1 way to lose Sr Staff loops.
Did you name trade-offs explicitly? "I'd use HNSW because we're under 100M vectors and want highest recall at the cost of memory."
Did you cover eval AND monitoring AND gotchas? Junior candidates stop at architecture.
Did you name second-order effects? Position bias, feedback loops, train/serve skew, calibration drift, network effects in marketplace experiments.
Did you stay scope-disciplined? When the interviewer asks "deep-dive on the ranker," you stop talking about retrieval.

PITFALL — the four classic failure modes

(1) Jumping to architecture before clarifying. (2) Ignoring evaluation entirely (offline AUC + online A/B + guardrails). (3) Skipping monitoring (latency p99, calibration drift, feature freshness). (4) Claiming a single technology solves everything ("we'd use Spanner for everything"). Any of these alone is a downlevel signal.

REMEMBER

Clarify → Capacity → Architecture → Deep-dive → Eval/monitoring/gotchas. Always.
Name trade-offs explicitly. "I'd pick X because Y, even though Z."
Cover eval, monitoring, and gotchas — they're 30% of the score.
Second-order effects (bias, drift, feedback loops) are the Sr Staff differentiator.

RECSYS · LARGE-SCALE FUNNEL

Design YouTube recommendations

Recommend videos to 2B users from a corpus of billions, with sub-100ms latency, optimizing for watchtime AND satisfaction (likes, subscribes, no-dislike) — without collapsing into a clickbait filter bubble. The hard part of this problem is the multi-objective optimization: pCTR alone produces clickbait; pWatchtime alone produces engagement-bait; you need a multi-task ranker with the right weights, and you need diversity / exploration baked in.

TL;DR

Two-stage funnel: retrieval (1B → 1k) merges several sources (two-tower + collab + fresh + trending); ranking (1k → 100) is a multi-task cross-encoder (DLRM-style or transformer) with MMoE shared experts; final scoring is a tuned weighted sum across pCTR/pWatchtime/pLike/pDislike with diversity (MMR/DPP) and exploration slots. Streaming training for freshness, shared embedding tables sharded model-parallel.

Requirements

2B users, 500h/min uploaded
Latency < 100ms p99
Optimize watchtime + satisfaction (likes, subscribes, low dislike rate)
Avoid filter-bubble + clickbait dynamics

Data

Implicit: watch, watch_fraction, skip
Explicit: like, dislike, subscribe, share
Context: time, device, locale
Content: channel, topic, language, age
User: history, social graph

Architecture — classic two-stage funnel

Candidate generation (retrieval): multiple sources merged.
- Two-tower: user tower (ID embed + recent history sequence + demographics) and item tower (video features), trained with sampled-softmax / in-batch negatives. ANN (HNSW or ScaNN) over O(100M) videos → top ~1k.
- Co-watch / collab: classical retrievers.
- Subscriptions / fresh: from followed channels.
- Trending / locale popular.
Ranking: cross-encoder (DLRM-style or transformer). Inputs: user features, candidate features, cross features, sequence of recent interactions (DIN/SASRec attention over user history vs candidate). Multi-task heads: pCTR, pWatchtime, pLike, pComment, pDislike. MMoE for shared+task-specific experts.
Final scoring: weighted sum across heads (weights tuned via online experiments / Pareto), with diversity / recency / freshness boost. MMR or DPP for diversity.

Training infra

Massive embedding tables (hundreds of GB) — sharded model parallel (TorchRec). Streaming training (Kafka → Flink → trainer) for freshness. Daily full retrain + hourly incremental.

Serving

Retrieval served from in-memory ANN index (replicated). Ranker on GPU/CPU farm with feature store fan-in. Online (Redis-like, sub-ms) + offline (warehouse for training).

Eval

Offline AUC, NDCG, recall@k, calibration. Online: A/B tests with primary metric watchtime, guardrails on dislikes/abandonment/diversity.

Monitoring

Per-head calibration drift, retrieval recall slipping, feature freshness, p99 latency, fraction of recommendations that come from each retrieval source (don't let one collapse), exploration slot fraction.

EXAMPLE — concrete capacity numbers

2B users × 10 sessions/day × 50 candidates ranked = ~1 trillion ranker evaluations/day. At 1ms per ranking on GPU, that's ~12k GPU-equivalents continuously. In practice you batch and run on heterogeneous CPU+GPU pools. Embedding tables: 100M videos × 256-dim BF16 = ~50 GB just for items; user side similar order. Total ~hundreds of GB → must be model-parallel sharded (TorchRec).

PITFALL — clickbait, feedback loops, position bias

Clickbait: training on pCTR alone teaches the model to surface thumbnails that get clicked, not videos that get watched. Always train pWatchtime + pSatisfaction, not pCTR alone.
Feedback loops: model only sees feedback on items it showed, so it can't unlearn its own biases. Mandatory exploration slots (5–10%) are non-negotiable.
Position bias: position 1 always gets clicked more. Estimate the position effect with a shallow position-only tower and subtract; or use position-aware loss (PAL).
Train/serve skew: same feature definitions across pipelines, point-in-time correctness during training joins.

REMEMBER — what they're probing

Two-stage funnel discipline: retrieval is recall-first, ranking is precision-first.
Multi-task heads + MMoE — never single-objective on a recommender.
Calibration matters when scores feed downstream (auctions, weighted blending).
Feedback loops are the killer: exploration is required, not optional.

RECSYS · REAL-TIME FEED

Design Twitter / X feed ranking

Like YouTube, but with severe recency, real-time engagement signals, and heterogeneous content (text/image/video/links). The hard part of this problem is real-time freshness: a tweet from 2 minutes ago with 500 likes/min is more valuable than a tweet from 2 days ago with 10k cumulative likes — your ranker needs to consume real-time engagement counters as features without breaking train/serve consistency.

TL;DR

Source mixer (in-network + out-of-network + promoted + lists) feeds a heavy multi-task ranker (MaskNet / DCN-v2) with real-time engagement counts and recency decay; heuristic re-rank handles diversity and sensitive-content filtering. Negative-feedback signals (mute/block) are weighted heavily; SimClusters provide interest representations.

Requirements

Real-time, sub-200ms p99
Billions of tweets, heterogeneous signals (text, image, video, links)
Strong recency requirement — minutes-old content matters

Architecture

Source mixer: in-network (people you follow), out-of-network (recommended), promoted, communities/lists. Each contributes candidates.
Heavy ranker: MaskNet / DCN-v2 with multi-task heads (p_like, p_RT, p_reply, p_dwell, p_negative_feedback). Inputs: user-author affinity, real-time engagement counts, embedding similarity, recency decay.
Heuristic re-rank: diversity (cap per author), recency boost, sensitive content filter.

Training infra

Hours-fresh data; daily retrains. Embeddings for user/author/topic. SimClusters (Twitter's clustering) provides interest representations.

Serving

Real-time engagement counters via Kafka + sliding window aggregator. Per-tweet feature lookup at request time. Hot tweet cache.

Eval

Offline: AUC per head, calibration. Online: dwell time, follow rate, negative feedback rate, retention.

PITFALL — negative feedback, bots, trending shifts

Negative feedback signals (mute, block, "see less") are strong → weight heavily, often as a separate head with negative coefficient in the final blend.
Bots / spam need an upstream filter; they pollute training data and can game engagement counts.
Trending events create positive-class shifts mid-day; a model trained yesterday underweights them. Streaming features and frequent retrains both matter.

REMEMBER — what they're probing

Real-time engagement features as ranker inputs — no train/serve skew.
Negative feedback is a signal, not noise — model it explicitly.
Source mixer pattern: candidates from many funnels, blended in the ranker.
Recency decay must be a feature, not just a heuristic post-rank.

RECSYS · CALIBRATED PROBABILITY

Design ad CTR prediction

Predict click probability for ad-impression pairs at millions of QPS, with predicted probability used directly in second-price auctions and pacing systems. The hard part of this problem is calibration: an AUC-optimal ranker that's miscalibrated will mis-bid on every impression. CTR is also extremely high cardinality (billions of users × millions of ads × thousands of slots) and very imbalanced (~0.1–1% positive rate).

TL;DR

DLRM-style stack with massive embedding tables for sparse IDs, DCN-v2 cross terms, DIN/DIEN sequence attention. Streaming online learning + daily retrains. Calibration via isotonic / Platt on a held-out window — recalibrate frequently, especially across regime shifts. Watch for delayed conversions and selection bias.

Requirements

O(ms) latency at QPS scale of millions
Very high cardinality features
Good calibration (predicted CTR feeds auctions and pacing)

Data

Impressions and clicks (positive class very rare, ~0.1–1%), context (page, slot, time), ad creative, advertiser, user features.

Model — classic stack

Feature embedding tables for IDs (user, ad, publisher). Hash-bucket for OOV.
DLRM (Naumov 2019, arxiv 1906.00091): bottom MLP on dense, embedding lookup for sparse, pairwise dot product across all sparse pairs → top MLP → sigmoid.
DCN-v2 (Wang 2020, arxiv 2008.13535): explicit cross terms via low-rank cross network.
Sequence: DIN/DIEN attention over user's recent ad interactions.

Training infra

Streaming online + daily retrains. Massive embedding tables (TBs) → row-wise parallel.

Serving

Embedding-table lookup service (sharded by hash), feature fetch, ranker eval, calibration, score back to auction.

Eval

AUC, log-loss, normalized cross-entropy (NCE), calibration plots. Online: revenue, eCPM, advertiser ROI.

Calibration — critical for auctions

Apply isotonic regression or Platt scaling on a held-out window; recalibrate frequently. AUC measures ranking, not whether predicted 0.05 actually means 5% click rate. In a second-price auction, miscalibration directly leaks revenue.

EXAMPLE — why calibration breaks the bid

Ad A and Ad B are both predicted at relative scores 0.10 and 0.05. AUC says "rank A higher" — correct. But the auction multiplies predicted CTR by bid: if true CTRs are 0.02 and 0.01, you've doubled both bids equally and the auction is fine. If the model is miscalibrated such that A is true-CTR 0.02 and B is true-CTR 0.005, the relative bid changes, the winner changes, and revenue degrades. AUC is invariant to monotone rescaling; auctions are not.

PITFALL — delayed conversions, selection bias, distribution shift

Delayed conversions: a click leading to a conversion takes hours to days; naive labeling treats unobserved conversions as negative. Use delayed-feedback modeling (Chapelle 2014: importance-weighted loss with conversion-delay distribution).
Selection bias: training log only contains impressions the model showed → IPS or doubly-robust correction needed.
Cold start: new ads have no history → content-based fallback (creative-text embedding).
Distribution shift at sales/holidays: model trained pre-Black-Friday over-bids on Black Friday. Frequent retrains and regime detection.

REMEMBER — what they're probing

Calibration ≠ AUC — name this distinction unprompted.
Embedding-table sharding is the systems story (TBs of state).
Delayed feedback + selection bias + cold start — the three classic CTR gotchas.
Streaming online + daily retrains: calibration drifts, you re-fit.

SEARCH · HYBRID RETRIEVAL

Design personalized search ranking

Search ranking with personalization, but careful — over-personalizing on ambiguous queries ("Apple") gives the wrong intent. The hard part of this problem is unbiased eval: position bias is enormous in search (top result gets 30%+ of all clicks regardless of relevance), so you need click models or randomized exploration to even measure quality.

TL;DR

Hybrid retrieval (BM25 + dense two-tower, possibly ColBERT late-interaction) → cheap L1 ranker → cross-encoder L2 ranker with listwise loss. Personalization via user embedding (long-term + session). Don't over-personalize ambiguous queries. Eval with NDCG@k offline, click models for unbiased online eval, plus head/tail query slices.

Pipeline

Query understanding: spelling correction, entity linking, intent classification, query embedding (dense retrieval), expansion.
Retrieval: hybrid — BM25 (lexical, inverted index) + dense (two-tower with query/doc encoders, ANN over corpus). Late-interaction ColBERT for higher recall.
L1 ranker (cheap, ~1k candidates): linear or shallow GBDT.
L2 ranker (cross-encoder, ~100): BERT/transformer cross-encoder scoring (query, doc) pair. Listwise loss (LambdaMART / ListNet).
Personalization: user embedding (long-term + session). Don't over-personalize for ambiguous queries.

Eval

NDCG@k, MRR, click models for unbiased eval (Cascade, PBM, DBN). Online: search success, session abandonment.

Monitoring

Per-segment NDCG (head queries vs tail queries vs navigational vs informational), latency p99, freshness for news queries, click-through patterns by position.

PITFALL — position bias, head/tail, freshness

Position bias (huge in search) → click models (Cascade, PBM, DBN) for unbiased eval. Otherwise you're just measuring "does the model agree with previous model".
Head/tail queries — separate eval slices. A model that wins on head can lose 5% on tail and overall metrics will hide it.
Freshness for news vs stable for evergreen — query-class-conditional ranking weights.
Over-personalization on ambiguous queries — "Apple" should sometimes show fruit; over-fitting to a tech user's history breaks intent diversity.

REMEMBER — what they're probing

Hybrid (BM25 + dense) is universal — pure dense underperforms on lexical queries.
Cross-encoder L2 ranker; listwise loss (LambdaMART). Pointwise loses signal.
Click models for unbiased eval — name them (PBM, Cascade, DBN).
Per-segment eval — head/tail, navigational/informational, fresh/stable.

LLM SERVING · MULTI-TENANT

Design ChatGPT serving at scale

Serve 200M+ DAU at sub-2s TTFT and sub-50ms inter-token latency, across short and long prompts, with multi-tier SLAs and tight GPU economics. The hard part of this problem is head-of-line blocking: a single 100k-token prefill request can block dozens of short requests in a naive scheduler. The solution is disaggregated prefill+decode pools, prefix caching, continuous batching, and SLO-aware scheduling.

TL;DR

Disaggregated prefill + decode pools (KV transferred over RDMA). PagedAttention prefix cache keyed on prompt prefix. Continuous batching with per-iteration scheduling. TP=8 within node for 70B+; speculative decoding (EAGLE); FP8 weights+activations. Model router selects mini/full/reasoning model by query complexity. Priority queues separate long-context from short.

Requirements

200M+ DAU
p99 TTFT < 2s, ITL < 50ms
Mix of short and long prompts
Multi-tenant SLA tiers (free / plus / pro / API)
GPU efficiency (cost is the constraint)

Architecture

Global LB → region → cluster.
Disaggregated serving: prefill pool + decode pool. KV transferred over RDMA.
Prefix cache: hash-table keyed on prompt prefix tokens. Block-aligned (PagedAttention).
Continuous batching: per-iteration scheduling.
Tensor parallel within node (TP=8 for 70B+); pipeline parallel only for very large models.
Speculative decoding with EAGLE-style head.
Quantization: FP8 weights+activations; INT4 weights for cost-sensitive variants.
Routing: model router selects mini/full/reasoning model based on query complexity.

Eval

Offline (held-out conversations, MMLU/HumanEval/MATH/AgentBench), online (thumbs feedback, conversation length, retry rate, A/B on model versions).

Monitoring

TTFT/ITL distributions, KV cache hit rate, GPU utilization, queue depths, OOM rate, refusal rates, output toxicity.

EXAMPLE — why disaggregated prefill+decode

Prefill is compute-bound (one big matmul per layer over the prompt). Decode is memory-bound (one matmul per token, dominated by KV cache reads). Mixing them in the same pool means decode steps wait behind prefill steps, ITL spikes. Disaggregating: prefill pool runs hot on compute (FP8 saturates tensor cores); decode pool runs hot on memory bandwidth. KV cache moves once over RDMA. Result: stable ITL even when prefill load fluctuates.

PITFALL — head-of-line blocking, autoscaling lag, safety latency

Head-of-line blocking: long-context requests block short ones in naive batching. Mitigations: priority queues, separate pools for long context, SLO-aware scheduling.
Noisy neighbors on shared pools.
Autoscaling lag: GPU spin-up takes minutes — you can't scale reactively. Predictive scaling on traffic forecasts.
Safety filtering pipeline adds latency: pre-filter on input, post-filter on output. Both add hundreds of ms if naive; needs streaming + parallel checks.
Streaming TTFT vs full-completion latency are different SLOs: optimize them separately.

REMEMBER — what they're probing

Disaggregated prefill+decode is the 2025 default.
Prefix caching + continuous batching + speculative decoding stack is non-negotiable.
Model router (mini/full/reasoning) is how you make economics work.
Head-of-line blocking on long context is THE LLM serving gotcha.

LLM TRAINING · ALIGNMENT LOOP

Design fine-tuning + Constitutional AI loop

Build the production pipeline that takes raw user prompts (and red-team prompts) and produces a fine-tuned, aligned, eval-gated, canary-rolled model. The hard part of this problem is the safety-helpfulness trade-off and the regression-gate engineering: a tiny safety regression on one slice should block the rollout, even if helpfulness improves overall.

TL;DR

Pipeline: data ingest (deduped, PII-filtered) → generation pool (K candidates per prompt, self-critique with constitutional principles, revisions) → preference labeling (AI judge + human gold subset) → SFT on revisions → DPO/KTO/PPO on preferences → eval (capabilities + safety + IF + winrate) → regression gate (block if any safety/capability metric regresses > ε) → canary rollout (1% → 10% → 100%) with online metric monitoring.

Pipeline

Data ingest: raw prompts (real user, red team, synthetic). Deduped, PII-filtered.
Generation pool: base model generates K candidates per prompt; self-critique with constitutional principles; revisions.
Preference labeling: AI judge (stronger model) rates pairs against constitution. Sample subset for human review (gold set).
Training: SFT on revisions; then DPO/KTO/PPO on preferences. Multiple iterations.
Evaluation: capabilities (MMLU, MATH, HumanEval), safety (HarmBench, XSTest, jailbreak suite), instruction following (IFEval), preference winrate vs prior model.
Regression gate: blocked from rollout if regression on N safety / capability metrics > ε.
Canary rollout: 1% → 10% → 100% with online metrics monitored.

Eval

Capabilities (MMLU, MATH, HumanEval, MMLU-Pro, GPQA), safety (HarmBench attack success, XSTest over-refusal, JailbreakBench), instruction following (IFEval), preference winrate vs prior model (Arena-Hard, AlpacaEval 2).

Monitoring

Online thumbs feedback, refusal rate, output toxicity rate, conversation length, retry rate, jailbreak attempt rate. Per-cohort breakdowns.

PITFALL — reward hacking, mode collapse, eval contamination

Reward hacking: PPO finds prompts where the RM is wrong and exploits them. Mitigations: KL penalty, RM ensemble, rejection sampling.
Mode collapse: DPO can over-decrease likelihood of both chosen and rejected — only the gap matters. Mitigations: stronger β, mix in SFT loss, IPO variant.
Distribution shift between SFT data and RL prompts → off-policy issues during PPO.
Safety-helpfulness trade-off: making the model refuse more reduces some risks but increases over-refusal on benign queries (XSTest).
Eval contamination: pretraining data leaked into eval → use contamination-resistant evals (GPQA-Diamond, LiveCodeBench, FrontierMath).

REMEMBER — what they're probing

Regression gate is the engineering discipline that makes alignment safe to ship.
AI judge + human gold subset is the cheap, high-quality preference label recipe.
Safety vs helpfulness is a Pareto curve — name both axes.
Canary rollout (1% → 10% → 100%) with online metrics — no big-bang deploys.

RAG · GROUNDED GENERATION

Design RAG at billion-doc scale

Build a retrieval-augmented generation system over billions of documents that produces grounded, cited answers. The hard part of this problem is chunk-boundary loss and "lost in the middle": naive chunking destroys context across boundaries; long stuffed contexts cause LLMs to underuse mid-context information.

TL;DR

Ingest with semantic-aware chunking + overlap → embed (Matryoshka-style for cheap recall + full-dim precision) → vector DB (sharded by tenant, IVF-PQ within shard) → hybrid retrieval (BM25 + dense, RRF fusion) → cross-encoder reranker (top-100 → top-10) → augmentation with citation markers → LLM generation with grounded-answer requirements. Optionally HyDE / multi-query / multi-hop.

Pipeline

Ingest: chunking (recursive, sentence-aware, ~500–1000 tokens with overlap; or semantic chunking via embedding similarity).
Embedding: dense (E5, BGE, OpenAI text-embedding-3, Cohere embed v4) with possible Matryoshka representation (truncate dim for cheap recall, full dim for precision).
Storage: vector DB (Pinecone, Weaviate, Qdrant) or built-in (FAISS sharded, ScaNN).
Sharding: by tenant, then by embedding cluster (IVF). Replicas for QPS.
Retrieval: hybrid (BM25 + dense with RRF or learned weighting), large k (~100). Optional ColBERT-style late interaction.
Reranking: cross-encoder (bge-reranker, cohere-rerank) on top-100 → top-10.
Augmentation: stitch retrieved chunks into prompt with source markers. Truncate to context budget.
Generation: LLM with citation requirements (grounded).
Optional: query rewriting (multiple sub-queries via LLM), chain-of-thought retrieval (multi-hop), HyDE (generate hypothetical answer, embed, retrieve).

Eval

Retrieval: recall@k, MRR vs labeled relevance.
End-to-end: faithfulness (RAGAS, TruLens), answer correctness, citation accuracy.
Hallucination rate: NLI-based check.

Monitoring

Retrieval recall@k drift, citation accuracy, hallucination rate, latency per stage (embed, retrieve, rerank, generate), index staleness per tenant, embedding-model version pinning.

PITFALL — chunk boundaries, embedding drift, lost in the middle

Chunk boundaries lose context → overlap, hierarchical chunking, semantic chunking.
Embedding drift when re-embedding with new model — need full reindex; never mix versions.
Lost in the middle (Liu 2023) — LLMs underuse middle of long context. Mitigations: re-rank top-k aggressively, put best snippet first or last, truncate if context is too long.
Stale data — TTL + incremental reindexing.
Hallucinated citations — even with citation requirements, models invent source IDs. Validate citations against retrieved set programmatically.

REMEMBER — what they're probing

Hybrid (BM25 + dense, RRF fusion) — never pure dense at billion-doc scale.
Cross-encoder reranker is what makes top-10 actually correct.
Chunking strategy matters more than people think — semantic + overlap.
Faithfulness eval (RAGAS) + citation validation are the production safety nets.

INFRASTRUCTURE · ANN

Design vector search infrastructure

Build a billion-vector ANN system that handles inserts, deletes, attribute filters, and consistent QPS. The hard part of this problem is the filter problem: vectors are easy to find, but combining ANN with attribute filters (tenant, time range, category) breaks the obvious algorithms in non-obvious ways.

TL;DR

HNSW for <100M (in-memory, best recall/latency); IVF-PQ for billion+ (compressed); ScaNN/DiskANN intermediate. Sharding by tenant then by clustering; replicas for QPS. Filters need inline traversal (attribute-aware HNSW) — pre/post-filter both have failure modes. Insert is easy, delete is hard (tombstones + compaction). Re-indexing is mandatory on embedding-model changes.

Algorithms

Flat (brute force): exact, O(N·d). Good for <1M.
IVF: cluster with k-means into C centroids. At query, search nearest probe centroids, scan their lists.
PQ: split d-dim vector into M subvectors, quantize each with k=256 (1 byte each). Compressed asymmetric distance via lookup tables. IVF-PQ combines both.
HNSW: hierarchical small-world graph. Multi-layer; greedy descent with ef-search controlling recall. Excellent for in-memory.
ScaNN (Google): IVF + anisotropic quantization. Strong recall/latency.
DiskANN (Microsoft): SSD-resident graph; ~2× slower than in-memory HNSW but 10× cheaper.

Memory: billion 768-d float32 = 3 TB; PQ to ~32 bytes/vec → 32 GB.

System

Sharding: by tenant, then by ID range or by clustering.
Replicas: for QPS and HA.
Updates: insertion in HNSW is O(log N); deletion is hard (tombstones + periodic compaction). Re-indexing for embedding model changes.
Filters: pre-filter (filter then search — bad for low-selectivity), post-filter (search then filter — bad for high-selectivity), inline filter (HNSW with attribute-aware traversal).

Eval

Recall@k against brute-force ground truth on a sample; query latency p50/p99; throughput at target recall.

Monitoring

Per-shard latency, recall sampling against brute-force baseline, index size, deletion-tombstone fraction, compaction lag, filter-selectivity histograms.

EXAMPLE — billion-vector memory accounting

1B vectors × 768 dims × 4 bytes (FP32) = 3 TB. Won't fit on a single node. Three options: (a) shard across ~50 nodes of 64 GB each (in-memory HNSW per shard); (b) compress with IVF-PQ to ~32 bytes/vec → 32 GB total, fits on a few nodes; (c) DiskANN on SSD — same shard count as (a) but 10× cheaper hardware. Pick based on recall SLA: HNSW for highest recall, IVF-PQ for tightest budget.

PITFALL — filters, deletes, embedding drift

Filters: pre-filter is bad when selectivity is low (you re-index on every query); post-filter is bad when selectivity is high (you discard most results). Inline attribute-aware traversal (HNSW filtered) is the modern answer.
Deletes: graph indices don't truly delete — tombstones + periodic full compaction.
Embedding-model drift: changing the embedding model means re-indexing the entire corpus. Plan for this; never mix vectors from different model versions in the same index.

REMEMBER — what they're probing

HNSW vs IVF-PQ trade-off — recall vs memory vs cost.
Filters are the gotcha — name pre/post/inline distinction.
Deletes need compaction; re-indexing is mandatory on model changes.
Sharding by tenant first, then by clustering — multi-tenant isolation matters.

DATA · FEATURE INFRA

Design ML feature store with point-in-time correctness

Build the feature store every ML team in a company shares: declarative feature definitions, online and offline serving, point-in-time joins for training, governance. The hard part of this problem is train/serve skew prevention and point-in-time correctness: training joins must look up feature values as of the label timestamp, never the current value.

TL;DR

Declarative feature definitions (Feast/Tecton) → batch (Spark/Flink) writes to offline warehouse + reverse-ETL to online KV → streaming (Kafka → Flink/Beam) for real-time aggregations → on-demand transforms at request time. Training uses point-in-time joins (feature value as-of label timestamp) to prevent leakage. Same code path online and offline = no train/serve skew.

Architecture

Feature definition: declarative (Feast, Tecton). One source of truth.
Storage: online low-latency KV (Redis, DynamoDB, Cassandra); offline data warehouse / lake (BigQuery, Snowflake, Iceberg) with full history.
Compute: batch (Spark/Flink) → write to offline store; reverse-ETL → online. Streaming (Kafka → Flink/Beam) → real-time aggregations. On-demand transforms (compute at request time).
Point-in-time joins: for training, join labels with feature values as of label timestamp — prevents leakage.
Serving: feature fetch service fans out lookups, applies on-demand transforms, returns vector to model server.

Governance

Feature ownership, lineage, freshness SLAs, deprecation workflow, cost attribution.

Eval

Per-feature freshness SLA tracking, online/offline parity sampling (compute online, log, recompute offline, diff), feature-vector latency p99.

EXAMPLE — point-in-time join

Label: user U clicked ad A at timestamp T. Training feature "U's 7-day click count" must be the value at T, not at training time. Naive join: select most recent feature row → leaks future clicks into training feature → model overfits a value it could never have at serving time. Correct join: pick the feature row with timestamp ≤ T (and ≥ T − staleness budget). Feast/Tecton implement this primitively.

PITFALL — train/serve skew, freshness drift, duplicate features

Train/serve skew — same code path online and offline. If "user 7-day click count" is computed in Python at serving and Spark at training, you have skew. Single-definition feature stores (Feast/Tecton) exist for this.
Feature freshness drift — monitor per-feature staleness; alert when a streaming feature falls behind.
Duplicate features across teams — centralize discoverability; mandatory feature registration.

REMEMBER — what they're probing

Point-in-time correctness — name this unprompted.
Online + offline stores backed by single feature definition (no skew).
Streaming (Flink/Beam) for real-time aggregations is mandatory at scale.
Governance + lineage + freshness SLAs are the org-scale problem.

EXPERIMENTATION

Design A/B testing framework for ML

Build the company's experimentation platform: assignment, logging, metrics, statistical engine, holdouts, heterogeneous treatment effects. The hard part of this problem is SUTVA violations in marketplaces and social networks: two-sided platforms break the "treatment of one user doesn't affect others" assumption, requiring cluster randomization or switchback designs.

TL;DR

Deterministic hash assignment in mutually exclusive layers (Google's Overlapping Experiments Infrastructure). Hierarchical metrics (North Star + proxy + guardrails). Stat engine with Welch t-test + CUPED variance reduction + sequential testing for early-stopping safety + multi-comparison correction. HTE via causal forests / meta-learners. Switchback or cluster randomization for marketplace effects. Long-term holdouts for novelty decay.

Components

Assignment: deterministic hashing on (user_id, experiment_id) → bucket. Mutually exclusive layers (Overlapping Experiments Infrastructure, Tang 2010 — Google paper). Holdouts.
Logging: exposure events (when user actually entered the experiment surface), interaction events.
Metrics: hierarchical — North Star (DAU, revenue), proxy (CTR, sessions), guardrails (latency, error rate).
Stat engine: Welch's t-test for means; CUPED variance reduction (Deng 2013); sequential testing (mSPRT) for early-stopping safety; Bonferroni / BH for multiple comparisons.
Heterogeneous treatment effects: causal forests, meta-learners (S/T/X-learners) to find subgroup wins.
Switchback / interleaving for marketplace effects (network interference).
Long-term effects: holdouts (a population that never sees treatment for months) to measure novelty decay.

Eval / monitoring

Sample ratio mismatch (SRM) detection; bucket-balance checks; metric stability (rolling baseline drift); experiment-coverage dashboard.

PITFALL — SUTVA, peeking, Simpson's, SRM

SUTVA violations (network effects in social products, supply-side competition in marketplaces) → cluster randomization or switchback.
Simpson's paradox: aggregate effect can flip sign vs subgroup effects.
Peeking without sequential correction: looking at p-values daily and stopping when significant inflates false-positive rate massively.
Metric definition drift: definition changes mid-experiment.
Logging bugs that look like wins: the most common false-positive source. Always check exposure parity (SRM).

REMEMBER — what they're probing

CUPED is variance reduction with a pre-experiment covariate — name it.
Mutually exclusive layers (OEI) — multiple experiments on same user.
SUTVA violations are the marketplace gotcha — switchback / cluster randomization.
Sequential testing or no early-stopping — peeking inflates FPR.

RECSYS · MULTI-MODAL

Design multi-modal recommendation system

Recommend items where the items are inherently multi-modal: short videos with audio + text + visual + author info. The hard part of this problem is modality fusion + missingness: not every item has every modality, and a model that assumes "always have video" breaks on items without one. Plus, video encoding is expensive — precompute and cache aggressively.

TL;DR

Per-modality pretrained encoders (CLIP/SigLIP for image-text, video transformer for clips) project into shared embedding space. Gated attention fuses modalities (gates handle missingness). User encoder over historical item embeddings. Two-tower retrieval with multimodal item tower. Cross-encoder ranker with multi-task heads. Precompute video encodings — they're expensive.

Architecture

Item encoders: pretrained encoders per modality (CLIP/SigLIP for image-text, video transformer for clips). Project into shared embedding via projection head.
Multi-modal fusion: gated attention combining modalities; quality of each modality gated.
User encoder: sequence model over user's historical item embeddings + side info.
Two-tower retrieval with multimodal item tower; ANN.
Cross-encoder ranker: full attention between user history and candidate, multi-task heads.
LLM-augmented (optional): LLM generates natural language query/intent from history; semantic retrieval over multimodal embeddings.

Training infra

Pretrain modality encoders separately; align via contrastive on (user_history, clicked_item); fine-tune end-to-end on engagement.

Serving

Precompute item embeddings on ingest; cache aggressively. Video encoding latency is hostile — never compute on the request path.

Eval

Per-modality ablation: how much does each modality contribute? Cold-start eval on items with subset of modalities. Standard recsys metrics (recall, NDCG, online engagement).

PITFALL — modality dropout, expense, modality bias

Modality dropout / missingness: model gates must handle missing modalities gracefully. Train with random modality masking.
Expensive video encoding → precompute, cache; never on request path.
Modality bias: one modality (e.g., title text) dominates because it's always present and high-signal — model ignores video. Counter via per-modality dropout during training.
Cold-start — content-based via modality embeddings is the strength of this design (vs ID-only models).

REMEMBER — what they're probing

Pretrained per-modality encoders + projection heads — don't train end-to-end from scratch.
Gated fusion handles missing modalities; per-modality dropout in training prevents bias.
Precompute item embeddings; never encode video on request path.
Cold-start is the strength — multi-modal beats ID-only on new items.

PERCEPTION · SAFETY-CRITICAL

Design self-driving perception pipeline

360° real-time perception at 10–30 Hz, sub-50ms latency, multi-sensor (camera + lidar + radar), high recall on safety-critical objects, must generalize to long tail (mattress on highway, construction worker holding a sign). The hard part of this problem is long-tail rare objects + auto-labeling at PB scale — you can't human-label everything, you need an offline-large-model auto-labeler with human verification on edge cases.

TL;DR

Sensor fusion via BEV (Bird's-Eye View) — modern preference is mid-fusion (BEVFormer / Lift-Splat-Shoot lifts camera features into 3D, splats into BEV grid, transformer over BEV). Multi-task heads: 3D detection, segmentation, lane, traffic light, pose, intent. Tracking via Kalman + re-ID. Trajectory prediction with multimodal output. Auto-labeling pipeline + simulation + shadow mode. End-to-end (Wayve / Tesla FSD V12) is the alternative.

Requirements

360° perception at 10–30 Hz
Latency < 50ms
Multiple sensors (camera, lidar, radar)
High recall on safety-critical objects
Generalize to long tail

Architecture

Sensor fusion:
- Early: project lidar onto camera, jointly process.
- Late: per-sensor detections, fuse in BEV.
- Mid (modern): BEVFormer / Lift-Splat-Shoot — lift camera features into 3D via depth distribution, splat into BEV grid, transformer over BEV.
Backbone: per-camera ConvNet/ViT; temporal fusion (multiple frames).
Heads: 3D object detection (bbox, class, velocity), semantic/instance segmentation, lane detection, drivable area, traffic light state, pose, intent prediction.
Tracking: Kalman / data association; re-ID embeddings.
Prediction: trajectory prediction (Trajectron, MultiPath, MTR) — multimodal trajectory distributions.
End-to-end alternatives: Wayve / Tesla FSD V12 — camera → planning, trained on driving demonstration + RL from interventions.

Training infra

PB-scale data; auto-labeling pipeline (large offline model labels, humans verify edge cases); simulation for rare scenarios; mining hard negatives from fleet.

Eval

Offline (mAP, IoU, ADE/FDE); closed-loop sim; shadow mode (parallel run, log disagreements); on-road (disengagement rate).

Monitoring

Per-class recall (especially safety-critical: pedestrian, cyclist), calibration drift between sensors, latency p99, disengagement rate per region/condition, sim-to-real gap.

PITFALL — long tail, calibration drift, prediction collapse

Long tail of rare objects (mattress on highway, fallen ladder): auto-labeling + active mining + sim augmentation.
Calibration drift between sensors: lidar-camera extrinsics shift with vibration; periodic recalibration is mandatory.
Adversarial conditions (rain, sun glare, dust): per-condition eval slices.
Prediction multimodality — collapse to mean is dangerous (predicting average of "turn left" and "turn right" = drive straight into oncoming traffic). Use multi-modal trajectory output (MTR-style).
OTA model rollout with canary fleets — never big-bang deploy.

REMEMBER — what they're probing

BEV fusion is the modern default — name BEVFormer / LSS.
Auto-labeling pipeline + sim + shadow mode is the data flywheel.
Trajectory prediction must be multimodal — never collapse to mean.
Long tail is the hard problem — active mining is the answer.

CLOSING TACTICS

How to actually run a 45-min ML design loop

TL;DR

Phase the time explicitly. Talk less in clarification, more in deep-dive. Always cover eval AND monitoring AND gotchas. End with "what would I do with more time" — it shows scope-discipline. Never claim a single technology solves everything.

The phase-by-phase tactic sheet

0–5 min: clarify requirements, scale, what to optimize, what's allowed. Confirm with interviewer. Write the constraints on the board.
5–10 min: high-level architecture (boxes + arrows). State the funnel/stack.
10–25 min: deep-dive 1–2 components the interviewer signals interest in. Discuss alternatives. This is where 50% of your signal lives.
25–35 min: training infra + serving infra + eval (online + offline).
35–45 min: gotchas, monitoring, what would you do differently with more time.

The "if I had more time" close

End every loop with 2–3 things you'd explore further: alternative architectures, deeper eval, harder gotchas. This signals scope-discipline (you knew what you skipped) and intellectual honesty (you don't claim to have solved everything in 45 min).

PITFALL — the four canonical loop-killers

Don't immediately draw "DLRM + MMoE" — they don't know yet what problem you're solving. Don't ignore evaluation. Don't skip monitoring. Don't claim you'd use Spanner for everything.

REMEMBER

Phase the time explicitly — say "I'll spend 5 min clarifying, then 5 sketching, then 15 deep-diving the ranker."
Eval + monitoring + gotchas are 30% of the score; don't run out of time.
"If I had more time" is the scope-discipline signal.
Trade-offs over absolutes: "I'd pick X because Y, even though Z."

ML system design quiz — readiness check

Walk through YouTube's 2-stage funnel.
Show answer
Retrieval (~1B → 1k): two-tower + ANN, plus other sources (collab, fresh, trending). Ranking (~1k → 100): cross-encoder MMoE with multi-task heads (pCTR, pWatchtime, pLike, pDislike). Final scoring + diversity (MMR/DPP) + business rules → top 10.
How would you handle cold start for new users / new items?
Show answer
New user: content-based recs from session signals (initial interactions, demographics, source); explore via bandits. New item: content-based features (text/image embeddings) drive retrieval; explore via mandatory exploration slots; meta-learning if you have many items per category.
Your CTR model has high offline AUC but loses in A/B test. Diagnose.
Show answer
Suspect (1) selection bias / exposure bias — train log only contains items the old model showed; (2) train/serve skew — different feature pipeline; (3) miscalibration — AUC measures ranking, not calibrated prob; (4) shift in business mix; (5) novelty: model fits historical patterns that no longer hold.
Why does an MMoE outperform a shared-bottom multi-task model?
Show answer
Shared-bottom forces all tasks through one trunk → negative transfer when tasks conflict. MMoE has multiple experts shared across tasks; per-task gates softly mix experts. Each task can pick its preferred expert combination → less interference.
How do you debias position effects in a learned ranker?
Show answer
(1) PAL / position-aware learning: feed position as a feature during training; set to "no position" at inference. (2) Two-tower: shallow tower predicts position effect; main tower predicts relevance. (3) Click models (PBM, Cascade, DBN). (4) Result randomization for unbiased data.
In-batch negatives vs explicit hard negatives — tradeoff?
Show answer
In-batch: cheap, popularity-biased (high-frequency items appear more as negatives). Hard negatives (top non-positive from ANN): faster learning per example, risk of false negatives (true positives that look similar). Mixed Negative Sampling (MNS): combine in-batch + uniform-random; logQ correction for popularity debiasing.
HNSW vs IVF-PQ — when each?
Show answer
HNSW: best recall/latency in memory; high memory (no compression). Use for < 100M vectors with quality SLA. IVF-PQ: compressed (32×–128×); scales to billions; recall lower. Use for billion+ vectors at modest QPS. ScaNN/DiskANN intermediate.
Your retrieval layer returns the same 1000 items for everyone. What's wrong?
Show answer
Likely: (1) user tower collapsed (output ≈ constant — check embedding norm and variance); (2) lack of personalization features in user tower; (3) popularity-biased sampled-softmax without logQ correction → all items pushed toward popular ones; (4) ANN index returning popular items only (debiasing lost in serving).
How do you calibrate a multi-task ranker where each task has different positive rates?
Show answer
Per-task isotonic regression on a held-out window. For each task head's logit, fit monotonic mapping logit → calibrated prob using actual positive rate per quantile. Recalibrate frequently (weekly) under distribution shift. Joint calibration if tasks are correlated.
Migrate DLRM-style ranker to generative recommender (TIGER/HSTU) — risks?
Show answer
(1) Loss of explicit business signal control (multi-task heads gone). (2) Auto-regressive serving latency. (3) Cold-start changes (semantic IDs from RQ-VAE replace ID embeddings). (4) Training-time interactions with online behavior (counterfactual eval harder). (5) Capacity for personalization in generative model still being scaled out. Run as A/B with strict guardrails.
Design a RAG system at billion-doc scale.
Show answer
Chunking (semantic, ~500-1000 tok with overlap) → embedding (SigLIP/E5/Cohere) → vector DB (sharded by tenant, IVF-PQ within shard) → hybrid retrieval (BM25 + dense, RRF) → cross-encoder reranker → LLM generation with citations. Eval: recall@k, RAGAS faithfulness, citation accuracy.
Design a feature store with point-in-time correctness.
Show answer
Online (sub-ms KV) + offline (warehouse with full history). Streaming compute (Flink/Beam) → online + offline. For training, point-in-time joins: for each (entity, label timestamp), look up feature value as-of the label time. Prevents leakage of future info into training features.
What's CUPED?
Show answer
Variance reduction for A/B tests (Deng 2013). Adjust the outcome by a pre-experiment covariate: Y' = Y − θ (X − E[X]) where X is correlated with Y. Reduces variance → smaller sample size for same power. Used at every major web company.
How do you avoid SUTVA violations in marketplace experiments?
Show answer
SUTVA = stable unit treatment value assumption — treatment of one unit doesn't affect others. Violated by network effects (social) or supply-side competition (DoorDash, Uber). Solutions: cluster randomization (region-level), switchback (alternate treatment over time), counterfactual interleaving for ranking.
Design ChatGPT serving for 200M DAU.
Show answer
Disaggregated prefill + decode pools. Prefix caching (system prompts shared). Continuous batching. TP=8 within node for 70B+. Speculative decoding. FP8 weights. Model routing (cheap vs reasoning). Multi-tier autoscale. Safety filtering pipeline pre/post inference. SLO-aware scheduling.