Recommender Systems — Sr Staff edition
A course-level reference for the 2026 interview loop. Built for an experienced RecSys engineer who needs to articulate the field crisply enough to clear Pinterest / DoorDash / Reddit / TikTok / Snap / Anthropic / Databricks Sr Staff bars. The depth here is what you should be able to say out loud at a whiteboard, not what you read off a slide.
Contents
Part 1 — RecSys architecture
1.1 The canonical funnel
Every large-scale recommender — YouTube, Instagram, TikTok, Pinterest, DoorDash, Spotify, Reddit — has the same shape. You start with hundreds of millions to billions of candidate items and you have ~50–200ms end-to-end to return a sorted list of 10–50 to the user. You cannot run a deep model over 1B items per request, so you build a cascade of progressively more expensive models over progressively smaller candidate sets.
Latency budget — typical
| Stage | P99 budget | Compute per request | Why this budget |
|---|---|---|---|
| Request decode + auth | 5 ms | — | Edge layer, GeoDNS routing |
| User-context fetch | 10 ms | KV reads | User embedding, recent history, last 100 events |
| Retrieval (parallel) | 10–20 ms | ANN over millions | Multiple sources hit in parallel; slowest dominates |
| Feature hydration | 10–15 ms | Bulk KV / online store | Item embeddings + counters for ~10k candidates |
| Ranking inference | 20–40 ms | 1 GPU forward / batch of ~500 | Deep model, multi-task heads |
| Re-rank + policy | 5 ms | — | MMR, frequency caps, business rules |
| Total P99 | ~150 ms | — | Beyond this users feel jank on a feed scroll |
1.2 Retrieval / candidate generation
Retrieval is responsible for recall. It does not need to score perfectly — it just needs to ensure the truly relevant items are among the ~10k passed to ranking. Modern systems run multiple retrieval sources in parallel and union them: collaborative, content, social, geo, freshness, exploration. Each source is allowed to be biased; the union is what matters.
Two-tower (DSSM)
The dominant architecture. A user encoder consumes user features (id embedding, recent history sequence, demographics, context) and produces a vector u ∈ ℝd. An item encoder consumes item features (id embedding, content embeddings, taxonomy) and produces v ∈ ℝd. Score = ⟨u, v⟩ (or cosine). Trained with sampled-softmax or in-batch negatives so item encoder can be precomputed offline and indexed in an ANN. Only the user tower runs at request time.
Why two-tower wins for retrieval but not for ranking: a dot product cannot model feature crosses across user × item (e.g. "user from SF × restaurant in SF" gets the same score as "user from SF × restaurant in NYC" if those interaction features aren't in the encoders). For retrieval, this is acceptable because ranking will fix it. For ranking, it's a dealbreaker.
Negative sampling — the central question of two-tower training
- In-batch negatives. For each positive (u, v⁺), use other items in the batch as negatives. Cheap, but biased toward popular items (because popular items appear more often as positives, so they appear more often as negatives, and the gradient pushes them down). YouTube's seminal paper corrects with logQ subtraction: subtract log(sample frequency) from logits.
- Mixed negatives. Add uniform random negatives (from full corpus) to in-batch — fixes the popularity bias.
- Hard negatives. Mine items the model currently scores high but are not positives. Greatly improves discrimination but can collapse if hard negatives are actually mislabeled positives ("false negatives"). Common recipe: top-100 from a previous-version model, then sample 1–5 per positive.
- Cross-batch negatives. Maintain a queue of recent batch embeddings (MoCo-style) for larger negative pools without OOM.
Item-to-item collaborative filtering
Classical: build a co-occurrence matrix from sessions, normalize (cosine, Jaccard, or PMI), and serve "items frequently watched after X." Old but still extremely strong for fresh sessions and cold users where you have one anchor item. Amazon's "customers who bought this also bought" was item-item CF. Today usually combined with neural retrieval, not replaced.
Matrix factorization (ALS)
Decompose user–item interaction matrix R ≈ U VT. Implicit-feedback ALS treats observed interactions as confidence-weighted positives, all others as low-confidence zeros. Closed-form per-user/per-item updates make it embarrassingly parallelizable on Spark. Limitation: shallow, no side features (without extensions like LightFM).
Graph-based retrieval — PinSAGE, GraphSAGE, LightGCN
Treat the user–item bipartite graph (and optionally item–item co-engagement graph) as the substrate. Aggregate features from a node's neighborhood with learned weights. PinSAGE scaled this to ~3B node graphs at Pinterest by sampling neighborhoods with random walks. LightGCN strips out non-linearities and feature transforms — just iterative neighborhood averaging — and matches/exceeds heavier GCNs on retrieval benchmarks because in collab-filtering the embedding propagation is what matters, not the MLP.
Sequential models for retrieval — SASRec, BERT4Rec
Treat a user's interaction history as a sequence and use a transformer to predict the next item. SASRec uses causal attention (predict next from prefix); BERT4Rec uses bidirectional masked-item modeling. The "user representation" is the encoder output at the last position. The item embeddings are the input/output embedding table. Retrieval = top-K of WitemT · hlast.
Generative retrieval — TIGER, RecGPT
The newest paradigm. Items are tokenized into semantic IDs (a short tuple of discrete codes from an RQ-VAE trained on item content embeddings), then a transformer is trained to autoregressively generate the semantic ID of the next item. Retrieval becomes beam search. Strengths: generalizes to cold items (because the semantic ID is content-derived), shares parameters across items, can scale with model size. Weaknesses: beam search is slower than ANN; semantic IDs need re-training as the corpus drifts.
Mixed retrieval — the production pattern
No top-tier system uses one retrieval source. A typical Sr Staff answer:
- Two-tower neural retrieval (~3000 candidates) — captures personalization
- Item-item / "more like this" from current session anchor (~1000) — captures intent
- Trending / freshness pool (~500) — captures discovery
- Social: items engaged by friends in last 24h (~500) — captures network
- Geo / context: nearby restaurants, local creators (~500)
- Exploration source: random sample from underexposed long-tail (~200) — combats popularity feedback loops
Total ~5–10k candidates after dedup, all hydrated and passed to ranking. The diversity of sources is itself a debias mechanism — if your only retrieval is two-tower, popularity bias compounds.
1.3 ANN serving
Once you have item embeddings indexed, retrieval is "given query vector q, return top-K nearest neighbors by inner product or cosine." Exact NN is O(N·d) per query — for N=100M, d=128, that's 12.8B FLOPs per request. Approximate methods get you 100–1000x speedup with <1% recall loss.
HNSW (Hierarchical Navigable Small World)
A multi-layer proximity graph. Top layers are sparse (long-range links); bottom layer contains all points. Search starts at the top entry point, greedily descends by following edges to the closest neighbor of the query, then drops to the next layer. At the bottom layer, performs a beam search of width efSearch.
Key parameters and how they trade off:
M— max edges per node. Higher M ⇒ better recall but more memory (each node stores ~M neighbor IDs per layer it appears in). Typical 16–48.efConstruction— beam width during index build. Higher ⇒ better graph quality, slower build. Typical 100–400.efSearch— beam width at query time. Higher ⇒ better recall, slower query. Tunable per query, often the production knob.- Memory: ~(d × 4 bytes + M × 8 bytes × levels) per item. For N=100M, d=128, M=32: ~75 GB. Doesn't fit on one box → shard or use IVF-PQ.
IVF-PQ (Inverted File + Product Quantization)
Two-stage lossy compression:
- IVF (coarse). Cluster all vectors into
nlistcentroids (k-means). At query time, find thenprobenearest centroids and only search items in those Voronoi cells. - PQ (fine). Split each d-dim vector into
msub-vectors, run k-means with K=256 on each sub-space. Each item is now stored as m bytes (one cluster ID per sub-space). Distances are computed via precomputed lookup tables — extremely cache-friendly.
HNSW vs IVF-PQ — when to pick which
| HNSW | IVF-PQ | |
|---|---|---|
| Recall@10 (typical) | 0.95–0.99 | 0.85–0.95 (with refinement: 0.97) |
| Memory | Full vectors + graph (large) | Few bytes per item (small) |
| Latency | ~1 ms / query, very flat | ~3–5 ms, depends on nprobe |
| Build time | Hours for 100M | Minutes (k-means dominates) |
| Updates | Insert/delete supported but degrades | Easy to add to IVF cells |
| Sweet spot | ≤ 100M items, low-latency, plenty RAM | ≥ 1B items, GPU/disk-bound, OK with slight recall loss |
Other notable systems
- ScaNN (Google). Anisotropic vector quantization — quantizes to minimize loss in inner product direction specifically (instead of L2). Often best recall/latency tradeoff at moderate scale. Used in Vertex Matching Engine.
- Faiss (Meta). Library, not a service. Provides HNSW, IVF, IVF-PQ, OPQ, GPU implementations. Standard backend.
- Vespa (Yahoo / Verizon). Production search+ranking platform. HNSW + tensor evaluation in the same query — supports retrieval, filtering, and small ranking models in one engine. Used at Spotify, Tumblr, Yahoo.
- Vald, Milvus, Weaviate, Qdrant. Open-source vector DBs. Mostly HNSW under the hood, differentiated on operability, filtering, and hybrid search.
- DiskANN (Microsoft). SSD-resident graph index. Trick: store full vectors on SSD, compressed (PQ) vectors in RAM for routing. Lets you serve 1B+ vectors from a single machine with ~5–10 ms latency.
Filtering / hybrid search — the painful problem
Real recsys queries always have filters: "in stock," "shippable to my zip," "not blocked by user," "language=en." How to combine filters with ANN is non-obvious:
- Pre-filter: filter the corpus to candidates that match, then ANN. Correct but slow if filter selectivity is low (e.g. "in stock" matches 99% of items — you've done nothing) or very high (almost no candidates remain — exact NN is fine).
- Post-filter: ANN over everything, then drop disqualified items from the top-K. Fast but you may return fewer than K, or even 0.
- In-graph filtering (HNSW with attribute predicates): during graph traversal, skip neighbors that don't satisfy the predicate. Most modern systems (Vespa, Milvus, Qdrant) implement this. Caveat: if filter is very selective, the graph becomes disconnected and recall collapses → fall back to exact for those queries.
- Multi-index: partition the corpus by attribute (e.g. one index per country) and query the relevant index. Best when filter cardinality is small.
1.4 Ranking
The ranking stage scores each candidate from retrieval. Unlike retrieval, ranking can use rich user×item interaction features and deep architectures because the candidate set is small (~hundreds to a few thousand). Sr Staff interviews go deep here — be ready to defend choices about loss, architecture, multi-task structure, and calibration.
Loss formulations
- Pointwise (binary cross-entropy). For each (u, i) example with label y ∈ {0,1}, predict p(click). Easy to train, easy to calibrate (probability is meaningful), but doesn't directly optimize ranking — two items both with p=0.3 are "tied" even if one is better.
- Pairwise (BPR, RankNet). Sample pairs (i⁺, i⁻) and train so score(u, i⁺) > score(u, i⁻). Optimizes order, but loses absolute calibration. BPR-MF is the classic implicit-feedback baseline.
- Listwise (LambdaMART, ListNet, NeuralNDCG). Loss is a function of the entire ranked list. LambdaMART (gradient-boosted trees with lambda gradients from NDCG swap deltas) is still the workhorse for search/ads ranking — top-of-leaderboard on classical learning-to-rank tasks.
In practice, a Sr Staff answer is: pointwise BCE for the underlying probability heads (with calibration) plus a pairwise auxiliary loss when the metric you care about is ordering and your data is heavily implicit. For multi-task (CTR, like, share, completion), pointwise BCE per task with shared bottom is the baseline.
Deep ranking architectures
Wide & Deep (Google, 2016). Linear/cross-feature "wide" arm + DNN "deep" arm. Wide handles memorization (specific user×item co-occurrence indicators); deep handles generalization (embedding lookups + MLP). Still a useful mental model for "what features go where."
DeepFM, DCN, DCN-v2. Add explicit cross-feature layers. DCN-v2's cross network multiplies embeddings layer-by-layer to learn polynomial interactions of arbitrary degree, with cost roughly linear in feature count. Outperforms vanilla MLP because MLPs are surprisingly bad at learning multiplicative feature interactions.
AutoInt. Multi-head self-attention over feature embeddings to learn arbitrary interactions automatically. Heavier than DCN; sometimes wins, sometimes doesn't.
DLRM (Meta, 2019). The reference Meta architecture for years. Sparse features (id-types) → embedding tables; dense features → bottom MLP; pairwise dot products of all (sparse_emb, dense_proj) pairs → top MLP → logits. The dot-product interaction is the inductive bias that captures "this user × this item × this context."
Multi-task learning — Shared Bottom, MMoE, CGC/PLE
Real systems optimize multiple objectives jointly: click, like, share, watch-time, complete, follow, hide. Three architectures dominate:
- Shared Bottom. One MLP "trunk," then per-task heads. Simple, parameter-efficient. Fails when tasks conflict — gradients average out and no task wins.
- MMoE (Multi-gate Mixture of Experts). N expert MLPs share the input. Each task has its own gating network producing weights over experts. Tasks can specialize on different experts; shared experts capture common structure. Originated at Google for YouTube ranking.
- PLE / CGC (Tencent). Adds task-specific experts that only their own task can use, plus shared experts. Cleaner separation, better when tasks really do conflict (e.g. "share" and "long-watch" pull in different directions).
Why MMoE outperforms shared-bottom (the interview answer)
In shared-bottom, all tasks must compromise on a single trunk representation. If two tasks have negatively correlated labels (e.g. "fast click" vs "long watch"), their gradients destructively interfere and the trunk learns a weak average. In MMoE, each task routes to a different mixture of experts; the negative correlation can be expressed as the two tasks attending to different experts, and gradients no longer fight at the parameter level. Empirically this shows up most strongly when tasks have different positive rates (e.g. click 5% vs share 0.1%) and different feature dependencies — exactly the regime feed ranking lives in.
Multi-objective combination
You have p(click), p(like), p(share), p(complete), p(follow). What do you optimize? Three approaches:
- Linear scalarization. Score = Σ wk · f(pk). Weights are tuned via A/B test guidance from product (e.g. "increase shares 5% without losing CTR"). f might be log or identity. The interview-correct phrasing: "weights are not learned end-to-end — they're a product artifact, set by leadership in service of the ecosystem KPI."
- Pareto frontier. Treat as multi-objective optimization, find Pareto-optimal weight vectors. Useful for offline analysis but practical systems still ship a single weight vector.
- Constrained optimization. Maximize one (e.g. watch time) subject to others ≥ baseline. Dual / Lagrangian methods. More principled but harder to operate.
Calibration
Critical when scores feed into auctions (ads), bid shading, or business rules expecting probabilities. Multi-task models are systematically miscalibrated because:
- Logging policy biases — items shown more often have more training data and shrink toward base rate; rare items inflate.
- Negative downsampling during training breaks the link between predicted probability and true probability (need to back out: p_true = p_pred / (p_pred + (1 − p_pred) / r) where r is sample rate).
- Multi-task gradients pull each head toward an "average" calibration.
Common fixes: Platt scaling (logistic regression on top of logits, fit on holdout), isotonic regression (non-parametric monotonic mapping, more flexible but needs more data), temperature scaling (one scalar T fit to minimize NLL — under-fits but stable). For multi-task, calibrate each head separately, then combine. Track ECE (Expected Calibration Error) and reliability diagrams in dashboards.
1.5 Sequence-aware ranking
The single biggest accuracy lift in modern recsys, beyond going deep, has been treating user history as a sequence with attention rather than a bag of average-pooled embeddings. Below are the canonical models.
DIN — Deep Interest Network (Alibaba, 2018)
Idea: not all of a user's history is relevant to the current candidate. For a candidate i, compute attention weights over the user's past items based on similarity to i, then weighted-sum the past item embeddings into a "user-interest-given-i" vector. This is essentially target attention. Solved a real production problem at Alibaba where users with broad interests had their representation diluted.
DIEN — Deep Interest Evolution Network
Adds a GRU over the history to model temporal evolution of interests, then attention-weighted by relevance to candidate. Handles "user moved from electronics to home goods last week" gracefully.
BST — Behavior Sequence Transformer (Alibaba, 2019)
Replaces the GRU with a transformer encoder over the recent history. By 2019 this was the dominant architecture for sequence-aware ranking.
SASRec, BERT4Rec
These started as retrieval models but their hidden states make great sequence features for the ranker too. The pattern: pretrain SASRec on next-item prediction, then use the encoder output at last position as a frozen or fine-tuned input to the deep ranker.
HSTU — Hierarchical Sequential Transduction Unit (Meta, 2024)
Meta's new flagship architecture, replacing DLRM at Reels/IG ranking. Reframes recommendation as generative sequence modeling: tokens are user actions (impression, click, like, etc.) plus item IDs, the model learns to predict next-token. A single HSTU does both retrieval and ranking in one pass. Key tricks: (a) custom attention with relative positional encoding tuned for action sequences; (b) handles million-scale token vocabularies via sharded embeddings; (c) exhibits scaling laws — bigger HSTU keeps winning. The 2024 paper showed it strictly outperformed DLRM at the same FLOP budget at Meta scale.
LLM-augmented (ReLLA, P5, M6-Rec)
Use a pretrained LLM to encode item text/metadata and inject those embeddings into the ranker, or use the LLM directly as a reranker over the top-K. P5 reformulated all recommendation tasks as text-to-text; M6-Rec scaled this at Alibaba. Status in 2026: helpful for cold-start and content-rich domains (news, e-commerce), still too slow for top-of-funnel ranking at feed scale.
1.6 Generative recommendation
The frontier topic any 2026 Sr Staff loop will probe.
TIGER (Rajput et al., 2023)
Pipeline:
- Train a content encoder (text + image) to produce item embeddings.
- Train an RQ-VAE (Residual Quantized VAE) to compress each item embedding into a tuple of L codes (typically L=4, each code from a vocab of K=256). This is the item's semantic ID — semantically similar items share prefix codes.
- Train a vanilla transformer to autoregressively predict the next item's semantic ID given the user's history of semantic IDs.
- At serving, beam-search the top-K most likely semantic IDs and look them up.
Why it matters: (a) generalizes to new items (their semantic ID is computable from content with no training data), (b) the model size scales independently of corpus size (no embedding table per item), (c) beam-search controls retrieval and ranking jointly.
HSTU (Meta, 2024)
Different flavor of generative — keeps id-based item tokens but uses a custom transformer to do retrieval + ranking jointly. Trained on user action sequences with next-action prediction. Production-deployed at Meta at large scale; the 2024 paper showed clear scaling-law behavior in industrial recsys for the first time.
RecGPT framing
The umbrella name for "treat the user history as a token sequence and use an LLM-style decoder to generate next item." TIGER, HSTU, and various academic systems all fall under this. The Sr Staff framing: "we are converging recommendation, search, and conversational interfaces into one autoregressive system because semantic IDs let us share parameters across tasks and items, and because scaling laws are starting to apply."
Pros and cons vs traditional retrieval+ranking
| Generative | Traditional cascade | |
|---|---|---|
| Cold-start items | Strong — semantic IDs from content | Weak — needs interaction data for embedding |
| Scaling with model size | Yes (HSTU shows scaling laws) | Limited — sparse embeddings dominate |
| Latency | Beam search is slow vs ANN | Sub-10ms retrieval |
| Explainability | Opaque | Per-source attribution |
| Filtering / business rules | Hard to inject mid-generation | Easy — filter at ANN or post-rerank |
| Production maturity (2026) | HSTU at Meta; rest mostly research | Standard everywhere |
Part 2 — Eval, calibration, debiasing
2.1 Offline evaluation
The classical metrics
- HitRate@K — fraction of test users for whom the held-out item is in the top-K. Coarse but interpretable.
- Recall@K — for tests with multiple held-out items, the fraction of them that appear in top-K. The retrieval-stage metric.
- Precision@K — fraction of top-K that are positives. Less common in implicit feedback.
- MRR (Mean Reciprocal Rank) — 1 / rank of first relevant item, averaged over users.
- NDCG@K — discounts gain by log of position. The standard ranking metric. Sensitive to relevance labels at top positions.
- AUC — pointwise probability of ranking a positive above a random negative. Useful for binary classifier sanity, but a 0.01 AUC lift can correspond to massive online lift or none at all because AUC is an average over all (pos, neg) pairs while you only show top-K.
AUC vs NDCG — when each matters
AUC is position-insensitive: it asks "given any random pos and any random neg, is pos ranked higher?" NDCG is position-sensitive: lifts at rank 1 matter much more than lifts at rank 50. For a two-tower retrieval model, AUC is fine — you only care that positives are roughly above negatives. For a feed ranker where users see 10 items, NDCG@10 (or even NDCG@3) is what predicts online wins. Sr Staff phrasing: "AUC for retrieval, top-heavy metrics for ranking, and never trust AUC alone for the final ranker."
The offline–online correlation problem
The biggest hard-earned lesson in industrial recsys: offline metrics often do not correlate with online A/B test outcomes. Reasons:
- Selection bias. Logged data came from an old policy. New model is evaluated on items the old policy showed — it never sees the items it would now surface.
- Feedback loops. Offline you predict on historical context; online your predictions change future logs.
- Position bias. Logged labels are confounded with where they were shown.
- Multi-task interactions. A model that improves p(click) AUC may push down p(complete), and the linear combination loses.
- Distribution shift. Model trained on last month, served this month.
Counterfactual / IPS / DR estimators
For more honest offline eval, weight each logged event by 1/plogging(action), where plogging is the (logged or estimated) probability that the old policy showed this item to this user. This is the Inverse Propensity Score estimator. It's unbiased but has high variance when propensities are small. Doubly Robust (DR) combines IPS with a reward model: if either the propensity model or the reward model is correct, the estimator is unbiased. Production teams run replay eval (using IPS/DR) alongside traditional offline metrics, and trust the agreement.
2.2 Online evaluation
A/B test fundamentals
Sample size: n ≈ 16 σ² / Δ², for detecting a relative effect Δ at 80% power, 5% significance, two-sided. For typical CTR (5%) and a 1% relative MDE (so absolute Δ = 0.0005), n is in the tens of millions per arm. For low-traffic surfaces, this means month-long tests or no test at all. Sr Staff candidates should know:
- Randomization unit. User-level (default), session-level (for short-lived effects), cluster-level (for network effects).
- Triggered analysis. Only analyze users who actually hit the treatment surface, not all users in the bucket. Reduces variance.
- Multiple comparisons. If you're tracking 50 metrics, expect 2–3 false-positive movements. Pre-register the primary metric.
- Novelty effects. First week of a new model often inflated by curiosity. Run for ≥ 14 days.
- Network effects. If your treatment changes what creators see (e.g. ranking changes affect creator behavior), randomization unit must be a connected component, not a user.
CUPED variance reduction
Subtract a regression-predicted "expected metric value" (from pre-experiment user features) from the observed metric. The residual has lower variance, so you detect smaller effects with the same n. Conceptually: control for "this user was always going to click a lot." Reduces required sample size by 30–60% in most consumer experiments. Microsoft / Booking.com / Netflix all use it.
Interleaving
Instead of A/B (different users see A or B), each user sees a result list interleaved from both rankers (Team Draft Interleaving, Probabilistic Interleaving). Counts wins per ranker. Detects ranking improvements with ~10x less traffic than A/B because variance from "different users" is removed. Microsoft Bing's standard pre-A/B gating tool.
Multi-armed bandits
For exploration / cold-start / many-arm comparison:
- ε-greedy. Pick best arm with prob 1−ε, random arm with prob ε. Trivial; surprisingly hard to beat in practice.
- Thompson sampling. Maintain Bayesian posterior over each arm's reward; sample one parameter per arm per draw; pick the arm with the highest sampled value. Naturally allocates traffic proportional to posterior probability of being best.
- UCB (Upper Confidence Bound). Pick arm with highest mean + √(2 log t / narm). Provably good regret bound, but tuning the constant matters.
- Contextual bandits (LinUCB, neural bandits). Reward depends on context (user features). Foundation for personalized exploration.
Bandits in production usually serve either (a) cold-start exposure for new content, or (b) hyperparameter tuning across model variants. Pure bandit-driven feed ranking is rare — bandits don't naturally compose with the cascade.
Long-term metrics
The hardest, most Sr-Staff topic. Short-term CTR may go up while 30-day retention goes down (the "engagement bait" failure mode). Real production teams maintain:
- 30/60/90-day retention
- L7/L28 active days
- Diversity / novelty (e.g. categorical entropy of items shown)
- Creator-side health (creator earnings distribution, cold creator survival)
- Reported satisfaction (star surveys)
- Hide / mute / report rates
The operationalized answer: any launch must move the primary metric without significantly hurting any guardrail. Long-term metrics are tracked in holdback experiments (5% of users on baseline indefinitely) so you can attribute multi-month effects.
2.3 Bias and fairness
Position bias
Items at the top of a list get clicked more regardless of relevance. If you train naively on click logs, your model learns "be like the previous model" rather than relevance. Two correction families:
- Examination hypothesis. Click = examine × relevant. Estimate p(examine | position) from a randomized "swap" experiment, then weight click loss by 1/p(examine).
- PAL (Position-Aware Learning). At training, add position as a feature into a separate "bias" head that is summed with the relevance head. At serving, fix position to a constant (e.g. position 1 or "no position"). Trick: the model learns to put position-dependent variance into the bias head, leaving a clean relevance head for serving. Used at Huawei / Pinterest.
Selection / exposure bias
Items never shown have no labels. Naive training converges to "show items the previous model would've shown." Mitigations: random exploration traffic (e.g. 1% of impressions are random), counterfactual reasoning (IPS), and explicit exploration sources in retrieval.
Popularity bias and item cold start
Popular items appear in training data far more often, both as positives and (via in-batch sampling) as negatives. The negative effect typically dominates and you under-rank popular items unless you correct. YouTube's logQ subtraction (subtract log of sampling probability from logits during training) corrects in-batch negatives. For cold items, hybrid retrieval (content tower) and explicit cold-item exploration boost are standard.
Calibrated logging
The single highest-leverage practice for honest training: log the propensity that the action was taken. If your serving system did Thompson sampling or stochastic rerank, log the probability. This lets you train with IPS or DR estimators, run honest counterfactual eval, and avoid the offline-online correlation collapse. Most teams ship deterministic ranking by default, then quietly add ε-randomization just to enable better training data.
Fairness — not just bias
Beyond position bias, "fairness" in recsys typically means: equal exposure across creator groups, demographic parity, or specific KPIs for protected categories. Common interventions: post-rank reordering to satisfy exposure constraints (Polyak / Lagrangian), in-training fairness regularizers, or pipeline-level changes (e.g. "ensure ≥ 30% of impressions go to creators with < 10k followers"). Be ready to discuss tradeoffs with primary metric.
Part 3 — Embedding tables at scale
The number-one infra problem in industrial recsys. A modern ranker has tens of billions of parameters, ~95% of which are in sparse embedding tables (user IDs, item IDs, creator IDs, all the cross IDs). A single user-id embedding table at Meta or YouTube can be ~100B rows × 128 dim × 4 bytes = ~50 TB. No single GPU has that memory.
Sharding strategies
- Table-wise. Each whole table on one GPU. Simple, but balance is poor (some tables are huge, some small).
- Row-wise. Hash row IDs across GPUs; each GPU holds a slice of rows from each table. Best for one giant table.
- Column-wise. Each GPU holds all rows for a contiguous slice of columns. Useful when row count is moderate but dim is huge.
- Grid (table×row). Combination of the above; most flexible. TorchRec uses a planner (heuristic or ILP) that chooses the best sharding strategy per table given hardware.
Communication: all-to-all
For each minibatch, each GPU has some subset of needed embedding rows; the rest are on other GPUs. The pattern is:
- Forward all-to-all: each GPU sends "I need rows X, Y, Z from you" and receives back the embedding rows.
- Compute.
- Backward all-to-all: gradients flow the other way to update embeddings.
This is the hot loop of training. NVLink/IB bandwidth here is what limits throughput in production.
Compression and capacity tricks
- Hash trick. Map id → hash(id) % B for some bucket count B < cardinality. Loses some uniqueness; reduces memory by orders of magnitude. Used heavily for low-importance categorical features.
- Quantization. Store embeddings in int8 or fp8 instead of fp32. ~4x memory reduction with negligible accuracy loss for ranking. FBGEMM provides fused int8 lookup kernels.
- Mixed precision. Frequent items in fp16/bf16, infrequent in int8. Even fancier: differential dim — popular items get bigger embeddings.
- Composite embeddings. emb(id) = f(coarse_emb(hash1(id)), fine_emb(hash2(id))). Multi-resolution; reduces collision impact.
- UVM (Unified Virtual Memory). Embedding table lives in CPU host memory; GPU pages it in on demand via NVLink/PCIe. The hot rows naturally end up cached on GPU.
- Hot embedding cache. A small explicit GPU cache for top-N most accessed rows (popular items, popular users), backed by a slower CPU-resident store.
Training challenges specific to embeddings
- Skewed access (Zipf). 1% of items get 50% of impressions. The hot rows get massive gradient updates, the cold rows almost none. Need rowwise-adaptive optimizers (RowwiseAdagrad) that maintain per-row state.
- Optimizer state size. Adam doubles or triples table memory (m and v per row). Often forced to use Adagrad / RowwiseAdagrad to keep state small.
- Gradient staleness. In async training, gradients computed against an old version of an embedding row are applied to a newer version. Hot rows are updated by many workers per step.
- Rebalancing. Hot tables become bottlenecks; periodic re-sharding (TorchRec's planner) is run during long training.
Stack landscape
- TorchRec (Meta). PyTorch-native, supports table-wise / row-wise / column-wise sharding, FBGEMM kernels for fast embedding ops, integrates with FSDP for the dense parts. Production at Meta.
- HugeCTR (NVIDIA). CUDA-native, optimized for NVLink/IB. Marketed for ads/recsys at NVIDIA-stack shops.
- Merlin (NVIDIA). The umbrella ecosystem (HugeCTR + NVTabular + Triton-based serving).
- Persia (Bytedance). Heterogeneous training — embeddings on CPU parameter servers, dense on GPU workers. Predecessor pattern.
Part 4 — Six worked ML system designs
Each at the depth a Sr Staff loop expects. The goal isn't to memorize architectures — it's to know which knobs you would turn and why. For every problem, walk the same template: clarifying Qs → scale → data → features → arch → training infra → serving infra → eval → monitoring → gotchas → "if I had more time."
Design 1 — YouTube recommendations (homepage feed + watch-next)
Clarifying questions
- Which surface? Homepage (browse intent) vs watch-next (continuation intent) vs search results (intent-driven). I'll cover homepage primarily and contrast.
- Logged-in only or include non-logged? Affects use of personal history.
- What are we optimizing? Watch time historically, but also satisfaction surveys, long-term retention, creator diversity.
- Are Shorts, Live, regular videos in one feed or separate? Assume mixed.
- Latency budget? ~150 ms end-to-end.
Scale assumptions
- ~2 B logged-in MAU, ~500 M DAU.
- Corpus: ~5 B videos historically; ~hundreds of millions "active enough to recommend."
- Peak QPS for homepage: ~200k.
- ~30 trillion actions logged per day across all surfaces.
Data sources
- User watch history (with watch-time fraction, completion, like, dislike, save, share, subscribe).
- Search queries.
- Video metadata: title, description, channel, tags, transcript.
- Visual features: thumbnail and frame embeddings (a CNN/CLIP-style encoder offline).
- Audio features for music/Shorts.
- Creator features and creator-watcher graph.
- Surveys (1–5 star satisfaction) — used as a label for re-ranker calibration.
Architecture
Retrieval (multi-source).
- Two-tower neural retrieval. User tower over recent watch sequence (transformer encoder, 100 most recent items, attention pool) + demographic + context features. Item tower over content embeddings (text + thumbnail) + channel embedding + age/popularity. Trained on next-watch prediction with sampled-softmax + logQ correction. Output ~3000 candidates per request via HNSW shards (~100M item index, sharded by item ID hash, partitioned by language/region).
- Co-watch / item-item. For users with a clear current session anchor, retrieve items most often co-watched in the next 1–3 watches, normalized by popularity (PMI). ~1000 candidates.
- Subscription pull. Recent uploads from channels the user subscribes to or has watched recently. ~500.
- Trending / freshness pool. Hot videos in the user's region/language. ~500.
- Topic-personalized retrieval. User's top-K topics → top videos in those topics over last week. ~500.
- Exploration source. Random sample of underexposed creators / topics. ~200, with a flag so the ranker can downweight if confidence is too low.
Total ~5–6k candidates after dedup.
Ranking. A multi-task DLRM-style or HSTU-style model. Inputs: user features, item features, ~100-item watch history (sequence), context (time of day, device, country), candidate-specific cross features (past interaction with channel, topic affinity). Outputs: heads for p(click), p(watch_time_fraction), p(like), p(share), p(skip<5s), p(satisfaction_high|surveyed). Final score is a calibrated linear combination tuned via product A/B tests. Multi-task with MMoE or PLE because completion and click-bait pull in opposite directions.
Re-ranking / policy. Diversity (MMR over channels, topics — never two from the same channel back-to-back). Frequency caps (don't show creator X more than once per session). Business / quality rules (downrank borderline content, upweight kid-safe in restricted mode). Freshness boost for very recent uploads from subscribed channels.
Training infra
- Daily incremental training on the previous day's logs; weekly full retrain. Streaming online updates for the embedding table for hot items.
- Several hundred GPU-days per training run; sharded embeddings (TPU embeddings or TorchRec).
- Two-tower retraining nightly (item tower frozen between regen of item index every 4–6 h).
Serving infra
- Edge → routing → recsys gateway → parallel retrieval fanout → feature hydration (online feature store, e.g. Bigtable/Spanner per-user; remote KV per-item) → ranker (GPU inference cluster, batched per request) → re-rank → response.
- User embedding cached for ~minutes (sticky session).
- HNSW shards run on CPU with SIMD, ~3 ms per query per shard.
- Ranker on GPU (batch ~500 candidates per user per request), fp16, ~25 ms.
Evaluation
- Offline: NDCG@10 on next-watch holdout, Recall@1000 for retrieval per source, calibration curves per task head, IPS-corrected replay for the ranker.
- Online: watch-time per user, sessions per day, satisfaction surveys (5-star), 30-day retention, creator-side metrics (uploads engaged, creator earnings if monetized).
- Holdback for long-term metrics (1% on baseline indefinitely).
Monitoring
- Online prediction quality: continuous calibration tracking (predicted vs actual CTR per slice).
- Coverage / diversity: distribution of impressions over creators/topics.
- Latency P50/P95/P99 per stage; auto-rollback if P99 > 200ms for > 5 min.
- Feature freshness: alert if any feature lag > 2 h.
Gotchas
- Click-bait. Optimizing CTR alone ruins the product. Watch-time-weighted training plus the satisfaction head are the bandages; ultimately you need a survey-informed re-rank.
- Filter bubbles. Collab-only retrieval narrows recommendations. Mandatory exploration source.
- Creator side. If small creators never get exposure, ecosystem dies. Track creator coverage as a guardrail.
- Cold videos. Two-tower with content embeddings handles this; explicitly boost first-24h impressions for new uploads from subscribed channels.
If I had more time
- Move to HSTU-style generative ranker, replacing DLRM trunk.
- Joint optimize retrieval and ranking (gradient flow from ranking loss back through retrieval source weights).
- Train an LLM-based re-ranker over the top-50 to inject content understanding for niche topics.
- Causal-aware ranker (counterfactual reasoning about what user would have watched in absence of recommendation).
Design 2 — Pinterest home feed
Clarifying questions
- Single image-heavy feed; what's the unit? "Pins" (image + link). Mostly photo, some video.
- Are we optimizing repins, clickthroughs, hides, or session length? Pinterest has historically used a multi-objective combo with strong weight on long-term repin behavior (intent signal).
- Cold-start critical because users come for inspiration and many sessions are exploration — so retrieval needs to surface fresh, diverse content.
Scale
- ~500 M MAU, ~100 M DAU.
- Corpus: ~10 B pins.
- Each pin has rich visual + text; visual similarity is the dominant signal.
Architecture — distinctively visual
Retrieval.
- PinSAGE-style graph embeddings. Bipartite user×pin graph + pin–pin board co-occurrence. Aggregate via random-walk-sampled neighborhoods. Embeddings precomputed offline for all 10B pins.
- Visual content embedding. A vision transformer (or CLIP-aligned) embedding for every pin. Used for "more like this" (item-item retrieval), and as input feature.
- Two-tower neural retrieval. User tower over recent pins/boards/searches; item tower over content + graph embeddings.
- Board-personalized retrieval. For each of user's recent boards, retrieve pins similar to that board's centroid. Captures user's stated intent.
- Trending visual cluster source. Use a clustering of pin embeddings to surface what's new and trending in clusters the user shows interest in.
Ranking. Multi-task DLRM/transformer over user × pin × context. Heads: p(click), p(repin), p(hide), p(close-up view), p(long-engagement). Sequence model over the user's recent pin interactions (BST-style transformer). Input includes the pin's visual embedding directly — the visual is a first-class feature, not just an ID.
Re-ranking. Spatial layout matters: home feed is a Masonry grid. Re-ranker has to balance top-K relevance with visual diversity in adjacent positions (avoid 5 lookalike images in a row → MMR over visual embedding). Frequency caps on advertisers and creators. Topic balance.
Features
- Visual: CLIP/ViT embedding, dominant color, image dimensions, aesthetic score.
- Text: title, description, board names where pinned.
- Graph: PinSAGE embedding, board centroid embeddings.
- Engagement counters per pin (CTR, repin rate by segment).
- User: recent searches, recent boards created, demographic.
- Context: time of day, season (Halloween → spooky).
Training
- Two-tower: trained on positive (impression+repin or impression+long-view) with in-batch + hard negatives mined from the previous version.
- Ranker: incremental daily training on logged events with calibration head per task.
- PinSAGE: batch retrained weekly with MapReduce-style neighborhood aggregation.
Serving
- Visual embedding pre-baked at upload time (vision model on a dedicated GPU pool).
- HNSW for visual + neural retrieval; sharded by category + region.
- Ranker on GPU; batch per user per request.
Evaluation
- Offline: NDCG@10 with repin labels weighted higher than clicks.
- Online: long-term repin rate, sessions/week, hide rate (guardrail), creator coverage.
- Visual diversity metric: average pairwise visual distance among shown items.
Gotchas
- Visual repetition. The product looks broken if 4 of the top 8 pins are the same outfit photo. Diversity in re-rank is non-negotiable.
- Seasonal shift. Christmas pins explode in November and tank in February. Recency and seasonal adjusters.
- Cold pins. New pins have no graph neighbors. Vision embedding is the bridge; cold-start exposure pool boosts first-24h pins.
- NSFW / unsafe. Visual + text moderation classifier as a hard filter pre-ranking.
If more time
- Multimodal LLM to score "is this pin a creative match for what user is exploring on board X?" as a re-rank signal.
- Replace MMR with learned diversity head.
- Generative pin retrieval via TIGER over visual semantic IDs.
Design 3 — TikTok For You Page
Clarifying questions
- Single-video feed, very rapid feedback (every swipe is data). Watch-time fraction is the dominant label; like/share/follow are secondary.
- Latency: critical because users swipe fast and the next video must be ready. Need to pre-fetch next 2–3 videos.
- Cold start for both users (first session) and creators (first upload).
Scale
- ~1.5 B MAU, ~1 B DAU.
- Corpus: ~hundreds of millions of videos active.
- Daily uploads: tens of millions; new content's first hour is critical.
- Average session length: tens of swipes ⇒ tens of inference calls per session.
Architecture — sequential and exploration-heavy
Retrieval. Multi-source, but with strong emphasis on:
- Two-tower over recent watch sequence (heavy weighting on last 5–10 videos).
- Co-engagement (item-item) anchored on the just-watched video.
- Trending pool (a few thousand currently surging videos).
- New-creator exposure pool. A meaningful fraction (~5–10%) of impressions go to new content with little engagement data, because the FYP's life depends on the creator funnel staying healthy. This is the "TikTok secret sauce" — aggressive controlled exploration of new content.
- Audio-based retrieval (matching trending sounds).
Ranking. Sequence-aware deep ranker. Input is the user's last 100 watch events (with rich per-event features: video ID, watch fraction, like, share, follow, skip-time). HSTU-style transformer is the modern choice. Outputs: p(complete), p(watch_>60%), p(like), p(share), p(follow_creator), p(skip<3s). The combined score is heavily weighted toward complete/watch_fraction because TikTok lives or dies on dwell time.
Re-ranking. Diversity over creators, topics, visual styles. Frequency caps. Reinforcement-learning style policy for "should I show this exploration video here?" — TikTok has been public about using RL/bandits on top of the ranker to manage exploration vs exploitation across the session.
Why RL fits TikTok specifically
The single-video format gives clean, fast feedback (watch time of the next video is observed within seconds). The session is a sequence of decisions where the state evolves. This is closer to a true sequential decision problem than most recsys. Use cases: deciding when to inject exploration content, when to push a follow recommendation, when to insert ads. Implementations are typically batch-RL with policy gradient or off-policy correction (DQN-style with importance sampling), not pure online RL.
Serving — pre-fetching is the trick
- While user watches video N, system already retrieves and ranks for slot N+1, N+2, sometimes N+3.
- If user does an action mid-video (like, skip), a re-rank can happen for N+2 onward.
- Edge cache video files for the next 2 candidates so playback starts in <100 ms.
Evaluation
- Offline: NDCG-watch (positions weighted by watch-time), AUC on per-task heads.
- Online: average watch time per session, session count, retention D7/D30, creator coverage (Gini coefficient of impressions over creators).
- Long-term: holdback to detect "shorts addiction collapse" — the cohort of users who spend a lot then churn from satiation.
Gotchas
- Optimization for short-term watch time can be net-negative. Aggressive watch-time chasing produces "doomscroll" content that users regret. Survey labels and "satisfaction" model heads counterbalance.
- Hot-start cold-start for users. First 10 videos for a new user are a multi-armed bandit selecting topics rapidly.
- Adversarial creators. SEO-style content gaming. Need a separate "low-quality / engagement-bait" classifier.
- Latency. Pre-fetch failures are visible — empty feed.
If more time
- Move ranker fully to HSTU; collapse retrieval+ranking into one generative model for the long-tail user (those with rich history).
- End-to-end trainable session policy via offline RL with confidence intervals.
- World-model simulation to evaluate ranker policy changes without A/B traffic.
Design 4 — Twitter / X home timeline
Clarifying questions
- Two timelines historically: "Following" (chronological) and "For You" (ranked). Focus on For You.
- Tweets are short, real-time, network-driven. Freshness is critical — minute-level decay.
- Network effects are first-class: who you follow, who they retweet, who replies — all signals.
Scale
- ~500 M MAU.
- ~500 M tweets/day; spikes during live events (sports, breaking news).
- Most tweets have a useful lifetime of hours, not days.
Architecture — fanout vs query at read
The classic Twitter design choice. Two strategies, both used:
- Fanout-on-write. When user A tweets, push the tweet ID into a per-follower timeline materialized list. Read is fast (just read your list). Write is expensive — celebrities with 100M followers can't fanout per tweet, would create write storms.
- Query-on-read. At read time, fetch recent tweets from people you follow. Slow but no write amplification.
- Hybrid. Fanout for normal accounts; query-on-read for celebrities ("Bieber tweets" exception). Twitter implemented this for years.
Retrieval — sources for For You.
- In-network candidates. Recent tweets from people you follow + their retweets + their replies. ~hundreds to a few thousand.
- Out-of-network candidates. Tweets from people you don't follow but might engage with. Two-tower retrieval (user × tweet author embedding) over recent tweets. ~hundreds.
- Topic-based. Tweets in topics you follow.
- Co-engagement / collab-filter. Tweets liked by users similar to you.
- SimClusters (Twitter's own). Sparse community memberships for users and tweets; recommend tweets in your communities.
Ranking. Twitter's ranker (open-sourced in 2023 in part) uses a mix of heavy features: tweet text embedding, author embedding, rich engagement signals, recency, tweet metadata (has-image, is-reply, reply-count). Multi-task heads for p(like), p(reply), p(retweet), p(profile-click), p(longer-dwell). Aggregated via a learned weight vector.
Re-ranking. Author diversity (avoid one user's tweets clustered), conversation completeness (if showing a reply, ensure parent context), recency boost (decay older tweets), filter blocked / muted, downrank low-quality / spam.
Real-time pipeline
- Tweet ingestion → Kafka → real-time feature extraction (tweet embedding, classifier scores) → fanout or candidate index update within seconds.
- User-level engagement events also stream into the feature store; user embedding can be refreshed every few minutes.
Eval
- Offline: NDCG@K with multi-objective labels.
- Online: time spent, sessions per day, reply / retweet rate (engagement loops), reported satisfaction, controversial-content guardrails, hide rate.
- Track creator reach: are mid-tier creators getting enough impressions to keep tweeting?
Gotchas
- Outrage/engagement bait. Optimization for engagement amplifies controversy. Need explicit reward shaping or post-rank dampening.
- Real-time freshness. 30-min-old tweets are usually stale. Strong recency decay in score.
- Live events. Sports / breaking news cause query spikes; auto-scaled retrieval and dynamic candidate budget.
- Network effects in eval. Treatment of one user changes what their followers see — A/B randomization should be at the cluster (community) level for ranking changes that affect retweet/reply patterns.
- Spam / coordinated inauthentic behavior. Adversarial; need bot detection and suppression as a separate pipeline.
If more time
- Generative retrieval over semantic IDs of tweets (cluster tweets first, semantic-tokenize them).
- LLM scoring for "is this tweet of interest to user given their last 10 reads?"
- Conversation-level scoring rather than tweet-level (rank thread, not individual replies).
Design 5 — DoorDash store ranking
Clarifying questions
- Surface: home feed of restaurants/stores. Goal: maximize conversions while keeping operations sane (don't recommend a store that is overloaded or out of delivery range).
- Constraints: geospatial (must be in delivery radius), temporal (must be open + have prep capacity), supply-aware (Dasher availability).
- Optimize for first-order conversion, but also long-term retention and marketplace health (don't starve smaller stores).
Scale
- ~30 M MAU, US + INTL.
- Stores: ~hundreds of thousands; each city-level corpus is much smaller (~hundreds to low thousands per user).
- Strong daypart variation: lunch/dinner peaks shape both demand and supply.
Architecture — geospatial + supply-aware
Critical observation: candidate set is inherently small per user (in-radius stores). Retrieval is more about filtering than ANN.
Retrieval.
- Spatial filter: stores within delivery radius (geohash / S2 cell intersection), open right now, with ETA < threshold.
- Within that, optionally further pre-filter to top ~500 by engagement/historical conversion to bound ranking compute.
Ranking. Deep multi-task model. Inputs:
- User: order history (restaurants / cuisines / spend per category), demographic, device, recency.
- Store: cuisine, price tier, popularity, ratings, menu embedding, photo embedding.
- Cross: user × store CTR / conversion / repeat-rate.
- Context: time of day, day of week, weather (rain shifts demand).
- Operations: current ETA prediction, prep-time estimate, Dasher availability, store kitchen-load score.
Heads: p(click_into_store), p(add_to_cart), p(order), p(reorder_within_30d). Final score is a calibrated combination, with explicit downweight when operations score is unhealthy (long ETA, low capacity).
Operations-aware rerank. The reranker is allowed to adjust based on supply: if 5 high-ranked stores would all overload the same kitchen / same Dasher pool, spread out. This is closer to two-sided marketplace optimization than pure recsys. Often modeled as a bid-shading layer or LP solver across the home page.
Features unique to delivery
- Live ETA model output (separate ML model, fed as a feature).
- Live store load (how many open orders).
- Delivery fee, surge / peak pricing (if applicable).
- Promotional state (is this store running a $5-off promo? changes conversion).
Training
- Logs are smaller than feed-style products but cleaner (each impression has rich operational context). Daily retraining is enough.
- Holdouts per metro because behavior is regional.
Serving
- Spatial index (geohash buckets in Redis or S2 in a custom service) returns candidates in < 5ms.
- Feature hydration includes near-real-time ops features (push-updated by an ops service).
- Ranker on CPU is often fine because candidate count is small (~500 ranked).
Eval
- Offline: NDCG@K on conversion labels.
- Online: conversion rate, GMV per session, order completion (canceled orders are bad), repeat order rate, marketplace health (impression share for small/new merchants).
- Operations metric: ETA accuracy of recommended stores (penalize recommending stores that under-deliver on ETA).
Gotchas
- Supply collapse. Recommending only the 10 most popular stores creates a load spike that hurts ETAs across the board. Operations-aware rerank is non-negotiable.
- Daypart shift. 11am vs 7pm vs midnight have completely different distributions. Time-of-day must be a feature; consider per-daypart sub-models or a temporal embedding.
- Cold metros. A new market has no engagement data. Use cuisine-popularity priors and content (menu text, image) features.
- Promo gaming. A store running a discount looks artificially good. Disentangle promo lift from intrinsic preference (often a separate "promo-adjusted" score).
If more time
- Joint optimize ranking + ETA prediction (currently two separate models — gradients should flow).
- Per-user reorder model that handles "I order pizza every Friday at 7pm" patterns explicitly.
- LLM-generated personalized store summaries on cards to lift click-through.
Design 6 — Spotify Discover Weekly
Clarifying questions
- It's a 30-track playlist, refreshed weekly per user, intended to be a personal mix of new (to user) discoveries.
- Goals: high listen-through, high "save to library" rate, novelty (mostly tracks the user has never heard).
- Different from feed: batch-generated, fixed-length, must hold together as a coherent listening experience.
Scale
- ~600 M MAU.
- Catalog: ~100 M tracks.
- Generation runs Sunday night for everyone; pipeline must finish in ~hours.
Architecture — playlist generation, not single-shot ranking
Stage 1 — Candidate generation.
- Collaborative filter: ALS or two-tower over user × track listen history. Returns top ~5000 tracks.
- Audio content model: CNN/transformer over audio (mel-spectrogram or learned audio embeddings). Useful for long-tail tracks with little engagement.
- NLP model over track metadata, lyrics, artist tags.
- Playlist-co-occurrence: tracks frequently appearing together in user-created playlists.
Stage 2 — Filter and qualify.
- Remove tracks the user has already heard (the novelty constraint).
- Remove tracks by artists the user has actively avoided.
- Remove tracks not licensed in the user's market.
- Apply quality filter (skip tracks with very low listen-through globally).
Stage 3 — Rank and select 30.
- Per-track score from a model trained on historical "did the user listen to this track to the end / save it to library" labels.
- Diversity / coherence constraint: select 30 tracks subject to diversity over genres, tempo, mood, and a soft constraint that adjacent tracks flow well (transition-friendly).
- This is a constrained selection problem — often formulated as DPP (Determinantal Point Process) or a greedy MMR variant. Modern Spotify likely uses a learned re-rank with diversity reward.
Stage 4 — Order the 30.
- Order matters. First few tracks are an audition — if user bails, the playlist failed. Place high-confidence relevant tracks early; sequence the rest for arc (energy curve).
Features
- User: listen history, saved tracks, playlists created, top genres/artists, time-of-day listening patterns, device.
- Track: audio embedding, metadata (genre, tempo, key, energy), popularity, artist embedding.
- Cross: user × artist plays, user × genre affinity.
- Sequence: recent listening sessions to capture mood drift.
Training
- Predictive model: trained on past Discover Weekly logs — "given the playlist we generated, predict listen-through and save."
- Counterfactual: for tracks not in past playlists, use general listen-through priors.
- Refresh weekly with the previous week's outcomes.
Serving
- Batch pipeline (Spark / Beam) on weekend; writes 600 M playlists to a per-user blob.
- Read at request time is just a key-value lookup.
Eval
- Offline: predicted listen-through, save rate.
- Online: average tracks listened per playlist, save rate, share rate, "play next week" rate (does the user come back for the new one?).
- Long-term: D90 retention of users who consume Discover Weekly vs control.
Gotchas
- Stale-feeling playlists. If users keep getting similar music week to week, the product loses its magic. Track week-over-week diversity per user.
- Cold-start tracks. Indie / niche tracks with no listens. Audio model is critical here.
- Artist obsessions. A user who listens to 90% of one artist would get a playlist of just that artist's deep cuts. Cap per-artist in the playlist.
- Repeated dislikes. If user skips a track 3 weeks running, never again.
- Genre drift. User goes through phases. Use recent vs all-time differently.
If more time
- Move from track-by-track scoring to playlist-level scoring — train a model on "rate this playlist 0–1," generated by language-model paraphrase of human DJ playlists.
- Add LLM rationale ("this week we mixed X and Y because…") rendered in the UI to lift trust and click-through.
- Conversational refinement ("more upbeat") with a small interactive playlist agent.
Part 5 — RecSys + LLM convergence (the 2025–2026 frontier)
The most asked-about topic in 2026 Sr Staff loops. Be opinionated. Below: how LLMs are actually used in production recsys, and where they aren't yet.
LLMs as feature extractors
The most boring and most useful pattern. For any item with text (titles, descriptions, reviews, transcripts, captions), pre-compute LLM embeddings (OpenAI text-embedding-3, Cohere, in-house Llama-style encoder). These plug into ranking models as a continuous feature. Wins:
- Dramatic cold-item improvement — content embeddings exist before any interaction.
- Cross-lingual transfer (multilingual encoders).
- Subtle topical similarity beyond what tag systems capture.
This is in production at virtually every modern recsys team. Cost is dominated by one-time embedding compute, refreshed only for changed items.
LLMs as rerankers
Take the top-50 from your traditional ranker, send to an LLM with a prompt like "User has been browsing {history}. Rank these items by relevance: {items}." Used in:
- Search result reranking (Bing, You.com).
- E-commerce product reranking for high-intent queries.
- News reranking with personalized briefs.
Tradeoff: latency. A GPT-4-class call adds 500ms+. Production deployments use distilled smaller models, batch reranking offline for "warm" content, or only invoke LLM rerank for high-value queries (high commercial intent, low-latency tolerance).
Generative recommendation (autoregressive item generation)
TIGER, HSTU, and the broader RecGPT framing — covered in Part 1.6. The future direction. As of 2026, HSTU is in production at Meta scale; most other "generative recsys" are research or small/medium scale.
Conversational recommendation
"Recommend me a sci-fi movie like Arrival but happier." A user expresses preferences in language; an agent translates into retrieval, ranks, presents, and refines based on follow-up. Hot research area, increasingly shipping in product (Amazon Rufus, Klarna, Shopify, Google's product search). Architecture pattern:
- LLM extracts query into structured filters + free-text intent.
- Hybrid retrieval (BM25 + semantic + filtered).
- LLM-style rerank with chain-of-thought rationale.
- Multi-turn refinement: each user reply updates the latent intent state.
Personalized agents (multi-step research / shopping)
"I want to switch from gas to induction stove — what should I buy?" The agent reads product specs, reads reviews, does spec comparisons, answers tradeoffs, narrows to recommendations, possibly checks out. This is recsys + tool use + reasoning. Companies betting here: Anthropic (computer-use agents), Amazon, Klarna, Perplexity. Recsys engineer's role: design how recommendations are surfaced inside an agent loop, how to ground in user history without violating privacy, and how to attribute conversion.
World models for simulation-based eval
The dream: rather than running an A/B test for 14 days, simulate the new ranker against a learned model of user behavior to predict the outcome offline. Active research area (Meta, Google, academia). Risk: the world model itself is just a learned predictor and can suffer from distribution shift. Promising for narrowing the candidate set of A/B tests, not for replacing them.
Part 6 — Common interview questions, with Sr Staff answers
The system is a cascade. Retrieval narrows ~5B videos to ~5–10k candidates via parallel sources: a two-tower neural retrieval over the user's watch sequence and item content embeddings, item-item co-watch from the current session anchor, subscription pulls, trending in language/region, topic-personalized retrieval, and an exploration source that injects underexposed long-tail content. Each source is independently sharded with HNSW indexes. The union goes to a multi-task deep ranker — historically DLRM-style with sparse embeddings + dense MLP + dot-product interactions, in 2024+ moving toward HSTU-style sequential transformers — that produces calibrated heads for click, watch-time, like, share, and survey-satisfaction. A linear combination of those heads, with weights tuned by product, scores each candidate. A re-ranking pass enforces diversity (MMR over creators and topics), frequency caps, and policy / safety rules. Latency budget end-to-end is ~150ms; ranking inference is GPU-batched per user request, retrieval shards run on CPU with SIMD. The interesting parts are: (a) the multi-task structure resolves the conflict between click-bait short watches and long-watch satisfaction; (b) the calibration pipeline fixes the systematic miscalibration introduced by negative downsampling and multi-task gradient interference; (c) the exploration source plus survey-trained satisfaction head guard against feedback loops. The biggest lever in the last few years has been moving from average-pooled history to attention-over-history, and ultimately to a transformer over the action sequence.
New user. The user has no embedding signal. Strategies, in increasing sophistication: (1) Onboarding: ask the user 3–5 quick taste questions; map to topic embeddings. (2) Demographic / device priors: country, language, device type — decent baseline. (3) Bandit over a small set of broad-appeal content, with Thompson sampling, for the first ~10 impressions. (4) Once you have ≥ 5 interactions, the two-tower starts to produce a meaningful user embedding. Architecturally, the user encoder must work with empty history — typical fix is a learned "no-history" token.
New item. The item id has no usage. Solutions: (1) Content-based retrieval — text/image/audio embeddings give the item a starting point; the item tower reads those features so the embedding is non-degenerate from day one. (2) Creator priors — if a known creator uploads a new item, inherit creator's average engagement. (3) Cold-item exposure boost in retrieval (small fraction of impressions reserved for first-24h items). (4) Generative retrieval (TIGER) is particularly strong here because the semantic ID is content-derived. The Sr-Staff caveat: cold-start solutions trade off against short-term engagement metrics; you must guardrail with creator coverage so the team doesn't quietly turn off the boost in pursuit of CTR.
Common causes, in order of frequency:
- Selection / position bias. Offline you train and evaluate on logged impressions ranked by the old model, so AUC measures "agreement with old model." A new model that genuinely surfaces different items gets penalized offline by the missing labels for the items it would surface.
- Calibration shift. AUC is calibration-invariant. If the new model is better-ranked but worse-calibrated, downstream consumers (auction, multi-task combination, business rules) make worse decisions.
- Multi-task interference. AUC was on click only. The new model raises click but lowers the other heads (like, share, complete) used in the combined score. Look at per-head deltas in A/B and per-head AUC offline.
- Distribution shift. Train data is from time T-7; you serve at time T. Trends drift.
- Engagement-bait failure mode. Higher predicted CTR comes from clickbait that hurts dwell, satisfaction, or downstream retention.
- Cold-item starvation. The old model had implicit cold-item priors (calibration, exploration); the new model doesn't, and the corpus gradually concentrates on a narrower set.
- Bug. Always check: feature parity offline-vs-online, no leakage, no stale features, prediction logging matches scoring.
The diagnostic protocol is: check feature-parity diffs first, then look at A/B segments by traffic source, item age, user tenure; check calibration; then go back to offline replay with IPS to estimate counterfactual lift.
In shared-bottom, all tasks compute their head from one trunk representation. The trunk's gradient at each step is a weighted sum of per-task gradients; if tasks have negatively correlated labels (e.g. fast-click vs long-watch), those gradients destructively interfere and the trunk learns a weak average. In MMoE, each task has its own gating network selecting a soft mixture over N expert MLPs. Tasks can specialize on different experts — the negative correlation gets expressed as the two tasks routing to disjoint experts, and gradient interference at the parameter level disappears. Empirically this matters most when the tasks have (a) different positive rates and thus different gradient scales, and (b) different feature dependencies. CGC/PLE pushes this further by giving each task its own private experts plus shared experts, which helps when conflict is severe. The honest caveat: MMoE adds parameters, and on very tightly-correlated tasks (e.g. p_click and p_dwell on the same item) shared-bottom can actually win because the inductive bias of "share representation" is correct.
Position bias is the confound that items at the top get clicked more regardless of relevance, so naively trained rankers learn to mimic the previous ranker rather than learn relevance. Two correction families:
(1) Examination hypothesis / IPS. Decompose click = examine × relevant. Estimate p(examine | position) — typically from a small "swap" experiment where you randomly permute a few positions for ε% of traffic, then fit a position propensity curve. At training, weight each click loss by 1/p(examine | position). Unbiased; high variance for low-position items.
(2) PAL (Position-Aware Learning). Add position as a feature into a separate "bias" head. Total prediction = relevance_head(features) + bias_head(position). At serving, fix position to a constant (say "no position" or position 1). The model offloads position-dependent variance into the bias head, leaving a clean relevance head. Used in production at Huawei, Pinterest, and elsewhere because it's lower-variance than IPS and trivial to deploy.
(3) Combine with calibrated logging. Always log the position the impression was at, plus optionally a propensity if you randomized. Without this, no debiasing is possible.
Sr-Staff nuance: position is just one bias. Selection bias (only items the old policy showed are in your data), trust bias (users trust the system's first item), and presentation bias (different surface treatments) all interact. Debiasing one without the others can shift the system to a new biased equilibrium, not an unbiased one.
In-batch negatives are free — you reuse the items already loaded in the batch as negatives for each positive. Cheap, scales with batch size. Two problems: (a) popularity bias — popular items are over-represented as positives so they're over-represented as negatives, and the loss systematically pushes them down; YouTube's logQ subtraction (subtract log(sample frequency) from logits) corrects this. (b) Easy negatives — random items from the batch are usually trivially distinguishable from the positive, so the model gradient signal saturates and the model under-discriminates among similar items.
Hard negatives are items the current model scores high but are not positives. They produce strong gradients and force the model to refine fine-grained distinctions. Two risks: (a) false negatives — items mined as "hard negatives" may actually be positives the user just hasn't seen; this poisons the loss. (b) Cost — mining hard negatives requires either a previous-version model to generate them or expensive on-the-fly retrieval.
Production recipe is almost always a mix: large in-batch negative pool (with logQ correction) for coverage and easy distinguishability, plus 1–5 mined hard negatives per positive (usually top-100 from a previous model, sampled with some randomness, filtered for known positives in the user's history). MoCo-style cross-batch negative queues are a popular intermediate — bigger negative pool than the batch without per-step extra work.
HNSW is a multi-layer proximity graph. Each point sits in layer 0; with exponentially decreasing probability, it also sits in higher layers. Within each layer, every node has up to M edges to its closest neighbors. Search starts at the top entry point, greedily walks to the neighbor closest to the query, descends one layer when it can no longer improve, and at layer 0 performs a beam search of width efSearch to collect top-K. The intuition: the upper layers are like long-distance highways, the bottom layer is local roads. Build is O(N log N), query is O(log N) on average. Knobs: M (graph degree, controls memory and recall), efConstruction (build-time beam width, controls graph quality), efSearch (query-time beam width, controls recall vs latency).
IVF-PQ first clusters all vectors into nlist coarse Voronoi cells (k-means). At query time, it visits the nprobe nearest cells. Within a cell, vectors are stored in product-quantized form: each d-dim vector is split into m sub-vectors, each sub-vector is replaced by the cluster ID (out of K=256) of its nearest k-means centroid in that subspace. So each item is m bytes. Distances are computed via per-query precomputed lookup tables, very cache-friendly.
Pick HNSW when: corpus fits in RAM (≤ few hundred million for typical d=128), latency must be very low and predictable (~1ms), recall must be very high (≥ 95%), and memory budget is generous. It's the default for everything from ad ranking to in-doc embedding lookup.
Pick IVF-PQ when: corpus is huge (≥ 1B), memory is the bottleneck, you can tolerate a slight recall hit (or refine top candidates with exact distance after IVF-PQ shortlist). DiskANN is the disk-resident variant for "1B vectors on a single box." Many production systems do both: IVF-PQ for the very long tail with a refinement pass against full vectors for the top-K.
Classic symptom of collapsed personalization. Likely causes, ranked:
- User tower under-trained or under-fed. The user tower output may be near-constant — check the variance of user embeddings across users. If it's low, the model isn't conditioning on user features. Common bug: user features have wrong dtype, are zeroed, or the user-id embedding wasn't restored from checkpoint.
- Severe popularity bias. Without logQ correction or random negatives, the model converges to "rank popular items high" and the user signal is washed out.
- Hot ANN cache. If your serving caches the top-K of the global query (no user vector), you'd see this.
- Wrong feature serving. User features at serving time aren't the ones the model trained on — the user tower is effectively running on zeros.
- Index issue. The ANN index was rebuilt with stale item embeddings while user embeddings drifted; everyone's nearest neighbors degenerate to the most common items in embedding-space.
- Filter collapse. A safety filter or business rule cuts out 99.9% of candidates, and the same surviving 1000 are returned to everyone.
Diagnostic plan: log user embedding variance, log per-user candidate-set Jaccard similarity, sample 100 users and inspect their top-10 candidates and embeddings. The Sr-Staff move is to have these as standing dashboards before the bug ever happens.
Each task head needs its own calibration because:
- Positive rates differ by orders of magnitude (click 5%, share 0.1%, follow 0.02%).
- Negative downsampling, often per-task, distorts predicted probabilities differently per head.
- Multi-task gradient interference pulls each head toward an "average" calibration that fits no single task.
The pipeline:
- For each head, compute predicted probabilities and labels on a held-out, unbiased (or IPS-corrected) calibration set — distinct from the training set.
- Fit a per-head monotonic calibrator: Platt scaling (logistic regression on the logit) for low-data heads; isotonic regression for high-data heads where you want a flexible monotonic mapping. Temperature scaling (one scalar per head) is a stable fallback when data is very limited.
- If you used negative downsampling at rate r, recover the true probability before calibration: p_true = p_pred / (p_pred + (1 - p_pred) / r). Apply per task because sample rates may differ.
- Track ECE (Expected Calibration Error) and reliability diagrams per head, sliced by traffic segment (item age, user tenure, region) — calibration often drifts on slices.
- Recalibrate periodically (daily or per training run) because logging policy changes and data distribution shifts both invalidate old calibrators.
Sr-Staff nuance: in multi-task systems where the heads are summed into a serving score, miscalibration of one head propagates into the score combination. If p(share) is 5x too high, the share-weighted term dominates and the entire ranker drifts. Calibration is not optional — it's the glue that lets multi-task and multi-objective optimization actually work.
This is the migration-of-the-decade question. The plan I'd defend:
Phase 1 — De-risk through shadow deployment. Train an HSTU-style model in parallel; serve it in shadow (predict alongside DLRM, log scores, do not affect served results). Validate offline-online correlation: does HSTU's predicted ranking agree with what DLRM produces in production? Where it disagrees, do the disagreements correspond to plausibly better items?
Phase 2 — Replace the ranker only. Keep the existing retrieval cascade. Swap DLRM with HSTU as the ranking model. A/B test with strict guardrails on per-task heads, calibration, and long-term metrics. This isolates whether HSTU itself adds value, separate from any retrieval changes.
Phase 3 — Collapse retrieval into HSTU (if it pencils). Once the ranker is HSTU and stable, evaluate generative retrieval (TIGER-style semantic IDs or HSTU's joint retrieval head) to replace some retrieval sources. Likely keep some specialty sources (e.g. spatial / freshness / safety-filtered) outside the model — generative retrieval doesn't naturally compose with hard constraints.
Phase 4 — Scale up. Run scaling-law experiments to find the sweet spot of model size vs serving cost. HSTU's appeal is precisely that it scales like an LLM, so spend the compute.
Risks:
- Latency. Beam search is slower than ANN retrieval. May force you to keep the generative model ranker-only.
- Filtering / business rules. Hard to inject mid-generation. May need a post-generation filter that occasionally returns < K items if too many are filtered out — design for graceful degradation.
- Cold items. Semantic IDs help, but if the RQ-VAE wasn't trained on the cold item's content type, you're back to no signal.
- Eval comparability. Different model = different gradients = different miscalibrations. Don't compare AUC across architectures naively; rebuild calibration and re-set serving weights.
- Cost. HSTU at scale is more compute than DLRM at the same accuracy tier; the bet is that you spend it once and get scaling-law lifts that DLRM can't match.
- Org risk. The ranker is the highest-value model in the company. Migration is a multi-quarter effort with executive visibility. The plan above gives clear "kill switch" points (after each phase) so leadership stays comfortable.
- Drift in semantic IDs. If RQ-VAE codes shift between retrains, item identity in the generative model becomes unstable. Either freeze the codebook or version it carefully.
The Sr-Staff framing: this is a 6–12 month bet, not a refactor. The win condition is not "HSTU beats DLRM in offline metrics" — it's "we have a system whose accuracy improves with compute, so we can keep winning by buying GPUs." The DLRM ceiling is set by sparse embedding tables; HSTU lifts that ceiling.