Mock Interviews

Coding Anthropic 90 min

Multithreaded web crawler

Implement a single-domain web crawler. Given a seed URL, BFS all pages on the same domain, dedupe, ignore #fragment identifiers. Start sync; then make it concurrent with ThreadPoolExecutor. Follow-ups: politeness, robots.txt, distributed crawling.

Reveal answer

Approach

Sync version: queue.Queue for unvisited URLs, set for visited. Dequeue, fetch HTML, parse links, filter same-domain + strip fragment, push unseen.
Concurrent: ThreadPoolExecutor(max_workers=N). Submit fetch tasks; on completion, parse + enqueue new URLs. Visited set guarded by lock.
Termination: maintain in-flight counter; when queue empty AND no in-flight tasks, done.

Follow-up answers

Politeness: per-host token bucket; cap concurrent requests per host.
robots.txt: fetch and cache per host; respect Disallow.
Distributed: shard URLs by hash(URL) % N workers; central dedup via Redis or Bloom filter; coordinator detects "done."
Threads vs processes: GIL is released on socket reads → threads are fine. Processes only if HTML parsing dominates.

Coding OpenAI 75 min

Time-based key-value store with extensions

Implement set(key, value, timestamp) and get(key, timestamp) returning the latest value with timestamp ≤ given. Then extend: per-key locks vs global lock, disk persistence (replay log + snapshots), TTL on values. (LC 981 base, then real-system follow-ups.)

Reveal answer

Base structure

Hashmap key → list of (timestamp, value) sorted by timestamp. set: append (assume monotonic) or insort. get: binary search for largest ts ≤ target.

Extensions

Per-key locks: stripe-lock on hash(key) % N — much higher concurrency than global mutex on hot keys.
Persistence: append-only log (write-ahead) + periodic snapshot. Recover by replay since last snapshot.
TTL: store (timestamp, value, expiry); lazy eviction on read; periodic background sweep.

Edge cases interviewer expects you to mention

Out-of-order timestamps. Concurrent updates with same timestamp. Memory growth without eviction. Crash mid-write (use fsync semantics).

Coding OpenAI 60 min

Spreadsheet API with cycle detection

Implement setCell(name, value | formula) and getCell(name). Formulas can reference other cells. Detect cycles. Then optimize getCell to O(1).

Reveal answer

Approach

Parse formula to find dependencies. Build a dependency graph.
Cycle detection: DFS with 3-color marking (white/gray/black). Gray-on-gray = cycle.
Eager evaluation: on setCell, topological-sort affected cells; recompute in order. getCell = O(1) lookup.
Lazy alternative: cache evaluations; invalidate on dependency change.

Common follow-ups

Concurrent reads/writes (RWLock). Memory cost of dependency graph. Streaming partial updates.

Coding Universal · Anthropic / Apple / Stripe 30 min

LRU cache (LC 146)

Design a data structure with get(key) and put(key, value), both O(1). Capacity-limited; evict least-recently-used on overflow.

Reveal answer

Structure

Hashmap (key → node) + doubly-linked list (head = most recently used, tail = least recently used).

get: hashmap lookup; move node to head; return value.
put: if key exists, update value + move to head. Else create node, push to head, hashmap insert. If over capacity, remove tail node + remove from hashmap.

Pitfalls

Forgetting to remove from hashmap when evicting tail. Sentinel head/tail nodes simplify edge cases. Single mutex around the whole struct for thread safety.

LeetCode 146

Coding Universal · DoorDash / Snap / Pinterest 20 min

Top K frequent elements (LC 347)

Given an array, return the K most frequent elements. O(N log K) target.

Reveal answer

Approach

Hashmap counts: Counter(nums).
Min-heap of size K: push (count, num); if size > K, pop. End: heap has K largest.
Alt: bucket sort by count → O(N) but uses more memory.
Alt: quickselect on counts → O(N) average.

LeetCode 347

ML coding Anthropic / OpenAI / DeepMind / Mistral / Cohere — universal 30 min

Multi-head attention from scratch in PyTorch

Implement multi-head self-attention. Inputs: x of shape (B, T, d_model), num_heads. Include causal masking and an optional padding mask. Don't use nn.MultiheadAttention.

Reveal answer

Code skeleton

class MHA(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.h = n_heads; self.d_h = d_model // n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x, causal=True, pad_mask=None):
        B, T, _ = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, self.h, self.d_h).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]   # (B, H, T, d_h)
        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_h)
        if causal:
            mask = torch.triu(torch.ones(T, T, device=x.device), 1).bool()
            scores = scores.masked_fill(mask, float('-inf'))
        if pad_mask is not None:
            scores = scores.masked_fill(pad_mask[:, None, None, :], float('-inf'))
        attn = F.softmax(scores, dim=-1)
        out = (attn @ v).transpose(1, 2).reshape(B, T, -1)
        return self.out(out)

What interviewers check

Did you scale by √d_head? (Why: prevents softmax saturation.)
Did you mask before softmax? (Not after.)
Shape gymnastics correct? Comment shapes.
Bonus: extend to GQA. Extend to KV cache for autoregressive decode.

ML coding Anthropic / OpenAI 45 min

BPE tokenizer from scratch

Train a BPE tokenizer on a small corpus. Then encode/decode arbitrary strings using your vocabulary. Handle UTF-8 / unknown characters with byte fallback.

Reveal answer

Algorithm

Initialize vocabulary as bytes (256 entries).
Pre-tokenize corpus (split on whitespace + simple regex).
Count adjacent pair frequencies in current word splits.
Find most frequent pair. Add merged token to vocab. Apply merge to all words.
Repeat until target vocab size.

Encode

Apply learned merges in order (priority by merge index). Greedy left-to-right.

Common pitfalls interviewer probes

Whitespace handling: GPT-2 BPE includes leading space as part of token.
Byte fallback: unknown characters → emit raw bytes; vocab guarantees byte coverage.
Merge order matters at encode time.

ML coding Perplexity / OpenAI / Cohere 30 min

Top-K + top-P sampling

Given a logits vector, implement: temperature, top-k, top-p (nucleus), and combinations. Numerically stable softmax required.

Reveal answer

Code

def sample(logits, temp=1.0, top_k=None, top_p=None):
    logits = logits / max(temp, 1e-6)
    if top_k:
        idx = np.argpartition(logits, -top_k)[-top_k:]
        mask = np.full_like(logits, -1e9); mask[idx] = logits[idx]
        logits = mask
    # numerically stable softmax
    logits -= logits.max()
    probs = np.exp(logits) / np.exp(logits).sum()
    if top_p:
        order = np.argsort(-probs)   # stable descending
        cum = np.cumsum(probs[order])
        cutoff = np.searchsorted(cum, top_p) + 1
        keep = order[:cutoff]
        new = np.zeros_like(probs); new[keep] = probs[keep]
        probs = new / new.sum()
    return np.random.choice(len(probs), p=probs)

What gets you graded down

Forgetting logits -= logits.max() for stability.
Off-by-one in nucleus cutoff — must keep at least one token whose cum ≥ p.
Not handling temperature = 0 (greedy degenerate case).

ML coding Pinterest / Snap 30 min

K-means from scratch with k-means++ init

Implement k-means clustering. Use k-means++ initialization. Handle empty clusters. Specify a stopping criterion.

Reveal answer

Algorithm

k-means++ init: pick first centroid uniformly random; each subsequent centroid sampled with probability ∝ D(x)² where D(x) is distance to nearest existing centroid.
Assign: each point to nearest centroid (vectorized with numpy broadcast).
Update: centroid = mean of assigned points. Empty cluster → reinitialize via k-means++ or to a random point.
Stop: when assignments stop changing OR centroid movement < ε OR max iterations.

Follow-ups

Choose K: elbow method (within-cluster SSE vs k), silhouette score, gap statistic.
Scale to 1B points: mini-batch k-means; or k-means on a subsample then assign rest.
High-d data: PCA first; or use spherical k-means for unit-norm vectors.

ML coding Anthropic / OpenAI / vLLM 45 min

Speculative decoding (toy implementation)

Implement speculative decoding. Given a draft model and a target model, propose K tokens with the draft and verify with one target forward pass. Include the residual sampling fallback. Prove (in words) that it samples exactly from the target distribution.

Reveal answer

Algorithm sketch

Draft generates K tokens, recording draft probabilities at each step.
Target runs ONE forward pass over (prefix + drafted tokens) → next-token distributions at every drafted position.
For each drafted token t: accept with probability min(1, p_target(t) / p_draft(t)).
On reject: sample one corrected token from (p_target − p_draft)_+ / Z; stop.
If all K accepted: sample one bonus token from target's last distribution.

Why exact

Marginal distribution of each emitted token = p_d(t) · accept(t) + (1 − E[accept]) · residual(t). Algebra works out to p_t(t). Each emitted token is exactly distributed as the target.

ML system design Pinterest / TikTok / YouTube — universal 45 min

Design YouTube recommendations

Design a recommendation system for YouTube's home feed. 2B users, 500h/min uploaded, sub-100ms latency, optimize for watchtime + satisfaction.

Reveal answer outline

Funnel

Candidate generation (1B → 1k): two-tower (user/item embeddings) + ANN over item index. Multiple sources merged: collab, fresh, subscriptions, trending.
Ranking (1k → 100): cross-encoder transformer or DLRM with user history (DIN-style attention) + multi-task heads (pCTR, pWatchtime, pLike, pDislike) via MMoE.
Re-rank (100 → 10): final scoring (weighted heads), diversity (MMR / DPP), freshness boost, business rules.

Training infra

Embedding tables (TBs) sharded via TorchRec. Streaming Kafka → Flink → trainer for freshness. Daily full retrain + hourly incremental.

Eval

Offline: AUC, NDCG, recall@k, calibration. Online: A/B with primary watchtime, guardrails on dislikes, diversity, abandonment.

Gotchas to mention

Position bias → estimate via shallow tower or PAL.
Feedback loop → mandatory exploration slots.
Clickbait → train pWatchtime + pSatisfaction, not pCTR.
Train/serve skew → same feature pipeline, point-in-time correctness.

Full deep dive: ML system design.

ML system design OpenAI / Anthropic 60 min

Design ChatGPT serving for 200M DAU

Design the LLM serving stack. Latency: p99 TTFT < 2s, ITL < 50ms. Mix of short chat + long RAG/code prompts. Multi-tenant SLA tiers.

Reveal answer outline

Architecture

Disaggregated serving: prefill pool (compute-bound) + decode pool (bandwidth-bound). KV transferred over RDMA.
Prefix cache: hash on prompt prefix tokens; share KV across requests with common system prompt / few-shot.
Continuous batching: per-iteration scheduling. New requests join, finished requests leave.
Tensor parallel within node (TP=8 for 70B+); pipeline parallel across nodes only for very large.
Speculative decoding with EAGLE-style draft head.
Quantization: FP8 weights+activations; INT4 weights for cost-tier variants.
Routing: model router (cheap classifier on prompt) selects mini / full / reasoning model based on query complexity.

Eval / monitoring

Offline evals (MMLU, HumanEval, MATH, internal). Online: thumbs feedback, conversation length, retry rate, A/B on model versions.
Monitor TTFT / ITL distributions, KV cache hit rate, GPU util, queue depth, refusal rate, output toxicity.

Gotchas

Head-of-line blocking from long-context requests → priority queues, separate pools.
Autoscaling lag (GPU spin-up takes minutes) → predictive scaling.
Safety filtering pre/post adds latency.

Full deep dive: LLM inference.

ML system design Anthropic / OpenAI 45 min

Design a Constitutional AI loop / RLHF training pipeline

Design the system for continually improving a model via SFT, preference learning, and constitutional/RLAIF. Include eval gating.

Reveal answer outline

Pipeline

Data ingest: prompts (real, red-team, synthetic). Dedupe + PII filter.
Generation pool: base model produces K candidates per prompt; self-critique via constitutional principles; revisions.
Preference labeling: AI judge (stronger model) ranks pairs against constitution. Subset labeled by humans (gold).
Training: SFT on revisions; then DPO/KTO/PPO on preferences. Iterate.
Evaluation: capabilities (MMLU, MATH, HumanEval), safety (HarmBench, XSTest), instruction following (IFEval), preference winrate vs prior model.
Regression gate: blocked if N safety/capability metrics regress > ε.
Canary rollout: 1% → 10% → 100% with online monitoring.

Pitfalls to mention

Reward hacking, mode collapse in DPO, distribution shift between SFT data and RL prompts, eval contamination, safety-helpfulness trade-off.

System design Universal 45 min

Design a distributed rate limiter

Design a rate limiter that handles 100k req/s per key across a fleet of services. Keys are user IDs. Fairness, accuracy, low latency required.

Reveal answer outline

Approach

Token bucket per key, stored in Redis with Lua atomic update (refill based on elapsed time × rate, capped at capacity).
Higher scale: shard Redis by hash(key) % N.
Even higher scale: local approximate counters per service, periodic sync to Redis. Trades precision for throughput.
Leaky bucket alternative: smoother but no burst tolerance.
Sliding window log: most precise, highest memory.

Gotchas

Hot keys (single user / DDoS) — shard within key (key + random salt buckets).
Clock skew across services → use Redis time as authority.
Failure mode: if Redis is down, fail open (allow) or fail closed (deny)? Usually fail open with degraded service banner.

Behavioral All Sr Staff loops 5 min

"Tell me about a time you had to slow down a launch for safety/quality."

Heavily probed at Anthropic. Probes calibrated judgment, willingness to push back on shipping pressure, alignment with the safety mission.

Reveal answer template

STAR template

Situation: name the launch, the team, the deadline, the pressure.
Task: what specifically YOU owned (don't say "we").
Action: the data you saw or the test that flagged the concern. The decision to delay. Who you had to convince. The communication path (engineering manager, PM, ultimately the partner team / leadership).
Result: the time impact + the bug or risk you avoided. Quantify both.

What lands

You held a position despite shipping pressure. You had data. You took responsibility for the call (not "I escalated and let leadership decide").
You communicated transparently with stakeholders rather than blocking silently.
The fix didn't come at the cost of demoralizing the team; you made the path forward concrete.

Behavioral All 5 min

"Why are you leaving Meta?"

Asked at every interview. Loaded — easy to miss. Bash Meta = downlevel. Sound bored = downlevel. Sound vague = downlevel.

Reveal answer template

Three honest framings — pick one

Scope ceiling. "I've been Staff on [my area] for X years. To go Sr Staff at my current company requires more ladder time. I'd rather earn that level somewhere I'm pushed harder on a different problem."
Mission shift. "What I've been doing at scale is well-trodden ground. I want to be on the LLM/AGI frontier where the hard problems are unsolved."
Founder energy. "Want a smaller-org bet. Pre-IPO equity at a frontier lab is a bet I want to take while I can."

Don't say

"Meta is slowing down" / "TBD Lab took all the fun roles" / "My manager is bad" / "Burned out." All read as flight-risk.

Full playbook: Behavioral page.

Behavioral All 5 min

"Tell me about a real failure with cost."

Almost every Sr Staff loop. Tests self-awareness, ownership, learning rate.

Reveal answer template

Structure

Pick a real one. Quantified cost. "Missed launch by 2 weeks costing X% of quarterly OKR." Or "regression that affected Y% of users for Z hours."
Take responsibility cleanly. "I underestimated X" — not "the team didn't have time."
Specific lesson. Name the concrete thing you changed in your process / system afterwards.
Show the next instance went better. "When the same kind of decision came up six months later, I [did the new thing]."

Anti-patterns

"My biggest weakness is I work too hard." Cliché. Penalty.
Failure that was actually someone else's fault.
No quantified impact — "it was a learning opportunity."

Behavioral Anthropic specifically 3 min

"Why Anthropic specifically (not OpenAI, not DeepMind)?"

If your answer is generic, you fail. They want specific reference to RSP, Constitutional AI, interpretability, model welfare — and a real reason it appeals.

Reveal answer template

Reference points (use 2)

Constitutional AI as a path to scalable alignment vs RLHF's labeler bottleneck.
RSP / ASL framework — explicit capability thresholds that gate safety practices.
Interpretability investment (Templeton 2024 Scaling Monosemanticity, Olsson induction heads).
Mech interp as a plausible certification path for safety claims (vs purely behavioral evals which Sleeper Agents showed can be fooled).

Tie to your work

"I've spent X years building [your area — e.g. ranking systems where calibrated probabilities matter]. The interp work feels like the same kind of building-up-trust-via-measurement problem at a different scale, and I want to be on that side of it."

Theory probe DeepMind 2 min

"Derive Adam's bias correction."

Whiteboard exercise. Show the math.

Reveal answer

The EMA m_t = β m_{t−1} + (1−β) g_t with m_0 = 0 unrolls to:

m_t = (1−β) Σ_{i=1}^{t} β^(t−i) · g_i

Under stationary g: E[m_t] = (1−β^t) · E[g]. Biased toward 0 by factor (1−β^t). Divide by it: m̂_t = m_t / (1−β^t) is unbiased.

Why it matters: at step 1 with β=0.9, raw m has 10% the magnitude it should. First update is 10× too small without correction.

Theory probe All 2 min

"Why is L1 sparse but L2 not?"

Geometric and gradient explanations both expected.

Reveal answer

Geometric: L2 ball is a smooth circle/sphere. Loss contour first touches at a generic point — no coordinate exactly 0. L1 ball is a diamond/hyperoctahedron with corners on the axes. Loss contour most often touches at a corner — corners have all-but-one coordinate = 0 → sparsity.

Gradient: L2 gradient is 2λθ — proportional to weight, pulls toward 0 multiplicatively (never reaches it). L1 subgradient is λ · sign(θ) — constant pull regardless of magnitude. Soft-threshold operator sign(θ) · max(|θ| − λ, 0) drives small weights to exactly 0.

Bayes: L2 = Gaussian prior MAP. L1 = Laplace prior MAP (Laplace has heavier mass at 0).

Theory probe All 2 min

"Why use cross-entropy, not MSE, for classification?"

Probabilistic + gradient + calibration reasons.

Reveal answer

Probabilistic correctness: CE is the negative log-likelihood under Bernoulli/Categorical noise; MSE assumes Gaussian noise (wrong for discrete labels).
Gradient signal: CE gradient w.r.t. pre-softmax logits is p̂ − y — clean, bounded, doesn't vanish. MSE through softmax includes σ'(z), which vanishes when softmax saturates → slow learning.
Calibration: CE produces calibrated probabilities. MSE-trained classifiers are systematically miscalibrated.

Theory probe DeepMind / OpenAI 3 min

"Explain Chinchilla and why models in 2026 violate it."

Scaling-laws probe. Show you read the paper AND understand inference economics.

Reveal answer

Chinchilla (Hoffmann 2022): for fixed training compute C ≈ 6ND, the loss-minimizing model has D ≈ 20·N tokens per parameter. Pre-Chinchilla models (GPT-3, Gopher) were undertrained.

Why 2026 violates this: training compute is one-shot; inference compute is amortized over years of serving. Llama 3 8B trained on 15T tokens (1875 tokens/param) is way past Chinchilla optimal — but the smaller model is much cheaper to serve, so total system cost is lower over the model's lifetime.

The compute-optimal frontier shifts when you include inference cost in the objective.

Theory probe Anthropic 3 min

"Why does MLA need decoupled RoPE?"

Pet question at Anthropic. Tests whether you read DeepSeek V2.

Reveal answer

RoPE is a non-linear transformation depending on token position — it cannot be absorbed into Q's projection because the rotation depends on which token you're looking at, not the projection weights.

So MLA splits Q and K into two parts:

Q^C, K^C: content path. Compressed via low-rank latent c; no RoPE; up-projection absorbed into Q.
Q^R, K^R: rotary path. Small dim (e.g., 64); RoPE applied; K^R shared across all heads (MQA-style).

Final attention score = Q^C K^C + Q^R K^R. Cache stores only c + small K^R. ~7% of MHA cache size with MHA-equivalent quality.

Theory probe All inference-team loops 3 min

"How big is Llama 3 70B's KV cache at 128k context?"

Whiteboard. Specific numbers expected.

Reveal answer

Llama 3 70B uses GQA: 8 KV heads, d_head = 128, 80 layers, BF16.

Per token per layer: 2 · 8 · 128 · 2 = 4,096 bytes (the leading 2 = K + V)
Per token (all layers): 4096 · 80 = 327,680 ≈ 320 KB
At 128k tokens: 320 KB · 131,072 ≈ 40 GB per request
If MHA (64 KV heads): 8× → ~320 GB per request — totally unservable. That's why GQA exists.
MLA (latent dim ~512): ~3 GB at the same context — >10× smaller.

Theory probe All 2 min

"Test AUC is high but A/B test loses. What's wrong?"

Recsys/ranking probe. Tests practical ML diagnostic skill.

Reveal answer

Selection bias / exposure bias — train log only contains items the old model showed; new model's recommendations are out-of-distribution.
Train/serve skew — different feature pipeline at training vs serving.
Miscalibration — AUC measures ranking, not calibrated probability; downstream uses (auctions, thresholds) suffer.
Distribution shift — historical traffic mix doesn't match production.
Novelty effect — short-term spike from "new content" without sustained engagement.
Position bias — model fits old position-weighted clicks; new ranking changes positions, breaking the assumption.

Theory probe All 2 min

"Explain log-sum-exp trick."

Numerical-stability probe. Tests whether you've debugged a NaN in production.

Reveal answer

log Σ exp(x_i) overflows if any x_i is large (e.g., 1000 → exp explodes). Subtract the max:

log Σ exp(x_i) = m + log Σ exp(x_i − m)
where m = max(x_i)

Now exp arguments are ≤ 0, can't overflow. Used everywhere: softmax, partition functions, anything that's "log of a sum of exponentials." Forgetting this is the #1 source of NaN losses in custom kernels.

Theory probe Pinterest / Snap / TikTok 3 min

"In-batch negatives vs explicit hard negatives — tradeoff?"

RecSys two-tower probe. Pinterest, Snap, TikTok love this one.

Reveal answer

In-batch: cheap (no extra forward); popularity-biased (high-frequency items appear more often as negatives, get pushed away too aggressively); fix with logQ correction (Yi 2019): subtract log(sample_freq) from logits.
Hard negatives (top non-positive from ANN over corpus): faster per-example learning; risk of false negatives (true positives that look similar); risk of optimization instability.
Mixed Negative Sampling (MNS): combine in-batch (popularity-biased, cheap) + uniform-random (cheap, less biased) + hard (expensive, fast learning). Best of all worlds.
Cross-batch: queue past embeddings as additional negatives (MoCo-style). Larger negative pool with constant memory.

Anthropic-style 4-hour onsite 240 min

Run this back-to-back to simulate the loop

Coding (90 min) — Multithreaded web crawler.
Break (15 min) — water + walk.
ML coding (60 min) — Multi-head attention from scratch.
Behavioral (45 min) — "Why slow for safety" + "Why Anthropic" + 1 failure story.
ML system design (45 min) — Constitutional AI loop.

OpenAI-style 4-hour onsite 240 min

Run this back-to-back

Coding 1 (75 min) — Time-based KV store.
Coding 2 (75 min) — Spreadsheet API.
Break (15 min).
ML coding (60 min) — Top-k/top-p sampling + extension to speculative decoding.
Behavioral / culture-fit (45 min) — "Why OpenAI" + 2 STAR stories.

Pinterest-style RecSys MLE loop 240 min

Run this back-to-back

Tech screen (60 min) — 3 ML fundamentals + 2 LC hard.
ML coding (45 min) — K-means OR multi-head attention.
ML system design (60 min) — Design Pinterest home feed.
ML theory probes (30 min) — In-batch negatives, calibration, position bias.
Behavioral (45 min) — past projects + scope.

Mock interviews

The mock library

Multithreaded web crawler

Approach

Follow-up answers

Time-based key-value store with extensions

Base structure

Extensions

Edge cases interviewer expects you to mention

Spreadsheet API with cycle detection

Approach

Common follow-ups

LRU cache (LC 146)

Structure

Pitfalls

Top K frequent elements (LC 347)

Approach

Multi-head attention from scratch in PyTorch

Code skeleton

What interviewers check

BPE tokenizer from scratch

Algorithm

Encode

Common pitfalls interviewer probes

Top-K + top-P sampling

Code

What gets you graded down

K-means from scratch with k-means++ init

Algorithm

Follow-ups

Speculative decoding (toy implementation)

Algorithm sketch

Why exact

Design YouTube recommendations

Funnel

Training infra

Eval

Gotchas to mention

Design ChatGPT serving for 200M DAU

Architecture

Eval / monitoring

Gotchas

Design a Constitutional AI loop / RLHF training pipeline

Pipeline

Pitfalls to mention

Design a distributed rate limiter

Approach

Gotchas

"Tell me about a time you had to slow down a launch for safety/quality."

STAR template

What lands

"Why are you leaving Meta?"

Three honest framings — pick one

Don't say

"Tell me about a real failure with cost."

Structure

Anti-patterns

"Why Anthropic specifically (not OpenAI, not DeepMind)?"

Reference points (use 2)

Tie to your work

"Derive Adam's bias correction."

"Why is L1 sparse but L2 not?"

"Why use cross-entropy, not MSE, for classification?"

"Explain Chinchilla and why models in 2026 violate it."

"Why does MLA need decoupled RoPE?"

"How big is Llama 3 70B's KV cache at 128k context?"

"Test AUC is high but A/B test loses. What's wrong?"

"Explain log-sum-exp trick."

"In-batch negatives vs explicit hard negatives — tradeoff?"

Mock interview circuits

Run this back-to-back to simulate the loop

Run this back-to-back

Run this back-to-back

External mock-interview resources