Mock interviews
Curated mock prompts across coding, ML coding, ML system design, behavioral, and theory probes — with answers attached. Hit the random button, set the timer, and run a 45-min loop on yourself. Sourced from real Anthropic / OpenAI / DeepMind / Pinterest / Stripe loops.
- Pick a category (or hit Random mock).
- Start the timer at the top.
- Solve out loud (record yourself with Zoom or Loom — you'll learn 10× faster).
- Stop the timer. Reveal the answer. Mark what you missed.
- Write a 3-line postmortem in your application tracker / a notes file.
The mock library
Multithreaded web crawler
#fragment identifiers. Start sync; then make it concurrent with ThreadPoolExecutor. Follow-ups: politeness, robots.txt, distributed crawling.Reveal answer
Approach
- Sync version:
queue.Queuefor unvisited URLs,setfor visited. Dequeue, fetch HTML, parse links, filter same-domain + strip fragment, push unseen. - Concurrent:
ThreadPoolExecutor(max_workers=N). Submit fetch tasks; on completion, parse + enqueue new URLs. Visited set guarded by lock. - Termination: maintain in-flight counter; when queue empty AND no in-flight tasks, done.
Follow-up answers
- Politeness: per-host token bucket; cap concurrent requests per host.
- robots.txt: fetch and cache per host; respect
Disallow. - Distributed: shard URLs by hash(URL) % N workers; central dedup via Redis or Bloom filter; coordinator detects "done."
- Threads vs processes: GIL is released on socket reads → threads are fine. Processes only if HTML parsing dominates.
Time-based key-value store with extensions
set(key, value, timestamp) and get(key, timestamp) returning the latest value with timestamp ≤ given. Then extend: per-key locks vs global lock, disk persistence (replay log + snapshots), TTL on values. (LC 981 base, then real-system follow-ups.)Reveal answer
Base structure
Hashmap key → list of (timestamp, value) sorted by timestamp. set: append (assume monotonic) or insort. get: binary search for largest ts ≤ target.
Extensions
- Per-key locks: stripe-lock on hash(key) % N — much higher concurrency than global mutex on hot keys.
- Persistence: append-only log (write-ahead) + periodic snapshot. Recover by replay since last snapshot.
- TTL: store (timestamp, value, expiry); lazy eviction on read; periodic background sweep.
Edge cases interviewer expects you to mention
- Out-of-order timestamps. Concurrent updates with same timestamp. Memory growth without eviction. Crash mid-write (use fsync semantics).
Spreadsheet API with cycle detection
setCell(name, value | formula) and getCell(name). Formulas can reference other cells. Detect cycles. Then optimize getCell to O(1).Reveal answer
Approach
- Parse formula to find dependencies. Build a dependency graph.
- Cycle detection: DFS with 3-color marking (white/gray/black). Gray-on-gray = cycle.
- Eager evaluation: on
setCell, topological-sort affected cells; recompute in order.getCell= O(1) lookup. - Lazy alternative: cache evaluations; invalidate on dependency change.
Common follow-ups
- Concurrent reads/writes (RWLock). Memory cost of dependency graph. Streaming partial updates.
LRU cache (LC 146)
get(key) and put(key, value), both O(1). Capacity-limited; evict least-recently-used on overflow.Reveal answer
Structure
Hashmap (key → node) + doubly-linked list (head = most recently used, tail = least recently used).
- get: hashmap lookup; move node to head; return value.
- put: if key exists, update value + move to head. Else create node, push to head, hashmap insert. If over capacity, remove tail node + remove from hashmap.
Pitfalls
- Forgetting to remove from hashmap when evicting tail. Sentinel head/tail nodes simplify edge cases. Single mutex around the whole struct for thread safety.
Top K frequent elements (LC 347)
Reveal answer
Approach
- Hashmap counts:
Counter(nums). - Min-heap of size K: push (count, num); if size > K, pop. End: heap has K largest.
- Alt: bucket sort by count → O(N) but uses more memory.
- Alt: quickselect on counts → O(N) average.
Multi-head attention from scratch in PyTorch
nn.MultiheadAttention.Reveal answer
Code skeleton
class MHA(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0
self.h = n_heads; self.d_h = d_model // n_heads
self.qkv = nn.Linear(d_model, 3 * d_model)
self.out = nn.Linear(d_model, d_model)
def forward(self, x, causal=True, pad_mask=None):
B, T, _ = x.shape
qkv = self.qkv(x).reshape(B, T, 3, self.h, self.d_h).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2] # (B, H, T, d_h)
scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_h)
if causal:
mask = torch.triu(torch.ones(T, T, device=x.device), 1).bool()
scores = scores.masked_fill(mask, float('-inf'))
if pad_mask is not None:
scores = scores.masked_fill(pad_mask[:, None, None, :], float('-inf'))
attn = F.softmax(scores, dim=-1)
out = (attn @ v).transpose(1, 2).reshape(B, T, -1)
return self.out(out)
What interviewers check
- Did you scale by √d_head? (Why: prevents softmax saturation.)
- Did you mask before softmax? (Not after.)
- Shape gymnastics correct? Comment shapes.
- Bonus: extend to GQA. Extend to KV cache for autoregressive decode.
BPE tokenizer from scratch
Reveal answer
Algorithm
- Initialize vocabulary as bytes (256 entries).
- Pre-tokenize corpus (split on whitespace + simple regex).
- Count adjacent pair frequencies in current word splits.
- Find most frequent pair. Add merged token to vocab. Apply merge to all words.
- Repeat until target vocab size.
Encode
Apply learned merges in order (priority by merge index). Greedy left-to-right.
Common pitfalls interviewer probes
- Whitespace handling: GPT-2 BPE includes leading space as part of token.
- Byte fallback: unknown characters → emit raw bytes; vocab guarantees byte coverage.
- Merge order matters at encode time.
Top-K + top-P sampling
Reveal answer
Code
def sample(logits, temp=1.0, top_k=None, top_p=None):
logits = logits / max(temp, 1e-6)
if top_k:
idx = np.argpartition(logits, -top_k)[-top_k:]
mask = np.full_like(logits, -1e9); mask[idx] = logits[idx]
logits = mask
# numerically stable softmax
logits -= logits.max()
probs = np.exp(logits) / np.exp(logits).sum()
if top_p:
order = np.argsort(-probs) # stable descending
cum = np.cumsum(probs[order])
cutoff = np.searchsorted(cum, top_p) + 1
keep = order[:cutoff]
new = np.zeros_like(probs); new[keep] = probs[keep]
probs = new / new.sum()
return np.random.choice(len(probs), p=probs)
What gets you graded down
- Forgetting
logits -= logits.max()for stability. - Off-by-one in nucleus cutoff — must keep at least one token whose cum ≥ p.
- Not handling temperature = 0 (greedy degenerate case).
K-means from scratch with k-means++ init
Reveal answer
Algorithm
- k-means++ init: pick first centroid uniformly random; each subsequent centroid sampled with probability ∝ D(x)² where D(x) is distance to nearest existing centroid.
- Assign: each point to nearest centroid (vectorized with numpy broadcast).
- Update: centroid = mean of assigned points. Empty cluster → reinitialize via k-means++ or to a random point.
- Stop: when assignments stop changing OR centroid movement < ε OR max iterations.
Follow-ups
- Choose K: elbow method (within-cluster SSE vs k), silhouette score, gap statistic.
- Scale to 1B points: mini-batch k-means; or k-means on a subsample then assign rest.
- High-d data: PCA first; or use spherical k-means for unit-norm vectors.
Speculative decoding (toy implementation)
Reveal answer
Algorithm sketch
- Draft generates K tokens, recording draft probabilities at each step.
- Target runs ONE forward pass over (prefix + drafted tokens) → next-token distributions at every drafted position.
- For each drafted token
t: accept with probabilitymin(1, p_target(t) / p_draft(t)). - On reject: sample one corrected token from
(p_target − p_draft)_+ / Z; stop. - If all K accepted: sample one bonus token from target's last distribution.
Why exact
Marginal distribution of each emitted token = p_d(t) · accept(t) + (1 − E[accept]) · residual(t). Algebra works out to p_t(t). Each emitted token is exactly distributed as the target.
Design YouTube recommendations
Reveal answer outline
Funnel
- Candidate generation (1B → 1k): two-tower (user/item embeddings) + ANN over item index. Multiple sources merged: collab, fresh, subscriptions, trending.
- Ranking (1k → 100): cross-encoder transformer or DLRM with user history (DIN-style attention) + multi-task heads (pCTR, pWatchtime, pLike, pDislike) via MMoE.
- Re-rank (100 → 10): final scoring (weighted heads), diversity (MMR / DPP), freshness boost, business rules.
Training infra
- Embedding tables (TBs) sharded via TorchRec. Streaming Kafka → Flink → trainer for freshness. Daily full retrain + hourly incremental.
Eval
- Offline: AUC, NDCG, recall@k, calibration. Online: A/B with primary watchtime, guardrails on dislikes, diversity, abandonment.
Gotchas to mention
- Position bias → estimate via shallow tower or PAL.
- Feedback loop → mandatory exploration slots.
- Clickbait → train pWatchtime + pSatisfaction, not pCTR.
- Train/serve skew → same feature pipeline, point-in-time correctness.
Full deep dive: ML system design.
Design ChatGPT serving for 200M DAU
Reveal answer outline
Architecture
- Disaggregated serving: prefill pool (compute-bound) + decode pool (bandwidth-bound). KV transferred over RDMA.
- Prefix cache: hash on prompt prefix tokens; share KV across requests with common system prompt / few-shot.
- Continuous batching: per-iteration scheduling. New requests join, finished requests leave.
- Tensor parallel within node (TP=8 for 70B+); pipeline parallel across nodes only for very large.
- Speculative decoding with EAGLE-style draft head.
- Quantization: FP8 weights+activations; INT4 weights for cost-tier variants.
- Routing: model router (cheap classifier on prompt) selects mini / full / reasoning model based on query complexity.
Eval / monitoring
- Offline evals (MMLU, HumanEval, MATH, internal). Online: thumbs feedback, conversation length, retry rate, A/B on model versions.
- Monitor TTFT / ITL distributions, KV cache hit rate, GPU util, queue depth, refusal rate, output toxicity.
Gotchas
- Head-of-line blocking from long-context requests → priority queues, separate pools.
- Autoscaling lag (GPU spin-up takes minutes) → predictive scaling.
- Safety filtering pre/post adds latency.
Full deep dive: LLM inference.
Design a Constitutional AI loop / RLHF training pipeline
Reveal answer outline
Pipeline
- Data ingest: prompts (real, red-team, synthetic). Dedupe + PII filter.
- Generation pool: base model produces K candidates per prompt; self-critique via constitutional principles; revisions.
- Preference labeling: AI judge (stronger model) ranks pairs against constitution. Subset labeled by humans (gold).
- Training: SFT on revisions; then DPO/KTO/PPO on preferences. Iterate.
- Evaluation: capabilities (MMLU, MATH, HumanEval), safety (HarmBench, XSTest), instruction following (IFEval), preference winrate vs prior model.
- Regression gate: blocked if N safety/capability metrics regress > ε.
- Canary rollout: 1% → 10% → 100% with online monitoring.
Pitfalls to mention
- Reward hacking, mode collapse in DPO, distribution shift between SFT data and RL prompts, eval contamination, safety-helpfulness trade-off.
Design a distributed rate limiter
Reveal answer outline
Approach
- Token bucket per key, stored in Redis with Lua atomic update (refill based on elapsed time × rate, capped at capacity).
- Higher scale: shard Redis by hash(key) % N.
- Even higher scale: local approximate counters per service, periodic sync to Redis. Trades precision for throughput.
- Leaky bucket alternative: smoother but no burst tolerance.
- Sliding window log: most precise, highest memory.
Gotchas
- Hot keys (single user / DDoS) — shard within key (key + random salt buckets).
- Clock skew across services → use Redis time as authority.
- Failure mode: if Redis is down, fail open (allow) or fail closed (deny)? Usually fail open with degraded service banner.
"Tell me about a time you had to slow down a launch for safety/quality."
Reveal answer template
STAR template
- Situation: name the launch, the team, the deadline, the pressure.
- Task: what specifically YOU owned (don't say "we").
- Action: the data you saw or the test that flagged the concern. The decision to delay. Who you had to convince. The communication path (engineering manager, PM, ultimately the partner team / leadership).
- Result: the time impact + the bug or risk you avoided. Quantify both.
What lands
- You held a position despite shipping pressure. You had data. You took responsibility for the call (not "I escalated and let leadership decide").
- You communicated transparently with stakeholders rather than blocking silently.
- The fix didn't come at the cost of demoralizing the team; you made the path forward concrete.
"Why are you leaving Meta?"
Reveal answer template
Three honest framings — pick one
- Scope ceiling. "I've been Staff on [my area] for X years. To go Sr Staff at my current company requires more ladder time. I'd rather earn that level somewhere I'm pushed harder on a different problem."
- Mission shift. "What I've been doing at scale is well-trodden ground. I want to be on the LLM/AGI frontier where the hard problems are unsolved."
- Founder energy. "Want a smaller-org bet. Pre-IPO equity at a frontier lab is a bet I want to take while I can."
Don't say
- "Meta is slowing down" / "TBD Lab took all the fun roles" / "My manager is bad" / "Burned out." All read as flight-risk.
Full playbook: Behavioral page.
"Tell me about a real failure with cost."
Reveal answer template
Structure
- Pick a real one. Quantified cost. "Missed launch by 2 weeks costing X% of quarterly OKR." Or "regression that affected Y% of users for Z hours."
- Take responsibility cleanly. "I underestimated X" — not "the team didn't have time."
- Specific lesson. Name the concrete thing you changed in your process / system afterwards.
- Show the next instance went better. "When the same kind of decision came up six months later, I [did the new thing]."
Anti-patterns
- "My biggest weakness is I work too hard." Cliché. Penalty.
- Failure that was actually someone else's fault.
- No quantified impact — "it was a learning opportunity."
"Why Anthropic specifically (not OpenAI, not DeepMind)?"
Reveal answer template
Reference points (use 2)
- Constitutional AI as a path to scalable alignment vs RLHF's labeler bottleneck.
- RSP / ASL framework — explicit capability thresholds that gate safety practices.
- Interpretability investment (Templeton 2024 Scaling Monosemanticity, Olsson induction heads).
- Mech interp as a plausible certification path for safety claims (vs purely behavioral evals which Sleeper Agents showed can be fooled).
Tie to your work
"I've spent X years building [your area — e.g. ranking systems where calibrated probabilities matter]. The interp work feels like the same kind of building-up-trust-via-measurement problem at a different scale, and I want to be on that side of it."
"Derive Adam's bias correction."
Reveal answer
The EMA m_t = β m_{t−1} + (1−β) g_t with m_0 = 0 unrolls to:
m_t = (1−β) Σ_{i=1}^{t} β^(t−i) · g_i
Under stationary g: E[m_t] = (1−β^t) · E[g]. Biased toward 0 by factor (1−β^t). Divide by it: m̂_t = m_t / (1−β^t) is unbiased.
Why it matters: at step 1 with β=0.9, raw m has 10% the magnitude it should. First update is 10× too small without correction.
"Why is L1 sparse but L2 not?"
Reveal answer
Geometric: L2 ball is a smooth circle/sphere. Loss contour first touches at a generic point — no coordinate exactly 0. L1 ball is a diamond/hyperoctahedron with corners on the axes. Loss contour most often touches at a corner — corners have all-but-one coordinate = 0 → sparsity.
Gradient: L2 gradient is 2λθ — proportional to weight, pulls toward 0 multiplicatively (never reaches it). L1 subgradient is λ · sign(θ) — constant pull regardless of magnitude. Soft-threshold operator sign(θ) · max(|θ| − λ, 0) drives small weights to exactly 0.
Bayes: L2 = Gaussian prior MAP. L1 = Laplace prior MAP (Laplace has heavier mass at 0).
"Why use cross-entropy, not MSE, for classification?"
Reveal answer
- Probabilistic correctness: CE is the negative log-likelihood under Bernoulli/Categorical noise; MSE assumes Gaussian noise (wrong for discrete labels).
- Gradient signal: CE gradient w.r.t. pre-softmax logits is
p̂ − y— clean, bounded, doesn't vanish. MSE through softmax includes σ'(z), which vanishes when softmax saturates → slow learning. - Calibration: CE produces calibrated probabilities. MSE-trained classifiers are systematically miscalibrated.
"Explain Chinchilla and why models in 2026 violate it."
Reveal answer
Chinchilla (Hoffmann 2022): for fixed training compute C ≈ 6ND, the loss-minimizing model has D ≈ 20·N tokens per parameter. Pre-Chinchilla models (GPT-3, Gopher) were undertrained.
Why 2026 violates this: training compute is one-shot; inference compute is amortized over years of serving. Llama 3 8B trained on 15T tokens (1875 tokens/param) is way past Chinchilla optimal — but the smaller model is much cheaper to serve, so total system cost is lower over the model's lifetime.
The compute-optimal frontier shifts when you include inference cost in the objective.
"Why does MLA need decoupled RoPE?"
Reveal answer
RoPE is a non-linear transformation depending on token position — it cannot be absorbed into Q's projection because the rotation depends on which token you're looking at, not the projection weights.
So MLA splits Q and K into two parts:
- Q^C, K^C: content path. Compressed via low-rank latent
c; no RoPE; up-projection absorbed into Q. - Q^R, K^R: rotary path. Small dim (e.g., 64); RoPE applied; K^R shared across all heads (MQA-style).
Final attention score = Q^C K^C + Q^R K^R. Cache stores only c + small K^R. ~7% of MHA cache size with MHA-equivalent quality.
"How big is Llama 3 70B's KV cache at 128k context?"
Reveal answer
Llama 3 70B uses GQA: 8 KV heads, d_head = 128, 80 layers, BF16.
- Per token per layer:
2 · 8 · 128 · 2 = 4,096 bytes(the leading 2 = K + V) - Per token (all layers):
4096 · 80 = 327,680 ≈ 320 KB - At 128k tokens:
320 KB · 131,072 ≈ 40 GB per request - If MHA (64 KV heads): 8× → ~320 GB per request — totally unservable. That's why GQA exists.
- MLA (latent dim ~512): ~3 GB at the same context — >10× smaller.
"Test AUC is high but A/B test loses. What's wrong?"
Reveal answer
- Selection bias / exposure bias — train log only contains items the old model showed; new model's recommendations are out-of-distribution.
- Train/serve skew — different feature pipeline at training vs serving.
- Miscalibration — AUC measures ranking, not calibrated probability; downstream uses (auctions, thresholds) suffer.
- Distribution shift — historical traffic mix doesn't match production.
- Novelty effect — short-term spike from "new content" without sustained engagement.
- Position bias — model fits old position-weighted clicks; new ranking changes positions, breaking the assumption.
"Explain log-sum-exp trick."
Reveal answer
log Σ exp(x_i) overflows if any x_i is large (e.g., 1000 → exp explodes). Subtract the max:
log Σ exp(x_i) = m + log Σ exp(x_i − m)
where m = max(x_i)
Now exp arguments are ≤ 0, can't overflow. Used everywhere: softmax, partition functions, anything that's "log of a sum of exponentials." Forgetting this is the #1 source of NaN losses in custom kernels.
"In-batch negatives vs explicit hard negatives — tradeoff?"
Reveal answer
- In-batch: cheap (no extra forward); popularity-biased (high-frequency items appear more often as negatives, get pushed away too aggressively); fix with logQ correction (Yi 2019): subtract
log(sample_freq)from logits. - Hard negatives (top non-positive from ANN over corpus): faster per-example learning; risk of false negatives (true positives that look similar); risk of optimization instability.
- Mixed Negative Sampling (MNS): combine in-batch (popularity-biased, cheap) + uniform-random (cheap, less biased) + hard (expensive, fast learning). Best of all worlds.
- Cross-batch: queue past embeddings as additional negatives (MoCo-style). Larger negative pool with constant memory.
Mock interview circuits
Build a full mock loop by combining mocks across categories. Below are pre-built loops that mimic real onsites.
Run this back-to-back to simulate the loop
- Coding (90 min) — Multithreaded web crawler.
- Break (15 min) — water + walk.
- ML coding (60 min) — Multi-head attention from scratch.
- Behavioral (45 min) — "Why slow for safety" + "Why Anthropic" + 1 failure story.
- ML system design (45 min) — Constitutional AI loop.
Run this back-to-back
- Coding 1 (75 min) — Time-based KV store.
- Coding 2 (75 min) — Spreadsheet API.
- Break (15 min).
- ML coding (60 min) — Top-k/top-p sampling + extension to speculative decoding.
- Behavioral / culture-fit (45 min) — "Why OpenAI" + 2 STAR stories.
Run this back-to-back
- Tech screen (60 min) — 3 ML fundamentals + 2 LC hard.
- ML coding (45 min) — K-means OR multi-head attention.
- ML system design (60 min) — Design Pinterest home feed.
- ML theory probes (30 min) — In-batch negatives, calibration, position bias.
- Behavioral (45 min) — past projects + scope.
External mock-interview resources
- free Pramp — peer-to-peer mocks, daily slots
- paid interviewing.io — anonymous mocks with ex-FAANG
- paid Exponent — coaching + mocks for tech roles
- free Hello Interview — practice problems with detailed walkthroughs
- paid AlgoExpert — video walkthroughs of LeetCode-style
- free LeetCode mock interview mode
- free Self-record with Zoom — watch back at 1.5×; brutal but fastest feedback loop