The ML-algorithm coding round
Top ML teams increasingly replace pure LeetCode with an implement-an-ML-primitive round: a two-tower forward pass, in-batch negatives, a sparse dot product, KNN, softmax + cross-entropy, an attention block, one SGD step. This page gives each one a full NumPy and PyTorch reference, the math, the complexity, the follow-ups interviewers push on, and the gotchas that separate "works" from "hire." Every problem ends with LeetCode look-alikes to drill the underlying data-structure skill.
Interviewers who say this are testing whether you can turn ML math into correct, efficient code under time pressure — the day-job skill an MLE uses constantly. It splits into three buckets; prep all three because the hint is deliberately fuzzy:
| Bucket | What they hand you | The problems below |
|---|---|---|
| Retrieval / embeddings (the "two-tower" core) | "Implement the two-tower forward + in-batch negative loss" or "score a query against N items" | P1 two-tower · P2 in-batch sampled softmax+logQ · P3 sparse dot product · P5 cosine top-k |
| Classic-ML from scratch | "Implement KNN / logistic-regression training / k-means" | P4 KNN · P8 logistic-regression step · P11 k-means |
| NN primitives | "Implement softmax+cross-entropy / an embedding bag / scaled-dot-product attention / one SGD step" | P6 softmax+CE · P7 embedding bag · P9 attention · P10 mini-batch SGD |
The bar is not "do you know the model" — it's "is your code correct, clear, and numerically safe." They will ask for the brute-force first, then "make it faster / handle a batch / avoid the overflow." Narrate shapes out loud; shape bugs are the #1 failure. A clarifying question that always scores: "NumPy from scratch, or may I use PyTorch autograd?"
Every solution here is written to be reproduced from memory at a whiteboard, not to be short. That means: named helper functions (euclidean, softmax_rows, l2_normalize), plain loops over clever broadcasting, and one example before the batch. A simple for-loop that you can write correctly under pressure beats a one-line X[:,None,:]-C[None,:,:] tensor trick you'll fumble. Where a fast vectorized version matters, it's in a clearly-labelled "⚡ Speed-up (optional)" box — know it exists, mention it to the interviewer, but lead with the version you can actually reproduce. Clarity is the grade; speed is the bonus.
If the hint named two-tower specifically, these four are the most likely. P1 and P2 are the headline.
Problem. Implement a two-tower (dual-encoder) model. A user/query tower and an item tower each map their features to a $d$-dim embedding through an MLP; the relevance score is the dot product (or cosine) of the two embeddings. Given a batch of $B$ users and $B$ items, return the $B\times B$ score matrix.
The math
$u = f_\theta(x_{\text{user}})\in\mathbb{R}^{d}$, $\;v = g_\phi(x_{\text{item}})\in\mathbb{R}^{d}$. Score $s(u,v)=u^\top v$ (dot) or $\frac{u^\top v}{\lVert u\rVert\,\lVert v\rVert}$ (cosine). For a batch, the score matrix is $S = U V^\top$ where $U\in\mathbb{R}^{B\times d}$, $V\in\mathbb{R}^{B\times d}$, so $S\in\mathbb{R}^{B\times B}$ and $S_{ij}$ = score of user $i$ vs item $j$. The diagonal $S_{ii}$ is the positive pair.
PyTorch reference
import torch, torch.nn as nn, torch.nn.functional as F
class Tower(nn.Module):
def __init__(self, in_dim, hidden, out_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden), nn.ReLU(),
nn.Linear(hidden, out_dim),
)
def forward(self, x):
return self.net(x) # (B, out_dim)
class TwoTower(nn.Module):
def __init__(self, user_dim, item_dim, hidden=128, emb=64, normalize=True):
super().__init__()
self.user_tower = Tower(user_dim, hidden, emb)
self.item_tower = Tower(item_dim, hidden, emb)
self.normalize = normalize # cosine vs raw dot
def forward(self, user_x, item_x):
u = self.user_tower(user_x) # (B, emb)
v = self.item_tower(item_x) # (B, emb)
if self.normalize: # cosine = L2-normalize then dot
u = F.normalize(u, dim=-1)
v = F.normalize(v, dim=-1)
scores = u @ v.T # (B, B) S[i,j]=user_i . item_j
return scores, u, v
NumPy reference (if they say "no framework") — explicit, reproducible
import numpy as np
# ---- small, named helpers you can write without thinking ----
def relu(x):
return np.maximum(0, x)
def l2_normalize(M): # normalize each ROW to length 1
norms = np.linalg.norm(M, axis=1, keepdims=True) # (B, 1)
return M / (norms + 1e-8) # +eps so a zero row doesn't divide by 0
def tower(x, W1, b1, W2, b2): # one MLP: (B, in) -> (B, emb)
hidden = relu(x @ W1 + b1) # (B, hidden)
return hidden @ W2 + b2 # (B, emb)
# ---- the two-tower forward + all-pairs scores ----
def two_tower_scores(user_x, item_x, user_params, item_params, cosine=True):
u = tower(user_x, *user_params) # (B, emb) user embeddings
v = tower(item_x, *item_params) # (B, emb) item embeddings
if cosine: # cosine = normalize, then dot
u = l2_normalize(u)
v = l2_normalize(v)
scores = u @ v.T # (B, B): scores[i, j] = user_i . item_j
return scores # diagonal scores[i, i] = the positive pairs
Read it top-down: each helper does one obvious thing, and scores = u @ v.T is the only "matrix" line — say "B-by-emb times emb-by-B gives B-by-B" as you write it.
O(B·(in·hidden + hidden·emb)); all-pairs scoring O(B²·emb). At serving time you don't form B²: you precompute item embeddings into an ANN index (FAISS/HNSW) and do an approximate top-k lookup per user in ~O(log N).u @ v.T, not u @ v — say "B-by-emb times emb-by-B gives B-by-B" out loud. (2) Cosine: normalize before the dot, and add an epsilon to the norm or a zero vector divides by zero. (3) Two separate towers — a classic mistake is sharing weights; user and item features live in different spaces, so the towers are independent (weights can be shared only if the inputs are homogeneous, e.g. item-item). (4) The towers must be independent at inference (no cross-features between user and item) — that independence is exactly what lets you precompute item vectors and use ANN. If the interviewer adds cross-features, you've left retrieval and entered ranking.Follow-ups they will ask
- "How do you train it?" → in-batch negatives + sampled softmax — that's P2 below.
- "How do you serve it to 100M items?" → precompute item embeddings → ANN index (FAISS IVF/HNSW); the query tower runs online, the item tower offline/batch.
- "Dot vs cosine?" → cosine removes magnitude (popularity) so it's safer for retrieval; raw dot lets the model encode popularity in the norm. Many systems use dot with a temperature.
- "Cold-start item?" → the item tower runs on features, so a brand-new item still gets an embedding (vs. pure ID-based MF which can't).
Problem. Train the two-tower with in-batch negatives: in a batch of $B$ (user, positive-item) pairs, treat the other $B-1$ items in the batch as negatives for each user. Implement the loss. Then add the logQ correction that debiases popular items appearing too often as negatives.
The idea
From P1 you have $S\in\mathbb{R}^{B\times B}$. For user $i$, the positive is item $i$ (the diagonal); items $j\ne i$ are "free" negatives. So row $i$ is a $B$-way softmax classification with target $i$ — i.e. cross-entropy with labels $[0,1,\dots,B-1]$. logQ correction: popular items show up as in-batch negatives more often than their true rate, so the model over-penalizes them. Subtract $\log Q(j)$ (the log sampling probability of item $j$) from each logit to correct: $s'_{ij} = s_{ij} - \log Q(j)$.
PyTorch reference
import torch, torch.nn.functional as F
def in_batch_loss(u, v, log_q=None, temperature=0.05):
# u, v: (B, emb) positives are aligned: u[i] <-> v[i]
scores = (u @ v.T) / temperature # (B, B) logits
if log_q is not None: # logQ correction (sampled softmax)
scores = scores - log_q.unsqueeze(0) # subtract per-item log sampling prob
labels = torch.arange(u.size(0), device=u.device) # diagonal = positives
return F.cross_entropy(scores, labels) # B-way softmax, target = i
NumPy reference (spells out softmax + CE; reuses softmax_rows from P6)
import numpy as np
def in_batch_loss_np(u, v, log_q=None, temperature=0.05):
B = u.shape[0]
scores = (u @ v.T) / temperature # (B, B) logits, scores[i,j]=user_i.item_j
if log_q is not None: # logQ correction: lower popular items' logits
scores = scores - log_q # log_q is shape (B,), subtracts per column
probs = softmax_rows(scores) # softmax over each row (the B-way classes)
# the positive for user i is item i -> the diagonal is the correct class
correct_probs = probs[np.arange(B), np.arange(B)]
return -np.mean(np.log(correct_probs + 1e-12))
O(B²·emb) for the scores, O(B²) for the softmax — cheap because the negatives are free (already in the batch). Bigger batch = more negatives = better retrieval, which is why two-tower training pushes huge batches.labels = arange(B), because positive of user $i$ is item $i$. Getting this wrong silently trains garbage. (2) Temperature (~0.05) sharpens the softmax; without it the loss barely moves. (3) logQ sign: you subtract $\log Q$ — popular items have higher $Q$, so subtracting a larger number lowers their logit, undoing the over-sampling. (4) False negatives: if the same item appears as someone else's positive in the batch, it's wrongly a negative — at large batch this is rare and usually ignored, but mention it. (5) Numerical stability: subtract the row-max before exp (P6 covers why).Follow-ups
- "Why in-batch negatives at all?" → sampling explicit negatives is expensive; the batch gives you $B-1$ for free, and they're "hard" because they're real items.
- "What does logQ actually estimate?" → $Q(j)$ = probability item $j$ is sampled as a negative ≈ its frequency in the batch ≈ its popularity. You can estimate it online with a count-min sketch / streaming frequency.
- "Add hard negatives?" → mix in mined hard negatives (high score, not clicked) alongside the in-batch ones for a harder, better-separated space.
Problem. User and item feature vectors are high-dimensional and sparse (millions of dims, a few hundred non-zero). Represent each as a list of (index, value) pairs sorted by index, and compute the dot product efficiently. (This is literally LeetCode 1570, and it's a favorite at RecSys shops because sparse features are everywhere.)
Two approaches
# Approach A — hashmap: O(n + m), works even if unsorted
def dot_hash(a, b): # a, b: list of (idx, val)
da = {i: v for i, v in a}
return sum(v * da.get(i, 0) for i, v in b)
# Approach B — two-pointer merge: O(n + m), O(1) extra, needs sorted indices
def dot_two_pointer(a, b):
i = 0 # pointer into a
j = 0 # pointer into b
result = 0
while i < len(a) and j < len(b):
idx_a, val_a = a[i]
idx_b, val_b = b[j]
if idx_a == idx_b: # same feature -> multiply, advance both
result += val_a * val_b
i += 1
j += 1
elif idx_a < idx_b: # a is behind -> advance a
i += 1
else: # b is behind -> advance b
j += 1
return result
O(n+m) time. Two-pointer is O(1) extra space (vs the hashmap's O(n)) and is the answer they want when one vector is huge and the other tiny — or when both are sorted. If one list is much shorter, binary-search its few indices into the long one: O(short·log long).Problem. Given a training set $X\in\mathbb{R}^{n\times d}$ with labels $y$ and one query point, predict its label by majority vote of the $k$ nearest neighbors (Euclidean).
The clean version — write this one (loop you can reproduce)
import numpy as np
from collections import Counter
def euclidean(a, b):
# straight-line distance between two points
return np.sqrt(np.sum((a - b) ** 2))
def knn_predict_one(X_train, y_train, query, k=5):
# 1. distance from the query to EVERY training point
distances = []
for i in range(len(X_train)):
d = euclidean(query, X_train[i])
distances.append((d, y_train[i])) # (distance, label)
# 2. sort by distance, take the k closest
distances.sort(key=lambda pair: pair[0])
k_labels = [label for (_, label) in distances[:k]]
# 3. majority vote
return Counter(k_labels).most_common(1)[0][0]
Three steps, each a plain line you can say out loud: "distance to all, sort, take k, vote." This is the version to reproduce under pressure — it's obviously correct.
⚡ Speed-up (optional) — vectorize for a whole batch of queries
Only reach for this if the interviewer says "make it faster" or "score many queries at once." Mention the identity, then write it; don't lead with it.
import numpy as np
from collections import Counter
def knn_predict_batch(X_train, y_train, X_query, k=5):
# distance to all train points, all queries at once.
# trick: ||a-b||^2 = ||a||^2 + ||b||^2 - 2 a.b (one matmul, no Python loop)
q_sq = (X_query ** 2).sum(axis=1, keepdims=True) # (Q, 1)
t_sq = (X_train ** 2).sum(axis=1) # (n,)
d2 = q_sq + t_sq - 2 * (X_query @ X_train.T) # (Q, n) squared dists
nearest = np.argpartition(d2, k, axis=1)[:, :k] # k smallest per row, O(n)
return np.array([Counter(y_train[row]).most_common(1)[0][0]
for row in nearest])
argpartition (O(n)) beats a full argsort (O(n log n)) — you only need the k smallest, not them ordered.
O(n·d) distances + O(n log n) sort per query. Batch speed-up: same distances via one matmul + O(n) selection. At scale swap brute force for a KD-tree/ball-tree (low d) or ANN (high d).Problem. Given a query embedding $q\in\mathbb{R}^{d}$ and an item matrix $V\in\mathbb{R}^{N\times d}$, return the indices of the top-$k$ items by cosine similarity. This is the serving half of two-tower retrieval.
import numpy as np
def l2_normalize_rows(M):
norms = np.linalg.norm(M, axis=1, keepdims=True) # length of each row
return M / (norms + 1e-8)
def topk_cosine(query, items, k=10):
# 1. normalize so that a dot product = cosine similarity
q = query / (np.linalg.norm(query) + 1e-8) # (d,)
V = l2_normalize_rows(items) # (N, d)
# 2. cosine score of the query against every item (one matrix-vector mult)
scores = V @ q # (N,)
# 3. take the top-k, then sort just those k by score (descending)
top_k = np.argsort(scores)[::-1][:k] # simple & clear
return top_k
"Normalize, dot against all, take the top k." The full argsort is O(N log N) — perfectly fine here; if the interviewer wants O(N), say "swap np.argsort for np.argpartition(-scores, k)."
O(N·d) for the scores + O(N) for the partial selection + O(k log k) to order the k. Brute force is fine for thousands of items; beyond that, an ANN index. Pre-normalize $V$ once and cosine becomes a plain dot product (the standard serving trick).argpartition(-sims, k) for the largest (negate). (3) If $V$ is pre-normalized at index-build time, you skip per-query normalization of items — only the query needs it. (4) Ties / duplicates near the cutoff: decide stable ordering."Implement X without a framework." The interviewer wants correct math and numerical safety.
Problem. Implement softmax and cross-entropy from scratch, numerically stable, for a batch of logits. Bonus they always ask: return the gradient of the loss w.r.t. the logits.
The math & the trick
$\text{softmax}(z)_i = \dfrac{e^{z_i}}{\sum_j e^{z_j}}$. Naively, $e^{z}$ overflows for large $z$. Softmax is shift-invariant: subtracting any constant $c$ from every logit leaves it unchanged. Pick $c=\max_j z_j$ so the largest exponent is $e^0=1$ — no overflow. Cross-entropy $L=-\sum_i y_i\log p_i$; for a one-hot target, $L=-\log p_{\text{true}}$. The gradient is famously clean: $\partial L/\partial z = p - y$.
import numpy as np
def softmax_rows(logits): # logits: (B, C)
# subtract each row's max so the biggest exponent is e^0 = 1 (no overflow)
shifted = logits - logits.max(axis=1, keepdims=True)
exp = np.exp(shifted)
return exp / exp.sum(axis=1, keepdims=True) # each row sums to 1
def cross_entropy(logits, labels): # labels: (B,) ints
probs = softmax_rows(logits) # (B, C)
B = logits.shape[0]
# loss = average of -log(probability of the correct class)
correct_probs = probs[np.arange(B), labels] # pick prob of the true class per row
loss = -np.mean(np.log(correct_probs + 1e-12))
# gradient dL/dlogits = (probs - onehot(labels)) / B -- the clean (p - y)
grad = probs.copy()
grad[np.arange(B), labels] -= 1 # subtract 1 at the true class
grad /= B
return loss, grad
exp — state that it's valid because softmax is shift-invariant. (2) Don't compute log(softmax) separately — that re-introduces instability; use log-sum-exp for log-prob, or fuse into one CE op. (3) The +1e-12 guards log(0). (4) Average over the batch for the gradient (the /B). (5) For binary, sigmoid + BCE has the same $(p-y)$ gradient.Problem. A user has a variable-length list of feature IDs (e.g. subreddits they follow). Look up each ID's embedding from a table and pool them (mean/sum) into a single fixed vector — the core of how sparse categorical features enter a ranking model.
import numpy as np
def embedding_bag(table, id_lists, mode="mean"):
# table: (V, d) learned embeddings; id_lists: list of lists of ints
out = np.zeros((len(id_lists), table.shape[1]))
for i, ids in enumerate(id_lists):
if not ids: continue # empty bag -> zeros
vecs = table[ids] # (len, d) fancy indexing
out[i] = vecs.mean(0) if mode == "mean" else vecs.sum(0)
return out
PyTorch gives this directly: nn.EmbeddingBag(V, d, mode='mean'), which fuses the lookup and pool (faster, less memory than gather-then-mean).
Problem. Implement logistic regression training by gradient descent: the sigmoid, the BCE loss, the gradient, and the weight update. This is the most common "train a model from scratch" ask.
The math
$p = \sigma(Xw+b)$, $\sigma(z)=\frac{1}{1+e^{-z}}$. BCE $L=-\frac1n\sum[y\log p+(1-y)\log(1-p)]$. Gradient: $\nabla_w L = \frac1n X^\top(p-y)$, $\nabla_b L = \text{mean}(p-y)$. Update $w \leftarrow w-\eta\nabla_w L$.
import numpy as np
def sigmoid(z): return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500))) # clip: no overflow
def lr_step(X, y, w, b, lr=0.1, l2=0.0):
n = X.shape[0]
p = sigmoid(X @ w + b) # (n,) predictions
err = p - y # (n,) the magic (p - y)
grad_w = X.T @ err / n + l2 * w # (d,) + optional L2
grad_b = err.mean()
w -= lr * grad_w # update weights
b -= lr * grad_b # update bias
loss = -np.mean(y*np.log(p+1e-12) + (1-y)*np.log(1-p+1e-12)) + 0.5*l2*(w@w)
return w, b, loss
exp(-z) doesn't overflow. (2) The gradient is X.T @ (p - y) / n — the /n matters (it's a mean) and the shape is (d,). (3) BCE not MSE — MSE on a sigmoid is non-convex with vanishing gradients. (4) L2 adds l2*w to the gradient. (5) Same (p-y) gradient as softmax — that's the GLM pattern; say it.Problem. Implement single-head scaled dot-product attention: given $Q\in\mathbb{R}^{n\times d}$, $K\in\mathbb{R}^{m\times d}$, $V\in\mathbb{R}^{m\times d_v}$, return $\text{softmax}(QK^\top/\sqrt{d})V$. Bonus: causal mask.
import numpy as np
def softmax_rows(M): # reuse from P6
shifted = M - M.max(axis=-1, keepdims=True)
exp = np.exp(shifted)
return exp / exp.sum(axis=-1, keepdims=True)
def attention(Q, K, V, mask=None):
d = Q.shape[-1]
# 1. similarity of every query to every key, scaled by sqrt(d)
scores = (Q @ K.T) / np.sqrt(d) # (n, m)
# 2. (optional) block forbidden positions BEFORE softmax
if mask is not None:
scores = np.where(mask, scores, -1e9) # -1e9 ~ -inf -> ~0 weight
# 3. turn scores into weights, then take the weighted sum of values
weights = softmax_rows(scores) # (n, m), each row sums to 1
return weights @ V # (n, d_v)
O(n·m·d) — the $n^2$ that drives long-context work.Problem. Write the canonical training loop: shuffle, iterate mini-batches, forward, loss, backward, step, epoch. They want to see you know the skeleton cold and where each piece goes.
zero_grad placement, and batch iteration — the thing you'd actually type at the job.import torch
def train(model, loader, epochs, lr=1e-3):
opt = torch.optim.AdamW(model.parameters(), lr=lr)
loss_fn = torch.nn.CrossEntropyLoss()
for epoch in range(epochs):
for xb, yb in loader: # loader already shuffles + batches
opt.zero_grad() # 1. clear old grads (else they accumulate!)
out = model(xb) # 2. forward
loss = loss_fn(out, yb) # 3. loss
loss.backward() # 4. backprop -> .grad
opt.step() # 5. update params
NumPy version of one step (no autograd): p = forward(X,w); g = backward(p,y,X); w -= lr*g. Gradient accumulation = don't zero_grad every step; sum N micro-batch grads, step once → big-batch math on small memory.
zero_grad before backward — PyTorch accumulates gradients, so forgetting it sums across steps (that's actually how you implement gradient accumulation deliberately). (2) Shuffle each epoch (the DataLoader does it). (3) Use model.eval() + torch.no_grad() at validation. (4) The order is fixed: zero → forward → loss → backward → step.Problem. Implement Lloyd's algorithm: assign each point to its nearest centroid, recompute centroids as the mean of their points, repeat to convergence.
The clean version — two obvious steps in a loop
import numpy as np
def euclidean(a, b):
return np.sqrt(np.sum((a - b) ** 2))
def nearest_centroid(point, centroids):
# index of the closest centroid to one point
best_j, best_d = 0, float("inf")
for j in range(len(centroids)):
d = euclidean(point, centroids[j])
if d < best_d:
best_d, best_j = d, j
return best_j
def kmeans(X, k, iters=100):
# init: pick k random points as the starting centroids
centroids = X[np.random.choice(len(X), k, replace=False)].copy()
for _ in range(iters):
# --- ASSIGN: each point to its nearest centroid ---
labels = np.array([nearest_centroid(point, centroids) for point in X])
# --- UPDATE: each centroid = mean of the points assigned to it ---
new_centroids = []
for j in range(k):
members = X[labels == j]
if len(members) > 0:
new_centroids.append(members.mean(axis=0))
else:
new_centroids.append(centroids[j]) # empty cluster: keep it
new_centroids = np.array(new_centroids)
# --- STOP when centroids stop moving ---
if np.allclose(new_centroids, centroids):
break
centroids = new_centroids
return centroids, labels
Two named steps you can narrate: "assign each point to its nearest centroid, then update each centroid to the mean of its points — repeat until they stop moving." No broadcasting gymnastics.
nan. (2) Init matters — random can land in a bad local optimum; k-means++ spreads initial centroids. Run several restarts, keep the lowest inertia. (3) Only finds a local optimum (monotonic decrease, not global). (4) Standardize features (distance-based). (5) k-means = hard-assignment EM on equal spherical Gaussians.A two-tower question decomposes into a few classic DS&A subroutines: build a score matrix, take a top-k, look up/merge sparse indices, sample by weight, normalize. These LeetCode problems train exactly those moves — so even if the ML wrapper is unfamiliar, the inner loop is muscle memory.
Map each piece of the two-tower pipeline to a LeetCode problem and drill it. When the interviewer says "now score the query against all items" or "return the top-k," your hands already know the pattern.
1 · The scoring core — sparse & dense dot products / matmul
The relevance score is a dot product; at scale the features are sparse. These are the literal subroutine.
2 · Top-k retrieval — the ANN lookup, by hand
Serving = "given the query embedding, return the k nearest item embeddings." That's a top-k by a similarity score — heap or quickselect.
3 · Embedding lookups & ID merging — the feature plumbing
User history → embedding-bag pooling, and matching user feature-IDs against item feature-IDs, are hashmap / two-pointer / merge problems.
4 · Negative sampling — by popularity / weight
In-batch negatives and the logQ correction rest on weighted sampling and frequency counting. These drill that.
5 · Normalization & numerical safety — softmax/cosine prep
The max-subtraction (softmax) and running-extreme patterns that keep the loss numerically sane.
If you only have time for a focused block: LC 1570 (the dot product — code it both two-pointer and hashmap) → LC 973 (top-k nearest, heap and quickselect) → LC 528 (weighted sampling = negatives) → then implement P1+P2 on this page (two-tower forward + in-batch loss) from a blank file. That sequence walks the entire pipeline — score → retrieve → sample → train — and is the highest-ROI hour you can spend on this hint.
- Clarify the contract first: "NumPy from scratch or PyTorch?", "should I return gradients too?", "is the input dense or sparse?" — each is a real fork.
- State shapes constantly. "X is (n, d), w is (d,), so X@w is (n,)." Shape errors are the #1 failure and narrating them prevents most.
- Brute force, then optimize. They almost always follow with "make it faster / handle a batch / avoid overflow." Have the vectorized + numerically-stable version ready.
- Name the property. "softmax is shift-invariant," "the (p−y) gradient is the GLM pattern," "cosine drops magnitude" — these one-liners signal you understand, not memorized.
- Connect to the system. End two-tower answers with serving (precompute items → ANN). That's the senior signal: you know how the code lives in production.