CODING INTERVIEW · ML-ALGORITHM CODING · PYTHON / NUMPY / PYTORCH

The ML-algorithm coding round

Top ML teams increasingly replace pure LeetCode with an implement-an-ML-primitive round: a two-tower forward pass, in-batch negatives, a sparse dot product, KNN, softmax + cross-entropy, an attention block, one SGD step. This page gives each one a full NumPy and PyTorch reference, the math, the complexity, the follow-ups interviewers push on, and the gotchas that separate "works" from "hire." Every problem ends with LeetCode look-alikes to drill the underlying data-structure skill.

Headline Two-tower Lang NumPy + PyTorch Bar correct math, clean code, vectorized 11 worked problems + LC drills
What "ML-algorithm coding, like two-tower" actually means — decode the hint

Interviewers who say this are testing whether you can turn ML math into correct, efficient code under time pressure — the day-job skill an MLE uses constantly. It splits into three buckets; prep all three because the hint is deliberately fuzzy:

BucketWhat they hand youThe problems below
Retrieval / embeddings (the "two-tower" core)"Implement the two-tower forward + in-batch negative loss" or "score a query against N items"P1 two-tower · P2 in-batch sampled softmax+logQ · P3 sparse dot product · P5 cosine top-k
Classic-ML from scratch"Implement KNN / logistic-regression training / k-means"P4 KNN · P8 logistic-regression step · P11 k-means
NN primitives"Implement softmax+cross-entropy / an embedding bag / scaled-dot-product attention / one SGD step"P6 softmax+CE · P7 embedding bag · P9 attention · P10 mini-batch SGD

The bar is not "do you know the model" — it's "is your code correct, clear, and numerically safe." They will ask for the brute-force first, then "make it faster / handle a batch / avoid the overflow." Narrate shapes out loud; shape bugs are the #1 failure. A clarifying question that always scores: "NumPy from scratch, or may I use PyTorch autograd?"

How to read the code on this page — write it the boring way

Every solution here is written to be reproduced from memory at a whiteboard, not to be short. That means: named helper functions (euclidean, softmax_rows, l2_normalize), plain loops over clever broadcasting, and one example before the batch. A simple for-loop that you can write correctly under pressure beats a one-line X[:,None,:]-C[None,:,:] tensor trick you'll fumble. Where a fast vectorized version matters, it's in a clearly-labelled "⚡ Speed-up (optional)" box — know it exists, mention it to the interviewer, but lead with the version you can actually reproduce. Clarity is the grade; speed is the bonus.

Part 1 · Retrieval & embeddings (the two-tower core)

If the hint named two-tower specifically, these four are the most likely. P1 and P2 are the headline.

01
Retrieval · HEADLINE

Two-tower model: forward pass + scoring

Problem. Implement a two-tower (dual-encoder) model. A user/query tower and an item tower each map their features to a $d$-dim embedding through an MLP; the relevance score is the dot product (or cosine) of the two embeddings. Given a batch of $B$ users and $B$ items, return the $B\times B$ score matrix.

What it tests: do you understand that retrieval = two independent encoders into a shared space, scored by inner product? Can you keep the tensor shapes straight and vectorize the all-pairs scoring?

The math

$u = f_\theta(x_{\text{user}})\in\mathbb{R}^{d}$, $\;v = g_\phi(x_{\text{item}})\in\mathbb{R}^{d}$. Score $s(u,v)=u^\top v$ (dot) or $\frac{u^\top v}{\lVert u\rVert\,\lVert v\rVert}$ (cosine). For a batch, the score matrix is $S = U V^\top$ where $U\in\mathbb{R}^{B\times d}$, $V\in\mathbb{R}^{B\times d}$, so $S\in\mathbb{R}^{B\times B}$ and $S_{ij}$ = score of user $i$ vs item $j$. The diagonal $S_{ii}$ is the positive pair.

PyTorch reference

import torch, torch.nn as nn, torch.nn.functional as F

class Tower(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, out_dim),
        )
    def forward(self, x):
        return self.net(x)                      # (B, out_dim)

class TwoTower(nn.Module):
    def __init__(self, user_dim, item_dim, hidden=128, emb=64, normalize=True):
        super().__init__()
        self.user_tower = Tower(user_dim, hidden, emb)
        self.item_tower = Tower(item_dim, hidden, emb)
        self.normalize = normalize              # cosine vs raw dot

    def forward(self, user_x, item_x):
        u = self.user_tower(user_x)             # (B, emb)
        v = self.item_tower(item_x)             # (B, emb)
        if self.normalize:                      # cosine = L2-normalize then dot
            u = F.normalize(u, dim=-1)
            v = F.normalize(v, dim=-1)
        scores = u @ v.T                        # (B, B)  S[i,j]=user_i . item_j
        return scores, u, v

NumPy reference (if they say "no framework") — explicit, reproducible

import numpy as np

# ---- small, named helpers you can write without thinking ----
def relu(x):
    return np.maximum(0, x)

def l2_normalize(M):                       # normalize each ROW to length 1
    norms = np.linalg.norm(M, axis=1, keepdims=True)   # (B, 1)
    return M / (norms + 1e-8)              # +eps so a zero row doesn't divide by 0

def tower(x, W1, b1, W2, b2):             # one MLP: (B, in) -> (B, emb)
    hidden = relu(x @ W1 + b1)            # (B, hidden)
    return hidden @ W2 + b2               # (B, emb)

# ---- the two-tower forward + all-pairs scores ----
def two_tower_scores(user_x, item_x, user_params, item_params, cosine=True):
    u = tower(user_x, *user_params)       # (B, emb)  user embeddings
    v = tower(item_x, *item_params)       # (B, emb)  item embeddings
    if cosine:                            # cosine = normalize, then dot
        u = l2_normalize(u)
        v = l2_normalize(v)
    scores = u @ v.T                      # (B, B): scores[i, j] = user_i . item_j
    return scores                         # diagonal scores[i, i] = the positive pairs

Read it top-down: each helper does one obvious thing, and scores = u @ v.T is the only "matrix" line — say "B-by-emb times emb-by-B gives B-by-B" as you write it.

Complexity. Encoders O(B·(in·hidden + hidden·emb)); all-pairs scoring O(B²·emb). At serving time you don't form B²: you precompute item embeddings into an ANN index (FAISS/HNSW) and do an approximate top-k lookup per user in ~O(log N).
Gotchas. (1) Shapes: the score matrix is u @ v.T, not u @ v — say "B-by-emb times emb-by-B gives B-by-B" out loud. (2) Cosine: normalize before the dot, and add an epsilon to the norm or a zero vector divides by zero. (3) Two separate towers — a classic mistake is sharing weights; user and item features live in different spaces, so the towers are independent (weights can be shared only if the inputs are homogeneous, e.g. item-item). (4) The towers must be independent at inference (no cross-features between user and item) — that independence is exactly what lets you precompute item vectors and use ANN. If the interviewer adds cross-features, you've left retrieval and entered ranking.

Follow-ups they will ask

  • "How do you train it?" → in-batch negatives + sampled softmax — that's P2 below.
  • "How do you serve it to 100M items?" → precompute item embeddings → ANN index (FAISS IVF/HNSW); the query tower runs online, the item tower offline/batch.
  • "Dot vs cosine?" → cosine removes magnitude (popularity) so it's safer for retrieval; raw dot lets the model encode popularity in the norm. Many systems use dot with a temperature.
  • "Cold-start item?" → the item tower runs on features, so a brand-new item still gets an embedding (vs. pure ID-based MF which can't).
02
Retrieval · HEADLINE

In-batch negatives: sampled softmax + logQ correction

Problem. Train the two-tower with in-batch negatives: in a batch of $B$ (user, positive-item) pairs, treat the other $B-1$ items in the batch as negatives for each user. Implement the loss. Then add the logQ correction that debiases popular items appearing too often as negatives.

What it tests: the single most-asked two-tower detail. Do you know that the $B\times B$ score matrix turns into a $B$-way classification where the correct class is the diagonal, and do you know why/what logQ corrects?

The idea

From P1 you have $S\in\mathbb{R}^{B\times B}$. For user $i$, the positive is item $i$ (the diagonal); items $j\ne i$ are "free" negatives. So row $i$ is a $B$-way softmax classification with target $i$ — i.e. cross-entropy with labels $[0,1,\dots,B-1]$. logQ correction: popular items show up as in-batch negatives more often than their true rate, so the model over-penalizes them. Subtract $\log Q(j)$ (the log sampling probability of item $j$) from each logit to correct: $s'_{ij} = s_{ij} - \log Q(j)$.

PyTorch reference

import torch, torch.nn.functional as F

def in_batch_loss(u, v, log_q=None, temperature=0.05):
    # u, v: (B, emb)  positives are aligned: u[i] <-> v[i]
    scores = (u @ v.T) / temperature        # (B, B) logits
    if log_q is not None:                   # logQ correction (sampled softmax)
        scores = scores - log_q.unsqueeze(0)   # subtract per-item log sampling prob
    labels = torch.arange(u.size(0), device=u.device)   # diagonal = positives
    return F.cross_entropy(scores, labels)  # B-way softmax, target = i

NumPy reference (spells out softmax + CE; reuses softmax_rows from P6)

import numpy as np

def in_batch_loss_np(u, v, log_q=None, temperature=0.05):
    B = u.shape[0]
    scores = (u @ v.T) / temperature        # (B, B) logits, scores[i,j]=user_i.item_j

    if log_q is not None:                   # logQ correction: lower popular items' logits
        scores = scores - log_q             # log_q is shape (B,), subtracts per column

    probs = softmax_rows(scores)            # softmax over each row (the B-way classes)

    # the positive for user i is item i -> the diagonal is the correct class
    correct_probs = probs[np.arange(B), np.arange(B)]
    return -np.mean(np.log(correct_probs + 1e-12))
Complexity. O(B²·emb) for the scores, O(B²) for the softmax — cheap because the negatives are free (already in the batch). Bigger batch = more negatives = better retrieval, which is why two-tower training pushes huge batches.
Gotchas. (1) The label is the row indexlabels = arange(B), because positive of user $i$ is item $i$. Getting this wrong silently trains garbage. (2) Temperature (~0.05) sharpens the softmax; without it the loss barely moves. (3) logQ sign: you subtract $\log Q$ — popular items have higher $Q$, so subtracting a larger number lowers their logit, undoing the over-sampling. (4) False negatives: if the same item appears as someone else's positive in the batch, it's wrongly a negative — at large batch this is rare and usually ignored, but mention it. (5) Numerical stability: subtract the row-max before exp (P6 covers why).

Follow-ups

  • "Why in-batch negatives at all?" → sampling explicit negatives is expensive; the batch gives you $B-1$ for free, and they're "hard" because they're real items.
  • "What does logQ actually estimate?" → $Q(j)$ = probability item $j$ is sampled as a negative ≈ its frequency in the batch ≈ its popularity. You can estimate it online with a count-min sketch / streaming frequency.
  • "Add hard negatives?" → mix in mined hard negatives (high score, not clicked) alongside the in-batch ones for a harder, better-separated space.
03
Retrieval · LIKELY

Sparse dot product (two-pointer)

Problem. User and item feature vectors are high-dimensional and sparse (millions of dims, a few hundred non-zero). Represent each as a list of (index, value) pairs sorted by index, and compute the dot product efficiently. (This is literally LeetCode 1570, and it's a favorite at RecSys shops because sparse features are everywhere.)

What it tests: do you exploit sparsity instead of materializing a dense million-dim vector? Two-pointer merge on sorted indices.

Two approaches

# Approach A — hashmap: O(n + m), works even if unsorted
def dot_hash(a, b):                  # a, b: list of (idx, val)
    da = {i: v for i, v in a}
    return sum(v * da.get(i, 0) for i, v in b)

# Approach B — two-pointer merge: O(n + m), O(1) extra, needs sorted indices
def dot_two_pointer(a, b):
    i = 0          # pointer into a
    j = 0          # pointer into b
    result = 0
    while i < len(a) and j < len(b):
        idx_a, val_a = a[i]
        idx_b, val_b = b[j]
        if idx_a == idx_b:                 # same feature -> multiply, advance both
            result += val_a * val_b
            i += 1
            j += 1
        elif idx_a < idx_b:                # a is behind -> advance a
            i += 1
        else:                              # b is behind -> advance b
            j += 1
    return result
Complexity. Both O(n+m) time. Two-pointer is O(1) extra space (vs the hashmap's O(n)) and is the answer they want when one vector is huge and the other tiny — or when both are sorted. If one list is much shorter, binary-search its few indices into the long one: O(short·log long).
Gotchas. (1) Only multiply when indices are equal; otherwise advance the pointer at the smaller index. (2) Don't build a dense vector — that's the whole point. (3) Follow-up "what if only one is sparse?" → binary search the sparse one's indices into the dense one. (4) This generalizes to sparse embedding lookups — the same merge underlies sparse-feature scoring in ranking models.
04
Classic ML · LIKELY

K-nearest neighbors from scratch

Problem. Given a training set $X\in\mathbb{R}^{n\times d}$ with labels $y$ and one query point, predict its label by majority vote of the $k$ nearest neighbors (Euclidean).

What it tests: can you write the obvious algorithm cleanly — distance to every point, take the k smallest, vote — and do you know KNN's properties (lazy, scale-sensitive, curse of dimensionality)?

The clean version — write this one (loop you can reproduce)

import numpy as np
from collections import Counter

def euclidean(a, b):
    # straight-line distance between two points
    return np.sqrt(np.sum((a - b) ** 2))

def knn_predict_one(X_train, y_train, query, k=5):
    # 1. distance from the query to EVERY training point
    distances = []
    for i in range(len(X_train)):
        d = euclidean(query, X_train[i])
        distances.append((d, y_train[i]))    # (distance, label)

    # 2. sort by distance, take the k closest
    distances.sort(key=lambda pair: pair[0])
    k_labels = [label for (_, label) in distances[:k]]

    # 3. majority vote
    return Counter(k_labels).most_common(1)[0][0]

Three steps, each a plain line you can say out loud: "distance to all, sort, take k, vote." This is the version to reproduce under pressure — it's obviously correct.

⚡ Speed-up (optional) — vectorize for a whole batch of queries

Only reach for this if the interviewer says "make it faster" or "score many queries at once." Mention the identity, then write it; don't lead with it.

import numpy as np
from collections import Counter

def knn_predict_batch(X_train, y_train, X_query, k=5):
    # distance to all train points, all queries at once.
    # trick: ||a-b||^2 = ||a||^2 + ||b||^2 - 2 a.b  (one matmul, no Python loop)
    q_sq = (X_query ** 2).sum(axis=1, keepdims=True)   # (Q, 1)
    t_sq = (X_train ** 2).sum(axis=1)                  # (n,)
    d2   = q_sq + t_sq - 2 * (X_query @ X_train.T)     # (Q, n) squared dists
    nearest = np.argpartition(d2, k, axis=1)[:, :k]    # k smallest per row, O(n)
    return np.array([Counter(y_train[row]).most_common(1)[0][0]
                     for row in nearest])

argpartition (O(n)) beats a full argsort (O(n log n)) — you only need the k smallest, not them ordered.

Complexity. Clean version: O(n·d) distances + O(n log n) sort per query. Batch speed-up: same distances via one matmul + O(n) selection. At scale swap brute force for a KD-tree/ball-tree (low d) or ANN (high d).
Gotchas. (1) Standardize features first — Euclidean distance is scale-sensitive, so an unscaled big-range feature dominates. (2) Even $k$ can tie — use odd $k$ or a tie-break. (3) Weighted KNN: weight each vote by $1/d$. (4) Small $k$ = high variance (jagged), large $k$ = high bias (smooth). (5) KNN does no training — all the work is at query time (it's "lazy"). (6) If they push speed, then mention the squared-norm matmul trick — but say it, don't open with it.
05
Retrieval · LIKELY

Cosine similarity top-k retrieval

Problem. Given a query embedding $q\in\mathbb{R}^{d}$ and an item matrix $V\in\mathbb{R}^{N\times d}$, return the indices of the top-$k$ items by cosine similarity. This is the serving half of two-tower retrieval.

What it tests: normalization, a single matvec instead of a loop, and partial-sort for top-k.
import numpy as np

def l2_normalize_rows(M):
    norms = np.linalg.norm(M, axis=1, keepdims=True)   # length of each row
    return M / (norms + 1e-8)

def topk_cosine(query, items, k=10):
    # 1. normalize so that a dot product = cosine similarity
    q = query / (np.linalg.norm(query) + 1e-8)         # (d,)
    V = l2_normalize_rows(items)                        # (N, d)

    # 2. cosine score of the query against every item (one matrix-vector mult)
    scores = V @ q                                     # (N,)

    # 3. take the top-k, then sort just those k by score (descending)
    top_k = np.argsort(scores)[::-1][:k]               # simple & clear
    return top_k

"Normalize, dot against all, take the top k." The full argsort is O(N log N) — perfectly fine here; if the interviewer wants O(N), say "swap np.argsort for np.argpartition(-scores, k)."

Complexity. O(N·d) for the scores + O(N) for the partial selection + O(k log k) to order the k. Brute force is fine for thousands of items; beyond that, an ANN index. Pre-normalize $V$ once and cosine becomes a plain dot product (the standard serving trick).
Gotchas. (1) Normalize both sides; epsilon-guard the norm. (2) argpartition(-sims, k) for the largest (negate). (3) If $V$ is pre-normalized at index-build time, you skip per-query normalization of items — only the query needs it. (4) Ties / duplicates near the cutoff: decide stable ordering.
Practice the underlying skill on LeetCode
Part 2 · NN primitives from scratch

"Implement X without a framework." The interviewer wants correct math and numerical safety.

06
NN primitive · HIGH FREQ

Numerically-stable softmax + cross-entropy (with gradient)

Problem. Implement softmax and cross-entropy from scratch, numerically stable, for a batch of logits. Bonus they always ask: return the gradient of the loss w.r.t. the logits.

What it tests: the max-subtraction trick (the most common numerical-stability question in ML interviews) and the beautiful $(p - y)$ gradient.

The math & the trick

$\text{softmax}(z)_i = \dfrac{e^{z_i}}{\sum_j e^{z_j}}$. Naively, $e^{z}$ overflows for large $z$. Softmax is shift-invariant: subtracting any constant $c$ from every logit leaves it unchanged. Pick $c=\max_j z_j$ so the largest exponent is $e^0=1$ — no overflow. Cross-entropy $L=-\sum_i y_i\log p_i$; for a one-hot target, $L=-\log p_{\text{true}}$. The gradient is famously clean: $\partial L/\partial z = p - y$.

import numpy as np

def softmax_rows(logits):                       # logits: (B, C)
    # subtract each row's max so the biggest exponent is e^0 = 1 (no overflow)
    shifted = logits - logits.max(axis=1, keepdims=True)
    exp = np.exp(shifted)
    return exp / exp.sum(axis=1, keepdims=True)  # each row sums to 1

def cross_entropy(logits, labels):              # labels: (B,) ints
    probs = softmax_rows(logits)                 # (B, C)
    B = logits.shape[0]

    # loss = average of -log(probability of the correct class)
    correct_probs = probs[np.arange(B), labels]  # pick prob of the true class per row
    loss = -np.mean(np.log(correct_probs + 1e-12))

    # gradient dL/dlogits = (probs - onehot(labels)) / B   -- the clean (p - y)
    grad = probs.copy()
    grad[np.arange(B), labels] -= 1              # subtract 1 at the true class
    grad /= B
    return loss, grad
Gotchas. (1) Subtract the row max before exp — state that it's valid because softmax is shift-invariant. (2) Don't compute log(softmax) separately — that re-introduces instability; use log-sum-exp for log-prob, or fuse into one CE op. (3) The +1e-12 guards log(0). (4) Average over the batch for the gradient (the /B). (5) For binary, sigmoid + BCE has the same $(p-y)$ gradient.
07
NN primitive · RecSys

Embedding lookup & pooling (embedding bag)

Problem. A user has a variable-length list of feature IDs (e.g. subreddits they follow). Look up each ID's embedding from a table and pool them (mean/sum) into a single fixed vector — the core of how sparse categorical features enter a ranking model.

What it tests: embedding tables = a learned lookup, and how variable-length sparse features become fixed-size dense inputs.
import numpy as np

def embedding_bag(table, id_lists, mode="mean"):
    # table: (V, d) learned embeddings; id_lists: list of lists of ints
    out = np.zeros((len(id_lists), table.shape[1]))
    for i, ids in enumerate(id_lists):
        if not ids: continue                 # empty bag -> zeros
        vecs = table[ids]                    # (len, d) fancy indexing
        out[i] = vecs.mean(0) if mode == "mean" else vecs.sum(0)
    return out

PyTorch gives this directly: nn.EmbeddingBag(V, d, mode='mean'), which fuses the lookup and pool (faster, less memory than gather-then-mean).

Gotchas. (1) Handle the empty bag (no IDs) → zero vector, not a crash. (2) Mean vs sum: mean is robust to how many IDs a user has; sum lets count leak in (sometimes wanted). (3) Reserve index 0 as a padding/OOV id. (4) The table is learned — it's parameters, not a fixed feature. (5) This is exactly the "user history → one vector" step that feeds a two-tower user tower.
Practice the underlying skill on LeetCode
08
Classic ML · HIGH FREQ

Logistic regression: one training step from scratch

Problem. Implement logistic regression training by gradient descent: the sigmoid, the BCE loss, the gradient, and the weight update. This is the most common "train a model from scratch" ask.

What it tests: the full loop — forward (sigmoid), loss (BCE), backward (the $(p-y)x$ gradient), update — and that you know it's the same $(p-y)$ gradient as softmax.

The math

$p = \sigma(Xw+b)$, $\sigma(z)=\frac{1}{1+e^{-z}}$. BCE $L=-\frac1n\sum[y\log p+(1-y)\log(1-p)]$. Gradient: $\nabla_w L = \frac1n X^\top(p-y)$, $\nabla_b L = \text{mean}(p-y)$. Update $w \leftarrow w-\eta\nabla_w L$.

import numpy as np

def sigmoid(z): return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))  # clip: no overflow

def lr_step(X, y, w, b, lr=0.1, l2=0.0):
    n = X.shape[0]
    p = sigmoid(X @ w + b)                 # (n,) predictions
    err = p - y                            # (n,) the magic (p - y)
    grad_w = X.T @ err / n + l2 * w        # (d,)  + optional L2
    grad_b = err.mean()
    w -= lr * grad_w          # update weights
    b -= lr * grad_b          # update bias
    loss = -np.mean(y*np.log(p+1e-12) + (1-y)*np.log(1-p+1e-12)) + 0.5*l2*(w@w)
    return w, b, loss
Gotchas. (1) Clip the sigmoid input (or use a stable formulation) so exp(-z) doesn't overflow. (2) The gradient is X.T @ (p - y) / n — the /n matters (it's a mean) and the shape is (d,). (3) BCE not MSE — MSE on a sigmoid is non-convex with vanishing gradients. (4) L2 adds l2*w to the gradient. (5) Same (p-y) gradient as softmax — that's the GLM pattern; say it.
Practice the underlying skill on LeetCode
09
NN primitive · SOTA

Scaled dot-product attention

Problem. Implement single-head scaled dot-product attention: given $Q\in\mathbb{R}^{n\times d}$, $K\in\mathbb{R}^{m\times d}$, $V\in\mathbb{R}^{m\times d_v}$, return $\text{softmax}(QK^\top/\sqrt{d})V$. Bonus: causal mask.

What it tests: the attention formula, why the $\sqrt{d}$ scale exists, and masking — increasingly asked even for RecSys roles.
import numpy as np

def softmax_rows(M):                          # reuse from P6
    shifted = M - M.max(axis=-1, keepdims=True)
    exp = np.exp(shifted)
    return exp / exp.sum(axis=-1, keepdims=True)

def attention(Q, K, V, mask=None):
    d = Q.shape[-1]

    # 1. similarity of every query to every key, scaled by sqrt(d)
    scores = (Q @ K.T) / np.sqrt(d)           # (n, m)

    # 2. (optional) block forbidden positions BEFORE softmax
    if mask is not None:
        scores = np.where(mask, scores, -1e9) # -1e9 ~ -inf -> ~0 weight

    # 3. turn scores into weights, then take the weighted sum of values
    weights = softmax_rows(scores)            # (n, m), each row sums to 1
    return weights @ V                        # (n, d_v)
Gotchas. (1) Divide by $\sqrt{d}$ — without it, large dot products push softmax to a near-one-hot with ~0 gradient. Be ready to explain: dot of two $d$-dim unit-variance vectors has variance $d$, so you rescale by $\sqrt{d}$. (2) Mask before softmax with $-\infty$ (here $-10^9$), not after — masking after softmax breaks normalization. (3) Causal mask = lower-triangular (token $i$ sees only $\le i$). (4) Multi-head = split $d$ into $h$ heads, attention each, concat. (5) Cost is O(n·m·d) — the $n^2$ that drives long-context work.
Practice the underlying skill on LeetCode
10
NN primitive · training

Mini-batch SGD training loop

Problem. Write the canonical training loop: shuffle, iterate mini-batches, forward, loss, backward, step, epoch. They want to see you know the skeleton cold and where each piece goes.

What it tests: the universal loop, zero_grad placement, and batch iteration — the thing you'd actually type at the job.
import torch

def train(model, loader, epochs, lr=1e-3):
    opt = torch.optim.AdamW(model.parameters(), lr=lr)
    loss_fn = torch.nn.CrossEntropyLoss()
    for epoch in range(epochs):
        for xb, yb in loader:               # loader already shuffles + batches
            opt.zero_grad()                 # 1. clear old grads (else they accumulate!)
            out = model(xb)                 # 2. forward
            loss = loss_fn(out, yb)         # 3. loss
            loss.backward()                 # 4. backprop -> .grad
            opt.step()                      # 5. update params

NumPy version of one step (no autograd): p = forward(X,w); g = backward(p,y,X); w -= lr*g. Gradient accumulation = don't zero_grad every step; sum N micro-batch grads, step once → big-batch math on small memory.

Gotchas. (1) zero_grad before backward — PyTorch accumulates gradients, so forgetting it sums across steps (that's actually how you implement gradient accumulation deliberately). (2) Shuffle each epoch (the DataLoader does it). (3) Use model.eval() + torch.no_grad() at validation. (4) The order is fixed: zero → forward → loss → backward → step.
Practice the underlying skill on LeetCode
11
Classic ML · sometimes

K-means clustering from scratch

Problem. Implement Lloyd's algorithm: assign each point to its nearest centroid, recompute centroids as the mean of their points, repeat to convergence.

What it tests: the assign → update alternation (= hard-EM), and the gotchas (init, empty clusters, local optima). Lead with the readable two-step loop.

The clean version — two obvious steps in a loop

import numpy as np

def euclidean(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

def nearest_centroid(point, centroids):
    # index of the closest centroid to one point
    best_j, best_d = 0, float("inf")
    for j in range(len(centroids)):
        d = euclidean(point, centroids[j])
        if d < best_d:
            best_d, best_j = d, j
    return best_j

def kmeans(X, k, iters=100):
    # init: pick k random points as the starting centroids
    centroids = X[np.random.choice(len(X), k, replace=False)].copy()

    for _ in range(iters):
        # --- ASSIGN: each point to its nearest centroid ---
        labels = np.array([nearest_centroid(point, centroids) for point in X])

        # --- UPDATE: each centroid = mean of the points assigned to it ---
        new_centroids = []
        for j in range(k):
            members = X[labels == j]
            if len(members) > 0:
                new_centroids.append(members.mean(axis=0))
            else:
                new_centroids.append(centroids[j])   # empty cluster: keep it
        new_centroids = np.array(new_centroids)

        # --- STOP when centroids stop moving ---
        if np.allclose(new_centroids, centroids):
            break
        centroids = new_centroids

    return centroids, labels

Two named steps you can narrate: "assign each point to its nearest centroid, then update each centroid to the mean of its points — repeat until they stop moving." No broadcasting gymnastics.

Gotchas. (1) Empty clusters — a centroid with no points; reinit it (e.g. to the farthest point) instead of nan. (2) Init matters — random can land in a bad local optimum; k-means++ spreads initial centroids. Run several restarts, keep the lowest inertia. (3) Only finds a local optimum (monotonic decrease, not global). (4) Standardize features (distance-based). (5) k-means = hard-assignment EM on equal spherical Gaussians.
Practice the underlying skill on LeetCode
Part 3 · LeetCode-style drills for the two-tower toolkit

A two-tower question decomposes into a few classic DS&A subroutines: build a score matrix, take a top-k, look up/merge sparse indices, sample by weight, normalize. These LeetCode problems train exactly those moves — so even if the ML wrapper is unfamiliar, the inner loop is muscle memory.

12
Two-tower · LEETCODE DRILLS

The exact LeetCode problems behind two-tower

Map each piece of the two-tower pipeline to a LeetCode problem and drill it. When the interviewer says "now score the query against all items" or "return the top-k," your hands already know the pattern.

1 · The scoring core — sparse & dense dot products / matmul

The relevance score is a dot product; at scale the features are sparse. These are the literal subroutine.

2 · Top-k retrieval — the ANN lookup, by hand

Serving = "given the query embedding, return the k nearest item embeddings." That's a top-k by a similarity score — heap or quickselect.

3 · Embedding lookups & ID merging — the feature plumbing

User history → embedding-bag pooling, and matching user feature-IDs against item feature-IDs, are hashmap / two-pointer / merge problems.

4 · Negative sampling — by popularity / weight

In-batch negatives and the logQ correction rest on weighted sampling and frequency counting. These drill that.

5 · Normalization & numerical safety — softmax/cosine prep

The max-subtraction (softmax) and running-extreme patterns that keep the loss numerically sane.

The 30-minute two-tower drill set (do these in order)

If you only have time for a focused block: LC 1570 (the dot product — code it both two-pointer and hashmap) → LC 973 (top-k nearest, heap and quickselect) → LC 528 (weighted sampling = negatives) → then implement P1+P2 on this page (two-tower forward + in-batch loss) from a blank file. That sequence walks the entire pipeline — score → retrieve → sample → train — and is the highest-ROI hour you can spend on this hint.

How to run this round (the meta-strategy)
  • Clarify the contract first: "NumPy from scratch or PyTorch?", "should I return gradients too?", "is the input dense or sparse?" — each is a real fork.
  • State shapes constantly. "X is (n, d), w is (d,), so X@w is (n,)." Shape errors are the #1 failure and narrating them prevents most.
  • Brute force, then optimize. They almost always follow with "make it faster / handle a batch / avoid overflow." Have the vectorized + numerically-stable version ready.
  • Name the property. "softmax is shift-invariant," "the (p−y) gradient is the GLM pattern," "cosine drops magnitude" — these one-liners signal you understand, not memorized.
  • Connect to the system. End two-tower answers with serving (precompute items → ANN). That's the senior signal: you know how the code lives in production.