FOUNDATIONS · ML THEORY

ML theory — the core fundamentals

The questions that distinguish "skilled practitioner" from "deep understanding." Bias-variance, why ReLU, L1 vs L2 geometry, MLE/MAP, calibration, generalization. Every Sr Staff loop has 2-3 of these.

Read ~45 min Asked at Anthropic, OpenAI, DeepMind, Meta, Google Difficulty Sr Staff bar Chapters 15 + 80-question quiz

What you'll learn

The bias-variance frame — the mental model under everything
Optimizers from first principles — GD → convexity → Adam → modern
Why ReLU? — the vanishing gradient story, in numbers
L1 vs L2 — geometry, gradient, Bayes, sparsity
The probabilistic view — MLE, MAP, Bayes; every loss is a NLL
Information theory toolkit — entropy, KL, MI, JS
Linear-algebra lens — PCA, SVD, low-rank
Kernels & SVM — the max-margin story
Tree-based methods — GBM, XGBoost, LightGBM
EM and the ELBO connection (EM → VAE)
GLMs — the framework that unifies linear / logistic / Poisson
Calibration — when probabilities matter
Generalization theory — VC, Rademacher, PAC-Bayes, NTK
"Why does X work?" — dropout, BN, augmentation, residuals, depth
Tricky theory probes — the questions that separate the bar

FOUNDATIONS · THE MENTAL MODEL

The bias-variance frame — the mental model under everything

TL;DR

Test error = irreducible noise + bias² + variance. Bias is "model class wrong on average," variance is "model jitters across datasets." Every regularizer, every ensemble, every architecture choice trades one against the other. Modern over-parameterized nets break the U-curve via double descent — SGD's implicit bias picks low-norm interpolators.

The decomposition — derive it on the whiteboard

Let y = f(x) + ε with ε ~ N(0, σ²) independent of D. Train predictor f̂(x; D) on dataset D. Expected squared error at point x:

E_D[(y − f̂(x; D))²]
  = E_D[(f(x) + ε − f̂(x; D))²]                                  # substitute
  = E[ε²] + E_D[(f(x) − f̂(x; D))²]                              # cross terms vanish (ε ⊥ D)
  = σ²
  + E_D[(f(x) − E_D[f̂(x; D)])²]                                 # = bias² (squared dist of mean prediction)
  + E_D[(f̂(x; D) − E_D[f̂(x; D)])²]                              # = variance
  ─────────────────────────────────
  = σ²        +    bias²(f̂; x)    +    var(f̂; x)
   irreducible      (model wrong on average)   (model unstable across datasets)

What bias and variance buy you in practice

High bias (underfit)

Model class too simple
Train and val loss both bad and close
Fix: bigger model, richer features, less regularization

High variance (overfit)

Model too sensitive to specific sample
Train loss low, val loss high
Fix: regularize, more data, augment, ensemble, early stop

irreducible noise — σ², the floor no model can beat. If your data has 10% label noise, no architecture will get past 90% accuracy.

Why bagging vs boosting attack different terms

Bagging (random forest): variance ↓↓, bias ≈ same. Average decorrelated trees.
Boosting (GBM): bias ↓↓, variance can ↑ if not regularized. Each tree fixes prior residuals.
Adding features: usually ↓ bias, can ↑ variance if features are noisy.

Where deep nets break the textbook U-curve

Classical view: model capacity ↑ → bias ↓, variance ↑ → U-shaped test error.

THE INSIGHT

Double descent — the modern correction

Past the interpolation threshold (model can fit train set perfectly), test error rises near the threshold then falls again. Why: among the infinite interpolating solutions, SGD's implicit bias picks low-norm ones, which generalize. The U-curve is a small-model phenomenon; over-parameterized regime has a different shape entirely.

EXAMPLE — concrete walkthrough

A 2-layer MLP on MNIST with hidden width sweep. Width 32: train acc 95%, test 94% (low bias, low variance). Width 256: 100% train, 92% test (interpolation, peak variance). Width 4096: 100% train, 97% test (over-parameterized, variance fell). The classical "more capacity is worse" story stops at the interpolation threshold.

PITFALL

Conflating "bias of an estimator" (statistics — E[θ̂] − θ_true) with "bias term in the decomposition" (squared distance of mean prediction at a single x). Same word, different objects. Interviewers love this distinction.

REMEMBER

Test MSE = irreducible σ² + bias² + variance. Be ready to derive.
Bagging kills variance; boosting kills bias. Choose accordingly.
Double descent voids the U-curve in the over-parameterized regime — SGD's implicit bias does the regularization for free.

OPTIMIZATION · CORE

Optimizers from first principles — GD to Adam to Shampoo

TL;DR

Gradient descent is θ ← θ − η ∇L. On convex-smooth problems it has clean rates. On non-convex deep nets it has none — yet works. SGD generalizes better than full-batch because noise biases toward flat minima. Adam = diagonal preconditioner; AdamW fixed Adam's L2 bug; Shampoo / K-FAC approximate the full Hessian. Conditioning is the secret optimizer ingredient — that's why init, BN/LN, and feature scaling all matter.

The base rule and what convexity buys

For loss L(θ): θ ← θ − η ∇L(θ). Convergence rates depend entirely on the problem class:

Setting	Rate	Step size
Convex, β-smooth	`O(1/k)` in objective	`η = 1/β`
μ-strongly convex + β-smooth	`O((1 − μ/β)^k)` linear	`η = 2/(μ+β)`
Non-convex (deep nets)	No general guarantee	Empirical heuristics

The condition number κ = β/μ controls speed; ill-conditioned problems crawl.

Why SGD generalizes better than full-batch

Empirically (Keskar 2017): small-batch SGD finds flatter minima than large-batch, and flatter minima generalize. Two main theoretical lenses:

Implicit regularization: SGD noise covariance ∝ Hessian-related quantities in low-loss regions, biasing the trajectory toward low-curvature (flat) minima.
Critical batch size (McCandlish 2018): B_crit ∝ tr(Σ)/||g||² where Σ is per-sample gradient covariance. Past B_crit, the linear-LR scaling rule (Goyal 2017) breaks.

EXAMPLE — when "scale LR linearly with batch" fails

You scale ResNet-50 from batch 256 to 8k by multiplying LR 32×. Past ~4k, no further speedup — you've crossed B_crit. Solutions: warmup the LR over the first 5 epochs (smooths Adam moment instability), use LARS/LAMB (per-layer adaptive LRs), or accept that batch beyond B_crit wastes compute.

Momentum, Nesterov, Newton, quasi-Newton

Momentum (Polyak heavy-ball): v ← μv + ∇L; θ ← θ − ηv. Variance reduction; smooths noise.
Nesterov: gradient at lookahead point θ − ημv. Theoretically optimal first-order rate O(1/k²) for convex smooth.
Newton: θ ← θ − H⁻¹ ∇L. Quadratic convergence near optimum. H is N×N — infeasible for N ≫ 1000.
Quasi-Newton (BFGS, L-BFGS): approximate H⁻¹ from gradient history. Fails in stochastic settings.
Adam: diagonal preconditioner using grad².
Shampoo: Kronecker-factored preconditioner.
K-FAC: block-diagonal Fisher approximation.

Adam, AdamW, and the L2 bug

Adam maintains EMAs of grad and grad². Update:

m ← β₁m + (1−β₁) g          # 1st moment EMA
v ← β₂v + (1−β₂) g²         # 2nd moment EMA
m̂ = m / (1 − β₁^t)          # bias correction
v̂ = v / (1 − β₂^t)
θ ← θ − η · m̂ / (√v̂ + ε)

The bug: standard L2 in Adam adds λθ to the gradient, which then gets divided by √v̂ — so high-variance params receive less decay. AdamW (Loshchilov 2019) decouples: apply θ ← θ − η(m̂/√v̂ + λθ) instead. AdamW is the modern default for transformers.

Why conditioning is the silent optimizer

The Hessian eigenvalue spread κ = λ_max / λ_min determines GD speed. In deep nets, κ can exceed 10⁶. Three tricks fight bad conditioning:

Feature scaling / normalization: gives input dims similar variance → conditions the Hessian's input block.
Weight init (He / Xavier): scales weights so activations and gradients stay comparable across layers — prevents conditioning collapse at depth.
Batch / Layer norm: re-scales activations → smoother loss landscape (Santurkar 2018). The original "internal covariate shift" claim is largely wrong; the real mechanism is Lipschitz smoothing.

PITFALL

Forgetting bias correction in Adam. At step 1 with β₁=0.9, raw m is only 10% of its true value. Without (1 − β^t) correction, your first hundred updates are too small and training drifts.

REMEMBER

Convex + β-smooth: O(1/k). Strongly convex: linear. Non-convex deep nets: no guarantee but works.
SGD's noise is a feature, not a bug — it picks flat minima.
AdamW > Adam for transformers (decoupled weight decay).
Init, normalization, and feature scaling all do the same job: they fix Hessian conditioning.

ARCHITECTURE · ACTIVATION FUNCTIONS

Why ReLU? — the vanishing gradient story, in numbers

TL;DR

Pre-2010 deep nets used sigmoid/tanh. Both saturate; their max derivative is 0.25 (sigmoid) or 1.0 (tanh at z=0). Stack 20 sigmoid layers and the gradient at layer 1 is multiplied by 0.25²⁰ ≈ 10⁻¹². ReLU's derivative is 1 in the active region — no shrinking. That single fix unlocked deep nets. GELU/SwiGLU smooth the kink for transformers.

Sigmoid's vanishing gradient — the actual numbers

Sigmoid: σ(z) = 1/(1+e⁻ᶻ); σ'(z) = σ(z)(1−σ(z)). Maximum at z=0: σ'(0) = 0.25.

Backprop through L sigmoid layers multiplies L derivatives. Even at the maximum:

Depth L	Best-case gradient at layer 1
5	0.25⁵ ≈ 10⁻³
10	0.25¹⁰ ≈ 10⁻⁶
20	0.25²⁰ ≈ 10⁻¹²

Tanh is slightly better (tanh'(0) = 1.0) but still saturates outside a narrow band.

ReLU's properties — what changed

Derivative: 1 if z > 0, 0 otherwise. No saturation in the positive regime → no shrinking gradient through depth.
Sparsity: ~half the units are 0 at init → sparse activation, better feature selectivity, cheaper compute.
Cost: cheap (a max op).
Why not identity: without nonlinearity, stacked linear layers collapse to one linear layer — no expressivity.

dead ReLU — a unit whose pre-activation is always < 0, so its gradient is always 0 and it never recovers. Typically 10-40% of ReLUs end up dead post-training. Mitigations: Leaky ReLU (small negative slope), PReLU (learned slope), ELU, GELU, SiLU.

Why GELU / SwiGLU now in transformers

ReLU's hard cutoff loses info at z near 0. GELU is smooth: x · Φ(x) — small negative inputs still pass a small signal. SwiGLU (Shazeer 2020) adds gating: (x W₁) · σ(x W₂). Empirically improves quality at fixed FLOPs.

None of this would scale without residual connections. Residuals give every layer an "identity path" so the gradient never has to multiply through the activation derivative — see chapter 14.

EXAMPLE — counting dead ReLUs in a real model

After training a 4-layer MLP on CIFAR-10 with ReLU, instrument the pre-activations on the val set. Typical result: 25-35% of units fire on <1% of inputs — those are dead. Replace with GELU and the dead-unit count falls to <5%, but throughput drops 5% (GELU is more expensive). For inference-bound deployments, ReLU still wins on Pareto.

REMEMBER

Sigmoid's max derivative is 0.25; stack 20 layers and gradient is 10⁻¹². That's the vanishing-gradient story.
ReLU's derivative is 1 in the active half — saturation gone.
Dead ReLU is a real failure mode (10-40% of units). GELU/SiLU/SwiGLU smooth the kink.
Activation choice matters less when you have residuals — they're the real reason 100+ layer nets train.

REGULARIZATION · GEOMETRY

L1 vs L2 — geometry, gradient, Bayes, sparsity

TL;DR

L2 is a Gaussian prior; constraint is a smooth ball; gradient pulls weights proportionally; never reaches 0. L1 is a Laplace prior; constraint is a diamond with axis-aligned corners; the corner-touching geometry zeros out coordinates → sparsity. Elastic net combines them. In deep learning, L2 (weight decay) is critical not for "preventing overfitting" but for controlling Lipschitz constant and interacting with batch norm.

L2 (ridge) — the round constraint

Penalty: L = L_0 + λ ||θ||₂² = L_0 + λ Σ θᵢ².

Constraint geometry: L2 ball — circle in 2D, sphere in N-D. Smooth.
Lagrangian solution: optimum is where the loss contour first touches the ball — a generic point, no coordinate zeros out.
Closed form for linear regression: (XᵀX + λI)⁻¹ Xᵀy. Always invertible (even when XᵀX is singular).
Bayesian: prior θ ~ N(0, σ²/λ · I) → MAP equals ridge.
Gradient: 2λθ — proportional to weight; pulls all weights toward 0 uniformly, never reaches 0.

L1 (lasso) — the diamond constraint

Penalty: L = L_0 + λ ||θ||₁ = L_0 + λ Σ |θᵢ|.

Constraint geometry: L1 ball — diamond in 2D, hyperoctahedron in N-D. Corners aligned with axes.
Optimum often touches a corner → some coordinates exactly 0 → sparse solution.
No closed form; solved with proximal gradient (ISTA, FISTA) or coordinate descent.
Bayesian: prior θ ~ Laplace(0, b) → MAP equals lasso.
Gradient (subgradient at 0): λ · sign(θ) — constant pull; stops once at 0.

Side by side

L2 (ridge)

Smooth penalty (differentiable everywhere)
Multiplicative shrinkage
Closed-form solution exists
Gaussian prior
Keeps correlated features (small but non-zero)
Use as default for prediction

L1 (lasso)

Non-smooth penalty (kink at 0)
Soft-thresholding: sign(θ) · max(|θ|−λ, 0)
Iterative algorithms only (ISTA, LARS)
Laplace prior
Picks one of correlated features (somewhat arbitrarily)
Use for feature selection / interpretability

Why L1 is sparse — visual + math

Visual: imagine the elliptical loss contour tightening toward its minimum. With L2 (round constraint) the first touch is generic. With L1 (diamond) the first touch is most often at a vertex — and vertices have all-but-one coordinate equal to zero.

Math: at a coordinate where |∂L_0/∂θᵢ| < λ, the L1 subgradient zeros that coordinate. L2 only shrinks (multiplicative); L1 thresholds (the soft-threshold operator sign(θ) · max(|θ|−λ, 0)).

EXAMPLE — soft thresholding in one update

Suppose the unregularized gradient step gives θ' = 0.03. With L2 at λ=0.5, the next iterate is θ' / (1 + 2λ) = 0.015 — shrunk but non-zero. With L1 at λ=0.05, soft-threshold gives sign(0.03) · max(0.03 − 0.05, 0) = 0. Coordinate killed.

Elastic net — the practical compromise

α L1 + (1−α) L2. Combines sparsity with L2's grouping (correlated features both retained or both dropped together). glmnet (Friedman, Hastie, Tibshirani 2010) is the canonical implementation.

Why weight decay works in deep learning (different reason)

In deep nets, weight decay (≈ L2) is critical for generalization, but for different reasons than classical statistics:

Caps the effective scale of weight matrices → bounds Lipschitz constant → smoother decision boundary.
Interacts with batch norm: BN makes scale invariant; weight decay then controls the angular direction of weights (effectively projects to unit norm).
AdamW decoupling: standard L2 in Adam scales decay by 1/√v̂ — small decay for high-variance params, undesirable. AdamW separates decay from gradient step.

PITFALL

Saying "L1 is sparse, so use it for feature selection in deep nets." L1 on individual weights doesn't induce structured sparsity (whole-channel removal). For deep nets, use group-lasso, magnitude pruning, or movement pruning instead — naive L1 just makes weights tiny, not zero, and gives no inference speedup.

REMEMBER

L2 = Gaussian prior + smooth ball + multiplicative shrinkage. L1 = Laplace prior + diamond + soft-threshold sparsity.
L1 produces zeros because its constraint set has axis-aligned corners.
In deep learning weight decay's main job is Lipschitz control + interaction with BN/LN, not classical "shrink coefficients."
Use AdamW (decoupled), not Adam + L2.

PROBABILISTIC ML · UNIFICATION

The probabilistic view — every loss is a negative log-likelihood

TL;DR

MLE picks θ to maximize P(D|θ). Almost every loss in ML is the negative log-likelihood of some distributional assumption: Gaussian → MSE; Bernoulli → binary cross-entropy; Categorical → cross-entropy; Laplace → MAE; Poisson → Poisson loss. Adding a prior P(θ) turns MLE into MAP — and Gaussian prior = ridge, Laplace prior = lasso. Full Bayes integrates θ out.

MLE — the framework

Choose θ to maximize P(D|θ) = ∏ P(xᵢ|θ). Equivalently, minimize −log P(D|θ) = −Σ log P(xᵢ|θ). This is the loss.

Common losses are NLLs in disguise

Likelihood model	Negative log-likelihood = Loss	Used for
Gaussian noise: `y ~ N(f(x), σ²)`	MSE: `(1/2σ²) Σ (y − f(x))²`	Regression
Bernoulli: `y ~ Bern(σ(z))`	Binary cross-entropy: `−Σ y log p̂ + (1−y) log(1−p̂)`	Binary classification
Categorical: `y ~ Cat(softmax(z))`	Cross-entropy: `−Σ y log p̂`	Multi-class
Laplace: `y ~ Laplace(f(x), b)`	MAE / L1 loss	Robust regression
Poisson: `y ~ Poisson(λ(x))`	Poisson loss: `Σ λ − y log λ`	Count regression (CTR with rare events)
Negative Binomial	NB loss	Over-dispersed counts

THE INSIGHT

"Why MSE for regression?" — the deep answer

MSE is the negative log-likelihood under Gaussian noise. If you assume the noise is heavy-tailed (Laplace), use MAE. If you assume counts (Poisson), use Poisson loss. The loss is a noise-model choice. Picking a loss without thinking about the implicit noise model is what beginners do; staff engineers always know which distribution they're assuming.

Why cross-entropy beats MSE for classification

Probabilistic correctness: CE matches the Bernoulli/Categorical likelihood, MSE does not.
Gradient signal: CE gradient w.r.t. pre-softmax logits is p̂ − y — clean and bounded. MSE gradient through softmax involves an extra σ'(z) factor that vanishes when σ saturates → slow learning.
Calibration: CE produces (more) calibrated probabilities; MSE-trained classifiers are systematically miscalibrated.

MAP — adding a prior

argmax P(θ|D) = argmax P(D|θ) P(θ) by Bayes. Adding the prior is adding a regularizer:

MLE

Maximize P(D|θ)
No prior
Asymptotically unbiased
Overfits at small N

MAP

Maximize P(D|θ) P(θ)
Gaussian prior → ridge
Laplace prior → lasso
Uniform prior → MAP = MLE

Full Bayes — integrating out θ

Conjugate priors — still asked

Likelihood	Conjugate prior
Bernoulli / Binomial	Beta
Categorical / Multinomial	Dirichlet
Gaussian (mean, known var)	Gaussian
Gaussian (precision)	Gamma
Poisson	Gamma

EXAMPLE — Beta-Bernoulli for CTR estimation

Click-through rate for a new ad, observed 2 clicks out of 30 impressions. Pure MLE: 2/30 = 6.7%. With Beta(1,1) prior (uniform), posterior is Beta(3, 29), posterior mean = 3/32 = 9.4%. With Beta(10, 90) prior (informative — average ad is ~10% CTR), posterior is Beta(12, 119), mean = 12/131 = 9.2%. The prior regularizes the small-sample estimate toward a sensible default.

REMEMBER

Every loss = NLL of a distributional assumption. Pick the noise model first, the loss second.
CE beats MSE for classification: matches likelihood, clean gradient (p̂ − y), better calibration.
MAP = MLE + prior = MLE + regularization. Gaussian → ridge; Laplace → lasso.
Conjugate priors give closed-form posteriors. Beta/Bernoulli, Dirichlet/Cat, Gamma/Poisson.

INFORMATION THEORY · TOOLKIT

The information-theoretic toolkit — entropy, KL, MI, JS

TL;DR

Entropy = average bits needed. Cross-entropy = bits if you use the wrong code. KL = excess bits = H(p,q) − H(p). KL is asymmetric: KL(p||q) is mode-covering, KL(q||p) is mode-seeking — that asymmetry decides which one VAEs use vs EP. JS is symmetric and bounded. Mutual information measures shared information; InfoNCE estimates a lower bound and powers contrastive learning. Every distillation loss, every RLHF KL penalty, every VAE ELBO traces back to these.

Entropy and cross-entropy

entropy — H(p) = −Σ p(x) log p(x). Average bits to encode a sample from p (if log = log₂). Maximum at uniform; 0 for deterministic.

cross-entropy — H(p, q) = −Σ p(x) log q(x). Bits to encode p when using q's code. Always ≥ H(p).

KL divergence — the workhorse

KL(p || q) = Σ p(x) log(p(x)/q(x)) = H(p, q) − H(p). Asymmetric; not a metric. KL ≥ 0 (Gibbs); = 0 iff p = q. Used everywhere: VAE ELBO, RLHF KL penalty, knowledge distillation.

Why the asymmetry matters — mode-covering vs mode-seeking

KL(p || q) — "forward KL"

Weighted by p
Penalizes q being small where p is large
Mode-covering: q must cover all modes of p
Used in expectation propagation, MLE

KL(q || p) — "reverse KL"

Weighted by q
Penalizes q being large where p is small
Mode-seeking: q sticks to one mode of p
Used in VAE ELBO, variational inference, RLHF

EXAMPLE — fitting a unimodal q to a bimodal p

Say p is a 50/50 mixture of two Gaussians at ±5 and q is a single Gaussian. Minimizing KL(p||q) places q at mean 0 with huge variance (covers both modes, badly). Minimizing KL(q||p) places q on one of the modes (sharp fit on half the distribution). VAEs use reverse KL → posteriors collapse to single modes if not careful.

Mutual information and InfoNCE

I(X; Y) = KL(P(X,Y) || P(X)P(Y)) = H(X) − H(X|Y) = H(Y) − H(Y|X). Reduction in uncertainty about X given Y. InfoNCE (van den Oord 2018) is a tractable lower bound on I(X;Y); it's the loss behind SimCLR, CLIP, and most modern contrastive learning.

Jensen-Shannon divergence

JS(p, q) = ½ KL(p || m) + ½ KL(q || m) where m = ½(p+q). Symmetric; bounded in [0, log 2]; used in original GAN.

Why KL in RLHF?

The KL penalty r − β · KL(π || π_ref) prevents the policy from drifting too far from the reference (SFT model). Without it, RL mode-collapses onto reward-hacking patterns. β trades off reward maximization vs staying-close-to-reference.

PITFALL

Treating KL as a distance. KL(p||q) ≠ KL(q||p), triangle inequality fails, can be infinite if q has zero probability where p doesn't. Use JS or Wasserstein if you need a proper metric.

REMEMBER

Cross-entropy = entropy + KL. Minimizing CE = minimizing KL when H(p) is fixed.
Forward KL is mode-covering; reverse KL is mode-seeking. VAE / VI / RLHF all use reverse.
InfoNCE = tractable lower bound on MI; powers contrastive learning.
RLHF's KL penalty is what stops reward hacking.

LINEAR ALGEBRA · LOW-RANK

The linear-algebra lens — PCA, SVD, low-rank

TL;DR

SVD X = UΣVᵀ is the swiss-army knife. Columns of V are PCA's principal directions; Σ²/(n−1) are the variances. Truncated SVD gives the best rank-k approximation in Frobenius norm (Eckart-Young). A linear autoencoder with MSE recovers the PCA subspace. LoRA, low-rank adapters, embedding compression — all leverage low-rank structure that SVD makes visible.

PCA via SVD

Center X (subtract column means). SVD: X = U Σ Vᵀ. Then:

Columns of V are principal directions (eigenvectors of XᵀX).
Σ²/(n−1) are eigenvalues = variances along principal directions.
Projection onto top-k components: X V[:, :k].
Equivalent: U Σ = principal scores; truncate.

SVD identities to know cold

Eckart-Young: truncated SVD = best rank-k approximation in Frobenius norm (and in the spectral norm).
Pseudoinverse: X⁺ = V Σ⁺ Uᵀ where Σ⁺ inverts non-zero singular values.
Condition number: σ_max / σ_min. Ill-conditioned matrices are hard to invert / optimize.

Why PCA matters at interviews

Dimensionality reduction (visualization, compression).
Decorrelation of features → conditioning of optimization.
Eigenfaces, latent semantic analysis, embedding analysis.
Connection to autoencoders: a linear autoencoder with MSE loss recovers the PCA subspace.

EXAMPLE — LoRA as a low-rank update

LoRA replaces a fine-tune weight delta ΔW ∈ ℝ^{d×d} with BA where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×d}, r ≪ d. For Llama-2 70B with r=8, that's 8 / 8192 = 0.1% the params. Works because the empirical fine-tuning delta has low effective rank — exactly the assumption SVD's truncation theorem gives you. If r = 64 captures 95% of the spectral energy of the true ΔW, the rank-64 LoRA is nearly lossless.

REMEMBER

SVD is the lens for everything low-rank: PCA, pseudoinverse, condition number, LoRA, distillation.
Eckart-Young: truncated SVD is the optimal low-rank approximation in Frobenius norm.
Linear AE with MSE = PCA. Non-linear AE = (much) more.

CLASSICAL ML · MAX-MARGIN

Kernels and SVM — the max-margin story (still asked)

TL;DR

SVM finds the hyperplane that maximizes the margin 2/||w||. Soft-margin adds slack ξ_i with penalty C. The dual depends on data only through inner products → swap inner product for a kernel K(x,x') = ⟨φ(x),φ(x')⟩ and you get a non-linear classifier without computing φ. Hinge loss is the SVM's loss in disguise. Logistic regression and linear SVM scale similarly; kernel SVM is O(N²) memory and dies past ~100k samples.

The kernel trick

Map data to high-dim feature space φ(x); train linear model there. Trick: any algorithm that depends on data only through inner products ⟨x, x'⟩ can be "kernelized" by replacing inner products with a kernel K(x, x') = ⟨φ(x), φ(x')⟩ — without ever computing φ explicitly.

Common kernels

Linear: K(x, x') = xᵀx'
Polynomial: (xᵀx' + c)^d
RBF / Gaussian: exp(−γ ||x − x'||²) — corresponds to infinite-dim feature space
Laplacian: exp(−γ ||x − x'||₁)

SVM derivation — be ready to whiteboard

Maximum-margin classifier. For separable data with labels y ∈ {−1, +1}:

min_{w, b} ½ ||w||²
subject to  y_i (wᵀx_i + b) ≥ 1  for all i

Why ½||w||²? Margin = 2/||w||. Maximize margin ↔ minimize ||w||.

Soft margin (real data is not separable):

min ½||w||² + C Σ ξ_i
s.t. y_i(wᵀx_i + b) ≥ 1 − ξ_i,  ξ_i ≥ 0

The Lagrangian dual is what gets kernelized. Dual variables α_i: w = Σ α_i y_i x_i. Only support vectors (α_i > 0) contribute. Replace x_iᵀ x_j with K(x_i, x_j) in the dual → kernel SVM.

Hinge loss — the SVM in loss form

SVM is equivalent to minimizing max(0, 1 − y · f(x)). Hinge gives 0 loss for correctly classified points outside the margin → only "marginal" examples drive learning. Compare to logistic loss log(1 + exp(−y · f(x))) which has nonzero gradient everywhere.

EXAMPLE — when the margin actually matters

For a near-linearly-separable dataset like spam vs ham TF-IDF features, linear SVM and logistic regression give nearly identical accuracy. Where SVM wins: when calibrated probabilities don't matter and you need the maximum-margin decision boundary for adversarial robustness. Where LR wins: when you need probabilities (CTR prediction, ranking), or when classes are not separable and gradient signal everywhere helps optimization.

PITFALL

Trying to scale kernel SVM to a 1M-sample dataset. The Gram matrix is N × N — at N=1M that's 8 TB of float64. Use linear SVM (LIBLINEAR, scikit-learn's LinearSVC) or random Fourier features (Rahimi-Recht) to approximate the kernel.

REMEMBER

Margin = 2/||w||. Maximize margin ↔ minimize ||w||² s.t. functional margin ≥ 1.
Kernel trick: replace inner product with K — get non-linear from a linear algorithm.
RBF kernel = infinite-dim features. Polynomial kernel = degree-d features.
Kernel SVM is O(N²) memory. Beyond ~100k samples, use linear SVM or random features.

CLASSICAL ML · TABULAR

Tree-based methods — GBM, XGBoost, LightGBM (still beats NN on tabular)

TL;DR

Trees do greedy axis-aligned splits. Random forest = bagged decorrelated trees → variance kill. Gradient boosting fits sequential trees to negative-gradient pseudo-residuals → bias kill. XGBoost adds Newton-style 2nd-order + regularization + missing-value defaults; LightGBM adds histogram bins + leaf-wise growth + GOSS/EFB for speed; CatBoost adds ordered boosting for categorical features. On tabular with weak signal-to-noise, GBMs still beat neural nets.

Decision trees

Greedy split selection: at each node, pick (feature, threshold) maximizing Gini / entropy / variance reduction.
Stopping: max depth, min samples leaf, min impurity decrease, max leaves (LightGBM-style).
Pruning: cost-complexity (alpha-pruning).

Random forest (Breiman 2001)

Bagging: train T trees on bootstrap samples; aggregate by majority vote / mean.
Random feature subset per split: decorrelates trees → variance ↓↓.
Out-of-bag (OOB) error: free CV estimate.

Gradient boosting

Sequentially fit weak learners to the residuals (or, generally, to the negative gradient of the loss). For squared loss, residual = y − ŷ. For other losses, fit pseudo-residuals = −∂L/∂ŷ.

XGBoost (Chen 2016)

Regularized objective with Newton-style 2nd-order; handles missing values via learned default direction; histogram-based splits.

LightGBM (Ke 2017)

Histogram + leaf-wise growth + GOSS (gradient-based one-side sampling) + EFB (exclusive feature bundling). Faster, less memory.

CatBoost

Ordered boosting (mitigates target leakage); native categorical features; usually the easiest to tune.

EXAMPLE — why GBM beats NN on Kaggle tabular

A typical tabular Kaggle dataset has 100-500 features with axis-aligned predictive structure (e.g., "days since last login > 30" matters; "log(income)" matters). Trees pick exactly those splits in O(features × n) per node. A NN must learn axis-alignment from scratch — typically needs 10× the data and careful feature engineering. NNs catch up at scale (TabPFN, FT-Transformer with 100M+ rows).

Common interview Qs on trees

"Why do GBMs beat NNs on tabular?" → Tabular features have axis-aligned structure that trees exploit; NNs need lots of data to learn that. NN catches up at scale (TabPFN, FT-Transformer).
"Why is LightGBM faster than XGBoost?" → Histogram-based binning, leaf-wise growth (vs level-wise), GOSS + EFB.
"How does XGBoost handle missing values?" → At each split, missing samples are sent to the side that yields better gain (learned default direction).

PITFALL

Using one-hot encoding for high-cardinality categorical features in a GBM. Each one-hot column gets its own split candidate; with 10k categories, the trees can't find good splits. Use target encoding (with smoothing) or CatBoost's native handling instead.

REMEMBER

RF kills variance via decorrelated bagging. GBM kills bias via sequential residual fitting.
Pseudo-residuals = −∂L/∂ŷ generalize residual fitting to any differentiable loss.
XGBoost / LightGBM / CatBoost are still the tabular default — pick LightGBM for speed, CatBoost for cats, XGBoost for safe baselines.
NNs catch up to trees only at very large data scales.

LATENT VARIABLES · EM ↔ ELBO

The EM algorithm and the ELBO connection (EM → VAE)

TL;DR

EM alternates: E-step computes the posterior over latents q(z) = P(z|x, θ_t); M-step maximizes the expected complete-data log-likelihood. Each iteration monotonically increases the marginal likelihood. The same identity is the ELBO — VAEs replace the exact E-step (intractable) with an amortized inference network q_φ(z|x). EM is the unsupervised algorithm that quietly underlies GMMs, HMMs, LDA, and modern variational generative models.

EM in two steps

E-step: compute posterior over latents given current params: q(z) = P(z | x, θ_t).
M-step: maximize expected complete-data log-likelihood w.r.t. θ: θ_{t+1} = argmax_θ E_q[log P(x, z | θ)].

Guarantee: each iteration monotonically increases the marginal likelihood log P(x | θ). Converges to local optimum.

THE INSIGHT

EM = ELBO maximization (twice)

EM maximizes the ELBO L(q, θ) = E_q[log P(x, z | θ)] + H(q). The E-step makes the ELBO tight by setting q = posterior; the M-step maximizes the ELBO over θ. This is the same ELBO as in VAEs — a VAE generalizes by using a learned q_φ(z|x) instead of the exact posterior. Once you see this, every modern variational generative model (VAE, β-VAE, normalizing flow latents, diffusion ELBOs) becomes a variant of the same procedure.

EXAMPLE — EM for a 2-component GMM

Initial guess: two Gaussians with random means. E-step: for each point x_i compute γ_i = P(z=1 | x_i, θ) via Bayes. M-step: μ_1 = (Σ γ_i x_i)/(Σ γ_i), μ_2 = (Σ (1−γ_i) x_i)/(Σ (1−γ_i)); update variances and mixing weights similarly. Iterate until log-likelihood plateaus. Local optima abound — initialize from k-means.

REMEMBER

E-step = posterior. M-step = MLE on expected complete-data NLL.
ELBO = E_q[log P(x,z|θ)] + H(q). EM coordinate-ascends it.
VAE = EM with a learned q_φ instead of the exact posterior. Same identity, scalable.

CLASSICAL ML · UNIFICATION

GLMs — the framework that unifies linear, logistic, Poisson

TL;DR

GLMs unify linear regression, logistic regression, Poisson regression, and more under one recipe: pick an exponential-family distribution for y, a linear predictor η = xᵀβ, and a link function g(μ) = η. Canonical link guarantees convex loss and calibrated probabilities. The whole reason logistic regression is a "well-behaved" classifier is GLM theory.

The three components

Random component: y has distribution from exponential family (Gaussian, Bernoulli, Poisson, Gamma, ...).
Systematic component: linear predictor η = xᵀβ.
Link function: g(μ) = η. Connects E[y] to η.

GLM	Distribution	Link
Linear regression	Gaussian	identity
Logistic regression	Bernoulli	logit: `log(p/(1−p))`
Poisson regression	Poisson	log
Multinomial logistic	Categorical	softmax (multivariate logit)

Why this framework matters

Unifies many models under one estimation procedure (IRLS — iteratively reweighted least squares).
Logistic regression's "wisdom" — convex loss, no local minima — comes from GLM theory.
Calibration story: canonical link guarantees E[y] = link⁻¹(xᵀβ) → predictions are calibrated probabilities.

EXAMPLE — picking the right GLM for CTR

You're modeling click counts per impression bucket. If you treat clicks/impressions as continuous and use linear regression, you'll predict negative values and lose efficiency. Use Poisson regression with offset: log(λ) = log(impressions) + xᵀβ, where the offset converts rate into count. Predicted λ is automatically non-negative and the MLE is convex.

REMEMBER

GLM = exponential-family distribution + linear predictor + link function.
Canonical link gives convex NLL and calibrated probabilities for free.
Logistic regression is a Bernoulli GLM with logit link. Its niceness comes from being canonical.
Poisson regression with log link is the right tool for counts and rates.

PROBABILITIES · WHEN THEY MATTER

Calibration — when probabilities matter

TL;DR

A model is calibrated if among samples with predicted probability p, the actual rate is p. Critical for recsys auctions, medical decisions, anything that uses raw probabilities. Modern over-parameterized NNs are systematically over-confident (Guo 2017). Temperature scaling — divide all logits by a single scalar T fit on a held-out set — fixes most of it without retraining and preserves argmax.

Measuring calibration

Reliability diagram: bin predictions into quantiles, plot mean predicted vs mean observed in each bin.
ECE (Expected Calibration Error): Σ (|B_m|/n) · |acc(B_m) − conf(B_m)|. Sensitive to binning choice.
Brier score: Σ (p̂ − y)². Decomposes into calibration + sharpness components.
Calibration slope / intercept: fit logistic on (prediction, label); slope=1, intercept=0 = calibrated.

Recalibration techniques

Parametric

Platt scaling: fit σ(a · logit + b). 2 params; works for monotonic miscalibration.
Beta calibration: parameterized monotonic, more flexible than Platt.
Temperature scaling: divide logits by T (1 param). Standard for NN classifiers; preserves argmax → no accuracy change.

Non-parametric

Isotonic regression: monotonic non-parametric mapping. Handles arbitrary curves; needs more data; overfits at small N.
Histogram binning: average within bin, replace with bin's empirical rate. Simple and weak.

Why neural nets are miscalibrated

(Guo 2017, "On Calibration of Modern Neural Networks") Modern over-parameterized NNs achieve very low cross-entropy but are overconfident. Hypotheses: capacity-driven overfit on confidence; BN; weight decay shifting output scale. Temperature scaling fixes most of it without retraining.

EXAMPLE — temperature scaling in 5 lines

Train your classifier normally. After training, freeze the model and fit a single scalar T by minimizing NLL on the validation set with logits replaced by logits / T. Typically T ≈ 1.5 for ResNet-50 on ImageNet. ECE drops from ~5% to ~1%. Argmax (and therefore accuracy) is unchanged because dividing logits by a positive constant preserves the maximum.

PITFALL

Reporting accuracy as your single classifier metric for an auction or medical-decision system. Accuracy ignores how confident the wrong predictions were. A model that's 99% confident on its 5% errors will tank an expected-cost calculation. Always also report Brier score or ECE when probabilities are downstream.

REMEMBER

Calibrated = predicted probability matches observed rate within each bucket.
Modern NNs are overconfident. Temperature scaling fixes it for free.
Use ECE / Brier / reliability diagram, not just accuracy, when probabilities matter.
Isotonic for arbitrary shape + lots of data; Platt / temperature for small data + monotonic miscalibration.

LEARNING THEORY · GENERALIZATION

Generalization theory — VC, Rademacher, PAC-Bayes, NTK

TL;DR

Classical theory (PAC, VC) bounds test error by train error + a complexity term that scales with model capacity. For deep nets it's vacuous — VC ~ #params ≫ n. Rademacher (margin-based) and PAC-Bayes are tighter. NTK explains training dynamics in the infinite-width limit but doesn't capture feature learning. The honest summary: deep-learning generalization theory is still open. Practitioners rely on empirical heuristics + implicit-bias arguments.

PAC learning

A hypothesis class H is PAC-learnable if there's an algorithm such that for any ε, δ > 0, with prob ≥ 1−δ over samples, the learned h has error ≤ ε. Sample complexity bounded by VC dimension.

VC dimension

Largest set of points that H can shatter (label in all 2ⁿ ways). VC of linear classifiers in d-dim = d+1. Generalization bound: err_test ≤ err_train + O(√(VC/n)). Vacuous for over-parameterized NNs (VC ~ # params ≫ n).

Rademacher complexity

Tighter than VC for many cases. R_n(H) = E_σ [sup_{h ∈ H} (1/n) Σ σ_i h(x_i)] where σ_i are Rademacher (±1) random vars. Margin-based bounds (Bartlett) scale with weight norms / Lipschitz constants instead of param count → less vacuous for deep nets.

PAC-Bayes (used by Anthropic for SAEs etc.)

Bound expected loss over a posterior Q: E_Q[err_test] ≤ E_Q[err_train] + O(√(KL(Q || P) / n)). P = prior, Q = posterior. Tight when Q is close to P. Dziugaite & Roy 2017 gave non-vacuous PAC-Bayes bounds for MNIST nets — a milestone.

Implicit bias of SGD / NTK / μP

NTK (Neural Tangent Kernel): in infinite-width limit, NN training behaves like kernel regression with the NTK. Gives a clean theory of generalization in this limit but doesn't capture feature learning.
μP (Tensor Programs): parameterizations that preserve feature learning at scale; HPs transfer.
Implicit bias for separable data (Soudry 2018): logistic regression w/ SGD on linearly separable data → max-margin classifier (same as SVM). For NNs, implicit bias is to "low-norm" interpolating solutions.

EXAMPLE — VC bound is useless for ResNet

ResNet-50 has 25M params. VC dimension is at least linear in #params, so VC ≥ 25M. ImageNet has 1.2M training images. The bound err_test ≤ err_train + √(VC/n) = err_train + √(25M/1.2M) ≈ err_train + 4.6. Probability bound = "test error is at most 460%." Useless. This is why margin-based and PAC-Bayes bounds are needed — they scale with norms and KL divergences, not parameter count.

REMEMBER

Classical VC bounds are vacuous for over-parameterized NNs.
Rademacher (margin / Lipschitz) and PAC-Bayes give non-vacuous bounds in some regimes.
NTK explains infinite-width training as kernel regression but misses feature learning.
Implicit bias of SGD selects low-norm interpolators — that's the modern best explanation of why deep learning generalizes.

DEEP LEARNING · "WHY DOES X WORK?"

The "why does X work?" probes — dropout, BN, augmentation, residuals, depth

TL;DR

Sr Staff favorites. Dropout = ensemble + Bayesian + co-adaptation prevention. BN's real job is loss-landscape smoothing, not "covariate shift." Augmentation = regularization + invariance encoding. Early stopping ≈ L2. Residuals give gradient an identity path → trainable to 100+ layers. Depth beats width because compositional features are exponentially more expressive at fixed param count.

Why does dropout work?

Ensemble interpretation: each dropout mask defines a sub-network; training averages over an exponential ensemble. Test-time scaling by (1−p) approximates the ensemble.
Bayesian interpretation (Gal & Ghahramani 2016): dropout is a variational approximation to a Gaussian process posterior.
Co-adaptation prevention: forces each unit to be useful regardless of which other units are active.
Modern LLM pretraining usually uses dropout=0 — we're data-bottlenecked, not capacity-bottlenecked. SFT and small-data fine-tune still use it.

Why does batch norm work?

Original claim (Ioffe & Szegedy 2015): "internal covariate shift" reduction. Newer view (Santurkar 2018, arxiv 1805.11604): BN makes the loss landscape smoother (smaller Lipschitz constant of gradients), allowing larger LRs. The "covariate shift" claim is largely wrong.

Train/test discrepancy: BN uses batch stats during train, running averages at test → bug source if batch composition differs.
Doesn't work for very small batches (noisy stats) → GroupNorm / LayerNorm replace it.
For sequences (transformers), LN is preferred — no batch dependence.

Why does data augmentation work?

Effective regularization: increases the effective dataset size.
Encodes invariances directly (translation, rotation, color jitter for vision).
For LLMs: synthetic data via stronger model = augmentation.

Why does early stopping work?

Equivalent to L2 regularization in the linear/quadratic regime (Yao 2007). Stops training before the model fits noise; an implicit constraint on model complexity via the optimization trajectory.

Why does feature normalization help?

Conditioning. The Hessian's eigenvalues depend on input scales. Normalizing features → Hessian eigenvalues become more similar → larger LR works → faster convergence. Without normalization, GD takes tiny steps in the small-eigenvalue directions.

Why does deep beat shallow?

Compositional features: depth gives exponentially more expressive function classes for the same param count (Telgarsky 2016).
Feature reuse: lower layers extract reusable primitives (edges → textures → parts → objects).
Implicit bias toward low-rank / structured solutions: depth + GD favors hierarchical decompositions.
Caveat: at fixed compute, very-deep-very-narrow vs moderate-depth-wide is empirically narrow-wins for transformers; inductive bias matters.

Why does the residual connection matter?

Without residuals, gradient through L layers is a product of L Jacobians → either vanishes or explodes. With residuals, gradient has an "identity path" — ∂(x + f(x))/∂x = I + ∂f/∂x — preserving signal. Lets you train 100+ layers (ResNet, transformers).

Why does softmax saturate / how to avoid?

For very large logits, softmax outputs ≈ 1.0 for the max, ≈ 0 for the rest → gradient through the rest vanishes. Fixes: temperature scaling, label smoothing, gradient clipping. In transformers, the √d_k attention scale is exactly to prevent this.

EXAMPLE — the √d_k trick in attention

For dot-product attention QKᵀ with d_k = 128, raw scores have variance ~128. Softmax of variance-128 logits is essentially one-hot — gradient through all but one position is ~0. Dividing by √d_k = √128 ≈ 11.3 normalizes the score variance back to ~1 → softmax is well-conditioned, gradients flow, training is stable. This is the difference between "transformers train" and "transformers diverge in 5 steps."

REMEMBER

Dropout = ensemble + Bayesian + anti-co-adaptation. Off in modern LLM pretraining.
BN's real mechanism is loss smoothing, not "covariate shift." LN for sequences.
Early stopping ≈ L2. Both implicit constraints on complexity.
Residuals make gradient flow via identity path — without them, 100-layer nets don't train.
Depth gives exponentially richer functions per param than width — but at fixed compute, "moderate depth + width" wins.
The √d_k in attention exists to keep softmax out of saturation.

INTERVIEW SIGNAL · DEEP UNDERSTANDING

Tricky theory probes — what tells an interviewer you really get it

TL;DR

The questions below distinguish "knows the formulas" from "understands what's going on." Each one has a clean, deep answer that a Sr Staff candidate should give without hesitation. Skim the questions, sketch your answer, then check.

The 50 deep-understanding questions

Q. "If your validation loss is much higher than train loss, but test accuracy is fine, what's going on?"
Show answer

Almost always the model is overconfident on misclassified validation examples — the cross-entropy loss penalizes confidence × wrongness, while accuracy is binary. The model has learned to be very confident on most examples and is occasionally very wrong; that drives loss up while accuracy stays good. Fix: temperature scaling for calibration if probability matters; otherwise it's not a real problem.
Q. "Why isn't ReLU differentiable at 0, and why doesn't that break SGD?"
Show answer

ReLU has a kink at 0 — left derivative = 0, right derivative = 1. The probability of a stochastic-gradient update landing exactly at 0 is measure zero. In practice we use either the subgradient (any value in [0,1]) or set the derivative at 0 to 0 by convention. SGD's robustness to non-smoothness comes from finite-precision floating point + stochasticity.
Q. "Why does adding L2 regularization improve a model that's not overfitting?"
Show answer

L2 doesn't only fight overfitting. It (1) improves Hessian conditioning (adds λI → eigenvalues bounded below by λ), (2) makes training more stable / faster convergence, (3) bounds Lipschitz constant → robustness, (4) for NNs interacting with batch norm, controls effective angular direction. So even at-capacity models benefit.
Q. "What's wrong with using accuracy on imbalanced classes?"
Show answer

If 99% of samples are negative, a model predicting always-negative gets 99% accuracy with zero recall. Use precision/recall, F1, AUC-ROC (threshold-independent), AUC-PR (better for imbalance), Brier score. For ranking, NDCG / MAP. For decisions, expected cost / value.
Q. "AUC-ROC vs AUC-PR — when each?"
Show answer

AUC-ROC is invariant to class balance. AUC-PR (precision-recall curve area) is more sensitive to positive-class performance under heavy imbalance — use it when positives are rare and you care about them (rare-event detection, fraud, ad click).
Q. "Your test set AUC is great but the model fails in production. Why?"
Show answer

Most common: distribution shift. Train/test were i.i.d. from logged data; production has different traffic mix (covariate shift), different label distribution (prior shift), or selection bias (you only have labels on items the old model showed — exposure bias). Other: temporal leakage (future info in features), train/serve skew (different feature pipelines).
Q. "What is the bias of an estimator?"
Show answer

The expected difference between the estimator and the true value: Bias = E[θ̂] − θ_true. Distinct from bias-variance decomposition's "bias" (which is squared at a single x). MLE is asymptotically unbiased under regularity conditions, but at finite n can be biased (e.g., MLE of Gaussian variance underestimates).
Q. "Why does Adam converge to a worse generalization point than SGD on some vision tasks?"
Show answer

Adam's per-coordinate adaptive LR can find sharper minima with worse generalization than SGD. Empirically observed (Wilson 2017, "The Marginal Value of Adaptive Gradient Methods"). LLMs / transformers go the other way (Adam beats SGD); the differences come down to noise structure + curvature shape per problem class. Use AdamW with proper weight decay if Adam.
Q. "Why does softmax + cross-entropy have a clean gradient?"
Show answer

Let p = softmax(z) and L = −Σ y log p. Then ∂L/∂z = p − y. The complex Jacobian of softmax cancels with the log inside CE. This is why CE is preferred over MSE for classification — gradient signal is simple and proportional to error.
Q. "When is Naive Bayes optimal?"
Show answer

When features are conditionally independent given the class. In practice, NB often works well even when this assumption is violated, because for classification we only need the argmax of P(class | x) — not the calibrated probability. NB's biased probability estimates can still produce correct argmax.
Q. "Why does the bias term in linear regression matter?"
Show answer

Without bias (intercept), the regression line is forced through the origin — generally a worse fit. With bias, the model fits an arbitrary affine function. In NNs with biased layers, removing biases ties the layer's "default" output to zero, which is rarely what you want — though some modern transformers (Llama) drop bias terms in linear layers because RMSNorm makes them redundant.
Q. "Explain log-sum-exp trick."
Show answer

log Σ exp(x_i) overflows if any x_i is large. Subtract max: log Σ exp(x_i) = m + log Σ exp(x_i − m) where m = max(x_i). Now exp arguments are ≤ 0, can't overflow. Used in softmax, partition functions, and any "log of a sum of exponentials" computation.
Q. "What's the curse of dimensionality?"
Show answer

In high-dim space, (1) volume concentrates on the surface of any region, (2) all pairwise distances become similar (concentration of measure), (3) the number of samples needed to cover the space grows exponentially, (4) most local methods (kNN, kernel density) fail. Why deep learning still works: data lies on a low-dim manifold within the high-dim space.
Q. "Why is GBM more sensitive to LR than RF?"
Show answer

GBM's trees are sequential — each fits the residual after applying previous trees with shrinkage η. Small η = small steps, more trees needed, less overfit per step. RF trees are independent — no LR; aggregation reduces variance regardless. GBM's η + n_estimators control bias-variance trade.
Q. "Explain regularization paths."
Show answer

The set of solutions w(λ) as the regularization strength λ varies from 0 to ∞. For lasso, the path is piecewise linear in λ (LARS algorithm). For ridge, smooth. Useful for hyperparam selection (CV over the path) and feature ranking (order in which lasso adds features).
Q. "Why does the kernel trick let SVMs handle non-linear data?"
Show answer

Map data implicitly to a high-dim feature space φ(x); a linear separator in φ-space corresponds to a non-linear boundary in original space. The trick avoids explicit φ — only inner products are needed, and K(x,x') = ⟨φ(x), φ(x')⟩ is computed directly. RBF kernel corresponds to infinite-dim feature space.
Q. "Difference between generative and discriminative models?"
Show answer

Generative models P(x, y) (or P(x|y) and P(y)). Examples: Naive Bayes, GMMs, HMMs, VAEs, diffusion. Can sample x. Discriminative models P(y | x) directly. Examples: logistic regression, SVM, neural classifiers. Generally better classification accuracy with enough data; generative wins with little data or when you need to generate. (Ng & Jordan 2002.)
Q. "What does it mean for an estimator to be 'consistent'?"
Show answer

It converges in probability to the true parameter as n → ∞: P(|θ̂_n − θ| > ε) → 0. Distinct from unbiasedness — biased estimators can be consistent (bias goes to 0). MLE is consistent under regularity conditions; biased shrinkage estimators (like ridge) are inconsistent for the true coefficient but can have lower MSE.
Q. "What is the bias of a sample variance? How do you fix it?"
Show answer

Sample variance with 1/n divisor is biased: E[s²] = (n−1)/n · σ². Bessel's correction divides by (n−1) instead of n, yielding unbiased s² = (1/(n−1)) Σ (x_i − x̄)². Intuition: x̄ is computed from the sample, so deviations from x̄ are slightly smaller than from true μ.
Q. "Why don't we use accuracy as a loss function?"
Show answer

Accuracy is non-differentiable (involves argmax). Gradient zero almost everywhere. Need a smooth surrogate (cross-entropy, hinge, softmax) that's differentiable and shares argmax with accuracy. Surrogate losses' relationship to the target metric is studied as "calibration" / "consistency" in the surrogate-loss literature.
Q. "When is a problem 'convex'? Why do we care?"
Show answer

The objective is a convex function and the feasible set is convex. Implications: any local minimum is global; gradient descent converges to optimum; saddle points don't exist (in strict convex case); duality gap is zero. Linear/logistic regression with L2 is convex. Deep NNs are non-convex — much of optimization theory doesn't apply directly, hence the focus on empirical heuristics.
Q. "What does it mean for two random variables to be 'uncorrelated'? Same as 'independent'?"
Show answer

Uncorrelated: Cov(X, Y) = 0 ⟺ E[XY] = E[X] E[Y]. Independent: P(X, Y) = P(X) P(Y) (full distribution factors). Independence implies uncorrelated; the converse holds only for jointly Gaussian X, Y. Classic counterexample: X ~ Uniform(−1, 1), Y = X². Cov(X, Y) = 0 but X and Y are perfectly dependent.
Q. "What is the variance of a sum of correlated random variables?"
Show answer

Var(Σ X_i) = Σ Var(X_i) + 2 Σ_{i<j} Cov(X_i, X_j). For uncorrelated, the cross-covariance terms vanish. This is why ensembling (random forest) reduces variance more when base learners are decorrelated.
Q. "What's the difference between L1 and L2 from a gradient perspective?"
Show answer

L2 gradient: 2λθ — proportional to weight; pulls weights uniformly toward zero, never reaching it. L1 (sub)gradient: λ · sign(θ) — constant pull regardless of magnitude; small weights get pushed to exactly 0 by the soft-threshold operator sign(θ) max(|θ| − λ, 0). That's why L1 induces sparsity.
Q. "What's the role of the Hessian in second-order optimization?"
Show answer

Newton's update: θ ← θ − H⁻¹ ∇L. H⁻¹ rescales gradient by the inverse local curvature → quadratic convergence near optimum. Issue: H is N×N (infeasible for N≫1000). Approximations: BFGS (rank-1 updates from gradient history), Adam (diagonal preconditioner using grad²), Shampoo (Kronecker-factored), K-FAC (block-diagonal Fisher).
Q. "Explain Jensen's inequality and one ML application."
Show answer

For convex f: f(E[X]) ≤ E[f(X)]. Concave: reverse. ML application — ELBO derivation: log P(x) = log ∫ P(x, z) dz = log ∫ q(z) P(x,z)/q(z) dz ≥ ∫ q(z) log P(x,z)/q(z) dz (by Jensen on log, which is concave). The RHS is the ELBO. This is how the VAE's variational bound is derived.
Q. "Why is cross-entropy non-negative?"
Show answer

For probability distributions p (true) and q (predicted): H(p, q) ≥ H(p) ≥ 0. The first inequality is Gibbs' inequality (equivalent to KL ≥ 0). The second is because log p ≤ 0 for p ∈ [0,1]. Cross-entropy = 0 only when p is deterministic AND q matches.
Q. "Why is KL divergence asymmetric?"
Show answer

KL(p||q) = Σ p log(p/q) uses p as the weighting. KL(q||p) uses q. They penalize different regions: KL(p||q) heavily penalizes q being small where p is large (covers all modes of p — "mode-covering"); KL(q||p) heavily penalizes q being large where p is small (q sticks to one mode — "mode-seeking"). Choice matters in VI: ELBO uses KL(q||p) → mode-seeking; EP uses KL(p||q) → mode-covering.
Q. "What is the difference between batch norm at train vs eval?"
Show answer

Train: normalize using batch statistics (mean/var of current mini-batch). Eval: normalize using running averages accumulated during training. Differences cause subtle bugs: a model in train mode at inference will use one-batch stats and behave differently. Always call model.eval() for inference.
Q. "Why doesn't increasing model capacity always help in deep learning?"
Show answer

Classical view: bigger capacity → overfit. But for over-parameterized NNs, double descent shows test error rises near interpolation threshold, then falls. Practical issues: (1) data bottleneck — bigger model with same data doesn't help past a point (Chinchilla scaling); (2) optimization gets harder; (3) inference cost dominates total cost. The right framing is "compute optimal" — fix budget, find best model size + data size jointly.
Q. "What's the cost of training a logistic regression vs an SVM at scale?"
Show answer

Logistic regression: convex, smooth — SGD scales linearly with N samples. SVM: convex, non-smooth (hinge) — solved as QP via SMO or via SGD on the hinge loss. Linear SVM and logistic at scale are essentially equivalent in cost; kernel SVM is O(N²) memory for the Gram matrix → infeasible past ~100k samples. Modern recsys/CTR pipelines use logistic regression (or its NN generalization) for this reason.
Q. "Why are random forests said to be 'almost impossible to overfit'?"
Show answer

Adding more trees doesn't increase model variance — bagging averages out variance. But each tree can overfit individually if grown unrestricted; the variance of the ensemble is tamed by averaging. RF still can overfit if (a) too few trees, (b) trees too deep relative to data, (c) too many irrelevant features at each split. "Hard to overfit" is a slight overstatement.
Q. "What's the relationship between sigmoid and softmax?"
Show answer

Softmax with 2 classes is equivalent to sigmoid: softmax([z, 0]) = [σ(z), 1−σ(z)]. Sigmoid is the binary special case. For binary classification, sigmoid is more efficient (one logit instead of two with the constraint they sum to 1).
Q. "Why does feature engineering still matter when we have deep learning?"
Show answer

For tabular data with weak signal-to-noise, feature engineering (interactions, log transforms, target encoding, time-aggregations) outperforms NNs at modest data sizes. NNs need lots of data to learn what good features are; trees + good features beat NNs + raw data on most tabular tasks. Even at scale, domain-specific feature engineering speeds learning and improves robustness.
Q. "What's the difference between standardization, normalization, and whitening?"
Show answer

Standardization: zero mean, unit variance per feature. Min-max normalization: rescale to [0,1] (or [−1,1]). Whitening: full covariance is identity (decorrelates features). PCA whitening uses eigenvectors; ZCA whitening preserves the original feature space orientation. NN inputs typically use standardization; whitening is rare for NNs because BN partly accomplishes it inside the network.
Q. "What's wrong with using R² for non-linear models?"
Show answer

R² = 1 − SS_res/SS_tot. For linear regression with intercept on training data, R² ∈ [0, 1] and equals squared Pearson correlation between y and ŷ. For non-linear / no-intercept / out-of-sample, R² can be negative (model worse than mean predictor). Better to report MSE/MAE plus context.
Q. "What does 'data leakage' mean? Common sources?"
Show answer

Train data contains information about the label or test set that wouldn't be available at prediction time. Sources: (1) target included in features (oops); (2) data preprocessing fit on full dataset before split (e.g., scaling, target encoding); (3) future info in time-series features; (4) duplicates between train and test; (5) selection bias (only labeled examples are easy ones); (6) for recsys: training on logs that the model itself created.
Q. "Explain stratified vs random K-fold cross-validation."
Show answer

Stratified: each fold has the same class distribution as the full dataset. Critical for imbalanced classification — random folds can leave a fold with 0 positive examples. For regression, can stratify on binned target. For time-series, neither — use time-series CV (forward-chaining splits).
Q. "Why might precision and recall not be enough?"
Show answer

P/R fix a single threshold. AUC summarizes across thresholds. But neither captures: (1) cost asymmetry (false positive cost ≠ false negative cost — use expected cost), (2) ranking quality beyond binary (NDCG), (3) calibration of probabilities, (4) per-segment performance (slicing matters). For recommendation, P@K is misleading without considering position bias.
Q. "What is a 'sufficient statistic'?"
Show answer

A function T(X) of the data such that the conditional distribution of X given T(X) doesn't depend on the parameter θ. All info about θ in the data is captured by T(X). For Gaussian with known variance, sample mean is sufficient for μ. Comes up in exponential family / GLM theory.
Q. "What's the connection between linear regression and the projection matrix?"
Show answer

OLS solution β̂ = (XᵀX)⁻¹ Xᵀy. Predictions ŷ = X β̂ = X (XᵀX)⁻¹ Xᵀ y = H y where H = X(XᵀX)⁻¹Xᵀ is the "hat" matrix — the projection onto the column space of X. So OLS projects y onto the column space of X. I − H projects onto the residual space. Sum of leverages = trace(H) = rank(X) = effective parameter count.
Q. "What's the multiple-comparisons problem in A/B testing?"
Show answer

Running many tests at α = 0.05, you'll get false positives at rate ~5% per test → with 20 tests, ~63% chance of at least one false positive. Corrections: Bonferroni (divide α by # tests, conservative), Benjamini-Hochberg (controls false discovery rate, less conservative), sequential testing methods. Always pre-register your primary metric and a small set of guardrails.
Q. "What does it mean for a model to be 'identifiable'?"
Show answer

Different parameter values yield different distributions: P(x | θ_1) ≠ P(x | θ_2) implies θ_1 ≠ θ_2. Without identifiability, you can't uniquely recover θ from data even with infinite samples. Mixture models are non-identifiable up to label permutation. Deep nets are non-identifiable in the strict sense (many weight configs produce same function); in practice we don't care about θ, only the function.
Q. "Why does adding a hidden layer to a neural net make it 'universal'?"
Show answer

Universal approximation theorem (Cybenko 1989, Hornik 1991): a feedforward NN with one hidden layer of arbitrary width and a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy. Theoretical result; practically the required width is often exponential. Why depth helps: depth gives exponentially more efficient representation than width for many functions (Telgarsky 2016).
Q. "What's the empirical risk vs the true risk?"
Show answer

True risk: R(f) = E_{(x,y)~P}[L(f(x), y)]. What you actually want — expected loss on the data distribution. Empirical risk: R̂(f) = (1/n) Σ L(f(x_i), y_i). What you compute on the training set. ERM minimizes R̂; learning theory studies how close R̂ is to R (generalization gap).
Q. "What's the difference between online learning and online inference?"
Show answer

Online learning: model updates from each example as it arrives (streaming SGD). Online inference: model serves predictions in real time (latency budget, no retraining). They're orthogonal — most production systems do online inference + batch retraining; full online learning is rare due to stability concerns.
Q. "What is the curse of binarization for ranking metrics?"
Show answer

Binary metrics (precision@K, recall@K) treat all relevant items as equal. NDCG / graded relevance accounts for varying degrees of relevance. For ranking, the difference between rank-1 and rank-2 of a great item matters more than rank-99 vs rank-100 — discount functions (1/log₂(rank+1) for NDCG) capture this; binary metrics don't.
Q. "Why doesn't increasing regularization always reduce overfitting?"
Show answer

Past a point, regularization underfits — increases bias more than it reduces variance. The bias-variance trade-off has a sweet spot; CV finds it. Also, the form of regularization matters — L2 may not produce the right inductive bias for sparse / structured data, where L1 / group-lasso would.

REMEMBER

If you can answer all 50 above without hesitation, you're at the Sr Staff theory bar.
The next section has 30+ more questions covering classical ML, optimization, and deep learning probes.

Extended quiz — 30+ more theory questions

Cover all of these and you've passed any "ML theory probe" round.

Q. "What is overfitting?"
Show answer

Model fits training data better than the underlying distribution → poor generalization. Symptoms: large train-test gap. Causes: high model capacity, low data, noisy labels, leakage. Cures: regularization, more data, augmentation, early stopping, ensembling, simpler model.
Q. "Why does dropout reduce overfitting?"
Show answer

Stochastic ensemble interpretation (each mask = sub-network); co-adaptation prevention; Bayesian variational interpretation (Gal & Ghahramani 2016). Equivalent to a noise injection regularizer.
Q. "What is the difference between parameters and hyperparameters?"
Show answer

Parameters: learned from data via optimization (weights, biases). Hyperparameters: set before learning, control the learning process (LR, batch size, regularization strength, architecture). Hyperparameter optimization: grid search, random search, Bayesian optimization (Optuna, Hyperopt), Hyperband, BOHB.
Q. "What is gradient clipping and when do you use it?"
Show answer

Cap gradient norm at a threshold C: if ||g|| > C, scale g by C/||g||. Used to prevent exploding gradients in RNNs, transformers (rare with proper init + LR), RL. Modern LLM pretraining uses gradient clipping at 1.0 routinely; it's a cheap insurance against loss spikes.
Q. "What's a vanishing/exploding gradient and what causes them?"
Show answer

Backprop through L layers multiplies L Jacobians. If their product norm is << 1 → vanishing (early-layer weights don't update). If >> 1 → exploding (NaN losses). Causes: bad init (too small / too large), sigmoid/tanh saturation, no normalization, no residual connections. Fixes: He/Xavier init, ReLU/GELU, BN/LN, residuals, gradient clipping.
Q. "What's a 'representation' in deep learning?"
Show answer

An intermediate layer's activation, viewed as a feature vector for the input. Good representations: linearly separable for downstream tasks; capture invariances; transferable. Self-supervised pretraining (CLIP, SimCLR, MAE, masked LM) explicitly trains for good representations without labels.
Q. "What is contrastive learning?"
Show answer

Learn representations by pulling similar (positive) pairs together and pushing dissimilar (negative) pairs apart in embedding space. InfoNCE loss is the dominant form. Examples: SimCLR (image augmentations as positives), CLIP (image-caption pairs), two-tower retrieval (user-item interactions).
Q. "Why is temperature in softmax / contrastive learning important?"
Show answer

Temperature τ scales logits before softmax. Low τ → sharper distribution (focuses on top items). High τ → flatter. In contrastive learning, low τ amplifies the gradient on hard negatives but can over-fit; high τ underweights hard examples. CLIP uses learned τ.
Q. "What's the difference between SGD and mini-batch SGD?"
Show answer

SGD: gradient on one example. Mini-batch SGD: gradient on a small batch (32-1024 typical). Mini-batch is the standard — vectorized compute, less noisy, can use BN. "SGD" in modern DL almost always means mini-batch SGD.
Q. "What is gradient checkpointing?"
Show answer

Trade compute for memory: save only some activations during forward; recompute the rest during backward. ~33% extra forward FLOPs for full block checkpointing; ~10× memory savings. Critical for training large models on limited GPU memory.
Q. "Explain teacher forcing vs free running in seq2seq training."
Show answer

Teacher forcing: at each decode step, feed the ground-truth previous token (not the model's own output). Stable, fast training. Issue: exposure bias — at inference the model sees its own (potentially wrong) outputs. Free running: feed model's own output. Slower training, can be unstable. Modern fixes: scheduled sampling, distillation, RL fine-tuning.
Q. "What is label smoothing and when does it help?"
Show answer

Replace one-hot target with (1 − ε) · onehot + ε / K. The model is no longer rewarded for assigning probability 1 to the correct class → less overconfident, better calibration. Used in original Transformer, Inception nets. Less common in LLM pretraining where calibration isn't a target metric.
Q. "What is mixup and why might it work?"
Show answer

Train on convex combinations: x̃ = λ x_i + (1 − λ) x_j, ỹ = λ y_i + (1 − λ) y_j. Encourages linear behavior between training points. Improves generalization, calibration, robustness. Vision-only really; doesn't work for discrete tokens.
Q. "What is the manifold hypothesis?"
Show answer

Real-world high-dim data (images, text) lies on a low-dimensional manifold within the ambient high-dim space. Implication: ML can succeed despite curse of dimensionality. Most data points are unreachable (gibberish images, random text); the meaningful subset has low intrinsic dimension. Generative models (diffusion, VAE) implicitly learn this manifold.
Q. "Why does the LR schedule matter?"
Show answer

Early training: high LR for fast progress, but instability. Late training: low LR for fine convergence. Warmup smooths early instability (large grad magnitudes when Adam moments are noisy). Decay (cosine, WSD, linear) reduces LR over time → finer steps near optimum. Empirical: schedule choice affects final loss by 5-15% with the same total compute.
Q. "How do you handle class imbalance?"
Show answer

(1) Resampling: oversample minority (SMOTE), undersample majority. (2) Class weights in loss. (3) Focal loss (down-weights easy examples). (4) Threshold tuning at inference. (5) Get more minority data. (6) Different loss / metric (PR-AUC instead of accuracy). For very rare events, anomaly detection framing.
Q. "What is the bias of cross-validation as an estimator of test error?"
Show answer

K-fold CV with K folds uses (1 − 1/K) of data per fold → trained on slightly less data than the full model. Slightly pessimistic bias. LOOCV (K=N) is nearly unbiased but high-variance. K=5 or 10 is the standard compromise.
Q. "Why might LASSO hurt prediction accuracy compared to ridge?"
Show answer

LASSO's sparsity comes at a cost: it picks one feature among correlated ones (somewhat arbitrarily). If both correlated features are predictive, ridge keeps both (small but non-zero coefficients) → better prediction. Elastic net combines both: shrinkage AND selection.
Q. "Explain the concept of 'leakage' in time series."
Show answer

Future information appears in features available at prediction time. Examples: rolling statistics computed across the whole dataset (uses future); target aggregates by entity that include the current target; preprocessing fit on full data. Time-series CV (forward-chaining, expanding window) prevents this; "as-of" feature stores enforce point-in-time correctness.
Q. "What's the difference between ROC and PR curves?"
Show answer

ROC: TPR (recall) vs FPR. Invariant to class balance. PR: precision vs recall. Sensitive to class balance — for rare positives, AUC-PR is more informative than AUC-ROC. PR baseline for random classifier = positive class rate (not 0.5).
Q. "Why do we use stochastic gradient descent instead of full-batch?"
Show answer

Three reasons: (1) computational — full batch is expensive at scale; (2) statistical — gradient noise acts as a regularizer (finds flatter minima); (3) memory — can't fit full batch. The "right" batch size is often the largest that doesn't sacrifice generalization, scaled with critical batch size theory (McCandlish 2018).
Q. "What's different about training time-series vs i.i.d. data?"
Show answer

(1) Sampling order matters — can't shuffle. (2) Train/val/test must be split chronologically. (3) Cross-validation: forward-chaining or expanding window. (4) Stationarity assumption — distribution can shift over time, requires periodic retraining. (5) Lag/lead features need careful construction. (6) Target leakage from rolling stats.
Q. "Explain Maximum Mean Discrepancy (MMD)."
Show answer

A kernel-based distance between two distributions: MMD²(P, Q) = E_{x,x'~P}[k(x,x')] − 2 E_{x~P, y~Q}[k(x,y)] + E_{y,y'~Q}[k(y,y')]. Used for two-sample testing, generative model evaluation, distribution shift detection. With a characteristic kernel (RBF), MMD = 0 iff P = Q.
Q. "Why is the Gaussian distribution so prevalent in ML?"
Show answer

(1) Central Limit Theorem: sum of independent random variables tends to Gaussian. (2) Maximum entropy distribution given mean and variance constraints. (3) Closed-form posterior in many conjugate models. (4) MLE under Gaussian noise = MSE, the most-used loss. (5) Computationally tractable (multivariate Gaussian closed under marginal/conditional/linear transform).
Q. "Why does Bayes' rule matter?"
Show answer

P(θ | D) = P(D | θ) P(θ) / P(D). Posterior ∝ likelihood × prior. Foundation of Bayesian ML, MAP estimation, naive Bayes classifier, every probabilistic interpretation in deep learning. The denominator P(D) is the partition function; intractable in general → motivates MCMC, VI, ABC.

0 → hero reading path

foundation Deep Learning (Goodfellow, Bengio, Courville) — Ch 5 (ML basics), Ch 6 (deep nets), Ch 7 (regularization), Ch 8 (optimization)
foundation Elements of Statistical Learning (Hastie, Tibshirani, Friedman) — Ch 2-7. The most-cited stats-ML reference.
build Pattern Recognition and Machine Learning (Bishop) — Ch 3 (linear regression), Ch 4 (linear classification), Ch 9 (EM)
depth ESL chapters on boosting (10) + SVM (12) for tree/kernel intuition that still shows up at interviews
depth Mathematics for Machine Learning (Deisenroth) — free; brush up on linear algebra, calculus, probability if rusty
depth Probabilistic Machine Learning (Murphy) — current canonical Bayesian-flavored ML textbook
specialty CS 4780 (Cornell) lecture notes — concise, still excellent
specialty Stanford CS 229 lecture notes — Ng's classic

Reading: papers and posts that go deep on each topic

Bias-variance: ESL Ch 7
L1 vs L2 sparsity: Friedman, Hastie, Tibshirani 2010 — glmnet paper
Why ReLU: Glorot, Bordes, Bengio 2011 "Deep Sparse Rectifier Neural Networks"
Generalization theory survey: Kawaguchi, Kaelbling, Bengio 2017
Why does SGD generalize: Keskar 2017 — Large-batch training; Belkin 2019 — double descent
BatchNorm "covariate shift" rebutted: Santurkar 2018
Calibration of NNs: Guo 2017
Bayesian dropout: Gal & Ghahramani 2016
Adam vs SGD generalization: Wilson 2017
Implicit bias of gradient descent: Soudry 2018
NTK: Jacot, Gabriel, Hongler 2018
μP: Yang 2022 — Tensor Programs V
Karpathy's A Recipe for Training Neural Networks — pragmatic ML wisdom
Lilian Weng's blog — every post is a deep dive: lilianweng.github.io