FOUNDATIONS · ML THEORY

ML theory — the core fundamentals

The questions that distinguish "skilled practitioner" from "deep understanding." Bias-variance, why ReLU, L1 vs L2 geometry, MLE/MAP, calibration, generalization. Every Sr Staff loop has 2-3 of these.

Read ~45 min Asked at Anthropic, OpenAI, DeepMind, Meta, Google Difficulty Sr Staff bar Chapters 15 + 80-question quiz
01
FOUNDATIONS · THE MENTAL MODEL

The bias-variance frame — the mental model under everything

TL;DR

Test error = irreducible noise + bias² + variance. Bias is "model class wrong on average," variance is "model jitters across datasets." Every regularizer, every ensemble, every architecture choice trades one against the other. Modern over-parameterized nets break the U-curve via double descent — SGD's implicit bias picks low-norm interpolators.

The decomposition — derive it on the whiteboard

Let y = f(x) + ε with ε ~ N(0, σ²) independent of D. Train predictor f̂(x; D) on dataset D. Expected squared error at point x:

E_D[(y − f̂(x; D))²]
  = E_D[(f(x) + ε − f̂(x; D))²]                                  # substitute
  = E[ε²] + E_D[(f(x) − f̂(x; D))²]                              # cross terms vanish (ε ⊥ D)
  = σ²
  + E_D[(f(x) − E_D[f̂(x; D)])²]                                 # = bias² (squared dist of mean prediction)
  + E_D[(f̂(x; D) − E_D[f̂(x; D)])²]                              # = variance
  ─────────────────────────────────
  = σ²        +    bias²(f̂; x)    +    var(f̂; x)
   irreducible      (model wrong on average)   (model unstable across datasets)

What bias and variance buy you in practice

High bias (underfit)

  • Model class too simple
  • Train and val loss both bad and close
  • Fix: bigger model, richer features, less regularization

High variance (overfit)

  • Model too sensitive to specific sample
  • Train loss low, val loss high
  • Fix: regularize, more data, augment, ensemble, early stop
irreducible noiseσ², the floor no model can beat. If your data has 10% label noise, no architecture will get past 90% accuracy.

Why bagging vs boosting attack different terms

Where deep nets break the textbook U-curve

Classical view: model capacity ↑ → bias ↓, variance ↑ → U-shaped test error.

THE INSIGHT

Double descent — the modern correction

Past the interpolation threshold (model can fit train set perfectly), test error rises near the threshold then falls again. Why: among the infinite interpolating solutions, SGD's implicit bias picks low-norm ones, which generalize. The U-curve is a small-model phenomenon; over-parameterized regime has a different shape entirely.

EXAMPLE — concrete walkthrough

A 2-layer MLP on MNIST with hidden width sweep. Width 32: train acc 95%, test 94% (low bias, low variance). Width 256: 100% train, 92% test (interpolation, peak variance). Width 4096: 100% train, 97% test (over-parameterized, variance fell). The classical "more capacity is worse" story stops at the interpolation threshold.

PITFALL
Conflating "bias of an estimator" (statistics — E[θ̂] − θ_true) with "bias term in the decomposition" (squared distance of mean prediction at a single x). Same word, different objects. Interviewers love this distinction.
REMEMBER
  • Test MSE = irreducible σ² + bias² + variance. Be ready to derive.
  • Bagging kills variance; boosting kills bias. Choose accordingly.
  • Double descent voids the U-curve in the over-parameterized regime — SGD's implicit bias does the regularization for free.
02
OPTIMIZATION · CORE

Optimizers from first principles — GD to Adam to Shampoo

TL;DR

Gradient descent is θ ← θ − η ∇L. On convex-smooth problems it has clean rates. On non-convex deep nets it has none — yet works. SGD generalizes better than full-batch because noise biases toward flat minima. Adam = diagonal preconditioner; AdamW fixed Adam's L2 bug; Shampoo / K-FAC approximate the full Hessian. Conditioning is the secret optimizer ingredient — that's why init, BN/LN, and feature scaling all matter.

The base rule and what convexity buys

For loss L(θ): θ ← θ − η ∇L(θ). Convergence rates depend entirely on the problem class:

SettingRateStep size
Convex, β-smoothO(1/k) in objectiveη = 1/β
μ-strongly convex + β-smoothO((1 − μ/β)^k) linearη = 2/(μ+β)
Non-convex (deep nets)No general guaranteeEmpirical heuristics

The condition number κ = β/μ controls speed; ill-conditioned problems crawl.

Why SGD generalizes better than full-batch

Empirically (Keskar 2017): small-batch SGD finds flatter minima than large-batch, and flatter minima generalize. Two main theoretical lenses:

EXAMPLE — when "scale LR linearly with batch" fails

You scale ResNet-50 from batch 256 to 8k by multiplying LR 32×. Past ~4k, no further speedup — you've crossed B_crit. Solutions: warmup the LR over the first 5 epochs (smooths Adam moment instability), use LARS/LAMB (per-layer adaptive LRs), or accept that batch beyond B_crit wastes compute.

Momentum, Nesterov, Newton, quasi-Newton

Adam, AdamW, and the L2 bug

Adam maintains EMAs of grad and grad². Update:

m ← β₁m + (1−β₁) g          # 1st moment EMA
v ← β₂v + (1−β₂) g²         # 2nd moment EMA
m̂ = m / (1 − β₁^t)          # bias correction
v̂ = v / (1 − β₂^t)
θ ← θ − η · m̂ / (√v̂ + ε)

The bug: standard L2 in Adam adds λθ to the gradient, which then gets divided by √v̂ — so high-variance params receive less decay. AdamW (Loshchilov 2019) decouples: apply θ ← θ − η(m̂/√v̂ + λθ) instead. AdamW is the modern default for transformers.

Why conditioning is the silent optimizer

The Hessian eigenvalue spread κ = λ_max / λ_min determines GD speed. In deep nets, κ can exceed 10⁶. Three tricks fight bad conditioning:

  1. Feature scaling / normalization: gives input dims similar variance → conditions the Hessian's input block.
  2. Weight init (He / Xavier): scales weights so activations and gradients stay comparable across layers — prevents conditioning collapse at depth.
  3. Batch / Layer norm: re-scales activations → smoother loss landscape (Santurkar 2018). The original "internal covariate shift" claim is largely wrong; the real mechanism is Lipschitz smoothing.
PITFALL
Forgetting bias correction in Adam. At step 1 with β₁=0.9, raw m is only 10% of its true value. Without (1 − β^t) correction, your first hundred updates are too small and training drifts.
REMEMBER
  • Convex + β-smooth: O(1/k). Strongly convex: linear. Non-convex deep nets: no guarantee but works.
  • SGD's noise is a feature, not a bug — it picks flat minima.
  • AdamW > Adam for transformers (decoupled weight decay).
  • Init, normalization, and feature scaling all do the same job: they fix Hessian conditioning.
03
ARCHITECTURE · ACTIVATION FUNCTIONS

Why ReLU? — the vanishing gradient story, in numbers

TL;DR

Pre-2010 deep nets used sigmoid/tanh. Both saturate; their max derivative is 0.25 (sigmoid) or 1.0 (tanh at z=0). Stack 20 sigmoid layers and the gradient at layer 1 is multiplied by 0.25²⁰ ≈ 10⁻¹². ReLU's derivative is 1 in the active region — no shrinking. That single fix unlocked deep nets. GELU/SwiGLU smooth the kink for transformers.

Sigmoid's vanishing gradient — the actual numbers

Sigmoid: σ(z) = 1/(1+e⁻ᶻ); σ'(z) = σ(z)(1−σ(z)). Maximum at z=0: σ'(0) = 0.25.

Backprop through L sigmoid layers multiplies L derivatives. Even at the maximum:

Depth LBest-case gradient at layer 1
50.25⁵ ≈ 10⁻³
100.25¹⁰ ≈ 10⁻⁶
200.25²⁰ ≈ 10⁻¹²

Tanh is slightly better (tanh'(0) = 1.0) but still saturates outside a narrow band.

ReLU's properties — what changed

dead ReLU — a unit whose pre-activation is always < 0, so its gradient is always 0 and it never recovers. Typically 10-40% of ReLUs end up dead post-training. Mitigations: Leaky ReLU (small negative slope), PReLU (learned slope), ELU, GELU, SiLU.

Why GELU / SwiGLU now in transformers

ReLU's hard cutoff loses info at z near 0. GELU is smooth: x · Φ(x) — small negative inputs still pass a small signal. SwiGLU (Shazeer 2020) adds gating: (x W₁) · σ(x W₂). Empirically improves quality at fixed FLOPs.

None of this would scale without residual connections. Residuals give every layer an "identity path" so the gradient never has to multiply through the activation derivative — see chapter 14.

EXAMPLE — counting dead ReLUs in a real model

After training a 4-layer MLP on CIFAR-10 with ReLU, instrument the pre-activations on the val set. Typical result: 25-35% of units fire on <1% of inputs — those are dead. Replace with GELU and the dead-unit count falls to <5%, but throughput drops 5% (GELU is more expensive). For inference-bound deployments, ReLU still wins on Pareto.

REMEMBER
  • Sigmoid's max derivative is 0.25; stack 20 layers and gradient is 10⁻¹². That's the vanishing-gradient story.
  • ReLU's derivative is 1 in the active half — saturation gone.
  • Dead ReLU is a real failure mode (10-40% of units). GELU/SiLU/SwiGLU smooth the kink.
  • Activation choice matters less when you have residuals — they're the real reason 100+ layer nets train.
04
REGULARIZATION · GEOMETRY

L1 vs L2 — geometry, gradient, Bayes, sparsity

TL;DR

L2 is a Gaussian prior; constraint is a smooth ball; gradient pulls weights proportionally; never reaches 0. L1 is a Laplace prior; constraint is a diamond with axis-aligned corners; the corner-touching geometry zeros out coordinates → sparsity. Elastic net combines them. In deep learning, L2 (weight decay) is critical not for "preventing overfitting" but for controlling Lipschitz constant and interacting with batch norm.

L2 (ridge) — the round constraint

Penalty: L = L_0 + λ ||θ||₂² = L_0 + λ Σ θᵢ².

L1 (lasso) — the diamond constraint

Penalty: L = L_0 + λ ||θ||₁ = L_0 + λ Σ |θᵢ|.

Side by side

L2 (ridge)

  • Smooth penalty (differentiable everywhere)
  • Multiplicative shrinkage
  • Closed-form solution exists
  • Gaussian prior
  • Keeps correlated features (small but non-zero)
  • Use as default for prediction

L1 (lasso)

  • Non-smooth penalty (kink at 0)
  • Soft-thresholding: sign(θ) · max(|θ|−λ, 0)
  • Iterative algorithms only (ISTA, LARS)
  • Laplace prior
  • Picks one of correlated features (somewhat arbitrarily)
  • Use for feature selection / interpretability

Why L1 is sparse — visual + math

Visual: imagine the elliptical loss contour tightening toward its minimum. With L2 (round constraint) the first touch is generic. With L1 (diamond) the first touch is most often at a vertex — and vertices have all-but-one coordinate equal to zero.

Math: at a coordinate where |∂L_0/∂θᵢ| < λ, the L1 subgradient zeros that coordinate. L2 only shrinks (multiplicative); L1 thresholds (the soft-threshold operator sign(θ) · max(|θ|−λ, 0)).

EXAMPLE — soft thresholding in one update

Suppose the unregularized gradient step gives θ' = 0.03. With L2 at λ=0.5, the next iterate is θ' / (1 + 2λ) = 0.015 — shrunk but non-zero. With L1 at λ=0.05, soft-threshold gives sign(0.03) · max(0.03 − 0.05, 0) = 0. Coordinate killed.

Elastic net — the practical compromise

α L1 + (1−α) L2. Combines sparsity with L2's grouping (correlated features both retained or both dropped together). glmnet (Friedman, Hastie, Tibshirani 2010) is the canonical implementation.

Why weight decay works in deep learning (different reason)

In deep nets, weight decay (≈ L2) is critical for generalization, but for different reasons than classical statistics:

PITFALL
Saying "L1 is sparse, so use it for feature selection in deep nets." L1 on individual weights doesn't induce structured sparsity (whole-channel removal). For deep nets, use group-lasso, magnitude pruning, or movement pruning instead — naive L1 just makes weights tiny, not zero, and gives no inference speedup.
REMEMBER
  • L2 = Gaussian prior + smooth ball + multiplicative shrinkage. L1 = Laplace prior + diamond + soft-threshold sparsity.
  • L1 produces zeros because its constraint set has axis-aligned corners.
  • In deep learning weight decay's main job is Lipschitz control + interaction with BN/LN, not classical "shrink coefficients."
  • Use AdamW (decoupled), not Adam + L2.
05
PROBABILISTIC ML · UNIFICATION

The probabilistic view — every loss is a negative log-likelihood

TL;DR

MLE picks θ to maximize P(D|θ). Almost every loss in ML is the negative log-likelihood of some distributional assumption: Gaussian → MSE; Bernoulli → binary cross-entropy; Categorical → cross-entropy; Laplace → MAE; Poisson → Poisson loss. Adding a prior P(θ) turns MLE into MAP — and Gaussian prior = ridge, Laplace prior = lasso. Full Bayes integrates θ out.

MLE — the framework

Choose θ to maximize P(D|θ) = ∏ P(xᵢ|θ). Equivalently, minimize −log P(D|θ) = −Σ log P(xᵢ|θ). This is the loss.

Common losses are NLLs in disguise

Likelihood modelNegative log-likelihood = LossUsed for
Gaussian noise: y ~ N(f(x), σ²)MSE: (1/2σ²) Σ (y − f(x))²Regression
Bernoulli: y ~ Bern(σ(z))Binary cross-entropy: −Σ y log p̂ + (1−y) log(1−p̂)Binary classification
Categorical: y ~ Cat(softmax(z))Cross-entropy: −Σ y log p̂Multi-class
Laplace: y ~ Laplace(f(x), b)MAE / L1 lossRobust regression
Poisson: y ~ Poisson(λ(x))Poisson loss: Σ λ − y log λCount regression (CTR with rare events)
Negative BinomialNB lossOver-dispersed counts
THE INSIGHT

"Why MSE for regression?" — the deep answer

MSE is the negative log-likelihood under Gaussian noise. If you assume the noise is heavy-tailed (Laplace), use MAE. If you assume counts (Poisson), use Poisson loss. The loss is a noise-model choice. Picking a loss without thinking about the implicit noise model is what beginners do; staff engineers always know which distribution they're assuming.

Why cross-entropy beats MSE for classification

MAP — adding a prior

argmax P(θ|D) = argmax P(D|θ) P(θ) by Bayes. Adding the prior is adding a regularizer:

MLE

  • Maximize P(D|θ)
  • No prior
  • Asymptotically unbiased
  • Overfits at small N

MAP

  • Maximize P(D|θ) P(θ)
  • Gaussian prior → ridge
  • Laplace prior → lasso
  • Uniform prior → MAP = MLE

Full Bayes — integrating out θ

P(θ|D) ∝ P(D|θ) P(θ); predict by P(y|x, D) = ∫ P(y|x, θ) P(θ|D) dθ. Almost never tractable. Approximations: MCMC (HMC, NUTS), variational inference, Laplace approximation.

Conjugate priors — still asked

LikelihoodConjugate prior
Bernoulli / BinomialBeta
Categorical / MultinomialDirichlet
Gaussian (mean, known var)Gaussian
Gaussian (precision)Gamma
PoissonGamma
EXAMPLE — Beta-Bernoulli for CTR estimation

Click-through rate for a new ad, observed 2 clicks out of 30 impressions. Pure MLE: 2/30 = 6.7%. With Beta(1,1) prior (uniform), posterior is Beta(3, 29), posterior mean = 3/32 = 9.4%. With Beta(10, 90) prior (informative — average ad is ~10% CTR), posterior is Beta(12, 119), mean = 12/131 = 9.2%. The prior regularizes the small-sample estimate toward a sensible default.

REMEMBER
  • Every loss = NLL of a distributional assumption. Pick the noise model first, the loss second.
  • CE beats MSE for classification: matches likelihood, clean gradient (p̂ − y), better calibration.
  • MAP = MLE + prior = MLE + regularization. Gaussian → ridge; Laplace → lasso.
  • Conjugate priors give closed-form posteriors. Beta/Bernoulli, Dirichlet/Cat, Gamma/Poisson.
06
INFORMATION THEORY · TOOLKIT

The information-theoretic toolkit — entropy, KL, MI, JS

TL;DR

Entropy = average bits needed. Cross-entropy = bits if you use the wrong code. KL = excess bits = H(p,q) − H(p). KL is asymmetric: KL(p||q) is mode-covering, KL(q||p) is mode-seeking — that asymmetry decides which one VAEs use vs EP. JS is symmetric and bounded. Mutual information measures shared information; InfoNCE estimates a lower bound and powers contrastive learning. Every distillation loss, every RLHF KL penalty, every VAE ELBO traces back to these.

Entropy and cross-entropy

entropyH(p) = −Σ p(x) log p(x). Average bits to encode a sample from p (if log = log₂). Maximum at uniform; 0 for deterministic.
cross-entropyH(p, q) = −Σ p(x) log q(x). Bits to encode p when using q's code. Always ≥ H(p).

KL divergence — the workhorse

KL(p || q) = Σ p(x) log(p(x)/q(x)) = H(p, q) − H(p). Asymmetric; not a metric. KL ≥ 0 (Gibbs); = 0 iff p = q. Used everywhere: VAE ELBO, RLHF KL penalty, knowledge distillation.

Why the asymmetry matters — mode-covering vs mode-seeking

KL(p || q) — "forward KL"

  • Weighted by p
  • Penalizes q being small where p is large
  • Mode-covering: q must cover all modes of p
  • Used in expectation propagation, MLE

KL(q || p) — "reverse KL"

  • Weighted by q
  • Penalizes q being large where p is small
  • Mode-seeking: q sticks to one mode of p
  • Used in VAE ELBO, variational inference, RLHF
EXAMPLE — fitting a unimodal q to a bimodal p

Say p is a 50/50 mixture of two Gaussians at ±5 and q is a single Gaussian. Minimizing KL(p||q) places q at mean 0 with huge variance (covers both modes, badly). Minimizing KL(q||p) places q on one of the modes (sharp fit on half the distribution). VAEs use reverse KL → posteriors collapse to single modes if not careful.

Mutual information and InfoNCE

I(X; Y) = KL(P(X,Y) || P(X)P(Y)) = H(X) − H(X|Y) = H(Y) − H(Y|X). Reduction in uncertainty about X given Y. InfoNCE (van den Oord 2018) is a tractable lower bound on I(X;Y); it's the loss behind SimCLR, CLIP, and most modern contrastive learning.

Jensen-Shannon divergence

JS(p, q) = ½ KL(p || m) + ½ KL(q || m) where m = ½(p+q). Symmetric; bounded in [0, log 2]; used in original GAN.

Why KL in RLHF?

The KL penalty r − β · KL(π || π_ref) prevents the policy from drifting too far from the reference (SFT model). Without it, RL mode-collapses onto reward-hacking patterns. β trades off reward maximization vs staying-close-to-reference.

PITFALL
Treating KL as a distance. KL(p||q) ≠ KL(q||p), triangle inequality fails, can be infinite if q has zero probability where p doesn't. Use JS or Wasserstein if you need a proper metric.
REMEMBER
  • Cross-entropy = entropy + KL. Minimizing CE = minimizing KL when H(p) is fixed.
  • Forward KL is mode-covering; reverse KL is mode-seeking. VAE / VI / RLHF all use reverse.
  • InfoNCE = tractable lower bound on MI; powers contrastive learning.
  • RLHF's KL penalty is what stops reward hacking.
07
LINEAR ALGEBRA · LOW-RANK

The linear-algebra lens — PCA, SVD, low-rank

TL;DR

SVD X = UΣVᵀ is the swiss-army knife. Columns of V are PCA's principal directions; Σ²/(n−1) are the variances. Truncated SVD gives the best rank-k approximation in Frobenius norm (Eckart-Young). A linear autoencoder with MSE recovers the PCA subspace. LoRA, low-rank adapters, embedding compression — all leverage low-rank structure that SVD makes visible.

PCA via SVD

Center X (subtract column means). SVD: X = U Σ Vᵀ. Then:

SVD identities to know cold

Why PCA matters at interviews

EXAMPLE — LoRA as a low-rank update

LoRA replaces a fine-tune weight delta ΔW ∈ ℝ^{d×d} with BA where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×d}, r ≪ d. For Llama-2 70B with r=8, that's 8 / 8192 = 0.1% the params. Works because the empirical fine-tuning delta has low effective rank — exactly the assumption SVD's truncation theorem gives you. If r = 64 captures 95% of the spectral energy of the true ΔW, the rank-64 LoRA is nearly lossless.

REMEMBER
  • SVD is the lens for everything low-rank: PCA, pseudoinverse, condition number, LoRA, distillation.
  • Eckart-Young: truncated SVD is the optimal low-rank approximation in Frobenius norm.
  • Linear AE with MSE = PCA. Non-linear AE = (much) more.
08
CLASSICAL ML · MAX-MARGIN

Kernels and SVM — the max-margin story (still asked)

TL;DR

SVM finds the hyperplane that maximizes the margin 2/||w||. Soft-margin adds slack ξ_i with penalty C. The dual depends on data only through inner products → swap inner product for a kernel K(x,x') = ⟨φ(x),φ(x')⟩ and you get a non-linear classifier without computing φ. Hinge loss is the SVM's loss in disguise. Logistic regression and linear SVM scale similarly; kernel SVM is O(N²) memory and dies past ~100k samples.

The kernel trick

Map data to high-dim feature space φ(x); train linear model there. Trick: any algorithm that depends on data only through inner products ⟨x, x'⟩ can be "kernelized" by replacing inner products with a kernel K(x, x') = ⟨φ(x), φ(x')⟩ — without ever computing φ explicitly.

Common kernels

SVM derivation — be ready to whiteboard

Maximum-margin classifier. For separable data with labels y ∈ {−1, +1}:

min_{w, b} ½ ||w||²
subject to  y_i (wᵀx_i + b) ≥ 1  for all i

Why ½||w||²? Margin = 2/||w||. Maximize margin ↔ minimize ||w||.

Soft margin (real data is not separable):

min ½||w||² + C Σ ξ_i
s.t. y_i(wᵀx_i + b) ≥ 1 − ξ_i,  ξ_i ≥ 0

The Lagrangian dual is what gets kernelized. Dual variables α_i: w = Σ α_i y_i x_i. Only support vectors (α_i > 0) contribute. Replace x_iᵀ x_j with K(x_i, x_j) in the dual → kernel SVM.

Hinge loss — the SVM in loss form

SVM is equivalent to minimizing max(0, 1 − y · f(x)). Hinge gives 0 loss for correctly classified points outside the margin → only "marginal" examples drive learning. Compare to logistic loss log(1 + exp(−y · f(x))) which has nonzero gradient everywhere.

EXAMPLE — when the margin actually matters

For a near-linearly-separable dataset like spam vs ham TF-IDF features, linear SVM and logistic regression give nearly identical accuracy. Where SVM wins: when calibrated probabilities don't matter and you need the maximum-margin decision boundary for adversarial robustness. Where LR wins: when you need probabilities (CTR prediction, ranking), or when classes are not separable and gradient signal everywhere helps optimization.

PITFALL
Trying to scale kernel SVM to a 1M-sample dataset. The Gram matrix is N × N — at N=1M that's 8 TB of float64. Use linear SVM (LIBLINEAR, scikit-learn's LinearSVC) or random Fourier features (Rahimi-Recht) to approximate the kernel.
REMEMBER
  • Margin = 2/||w||. Maximize margin ↔ minimize ||w||² s.t. functional margin ≥ 1.
  • Kernel trick: replace inner product with K — get non-linear from a linear algorithm.
  • RBF kernel = infinite-dim features. Polynomial kernel = degree-d features.
  • Kernel SVM is O(N²) memory. Beyond ~100k samples, use linear SVM or random features.
09
CLASSICAL ML · TABULAR

Tree-based methods — GBM, XGBoost, LightGBM (still beats NN on tabular)

TL;DR

Trees do greedy axis-aligned splits. Random forest = bagged decorrelated trees → variance kill. Gradient boosting fits sequential trees to negative-gradient pseudo-residuals → bias kill. XGBoost adds Newton-style 2nd-order + regularization + missing-value defaults; LightGBM adds histogram bins + leaf-wise growth + GOSS/EFB for speed; CatBoost adds ordered boosting for categorical features. On tabular with weak signal-to-noise, GBMs still beat neural nets.

Decision trees

Random forest (Breiman 2001)

Gradient boosting

Sequentially fit weak learners to the residuals (or, generally, to the negative gradient of the loss). For squared loss, residual = y − ŷ. For other losses, fit pseudo-residuals = −∂L/∂ŷ.

XGBoost (Chen 2016)

Regularized objective with Newton-style 2nd-order; handles missing values via learned default direction; histogram-based splits.

LightGBM (Ke 2017)

Histogram + leaf-wise growth + GOSS (gradient-based one-side sampling) + EFB (exclusive feature bundling). Faster, less memory.

CatBoost

Ordered boosting (mitigates target leakage); native categorical features; usually the easiest to tune.

EXAMPLE — why GBM beats NN on Kaggle tabular

A typical tabular Kaggle dataset has 100-500 features with axis-aligned predictive structure (e.g., "days since last login > 30" matters; "log(income)" matters). Trees pick exactly those splits in O(features × n) per node. A NN must learn axis-alignment from scratch — typically needs 10× the data and careful feature engineering. NNs catch up at scale (TabPFN, FT-Transformer with 100M+ rows).

Common interview Qs on trees

PITFALL
Using one-hot encoding for high-cardinality categorical features in a GBM. Each one-hot column gets its own split candidate; with 10k categories, the trees can't find good splits. Use target encoding (with smoothing) or CatBoost's native handling instead.
REMEMBER
  • RF kills variance via decorrelated bagging. GBM kills bias via sequential residual fitting.
  • Pseudo-residuals = −∂L/∂ŷ generalize residual fitting to any differentiable loss.
  • XGBoost / LightGBM / CatBoost are still the tabular default — pick LightGBM for speed, CatBoost for cats, XGBoost for safe baselines.
  • NNs catch up to trees only at very large data scales.
10
LATENT VARIABLES · EM ↔ ELBO

The EM algorithm and the ELBO connection (EM → VAE)

TL;DR

EM alternates: E-step computes the posterior over latents q(z) = P(z|x, θ_t); M-step maximizes the expected complete-data log-likelihood. Each iteration monotonically increases the marginal likelihood. The same identity is the ELBO — VAEs replace the exact E-step (intractable) with an amortized inference network q_φ(z|x). EM is the unsupervised algorithm that quietly underlies GMMs, HMMs, LDA, and modern variational generative models.

EM in two steps

Guarantee: each iteration monotonically increases the marginal likelihood log P(x | θ). Converges to local optimum.

THE INSIGHT

EM = ELBO maximization (twice)

EM maximizes the ELBO L(q, θ) = E_q[log P(x, z | θ)] + H(q). The E-step makes the ELBO tight by setting q = posterior; the M-step maximizes the ELBO over θ. This is the same ELBO as in VAEs — a VAE generalizes by using a learned q_φ(z|x) instead of the exact posterior. Once you see this, every modern variational generative model (VAE, β-VAE, normalizing flow latents, diffusion ELBOs) becomes a variant of the same procedure.

EXAMPLE — EM for a 2-component GMM

Initial guess: two Gaussians with random means. E-step: for each point x_i compute γ_i = P(z=1 | x_i, θ) via Bayes. M-step: μ_1 = (Σ γ_i x_i)/(Σ γ_i), μ_2 = (Σ (1−γ_i) x_i)/(Σ (1−γ_i)); update variances and mixing weights similarly. Iterate until log-likelihood plateaus. Local optima abound — initialize from k-means.

REMEMBER
  • E-step = posterior. M-step = MLE on expected complete-data NLL.
  • ELBO = E_q[log P(x,z|θ)] + H(q). EM coordinate-ascends it.
  • VAE = EM with a learned q_φ instead of the exact posterior. Same identity, scalable.
11
CLASSICAL ML · UNIFICATION

GLMs — the framework that unifies linear, logistic, Poisson

TL;DR

GLMs unify linear regression, logistic regression, Poisson regression, and more under one recipe: pick an exponential-family distribution for y, a linear predictor η = xᵀβ, and a link function g(μ) = η. Canonical link guarantees convex loss and calibrated probabilities. The whole reason logistic regression is a "well-behaved" classifier is GLM theory.

The three components

  1. Random component: y has distribution from exponential family (Gaussian, Bernoulli, Poisson, Gamma, ...).
  2. Systematic component: linear predictor η = xᵀβ.
  3. Link function: g(μ) = η. Connects E[y] to η.
GLMDistributionLink
Linear regressionGaussianidentity
Logistic regressionBernoullilogit: log(p/(1−p))
Poisson regressionPoissonlog
Multinomial logisticCategoricalsoftmax (multivariate logit)

Why this framework matters

EXAMPLE — picking the right GLM for CTR

You're modeling click counts per impression bucket. If you treat clicks/impressions as continuous and use linear regression, you'll predict negative values and lose efficiency. Use Poisson regression with offset: log(λ) = log(impressions) + xᵀβ, where the offset converts rate into count. Predicted λ is automatically non-negative and the MLE is convex.

REMEMBER
  • GLM = exponential-family distribution + linear predictor + link function.
  • Canonical link gives convex NLL and calibrated probabilities for free.
  • Logistic regression is a Bernoulli GLM with logit link. Its niceness comes from being canonical.
  • Poisson regression with log link is the right tool for counts and rates.
12
PROBABILITIES · WHEN THEY MATTER

Calibration — when probabilities matter

TL;DR

A model is calibrated if among samples with predicted probability p, the actual rate is p. Critical for recsys auctions, medical decisions, anything that uses raw probabilities. Modern over-parameterized NNs are systematically over-confident (Guo 2017). Temperature scaling — divide all logits by a single scalar T fit on a held-out set — fixes most of it without retraining and preserves argmax.

Measuring calibration

Recalibration techniques

Parametric

  • Platt scaling: fit σ(a · logit + b). 2 params; works for monotonic miscalibration.
  • Beta calibration: parameterized monotonic, more flexible than Platt.
  • Temperature scaling: divide logits by T (1 param). Standard for NN classifiers; preserves argmax → no accuracy change.

Non-parametric

  • Isotonic regression: monotonic non-parametric mapping. Handles arbitrary curves; needs more data; overfits at small N.
  • Histogram binning: average within bin, replace with bin's empirical rate. Simple and weak.

Why neural nets are miscalibrated

(Guo 2017, "On Calibration of Modern Neural Networks") Modern over-parameterized NNs achieve very low cross-entropy but are overconfident. Hypotheses: capacity-driven overfit on confidence; BN; weight decay shifting output scale. Temperature scaling fixes most of it without retraining.

EXAMPLE — temperature scaling in 5 lines

Train your classifier normally. After training, freeze the model and fit a single scalar T by minimizing NLL on the validation set with logits replaced by logits / T. Typically T ≈ 1.5 for ResNet-50 on ImageNet. ECE drops from ~5% to ~1%. Argmax (and therefore accuracy) is unchanged because dividing logits by a positive constant preserves the maximum.

PITFALL
Reporting accuracy as your single classifier metric for an auction or medical-decision system. Accuracy ignores how confident the wrong predictions were. A model that's 99% confident on its 5% errors will tank an expected-cost calculation. Always also report Brier score or ECE when probabilities are downstream.
REMEMBER
  • Calibrated = predicted probability matches observed rate within each bucket.
  • Modern NNs are overconfident. Temperature scaling fixes it for free.
  • Use ECE / Brier / reliability diagram, not just accuracy, when probabilities matter.
  • Isotonic for arbitrary shape + lots of data; Platt / temperature for small data + monotonic miscalibration.
13
LEARNING THEORY · GENERALIZATION

Generalization theory — VC, Rademacher, PAC-Bayes, NTK

TL;DR

Classical theory (PAC, VC) bounds test error by train error + a complexity term that scales with model capacity. For deep nets it's vacuous — VC ~ #params ≫ n. Rademacher (margin-based) and PAC-Bayes are tighter. NTK explains training dynamics in the infinite-width limit but doesn't capture feature learning. The honest summary: deep-learning generalization theory is still open. Practitioners rely on empirical heuristics + implicit-bias arguments.

PAC learning

A hypothesis class H is PAC-learnable if there's an algorithm such that for any ε, δ > 0, with prob ≥ 1−δ over samples, the learned h has error ≤ ε. Sample complexity bounded by VC dimension.

VC dimension

Largest set of points that H can shatter (label in all 2ⁿ ways). VC of linear classifiers in d-dim = d+1. Generalization bound: err_test ≤ err_train + O(√(VC/n)). Vacuous for over-parameterized NNs (VC ~ # params ≫ n).

Rademacher complexity

Tighter than VC for many cases. R_n(H) = E_σ [sup_{h ∈ H} (1/n) Σ σ_i h(x_i)] where σ_i are Rademacher (±1) random vars. Margin-based bounds (Bartlett) scale with weight norms / Lipschitz constants instead of param count → less vacuous for deep nets.

PAC-Bayes (used by Anthropic for SAEs etc.)

Bound expected loss over a posterior Q: E_Q[err_test] ≤ E_Q[err_train] + O(√(KL(Q || P) / n)). P = prior, Q = posterior. Tight when Q is close to P. Dziugaite & Roy 2017 gave non-vacuous PAC-Bayes bounds for MNIST nets — a milestone.

Implicit bias of SGD / NTK / μP

EXAMPLE — VC bound is useless for ResNet

ResNet-50 has 25M params. VC dimension is at least linear in #params, so VC ≥ 25M. ImageNet has 1.2M training images. The bound err_test ≤ err_train + √(VC/n) = err_train + √(25M/1.2M) ≈ err_train + 4.6. Probability bound = "test error is at most 460%." Useless. This is why margin-based and PAC-Bayes bounds are needed — they scale with norms and KL divergences, not parameter count.

REMEMBER
  • Classical VC bounds are vacuous for over-parameterized NNs.
  • Rademacher (margin / Lipschitz) and PAC-Bayes give non-vacuous bounds in some regimes.
  • NTK explains infinite-width training as kernel regression but misses feature learning.
  • Implicit bias of SGD selects low-norm interpolators — that's the modern best explanation of why deep learning generalizes.
14
DEEP LEARNING · "WHY DOES X WORK?"

The "why does X work?" probes — dropout, BN, augmentation, residuals, depth

TL;DR

Sr Staff favorites. Dropout = ensemble + Bayesian + co-adaptation prevention. BN's real job is loss-landscape smoothing, not "covariate shift." Augmentation = regularization + invariance encoding. Early stopping ≈ L2. Residuals give gradient an identity path → trainable to 100+ layers. Depth beats width because compositional features are exponentially more expressive at fixed param count.

Why does dropout work?

Why does batch norm work?

Original claim (Ioffe & Szegedy 2015): "internal covariate shift" reduction. Newer view (Santurkar 2018, arxiv 1805.11604): BN makes the loss landscape smoother (smaller Lipschitz constant of gradients), allowing larger LRs. The "covariate shift" claim is largely wrong.

Why does data augmentation work?

Why does early stopping work?

Equivalent to L2 regularization in the linear/quadratic regime (Yao 2007). Stops training before the model fits noise; an implicit constraint on model complexity via the optimization trajectory.

Why does feature normalization help?

Conditioning. The Hessian's eigenvalues depend on input scales. Normalizing features → Hessian eigenvalues become more similar → larger LR works → faster convergence. Without normalization, GD takes tiny steps in the small-eigenvalue directions.

Why does deep beat shallow?

Why does the residual connection matter?

Without residuals, gradient through L layers is a product of L Jacobians → either vanishes or explodes. With residuals, gradient has an "identity path" — ∂(x + f(x))/∂x = I + ∂f/∂x — preserving signal. Lets you train 100+ layers (ResNet, transformers).

Why does softmax saturate / how to avoid?

For very large logits, softmax outputs ≈ 1.0 for the max, ≈ 0 for the rest → gradient through the rest vanishes. Fixes: temperature scaling, label smoothing, gradient clipping. In transformers, the √d_k attention scale is exactly to prevent this.

EXAMPLE — the √d_k trick in attention

For dot-product attention QKᵀ with d_k = 128, raw scores have variance ~128. Softmax of variance-128 logits is essentially one-hot — gradient through all but one position is ~0. Dividing by √d_k = √128 ≈ 11.3 normalizes the score variance back to ~1 → softmax is well-conditioned, gradients flow, training is stable. This is the difference between "transformers train" and "transformers diverge in 5 steps."

REMEMBER
  • Dropout = ensemble + Bayesian + anti-co-adaptation. Off in modern LLM pretraining.
  • BN's real mechanism is loss smoothing, not "covariate shift." LN for sequences.
  • Early stopping ≈ L2. Both implicit constraints on complexity.
  • Residuals make gradient flow via identity path — without them, 100-layer nets don't train.
  • Depth gives exponentially richer functions per param than width — but at fixed compute, "moderate depth + width" wins.
  • The √d_k in attention exists to keep softmax out of saturation.
15
INTERVIEW SIGNAL · DEEP UNDERSTANDING

Tricky theory probes — what tells an interviewer you really get it

TL;DR

The questions below distinguish "knows the formulas" from "understands what's going on." Each one has a clean, deep answer that a Sr Staff candidate should give without hesitation. Skim the questions, sketch your answer, then check.

The 50 deep-understanding questions

  1. Q. "If your validation loss is much higher than train loss, but test accuracy is fine, what's going on?"
    Show answer

    Almost always the model is overconfident on misclassified validation examples — the cross-entropy loss penalizes confidence × wrongness, while accuracy is binary. The model has learned to be very confident on most examples and is occasionally very wrong; that drives loss up while accuracy stays good. Fix: temperature scaling for calibration if probability matters; otherwise it's not a real problem.

  2. Q. "Why isn't ReLU differentiable at 0, and why doesn't that break SGD?"
    Show answer

    ReLU has a kink at 0 — left derivative = 0, right derivative = 1. The probability of a stochastic-gradient update landing exactly at 0 is measure zero. In practice we use either the subgradient (any value in [0,1]) or set the derivative at 0 to 0 by convention. SGD's robustness to non-smoothness comes from finite-precision floating point + stochasticity.

  3. Q. "Why does adding L2 regularization improve a model that's not overfitting?"
    Show answer

    L2 doesn't only fight overfitting. It (1) improves Hessian conditioning (adds λI → eigenvalues bounded below by λ), (2) makes training more stable / faster convergence, (3) bounds Lipschitz constant → robustness, (4) for NNs interacting with batch norm, controls effective angular direction. So even at-capacity models benefit.

  4. Q. "What's wrong with using accuracy on imbalanced classes?"
    Show answer

    If 99% of samples are negative, a model predicting always-negative gets 99% accuracy with zero recall. Use precision/recall, F1, AUC-ROC (threshold-independent), AUC-PR (better for imbalance), Brier score. For ranking, NDCG / MAP. For decisions, expected cost / value.

  5. Q. "AUC-ROC vs AUC-PR — when each?"
    Show answer

    AUC-ROC is invariant to class balance. AUC-PR (precision-recall curve area) is more sensitive to positive-class performance under heavy imbalance — use it when positives are rare and you care about them (rare-event detection, fraud, ad click).

  6. Q. "Your test set AUC is great but the model fails in production. Why?"
    Show answer

    Most common: distribution shift. Train/test were i.i.d. from logged data; production has different traffic mix (covariate shift), different label distribution (prior shift), or selection bias (you only have labels on items the old model showed — exposure bias). Other: temporal leakage (future info in features), train/serve skew (different feature pipelines).

  7. Q. "What is the bias of an estimator?"
    Show answer

    The expected difference between the estimator and the true value: Bias = E[θ̂] − θ_true. Distinct from bias-variance decomposition's "bias" (which is squared at a single x). MLE is asymptotically unbiased under regularity conditions, but at finite n can be biased (e.g., MLE of Gaussian variance underestimates).

  8. Q. "Why does Adam converge to a worse generalization point than SGD on some vision tasks?"
    Show answer

    Adam's per-coordinate adaptive LR can find sharper minima with worse generalization than SGD. Empirically observed (Wilson 2017, "The Marginal Value of Adaptive Gradient Methods"). LLMs / transformers go the other way (Adam beats SGD); the differences come down to noise structure + curvature shape per problem class. Use AdamW with proper weight decay if Adam.

  9. Q. "Why does softmax + cross-entropy have a clean gradient?"
    Show answer

    Let p = softmax(z) and L = −Σ y log p. Then ∂L/∂z = p − y. The complex Jacobian of softmax cancels with the log inside CE. This is why CE is preferred over MSE for classification — gradient signal is simple and proportional to error.

  10. Q. "When is Naive Bayes optimal?"
    Show answer

    When features are conditionally independent given the class. In practice, NB often works well even when this assumption is violated, because for classification we only need the argmax of P(class | x) — not the calibrated probability. NB's biased probability estimates can still produce correct argmax.

  11. Q. "Why does the bias term in linear regression matter?"
    Show answer

    Without bias (intercept), the regression line is forced through the origin — generally a worse fit. With bias, the model fits an arbitrary affine function. In NNs with biased layers, removing biases ties the layer's "default" output to zero, which is rarely what you want — though some modern transformers (Llama) drop bias terms in linear layers because RMSNorm makes them redundant.

  12. Q. "Explain log-sum-exp trick."
    Show answer

    log Σ exp(x_i) overflows if any x_i is large. Subtract max: log Σ exp(x_i) = m + log Σ exp(x_i − m) where m = max(x_i). Now exp arguments are ≤ 0, can't overflow. Used in softmax, partition functions, and any "log of a sum of exponentials" computation.

  13. Q. "What's the curse of dimensionality?"
    Show answer

    In high-dim space, (1) volume concentrates on the surface of any region, (2) all pairwise distances become similar (concentration of measure), (3) the number of samples needed to cover the space grows exponentially, (4) most local methods (kNN, kernel density) fail. Why deep learning still works: data lies on a low-dim manifold within the high-dim space.

  14. Q. "Why is GBM more sensitive to LR than RF?"
    Show answer

    GBM's trees are sequential — each fits the residual after applying previous trees with shrinkage η. Small η = small steps, more trees needed, less overfit per step. RF trees are independent — no LR; aggregation reduces variance regardless. GBM's η + n_estimators control bias-variance trade.

  15. Q. "Explain regularization paths."
    Show answer

    The set of solutions w(λ) as the regularization strength λ varies from 0 to ∞. For lasso, the path is piecewise linear in λ (LARS algorithm). For ridge, smooth. Useful for hyperparam selection (CV over the path) and feature ranking (order in which lasso adds features).

  16. Q. "Why does the kernel trick let SVMs handle non-linear data?"
    Show answer

    Map data implicitly to a high-dim feature space φ(x); a linear separator in φ-space corresponds to a non-linear boundary in original space. The trick avoids explicit φ — only inner products are needed, and K(x,x') = ⟨φ(x), φ(x')⟩ is computed directly. RBF kernel corresponds to infinite-dim feature space.

  17. Q. "Difference between generative and discriminative models?"
    Show answer

    Generative models P(x, y) (or P(x|y) and P(y)). Examples: Naive Bayes, GMMs, HMMs, VAEs, diffusion. Can sample x. Discriminative models P(y | x) directly. Examples: logistic regression, SVM, neural classifiers. Generally better classification accuracy with enough data; generative wins with little data or when you need to generate. (Ng & Jordan 2002.)

  18. Q. "What does it mean for an estimator to be 'consistent'?"
    Show answer

    It converges in probability to the true parameter as n → ∞: P(|θ̂_n − θ| > ε) → 0. Distinct from unbiasedness — biased estimators can be consistent (bias goes to 0). MLE is consistent under regularity conditions; biased shrinkage estimators (like ridge) are inconsistent for the true coefficient but can have lower MSE.

  19. Q. "What is the bias of a sample variance? How do you fix it?"
    Show answer

    Sample variance with 1/n divisor is biased: E[s²] = (n−1)/n · σ². Bessel's correction divides by (n−1) instead of n, yielding unbiased s² = (1/(n−1)) Σ (x_i − x̄)². Intuition: x̄ is computed from the sample, so deviations from x̄ are slightly smaller than from true μ.

  20. Q. "Why don't we use accuracy as a loss function?"
    Show answer

    Accuracy is non-differentiable (involves argmax). Gradient zero almost everywhere. Need a smooth surrogate (cross-entropy, hinge, softmax) that's differentiable and shares argmax with accuracy. Surrogate losses' relationship to the target metric is studied as "calibration" / "consistency" in the surrogate-loss literature.

  21. Q. "When is a problem 'convex'? Why do we care?"
    Show answer

    The objective is a convex function and the feasible set is convex. Implications: any local minimum is global; gradient descent converges to optimum; saddle points don't exist (in strict convex case); duality gap is zero. Linear/logistic regression with L2 is convex. Deep NNs are non-convex — much of optimization theory doesn't apply directly, hence the focus on empirical heuristics.

  22. Q. "What does it mean for two random variables to be 'uncorrelated'? Same as 'independent'?"
    Show answer

    Uncorrelated: Cov(X, Y) = 0E[XY] = E[X] E[Y]. Independent: P(X, Y) = P(X) P(Y) (full distribution factors). Independence implies uncorrelated; the converse holds only for jointly Gaussian X, Y. Classic counterexample: X ~ Uniform(−1, 1), Y = X². Cov(X, Y) = 0 but X and Y are perfectly dependent.

  23. Q. "What is the variance of a sum of correlated random variables?"
    Show answer

    Var(Σ X_i) = Σ Var(X_i) + 2 Σ_{i<j} Cov(X_i, X_j). For uncorrelated, the cross-covariance terms vanish. This is why ensembling (random forest) reduces variance more when base learners are decorrelated.

  24. Q. "What's the difference between L1 and L2 from a gradient perspective?"
    Show answer

    L2 gradient: 2λθ — proportional to weight; pulls weights uniformly toward zero, never reaching it. L1 (sub)gradient: λ · sign(θ) — constant pull regardless of magnitude; small weights get pushed to exactly 0 by the soft-threshold operator sign(θ) max(|θ| − λ, 0). That's why L1 induces sparsity.

  25. Q. "What's the role of the Hessian in second-order optimization?"
    Show answer

    Newton's update: θ ← θ − H⁻¹ ∇L. H⁻¹ rescales gradient by the inverse local curvature → quadratic convergence near optimum. Issue: H is N×N (infeasible for N≫1000). Approximations: BFGS (rank-1 updates from gradient history), Adam (diagonal preconditioner using grad²), Shampoo (Kronecker-factored), K-FAC (block-diagonal Fisher).

  26. Q. "Explain Jensen's inequality and one ML application."
    Show answer

    For convex f: f(E[X]) ≤ E[f(X)]. Concave: reverse. ML application — ELBO derivation: log P(x) = log ∫ P(x, z) dz = log ∫ q(z) P(x,z)/q(z) dz ≥ ∫ q(z) log P(x,z)/q(z) dz (by Jensen on log, which is concave). The RHS is the ELBO. This is how the VAE's variational bound is derived.

  27. Q. "Why is cross-entropy non-negative?"
    Show answer

    For probability distributions p (true) and q (predicted): H(p, q) ≥ H(p) ≥ 0. The first inequality is Gibbs' inequality (equivalent to KL ≥ 0). The second is because log p ≤ 0 for p ∈ [0,1]. Cross-entropy = 0 only when p is deterministic AND q matches.

  28. Q. "Why is KL divergence asymmetric?"
    Show answer

    KL(p||q) = Σ p log(p/q) uses p as the weighting. KL(q||p) uses q. They penalize different regions: KL(p||q) heavily penalizes q being small where p is large (covers all modes of p — "mode-covering"); KL(q||p) heavily penalizes q being large where p is small (q sticks to one mode — "mode-seeking"). Choice matters in VI: ELBO uses KL(q||p) → mode-seeking; EP uses KL(p||q) → mode-covering.

  29. Q. "What is the difference between batch norm at train vs eval?"
    Show answer

    Train: normalize using batch statistics (mean/var of current mini-batch). Eval: normalize using running averages accumulated during training. Differences cause subtle bugs: a model in train mode at inference will use one-batch stats and behave differently. Always call model.eval() for inference.

  30. Q. "Why doesn't increasing model capacity always help in deep learning?"
    Show answer

    Classical view: bigger capacity → overfit. But for over-parameterized NNs, double descent shows test error rises near interpolation threshold, then falls. Practical issues: (1) data bottleneck — bigger model with same data doesn't help past a point (Chinchilla scaling); (2) optimization gets harder; (3) inference cost dominates total cost. The right framing is "compute optimal" — fix budget, find best model size + data size jointly.

  31. Q. "What's the cost of training a logistic regression vs an SVM at scale?"
    Show answer

    Logistic regression: convex, smooth — SGD scales linearly with N samples. SVM: convex, non-smooth (hinge) — solved as QP via SMO or via SGD on the hinge loss. Linear SVM and logistic at scale are essentially equivalent in cost; kernel SVM is O(N²) memory for the Gram matrix → infeasible past ~100k samples. Modern recsys/CTR pipelines use logistic regression (or its NN generalization) for this reason.

  32. Q. "Why are random forests said to be 'almost impossible to overfit'?"
    Show answer

    Adding more trees doesn't increase model variance — bagging averages out variance. But each tree can overfit individually if grown unrestricted; the variance of the ensemble is tamed by averaging. RF still can overfit if (a) too few trees, (b) trees too deep relative to data, (c) too many irrelevant features at each split. "Hard to overfit" is a slight overstatement.

  33. Q. "What's the relationship between sigmoid and softmax?"
    Show answer

    Softmax with 2 classes is equivalent to sigmoid: softmax([z, 0]) = [σ(z), 1−σ(z)]. Sigmoid is the binary special case. For binary classification, sigmoid is more efficient (one logit instead of two with the constraint they sum to 1).

  34. Q. "Why does feature engineering still matter when we have deep learning?"
    Show answer

    For tabular data with weak signal-to-noise, feature engineering (interactions, log transforms, target encoding, time-aggregations) outperforms NNs at modest data sizes. NNs need lots of data to learn what good features are; trees + good features beat NNs + raw data on most tabular tasks. Even at scale, domain-specific feature engineering speeds learning and improves robustness.

  35. Q. "What's the difference between standardization, normalization, and whitening?"
    Show answer

    Standardization: zero mean, unit variance per feature. Min-max normalization: rescale to [0,1] (or [−1,1]). Whitening: full covariance is identity (decorrelates features). PCA whitening uses eigenvectors; ZCA whitening preserves the original feature space orientation. NN inputs typically use standardization; whitening is rare for NNs because BN partly accomplishes it inside the network.

  36. Q. "What's wrong with using R² for non-linear models?"
    Show answer

    R² = 1 − SS_res/SS_tot. For linear regression with intercept on training data, R² ∈ [0, 1] and equals squared Pearson correlation between y and ŷ. For non-linear / no-intercept / out-of-sample, R² can be negative (model worse than mean predictor). Better to report MSE/MAE plus context.

  37. Q. "What does 'data leakage' mean? Common sources?"
    Show answer

    Train data contains information about the label or test set that wouldn't be available at prediction time. Sources: (1) target included in features (oops); (2) data preprocessing fit on full dataset before split (e.g., scaling, target encoding); (3) future info in time-series features; (4) duplicates between train and test; (5) selection bias (only labeled examples are easy ones); (6) for recsys: training on logs that the model itself created.

  38. Q. "Explain stratified vs random K-fold cross-validation."
    Show answer

    Stratified: each fold has the same class distribution as the full dataset. Critical for imbalanced classification — random folds can leave a fold with 0 positive examples. For regression, can stratify on binned target. For time-series, neither — use time-series CV (forward-chaining splits).

  39. Q. "Why might precision and recall not be enough?"
    Show answer

    P/R fix a single threshold. AUC summarizes across thresholds. But neither captures: (1) cost asymmetry (false positive cost ≠ false negative cost — use expected cost), (2) ranking quality beyond binary (NDCG), (3) calibration of probabilities, (4) per-segment performance (slicing matters). For recommendation, P@K is misleading without considering position bias.

  40. Q. "What is a 'sufficient statistic'?"
    Show answer

    A function T(X) of the data such that the conditional distribution of X given T(X) doesn't depend on the parameter θ. All info about θ in the data is captured by T(X). For Gaussian with known variance, sample mean is sufficient for μ. Comes up in exponential family / GLM theory.

  41. Q. "What's the connection between linear regression and the projection matrix?"
    Show answer

    OLS solution β̂ = (XᵀX)⁻¹ Xᵀy. Predictions ŷ = X β̂ = X (XᵀX)⁻¹ Xᵀ y = H y where H = X(XᵀX)⁻¹Xᵀ is the "hat" matrix — the projection onto the column space of X. So OLS projects y onto the column space of X. I − H projects onto the residual space. Sum of leverages = trace(H) = rank(X) = effective parameter count.

  42. Q. "What's the multiple-comparisons problem in A/B testing?"
    Show answer

    Running many tests at α = 0.05, you'll get false positives at rate ~5% per test → with 20 tests, ~63% chance of at least one false positive. Corrections: Bonferroni (divide α by # tests, conservative), Benjamini-Hochberg (controls false discovery rate, less conservative), sequential testing methods. Always pre-register your primary metric and a small set of guardrails.

  43. Q. "What does it mean for a model to be 'identifiable'?"
    Show answer

    Different parameter values yield different distributions: P(x | θ_1) ≠ P(x | θ_2) implies θ_1 ≠ θ_2. Without identifiability, you can't uniquely recover θ from data even with infinite samples. Mixture models are non-identifiable up to label permutation. Deep nets are non-identifiable in the strict sense (many weight configs produce same function); in practice we don't care about θ, only the function.

  44. Q. "Why does adding a hidden layer to a neural net make it 'universal'?"
    Show answer

    Universal approximation theorem (Cybenko 1989, Hornik 1991): a feedforward NN with one hidden layer of arbitrary width and a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy. Theoretical result; practically the required width is often exponential. Why depth helps: depth gives exponentially more efficient representation than width for many functions (Telgarsky 2016).

  45. Q. "What's the empirical risk vs the true risk?"
    Show answer

    True risk: R(f) = E_{(x,y)~P}[L(f(x), y)]. What you actually want — expected loss on the data distribution. Empirical risk: R̂(f) = (1/n) Σ L(f(x_i), y_i). What you compute on the training set. ERM minimizes R̂; learning theory studies how close R̂ is to R (generalization gap).

  46. Q. "What's the difference between online learning and online inference?"
    Show answer

    Online learning: model updates from each example as it arrives (streaming SGD). Online inference: model serves predictions in real time (latency budget, no retraining). They're orthogonal — most production systems do online inference + batch retraining; full online learning is rare due to stability concerns.

  47. Q. "What is the curse of binarization for ranking metrics?"
    Show answer

    Binary metrics (precision@K, recall@K) treat all relevant items as equal. NDCG / graded relevance accounts for varying degrees of relevance. For ranking, the difference between rank-1 and rank-2 of a great item matters more than rank-99 vs rank-100 — discount functions (1/log₂(rank+1) for NDCG) capture this; binary metrics don't.

  48. Q. "Why doesn't increasing regularization always reduce overfitting?"
    Show answer

    Past a point, regularization underfits — increases bias more than it reduces variance. The bias-variance trade-off has a sweet spot; CV finds it. Also, the form of regularization matters — L2 may not produce the right inductive bias for sparse / structured data, where L1 / group-lasso would.

REMEMBER
  • If you can answer all 50 above without hesitation, you're at the Sr Staff theory bar.
  • The next section has 30+ more questions covering classical ML, optimization, and deep learning probes.

Extended quiz — 30+ more theory questions

Cover all of these and you've passed any "ML theory probe" round.

  1. Q. "What is overfitting?"
    Show answer

    Model fits training data better than the underlying distribution → poor generalization. Symptoms: large train-test gap. Causes: high model capacity, low data, noisy labels, leakage. Cures: regularization, more data, augmentation, early stopping, ensembling, simpler model.

  2. Q. "Why does dropout reduce overfitting?"
    Show answer

    Stochastic ensemble interpretation (each mask = sub-network); co-adaptation prevention; Bayesian variational interpretation (Gal & Ghahramani 2016). Equivalent to a noise injection regularizer.

  3. Q. "What is the difference between parameters and hyperparameters?"
    Show answer

    Parameters: learned from data via optimization (weights, biases). Hyperparameters: set before learning, control the learning process (LR, batch size, regularization strength, architecture). Hyperparameter optimization: grid search, random search, Bayesian optimization (Optuna, Hyperopt), Hyperband, BOHB.

  4. Q. "What is gradient clipping and when do you use it?"
    Show answer

    Cap gradient norm at a threshold C: if ||g|| > C, scale g by C/||g||. Used to prevent exploding gradients in RNNs, transformers (rare with proper init + LR), RL. Modern LLM pretraining uses gradient clipping at 1.0 routinely; it's a cheap insurance against loss spikes.

  5. Q. "What's a vanishing/exploding gradient and what causes them?"
    Show answer

    Backprop through L layers multiplies L Jacobians. If their product norm is << 1 → vanishing (early-layer weights don't update). If >> 1 → exploding (NaN losses). Causes: bad init (too small / too large), sigmoid/tanh saturation, no normalization, no residual connections. Fixes: He/Xavier init, ReLU/GELU, BN/LN, residuals, gradient clipping.

  6. Q. "What's a 'representation' in deep learning?"
    Show answer

    An intermediate layer's activation, viewed as a feature vector for the input. Good representations: linearly separable for downstream tasks; capture invariances; transferable. Self-supervised pretraining (CLIP, SimCLR, MAE, masked LM) explicitly trains for good representations without labels.

  7. Q. "What is contrastive learning?"
    Show answer

    Learn representations by pulling similar (positive) pairs together and pushing dissimilar (negative) pairs apart in embedding space. InfoNCE loss is the dominant form. Examples: SimCLR (image augmentations as positives), CLIP (image-caption pairs), two-tower retrieval (user-item interactions).

  8. Q. "Why is temperature in softmax / contrastive learning important?"
    Show answer

    Temperature τ scales logits before softmax. Low τ → sharper distribution (focuses on top items). High τ → flatter. In contrastive learning, low τ amplifies the gradient on hard negatives but can over-fit; high τ underweights hard examples. CLIP uses learned τ.

  9. Q. "What's the difference between SGD and mini-batch SGD?"
    Show answer

    SGD: gradient on one example. Mini-batch SGD: gradient on a small batch (32-1024 typical). Mini-batch is the standard — vectorized compute, less noisy, can use BN. "SGD" in modern DL almost always means mini-batch SGD.

  10. Q. "What is gradient checkpointing?"
    Show answer

    Trade compute for memory: save only some activations during forward; recompute the rest during backward. ~33% extra forward FLOPs for full block checkpointing; ~10× memory savings. Critical for training large models on limited GPU memory.

  11. Q. "Explain teacher forcing vs free running in seq2seq training."
    Show answer

    Teacher forcing: at each decode step, feed the ground-truth previous token (not the model's own output). Stable, fast training. Issue: exposure bias — at inference the model sees its own (potentially wrong) outputs. Free running: feed model's own output. Slower training, can be unstable. Modern fixes: scheduled sampling, distillation, RL fine-tuning.

  12. Q. "What is label smoothing and when does it help?"
    Show answer

    Replace one-hot target with (1 − ε) · onehot + ε / K. The model is no longer rewarded for assigning probability 1 to the correct class → less overconfident, better calibration. Used in original Transformer, Inception nets. Less common in LLM pretraining where calibration isn't a target metric.

  13. Q. "What is mixup and why might it work?"
    Show answer

    Train on convex combinations: x̃ = λ x_i + (1 − λ) x_j, ỹ = λ y_i + (1 − λ) y_j. Encourages linear behavior between training points. Improves generalization, calibration, robustness. Vision-only really; doesn't work for discrete tokens.

  14. Q. "What is the manifold hypothesis?"
    Show answer

    Real-world high-dim data (images, text) lies on a low-dimensional manifold within the ambient high-dim space. Implication: ML can succeed despite curse of dimensionality. Most data points are unreachable (gibberish images, random text); the meaningful subset has low intrinsic dimension. Generative models (diffusion, VAE) implicitly learn this manifold.

  15. Q. "Why does the LR schedule matter?"
    Show answer

    Early training: high LR for fast progress, but instability. Late training: low LR for fine convergence. Warmup smooths early instability (large grad magnitudes when Adam moments are noisy). Decay (cosine, WSD, linear) reduces LR over time → finer steps near optimum. Empirical: schedule choice affects final loss by 5-15% with the same total compute.

  16. Q. "How do you handle class imbalance?"
    Show answer

    (1) Resampling: oversample minority (SMOTE), undersample majority. (2) Class weights in loss. (3) Focal loss (down-weights easy examples). (4) Threshold tuning at inference. (5) Get more minority data. (6) Different loss / metric (PR-AUC instead of accuracy). For very rare events, anomaly detection framing.

  17. Q. "What is the bias of cross-validation as an estimator of test error?"
    Show answer

    K-fold CV with K folds uses (1 − 1/K) of data per fold → trained on slightly less data than the full model. Slightly pessimistic bias. LOOCV (K=N) is nearly unbiased but high-variance. K=5 or 10 is the standard compromise.

  18. Q. "Why might LASSO hurt prediction accuracy compared to ridge?"
    Show answer

    LASSO's sparsity comes at a cost: it picks one feature among correlated ones (somewhat arbitrarily). If both correlated features are predictive, ridge keeps both (small but non-zero coefficients) → better prediction. Elastic net combines both: shrinkage AND selection.

  19. Q. "Explain the concept of 'leakage' in time series."
    Show answer

    Future information appears in features available at prediction time. Examples: rolling statistics computed across the whole dataset (uses future); target aggregates by entity that include the current target; preprocessing fit on full data. Time-series CV (forward-chaining, expanding window) prevents this; "as-of" feature stores enforce point-in-time correctness.

  20. Q. "What's the difference between ROC and PR curves?"
    Show answer

    ROC: TPR (recall) vs FPR. Invariant to class balance. PR: precision vs recall. Sensitive to class balance — for rare positives, AUC-PR is more informative than AUC-ROC. PR baseline for random classifier = positive class rate (not 0.5).

  21. Q. "Why do we use stochastic gradient descent instead of full-batch?"
    Show answer

    Three reasons: (1) computational — full batch is expensive at scale; (2) statistical — gradient noise acts as a regularizer (finds flatter minima); (3) memory — can't fit full batch. The "right" batch size is often the largest that doesn't sacrifice generalization, scaled with critical batch size theory (McCandlish 2018).

  22. Q. "What's different about training time-series vs i.i.d. data?"
    Show answer

    (1) Sampling order matters — can't shuffle. (2) Train/val/test must be split chronologically. (3) Cross-validation: forward-chaining or expanding window. (4) Stationarity assumption — distribution can shift over time, requires periodic retraining. (5) Lag/lead features need careful construction. (6) Target leakage from rolling stats.

  23. Q. "Explain Maximum Mean Discrepancy (MMD)."
    Show answer

    A kernel-based distance between two distributions: MMD²(P, Q) = E_{x,x'~P}[k(x,x')] − 2 E_{x~P, y~Q}[k(x,y)] + E_{y,y'~Q}[k(y,y')]. Used for two-sample testing, generative model evaluation, distribution shift detection. With a characteristic kernel (RBF), MMD = 0 iff P = Q.

  24. Q. "Why is the Gaussian distribution so prevalent in ML?"
    Show answer

    (1) Central Limit Theorem: sum of independent random variables tends to Gaussian. (2) Maximum entropy distribution given mean and variance constraints. (3) Closed-form posterior in many conjugate models. (4) MLE under Gaussian noise = MSE, the most-used loss. (5) Computationally tractable (multivariate Gaussian closed under marginal/conditional/linear transform).

  25. Q. "Why does Bayes' rule matter?"
    Show answer

    P(θ | D) = P(D | θ) P(θ) / P(D). Posterior ∝ likelihood × prior. Foundation of Bayesian ML, MAP estimation, naive Bayes classifier, every probabilistic interpretation in deep learning. The denominator P(D) is the partition function; intractable in general → motivates MCMC, VI, ABC.

0 → hero reading path

  1. foundation Deep Learning (Goodfellow, Bengio, Courville) — Ch 5 (ML basics), Ch 6 (deep nets), Ch 7 (regularization), Ch 8 (optimization)
  2. foundation Elements of Statistical Learning (Hastie, Tibshirani, Friedman) — Ch 2-7. The most-cited stats-ML reference.
  3. build Pattern Recognition and Machine Learning (Bishop) — Ch 3 (linear regression), Ch 4 (linear classification), Ch 9 (EM)
  4. depth ESL chapters on boosting (10) + SVM (12) for tree/kernel intuition that still shows up at interviews
  5. depth Mathematics for Machine Learning (Deisenroth) — free; brush up on linear algebra, calculus, probability if rusty
  6. depth Probabilistic Machine Learning (Murphy) — current canonical Bayesian-flavored ML textbook
  7. specialty CS 4780 (Cornell) lecture notes — concise, still excellent
  8. specialty Stanford CS 229 lecture notes — Ng's classic

Reading: papers and posts that go deep on each topic