SUMMARY NOTES · DEEP LEARNING · PRACTICAL-FIRST

Deep learning — summary notes

A practitioner's crib-sheet: when to reach for ReLU vs GELU, BatchNorm vs LayerNorm, residual connections, gradient accumulation, mixed precision, and the Transformer — with the formulas that matter and the gotchas interviewers probe. Dense, scannable, built to trigger recall.

Format summary notes Bias practical / engineering Focus activations · norm · training · Transformers · LLMs Use ⚠ clears-up · ◆ probe · ✓ remember · Q-bank
The map
Foundations (neurons, backprop) → building blocks (activations, normalization, residuals, init, losses, regularization) → training (optimizers, the loop, data, debugging) → architectures (CNN, RNN, Transformer, fine-tuning) → scale & deployment. Each page: a hook, a figure, the key formulas, the gotcha, an interview probe, and a 10-question bank. Drill it with the flashcards.
01
PART 0 · HOW TO USE

How to read these notes + the DL mindset

🎯Deep learning = layers + a loss + an optimizer + a data pipeline. Nail the data and the learning rate and you're most of the way there.

A neural network is not magic — it is a data pipeline + architecture + loss + optimizer + regularizer, and the biggest real-world wins come from getting data right and tuning the learning rate, not from exotic architectures. Master the five pillars and you can debug anything.

THE FIVE PILLARS OF ANY DL SYSTEM
Architecturelayers, activations, skip connections — the function family Losswhat you're optimizing; must match the task exactly Optimizerhow you descend (SGD, Adam, AdamW); controls step size and momentum Regularizationdropout, L1/L2, early stopping — fighting overfitting Data pipelinequality, normalization, augmentation, shuffling — the highest-leverage lever

When something breaks, suspect them in reverse order: data first, then optimizer, then architecture last.

ACTIVATIONS, INIT, AND THE VANISHING/EXPLODING GRADIENT TRAP

Sigmoid/tanh saturate → gradients shrink exponentially with depth. ReLU fixes that but can die (stuck at 0 forever). Modern defaults: GELU (transformers), ReLU/Leaky-ReLU (CNNs), SiLU/Swish (EfficientNet family).

$$\text{ReLU}(x) = \max(0, x)$$zero if negative, identity if positive
$$\text{GELU}(x) \approx x \cdot \sigma(1.702x)$$smooth probabilistic gate — transformer default

Init pairing rule:

ActivationInitVariance
sigmoid / tanhXavier/Glorot$1/n_{in}$
ReLU familyHe$2/n_{in}$

Wrong init or all-zeros breaks symmetry and can stall training before it starts. Exploding gradients (RNNs especially) → gradient clipping by norm (preserves direction; clip-by-value distorts it).

NORMALIZATION, RESIDUALS, AND DEPTH

BatchNorm: normalizes over the mini-batch during training, uses frozen running stats at inference. Forgetting model.eval() silently corrupts eval accuracy — the single most common BN bug. LayerNorm: normalizes over the feature dimension, batch-size independent — the transformer standard.

Residual/skip connections solve the degradation problem: adding plain layers can hurt even training error because layers struggle to learn identity. Skip connections make identity the default, provide a gradient highway, and are the reason 100+ layer nets train at all.

OPTIMIZERS AND LEARNING-RATE STRATEGY
OptimizerUse whenGotcha
SGD + momentumCNN/ResNet, final generalization mattersNeeds more LR tuning; finds flatter minima
AdamWTransformers, NLP, sparse grads, fast prototypingPlain Adam = coupled weight decay; always use AdamW

LR schedule: warmup avoids early instability (critical for large batches and transformers); cosine/step decay lets the model settle. A fixed LR is a permanent compromise. Large batch → scale LR linearly + add warmup or you lose accuracy. Small batches: noisier gradients, often better generalization, less memory.

LOSSES, REGULARIZATION, AND OUTPUT LAYER CHOICES
TaskOutputLoss
Multi-class (mutually exclusive)SoftmaxCategorical cross-entropy
Multi-labelIndependent sigmoidsBinary cross-entropy per class
RegressionLinearMSE / Huber
Imbalanced classesFocal loss or class-weighted CE

Regularization: L1 (Lasso) → sparsity, zero-out weights; L2 (Ridge / weight decay) → smooth shrinkage, DL default; use AdamW for properly decoupled weight decay. Dropout: p=0.5 in FC layers; lower or none in conv; inverted dropout scales at train time so inference needs no change — but disable it at test time. Overfitting? More/augmented data beats regularization; regularization beats reducing capacity.

⚠ Clears up — "Adam is always better than SGD" Adam converges faster and is more robust to LR choice, but SGD with momentum frequently finds flatter, better-generalizing minima — which is why many SOTA vision models still train with SGD. Also: always use AdamW, not plain Adam; plain Adam couples weight decay into the adaptive update, which hurts regularization.
◆ Interview probe Q: Training loss is stuck and not decreasing at all. How do you debug it? → A: First sanity-check the data (NaNs, label alignment, normalization). Then overfit a single tiny batch — if you can't drive that loss to near zero, the bug is in model/loss/optimizer wiring. Finally sweep the learning rate and verify gradients are nonzero with a backward pass check. Only then touch architecture or hyperparameters.
Remember   Debug data → overfit one batch → sweep LR; everything else is downstream.
Tricky interview questions 12
Your training loss is not decreasing at all. Walk me through how you debug it.
Sanity-check the data pipeline first (NaNs/Infs, label alignment, normalization, shuffling), then try to overfit a single tiny batch — if you can't drive loss near 0, the bug is in model/loss/optimizer wiring. Then sweep the learning rate and verify gradients are nonzero and flowing backward. Trap: jumping to architecture or hyperparameter tuning before confirming the model can overfit one batch and data is loaded correctly.
When would you pick Adam over SGD, and why do practitioners still train vision models with SGD?
Use AdamW for fast, robust convergence with little tuning — transformers, NLP, sparse gradients. Use SGD with momentum when final generalization matters most (many SOTA CNNs/ResNets) because it tends to find flatter, better-generalizing minima, at the cost of more LR tuning. Trap: claiming Adam is strictly better — it converges faster but can generalize worse; also forgetting AdamW (decoupled weight decay) is the modern default over plain Adam.
What causes vanishing vs exploding gradients, and how do you fix each in practice?
Vanishing: saturating activations (sigmoid/tanh) compounded over depth — fix with ReLU-family activations, residual/skip connections, BatchNorm/LayerNorm, and He/Xavier init. Exploding (common in RNNs): repeated multiplication by large weights — fix with gradient clipping by norm, good init, and normalization. Trap: conflating the two or omitting the matched remedy — interviewers want gradient clipping for exploding and residual connections for vanishing.
What is the dying ReLU problem and how do you address it?
A ReLU neuron whose pre-activation stays negative outputs 0 with zero gradient, so it never updates and is permanently dead. Mitigate with Leaky ReLU / PReLU / ELU / GELU, a lower learning rate, and He initialization. Trap: thinking the neuron recovers on its own — once stuck with all-negative inputs, gradient is 0 and it stays dead.
How does Batch Normalization behave differently at training vs inference, and why does it matter?
Training: BN normalizes with the current mini-batch mean/variance and accumulates running stats. Inference: it uses those fixed running stats. Forgetting model.eval() or using batch size 1 corrupts the running stats and silently tanks eval accuracy. Trap: not mentioning the train/inference statistic switch — this is the most common real-world BN bug.
Dropout: when do you use it, and how is it handled at inference?
Use dropout when FC layers overfit (typically p=0.5 in FC, lower or none in conv layers); at inference all neurons are active. With inverted dropout, activations are scaled during training so inference needs no change — but the dropout mask must be off. Trap: forgetting to disable dropout at test time, or over-applying it in conv layers alongside BatchNorm, where it can be redundant or harmful.
Your train accuracy is high but test accuracy is low. What's happening and what do you do?
That's overfitting — the model memorizes training data. Fix with more/augmented data first, then regularization (dropout, L2/weight decay), early stopping, and reducing capacity only as a last resort. Trap: reaching for a bigger model — overfitting calls for more data or regularization, not more capacity.
L1 vs L2 regularization — practical difference and when do you choose each?
L1 (Lasso) drives weights to exactly zero, giving sparsity and implicit feature selection; L2 (Ridge / weight decay) shrinks all weights smoothly toward zero without zeroing them — the DL default. Choose L1 for sparse/interpretable models, L2 for general smoothing. Trap: saying L2 produces sparsity (that's L1); and with Adam use AdamW so weight decay is decoupled from the adaptive update.
He vs Xavier initialization — when do you use each and why does it matter?
Xavier/Glorot (variance ~1/n_in) suits symmetric saturating activations (sigmoid/tanh); He (variance 2/n_in) suits ReLU-family because ReLU zeros half the inputs. Correct init keeps signal/gradient variance stable across depth. Trap: treating init as unimportant — wrong init (or all-zeros, which breaks symmetry) can stall training before it begins.
Softmax vs sigmoid on the output layer — when do you use which?
Softmax + categorical cross-entropy for single-label multi-class (mutually exclusive, outputs sum to 1); independent sigmoids + binary cross-entropy per class for binary or multi-label problems where labels are not mutually exclusive. Trap: using softmax for multi-label tasks — it forces probabilities to compete and sum to 1, which is wrong when an example can belong to multiple classes.
How does batch size affect training, and how do you choose it?
Small batches give noisier gradients that often generalize better and use less memory; large batches are faster and more stable but can generalize worse and require LR scaling (linear scaling + warmup). Pick the largest that fits memory (power of 2 for GPU efficiency), then tune the LR. Trap: increasing batch size without re-tuning the learning rate — large-batch training needs LR scaling and warmup or it loses accuracy.
Why does a deeper plain network sometimes do worse than a shallower one, and how do residual connections help?
Very deep plain nets suffer a degradation problem where added layers fail to learn even the identity — this shows up in training error, not just test error. Residual/skip connections make identity the default output and provide a gradient highway, enabling 100+ layer networks to train. Trap: attributing it only to overfitting — the degradation appears in training error too, so it's an optimization/gradient-flow issue.
02
PART 0 · FOUNDATIONS

Neurons, MLPs & what depth buys you

🎯A net is just linear maps glued together by nonlinearities — drop the nonlinearity and depth collapses to a single layer.
A multilayer perceptron (fully-connected net)input xhidden (ReLU)output
Each layer is $h=\phi(Wx+b)$ — a linear map then a nonlinearity $\phi$. Stack them and the network can approximate any function; the nonlinearity is what makes depth more than one big linear map.

A neural network is just matrix multiplies and a squashing function, stacked until the composition gets rich enough to carve any decision boundary. Depth is the lever — not because more layers means more parameters, but because each layer reuses the previous layer's features, achieving exponentially more coverage per parameter than width alone.

THE BUILDING BLOCKS
Single neuronscalar output: $h = \phi(w \cdot x + b)$ Single layervector output: $h = \phi(Wx + b)$ MLPchain of layers — each layer's output is the next layer's input Forward passsequence of matrix multiplies + activations; nothing else at inference
$$h = \phi(Wx + b)$$layer output = activation applied element-wise to (weight matrix × input + bias)
WHY NONLINEARITY IS NON-NEGOTIABLE

Stack two linear layers with no activation: $W_2(W_1 x + b_1) + b_2 = W' x + b'$ — still one linear map. Without $\phi$, depth is an illusion. The activation breaks linearity so each added layer genuinely expands the function class. Universal approximation says a single wide hidden layer can fit any continuous function — but it may need exponentially many neurons. Depth gives the same power with far fewer parameters, because lower layers learn reusable sub-features.

ACTIVATIONS — PICK THE RIGHT ONE
$$\text{ReLU}(x) = \max(0, x)$$dead-simple, zero cost, but kills negative inputs permanently
$$\text{Leaky/PReLU}(x) = \max(\alpha x, x)$$small slope α keeps dying neurons alive; PReLU learns α
$$\text{GELU}(x) \approx x \cdot \sigma(1.702x)$$smooth probabilistic gating — default in transformers
$$\sigma(x) = \frac{1}{1+e^{-x}}, \quad \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$sigmoid/tanh saturate → vanishing gradients in deep nets; use only in gates or output layers
ActivationWhen to reach for it
ReLUDefault hidden layer in CNNs and generic MLPs
Leaky/PReLUIf dying ReLU is a problem; also GANs
GELUTransformers / BERT-family — smooth gating
SigmoidBinary output neuron; LSTM gates
SoftmaxMulti-class output (mutually exclusive labels)
DEPTH, WIDTH & PARAMETER COUNT

Parameters per layer = $n_{in} \times n_{out} + n_{out}$ (weights + biases). Total = sum across all layers. Depth grows representational power combinatorially; width grows it linearly. For the same parameter budget, deeper usually wins in practice — but very deep plain networks hit a degradation wall (training error rises), which residual/skip connections solve by making identity the easy default and providing a gradient highway.

INITIALIZATION & GRADIENT HEALTH
$$\text{Xavier: } \text{Var}(w) = \frac{1}{n_{in}} \quad \text{He: } \text{Var}(w) = \frac{2}{n_{in}}$$Xavier for sigmoid/tanh; He for ReLU-family (ReLU kills half inputs, needs 2× variance)

Wrong init (especially all-zeros — breaks symmetry; all neurons learn the same thing) stalls training before it starts. Vanishing gradients: saturating activations + depth → fix with ReLU, He init, BatchNorm/LayerNorm, skip connections. Exploding gradients: large weight products → fix with gradient clipping by norm (preserves direction, unlike clip-by-value).

⚠ Clears up — "Adam is strictly better than SGD" Adam converges faster and needs less LR tuning, making it the default for transformers and sparse problems. But SGD with momentum consistently finds flatter, better-generalizing minima on vision tasks — that's why SOTA ResNets still use it. Also: use AdamW (decoupled weight decay), not plain Adam, to avoid weight decay being absorbed into the adaptive scale.
◆ Interview probe "Training loss is stuck from epoch 1 — where do you look first?" → Overfit a single batch. If you can't drive that mini-batch loss to near-zero, the bug lives in model wiring, loss function, or optimizer setup — not in hyperparameters. Check for NaNs/Infs in data, verify labels align with inputs, confirm gradients are nonzero with a quick backward pass print. Only after that batch overfit succeeds do you tune learning rate and scale up.
Remember   Nonlinearity is what makes depth meaningful — without it, every extra layer cancels out into one linear map.
Tricky interview questions 11
Your training loss is not decreasing at all from epoch 1. Walk me through how you debug it.
Start by trying to overfit a single tiny batch — if you can't push that loss near zero, the bug is in model/loss/optimizer wiring, not hyperparameters. Before that, sanity-check the data pipeline for NaNs, label misalignment, and normalization. Then verify gradients are nonzero and flowing via a quick backward pass. Trap: jumping to learning-rate sweeps or architecture changes before confirming the model can memorize even two examples.
Why does stacking linear layers without activations collapse to a single linear map?
$W_2(W_1 x + b_1) + b_2$ is just $W'x + b'$ — one affine transformation. Every additional layer without a nonlinearity is redundant and buys no expressive power. The activation $\phi$ is the only thing that makes depth matter. Trap: thinking that more layers always help — without nonlinearity, the entire stack is equivalent to a single layer regardless of how wide or deep it is.
What is the dying ReLU problem, and how do you address it?
A ReLU unit whose pre-activation stays permanently negative outputs 0 with zero gradient, so it never receives any update — it is effectively dead for the rest of training. Mitigate with Leaky ReLU / PReLU / ELU / GELU, careful He initialization, and avoiding very large learning rates. Trap: thinking the neuron might recover — once all inputs are negative the gradient is identically 0, so the weight never moves and the neuron stays dead.
He initialization vs Xavier/Glorot — when do you use each and why does it matter?
Xavier (variance $\approx 1/n_{in}$) suits sigmoid/tanh where the function is symmetric around zero. He (variance $\approx 2/n_{in}$) suits ReLU-family because ReLU zeros half the inputs, effectively halving signal variance — He compensates with 2×. Correct init keeps gradient variance stable across depth from the first forward pass. Trap: using Xavier with ReLU — it under-scales variance and can cause vanishing gradients from epoch 0, especially in deep nets.
What causes vanishing gradients and what specifically do you do to fix them?
Saturating activations (sigmoid/tanh) compound over depth: gradients multiply by derivatives near 0, shrinking exponentially. Fixes: switch to ReLU-family activations, add residual/skip connections (gradient highway), apply BatchNorm or LayerNorm, and use He/Xavier init. Exploding gradients are the mirror problem — fix with gradient clipping by norm, not by value (norm preserves direction). Trap: conflating vanishing and exploding or giving only one remedy — interviewers want the matched pair.
How does Batch Normalization behave differently at training vs inference, and why does that matter?
During training, BN normalizes each mini-batch with its own mean and variance while updating running estimates. At inference it uses those fixed running statistics. Forgetting model.eval() means the model keeps using batch statistics at test time — silently corrupting predictions, especially with small batches or batch size 1. Trap: not mentioning the train/inference statistic switch — this is one of the most common silent bugs in PyTorch code.
When do you use softmax vs independent sigmoids on the output layer?
Softmax + categorical cross-entropy for single-label multi-class: outputs are mutually exclusive and sum to 1. Independent sigmoids + binary cross-entropy for multi-label problems where each label is independent and an example can belong to multiple classes simultaneously. Trap: using softmax for multi-label — it forces outputs to compete and sum to 1, which is wrong when co-occurrence is valid (e.g., an image tagged both "dog" and "outdoors").
When would you pick Adam over SGD, and why do vision researchers still use SGD?
Use AdamW for fast convergence with minimal tuning — transformers, NLP, sparse gradients, limited compute budget for LR search. Use SGD with momentum when final generalization is paramount: many SOTA CNN/ResNet results use SGD because its noisier updates tend to find flatter, better-generalizing minima. Trap: claiming Adam is strictly better — it converges faster but often generalizes worse; also always use AdamW, not plain Adam, to properly decouple weight decay from the adaptive scaling.
Your train accuracy is high but test accuracy is low. What's happening and what do you reach for first?
Classic overfitting — the model is memorizing training data. Prioritized fixes: more data or data augmentation first (biggest bang), then regularization (dropout, L2/weight decay), early stopping, and reducing model capacity if all else fails. Trap: reaching for a bigger model or more layers — overfitting demands less capacity or more data, not more; and more/better data almost always beats clever regularization.
Why does a deeper plain network sometimes have higher training error than a shallower one, and how do residual connections fix it?
This is the degradation problem — not overfitting. Even on training data, very deep plain nets struggle because added layers fail to learn even the identity mapping; it's an optimization problem compounded by vanishing gradients. Residual connections make identity the trivial solution (just learn zero residual) and provide a direct gradient path to earlier layers, enabling networks 100+ layers deep. Trap: attributing degradation to overfitting — it shows up in training error too, confirming it's optimization/gradient-flow, not generalization.
L1 vs L2 regularization — practical difference and when do you choose each?
L1 (Lasso) adds $\lambda |w|$ and drives weights to exactly zero, giving sparsity and implicit feature selection. L2 (Ridge/weight decay) adds $\lambda w^2$ and shrinks all weights smoothly toward — but not to — zero. Default in deep learning is L2 via AdamW's weight decay. Use L1 when you want a sparse, interpretable model or explicit feature selection. Trap: claiming L2 produces sparsity — that's L1's distinctive property; L2 shrinks but never zeros out a weight.
03
PART 0 · FOUNDATIONS

Backprop & gradient flow

🎯Backprop is the chain rule on a graph; vanishing/exploding gradients are that product shrinking or blowing up.
1510152000.250.50.751Why deep nets need ReLU + residualslayer depth (from output)relative gradient sizesigmoid (vanishes)bad init (explodes)ReLU + residual
Backprop multiplies a Jacobian per layer. With saturating units (sigmoid) the factors are <1 so the gradient vanishes; with large weights it explodes. ReLU (gradient 1 when active) plus residual connections (an identity path) keep it ≈ constant — that is what makes 100-layer nets trainable.

Backprop is just the chain rule on a computation graph: one forward pass caches activations, one backward pass multiplies local gradients from loss to params — everything else (vanishing, exploding, dying neurons) is a story about what those products do when they get out of control.

THE MECHANICS
Forward passCompute and cache activations at every node. Cost ~1×. Backward passWalk the graph in reverse; multiply local Jacobians. Cost ~1×. Total ~2× forward. Autograd (PyTorch)Builds the computation graph dynamically on each forward pass — flexible but means the graph is rebuilt every step.
$$\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial a_n}\cdot\frac{\partial a_n}{\partial a_{n-1}}\cdots\frac{\partial a_{k}}{\partial w_i}$$Chain rule: loss gradient at layer n multiplied back through every layer to weight w_i
VANISHING vs EXPLODING — MATCHED FIXES
Vanishing (products < 1)Deep sigmoid/tanh networks. Early layers get ~0 gradient; they stop learning. Exploding (products > 1)Deep/RNN nets. Gradients grow exponentially → NaNs and weight blowup.

Fix vanishing: ReLU-family activations + residual/skip connections + BatchNorm/LayerNorm + He init.
Fix exploding: gradient clipping by norm (clip-by-value distorts direction — avoid). Clip by norm rescales the whole gradient vector when its $\ell_2$ norm exceeds threshold $\tau$:

$$g \leftarrow g \cdot \frac{\tau}{\max(\|g\|_2,\, \tau)}$$If gradient norm exceeds τ, rescale to τ; otherwise leave unchanged
ACTIVATIONS & DYING ReLU
ReLU$\max(0, x)$. Fast, non-saturating for $x>0$. Kills neurons with all-negative inputs. Leaky ReLU$\max(\alpha x, x)$, $\alpha\approx 0.01$. Prevents dead neurons. GELUSmooth probabilistic gate. Default in transformers. Sigmoid/TanhSaturate → vanishing gradients in deep nets. Use only at output (sigmoid for binary) or in gates.

Dying ReLU: neuron stuck with negative pre-activation → gradient = 0 forever. It does NOT recover on its own. Fix: Leaky/PReLU/ELU/GELU + lower LR + He init.

INITIALIZATION — HE vs XAVIER
InitVarianceUse when
Xavier/Glorot$2/(n_{in}+n_{out})$Sigmoid / Tanh (symmetric, saturating)
He (Kaiming)$2/n_{in}$ReLU-family (ReLU zeros half inputs, needs 2× boost)
All-zeros0Never — breaks symmetry, all neurons learn identically

Goal: keep activation/gradient variance stable across depth. Wrong init can stall training before the first epoch ends.

RESIDUAL CONNECTIONS

A plain deep net can have higher training error than a shallower one — not just overfitting but an optimization failure called the degradation problem. Skip connections let each block learn a residual $F(x)$ around the identity: output = $F(x) + x$. Two benefits: (1) identity is the easy default so blocks only learn deltas; (2) gradients flow directly through the skip path — a gradient highway that bypasses repeated multiplications.

$$y = F(x, \{W_i\}) + x$$Block output = residual function + shortcut identity. Gradient can bypass F entirely via the + node.
⚠ Clears up — Batch Norm train vs eval In training, BN normalizes with the current mini-batch mean/variance and accumulates running stats. At inference it uses those fixed running stats. Forgetting model.eval() (or using batch size 1 at test time) silently corrupts the statistics and tanks evaluation accuracy — the most common BN bug in production.
◆ Interview probe Q: Training loss won't decrease at all — what's your first move?
→ Sanity-check data (NaNs, label alignment, normalization). Then try to overfit a single tiny batch: if you can't drive that loss near 0, the bug is in model/loss/optimizer wiring, not hyperparameters. Then sweep LR and verify gradients are non-zero and flowing backward.
Remember   Backprop = chain rule on a graph; every training pathology (vanishing, exploding, dying neurons) reduces to the product of local gradients running away from 1 — know the matched cure for each direction.
Tricky interview questions 11
What causes vanishing vs exploding gradients and how do you fix each — matched?
Vanishing: saturating activations (sigmoid/tanh) compounded over depth make products < 1 → early layers get ~0 gradient. Fix: ReLU-family, residual connections, LayerNorm/BatchNorm, He init. Exploding: large-weight products > 1 blow up → NaNs; common in RNNs. Fix: gradient clipping by norm (not by value — value distorts direction), good init, normalization. Trap: conflating the two or offering only definitions; interviewers want the matched remedy — especially gradient clipping for exploding and residual connections for vanishing.
Your training loss is not decreasing at all. Walk me through your debug steps.
Start with data: check for NaNs/Infs, label alignment, normalization. Then try to overfit a single tiny batch — if loss won't drop to near 0, the bug is in model/loss/optimizer wiring. Then verify gradients are non-zero via hooks, sweep learning rate. Trap: jumping to architecture or hyperparameter changes before confirming the model can even memorize one batch.
What is the dying ReLU problem and how do you fix it?
A ReLU neuron whose pre-activation is always negative outputs 0 with zero gradient — it never updates and stays dead permanently. Fix with Leaky ReLU / PReLU / GELU, a lower learning rate, and He initialization. Trap: thinking it recovers on its own — once stuck with all-negative inputs, the gradient is exactly 0 and it cannot escape without external intervention.
He vs Xavier init — when do you use each and why does it matter?
Xavier/Glorot (variance 2/(n_in + n_out)) suits symmetric saturating activations (sigmoid/tanh). He (variance 2/n_in) suits ReLU-family because ReLU kills half the inputs and needs a 2× variance boost to keep signal scale stable across depth. Wrong init can stall training from epoch 0. Trap: treating init as a cosmetic detail, or using Xavier with ReLU (under-scales variance and reintroduces vanishing).
How does Batch Normalization behave differently at training vs inference?
Training: normalizes with current mini-batch mean/variance, updates running statistics. Inference: uses fixed running statistics accumulated during training. Forgetting model.eval() or running BN with batch size 1 at test time corrupts the stats and silently degrades accuracy. Trap: not mentioning the statistic switch — this is the top real-world BN bug.
When would you pick Adam over SGD, and why do practitioners still use SGD for vision models?
Use AdamW for fast robust convergence with little tuning: transformers, NLP, sparse gradients. Use SGD with momentum when final generalization matters most (SOTA CNNs/ResNets) — it tends to find flatter, better-generalizing minima at the cost of more LR tuning. Trap: claiming Adam is strictly better; it converges faster but can generalize worse. Also always use AdamW (decoupled weight decay) over plain Adam.
L1 vs L2 regularization — practical difference and when to choose each?
L1 (Lasso) drives weights to exactly zero — implicit feature selection, sparse models. L2 (Ridge / weight decay) shrinks all weights smoothly — the default in deep learning. For DL use AdamW so weight decay is decoupled from the adaptive gradient update. Trap: claiming L2 produces sparsity — that is L1. Also forgetting that Adam + plain L2 in the loss doesn't properly decouple decay; use AdamW.
Train accuracy high, test accuracy low — what's happening and what do you do?
Overfitting: the model memorizes training data. Priority order of fixes: more / augmented data first, then regularization (dropout, weight decay), early stopping, and reducing model capacity only if needed. Trap: reaching for a bigger model — overfitting calls for data or regularization, not more capacity.
How does batch size affect training and how do you set it?
Small batches: noisier gradients, often better generalization, memory-efficient but slower. Large batches: faster and stable but can generalize worse and require LR scaling (linear scaling rule + warmup). Start with the largest power-of-2 batch that fits GPU memory, then tune LR accordingly. Trap: increasing batch size without re-scaling the learning rate — standard LR with 8× batch usually diverges.
Why does a deeper plain network sometimes do worse than a shallower one, and how do residual connections fix it?
It's the degradation problem: added plain layers struggle to learn even the identity, so training error — not just test error — increases. Residual connections make identity the easy default (the block only needs to learn F(x) = 0) and provide a gradient highway through the + node, enabling much deeper nets. Trap: attributing it only to overfitting — degradation appears in training error too, so it is an optimization / gradient-flow problem, not a generalization one.
Gradient clipping by value vs by norm — which is preferred and why?
Clip-by-norm rescales the entire gradient vector when its L2 norm exceeds threshold τ, preserving direction. Clip-by-value caps each component independently, distorting the gradient direction. Clip-by-norm is the standard default; needed mainly in RNNs/LSTMs and sometimes transformers to handle occasional spike events. Trap: using clip-by-value as the default — it changes the gradient direction and can introduce bias in the update.
04
PART I · THE BUILDING BLOCKS

Activation functions — all the formulas, and when to use which

🎯Default to ReLU; GELU/SiLU in Transformers. Keep sigmoid/tanh for the output only.
-303-1012ActivationszsigmoidtanhReLUGELUSiLU/Swish-30300.250.50.751…their gradientszφ'(z)σ' ≤ ¼ → vanishes
Default to ReLU (cheap, no saturation for $z>0$) or GELU/SiLU in Transformers (smooth, small negative pass-through). Avoid sigmoid/tanh in hidden layers — they saturate and kill gradients; keep sigmoid for a binary output and softmax for multiclass.

Activation functions are the bends in the wire — without them every layer collapses to a single matrix multiply. Pick the wrong one and your network either dies (all-zero neurons) or drowns (vanishing gradients); pick the right one and the signal flows cleanly from input to loss.

THE FORMULAS YOU MUST KNOW
$$\sigma(z) = \frac{1}{1+e^{-z}}$$Sigmoid — squashes to (0,1); derivative peaks at 0.25
$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$Tanh — zero-centered, range (-1,1); still saturates at extremes
$$\text{ReLU}(z) = \max(0, z)$$Zero below, identity above — cheap, no positive saturation
$$\text{LeakyReLU}(z) = \max(\alpha z,\, z), \quad \alpha \approx 0.01$$Small negative slope keeps gradient alive for dead neurons
$$\text{GELU}(z) \approx 0.5z\!\left(1+\tanh\!\left(\sqrt{\tfrac{2}{\pi}}\bigl(z+0.044715z^3\bigr)\right)\right)$$Smooth gating by Gaussian CDF — default in Transformers
$$\text{SiLU/Swish}(z) = z \cdot \sigma(z)$$Swish — smooth, self-gated; closely related to GELU empirically
$$\text{SwiGLU}(x, W, V) = \text{Swish}(xW) \otimes (xV)$$Data-dependent gate in LLaMA/PaLM FFN; hidden dim cut to 2/3 to keep params equal
$$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$Turns logits into a competing probability distribution; use only at output for multi-class
WHEN TO USE WHICH — QUICK MAP
Layer / taskReach forWhy
CNN / MLP hidden layersReLUCheap, no positive saturation, proven baseline
Transformer FFN (BERT/GPT style)GELUSmooth, better empirical performance
Transformer FFN (LLaMA/PaLM)SwiGLUData-dependent gating, SOTA quality
Dead neurons appearLeakyReLU / PReLU / GELUNon-zero gradient for negatives
Binary output / gateSigmoidMaps to probability (0,1)
Multi-class single-label outputSoftmaxOutputs compete, sum to 1
Multi-label outputSigmoid per unitClasses are independent
Regression outputNone (linear)No range constraint
WHY NOT SIGMOID/TANH IN HIDDEN LAYERS
Saturation zoneBoth flatten at extremes; gradient ≈ 0 blocks backprop deep in the net Sigmoid biasAll outputs positive → weight gradients all same sign → zig-zag weight updates Tanh advantageZero-centered (range -1..1) avoids zig-zag but still saturates — better than sigmoid, still not ReLU-family Max sigmoid derivative0.25 — gradient is quartered at minimum even at the optimal input
DEAD RELU & HOW TO FIX IT

A unit is dead when its pre-activation stays negative for every input — ReLU outputs 0, gradient is 0, weight never moves. Primary cause: a large learning-rate spike pushes the bias permanently negative.

Detect itLog fraction of zero activations per layer per batch over time; healthy sparsity is fine, units dead every single batch = problem Fix 1Lower learning rate; pair with He initialization (var = 2/n_in) for ReLU Fix 2Add BatchNorm before ReLU to keep pre-activations centered Fix 3Switch to LeakyReLU (α ≈ 0.01), PReLU, ELU, or GELU — all pass gradient for negatives
SOFTMAX NUMERICS & TEMPERATURE

Raw softmax overflows for large logits. The standard fix is to subtract max before exponentiating — mathematically identical, numerically safe. Always use the framework's fused cross_entropy_with_logits; never chain softmax → log manually.

Temperature $T$: divide logits by $T$ before softmax. $T < 1$ sharpens (greedy); $T > 1$ flattens (exploratory). Used in LLM sampling, knowledge distillation (high $T$ to expose soft targets), and model calibration.

INIT PAIRING — DON'T MIX THESE UP
He / Kaiming (var = 2/n_in)For ReLU-family — compensates for ~50% zeros zeroing out variance Xavier / Glorot (var = 2/(n_in+n_out))For tanh / sigmoid — assumes symmetric, non-zeroing activation Mismatch costXavier + ReLU = signal decays in deep nets; He + sigmoid = potential explosion
⚠ Clears up — "ReLU is practically linear" ReLU is piecewise-linear but globally non-linear (it bends at zero). That bend is exactly what breaks the collapse to a single matrix multiply. Stacking purely linear layers — with or without ReLU removed — always reduces to one affine map regardless of depth.
◆ Interview probe Why do modern LLMs use SwiGLU instead of GELU in their FFN layers? → SwiGLU adds data-dependent gating: one linear projection is element-wise multiplied by the Swish-activated second projection, letting the network learn which features to pass through. Empirically this lifts quality. Because it requires a third weight matrix, the hidden dimension is cut to ≈2/3 to keep parameter count fair — forgetting this makes the comparison invalid.
Remember   ReLU is the free baseline for CNN/MLP, GELU/SwiGLU for Transformers, and sigmoid/softmax belong only at the output — never in hidden layers.
Tricky interview questions 12
Why do we need non-linear activation functions? What happens if you stack linear layers without one?
Composing linear layers collapses to a single affine map (W₂(W₁x + b₁) + b₂ = Wx + b), so the entire network can only represent linear functions regardless of depth. Non-linearities are what let a deep net approximate arbitrary functions. Trap: Saying ReLU is "linear so it doesn't count" — ReLU is piecewise-linear but globally non-linear (it bends at zero), which is exactly what provides expressivity.
What is the dying/dead ReLU problem and how do you fix it in practice?
If a neuron's pre-activation stays negative for all inputs, ReLU outputs 0 and its gradient is 0, so the unit never updates. Primary cause is a high learning rate spiking the weight update and pushing the bias permanently negative. Fixes: lower LR, use He init, add BatchNorm, or switch to LeakyReLU/PReLU/GELU. Trap: Blaming only initialization — a high learning rate is just as likely the culprit, and people often overlook it.
How do you detect dead neurons during training?
Log the fraction of zero activations per layer across the full batch, or watch activation histograms in TensorBoard/W&B. A unit that is zero for every example in every batch is dead; per-example sparsity is healthy and expected. Trap: Treating any zero activation as a problem — ReLU sparsity is a feature, not a bug; only persistent zero-for-all-inputs signals death.
Why is ReLU preferred over sigmoid/tanh in hidden layers?
Sigmoid and tanh saturate at the extremes (sigmoid derivative peaks at 0.25), causing vanishing gradients in deep nets. ReLU has gradient exactly 1 for positive inputs, preserving signal. It is also cheaper to compute (a max, no exp). Trap: Claiming ReLU "has no gradient problems" — it has the dead-neuron issue and is not zero-centered; it just avoids saturation on the positive side.
Why is tanh preferred over sigmoid when you must use a saturating activation?
Tanh is zero-centered (range -1..1) while sigmoid is all-positive (0..1). All-positive activations force weight gradients into the next layer to share the same sign, causing inefficient zig-zag parameter updates. Zero-centered activations allow gradients of both signs. Trap: Thinking tanh solves vanishing gradients — it is better than sigmoid but still saturates at the extremes; that is why the ReLU family is the default.
Why do Transformers (BERT, GPT) use GELU instead of ReLU?
GELU is smooth and differentiable everywhere, and it allows small negative values to pass weighted by their Gaussian percentile rather than cutting them off at a hard zero. This gives smoother gradients in very deep attention stacks and empirically trains better on NLP benchmarks. Note GELU is more expensive than ReLU and is often evaluated via a tanh approximation. Trap: Overstating a clean theoretical reason — the primary evidence is empirical; do not overfit to a single theoretical story.
What is SwiGLU and why did LLaMA and PaLM adopt it in the FFN?
SwiGLU computes Swish(xW) ⊗ (xV) — one projection is element-wise gated by a Swish-activated second projection, giving data-dependent feature selection. Empirically it improves quality over GELU FFNs. Because it adds a third weight matrix, the hidden dimension is cut to ≈2/3 of the usual 4× width to keep parameter count comparable. Trap: Forgetting the 2/3 hidden-dim adjustment — naively adding the gate inflates FLOPs and parameter count, making comparisons unfair.
How do you choose the output-layer activation for different tasks?
Regression: no activation (linear output). Binary or multi-label classification: sigmoid per output unit. Single-label multi-class: softmax over all logits. The output activation must pair with the correct loss: sigmoid + BCE, softmax + cross-entropy-from-logits. Trap: Using softmax for multi-label problems — softmax forces outputs to compete and sum to 1; multi-label needs independent sigmoids.
Why is naive softmax numerically unstable and how do you stabilize it?
exp() of large logits overflows float32. The fix is to subtract the max logit before exponentiating: softmax(xᵢ) = exp(xᵢ − max x) / Σ exp(xⱼ − max x), which is mathematically identical but overflow-safe. In practice, use the framework's fused log-softmax or cross-entropy-with-logits. Trap: Feeding softmax output into a separate log for the loss — always use the fused kernel to avoid double numerical error.
What does softmax temperature do, and when would you tune it?
Dividing logits by temperature T before softmax controls distribution sharpness: T < 1 sharpens (more confident, greedy), T > 1 flattens (more uniform, exploratory). Common use cases: LLM sampling (T ≈ 0.7-1.2), knowledge distillation (high T to expose soft targets), calibration experiments. Trap: Confusing temperature with top-k/top-p — temperature reshapes the whole distribution; top-k/top-p truncate the vocabulary tail.
How does activation choice interact with weight initialization?
He/Kaiming init (var = 2/n_in) is designed for the ReLU family because ReLU zeros roughly half its inputs, halving the activation variance — He compensates. Xavier/Glorot (var = 2/(n_in + n_out)) assumes a symmetric, non-zeroing activation and is correct for tanh/sigmoid. Mismatching causes signal decay or explosion and can trigger dead ReLUs from the first forward pass. Trap: Using Xavier with ReLU — it under-scales the weights for ReLU, causing activation variance to decay with depth.
When would you reach for Leaky ReLU, PReLU, or ELU instead of plain ReLU?
Use Leaky ReLU or PReLU when you observe persistent dead neurons — they maintain a small gradient (α ≈ 0.01 fixed or learned) for negative inputs. ELU gives approximately zero-centered outputs and smooth negative saturation, but costs more compute. However, first check LR and initialization; the variant change often masks a tuning issue and rarely moves accuracy much on its own. Trap: Switching activation variants as the first response to dead neurons instead of auditing the learning rate and init.
05
PART I · THE BUILDING BLOCKS

Normalization: BatchNorm, LayerNorm, RMSNorm

🎯BatchNorm for CNNs, LayerNorm for Transformers — one normalizes down the batch, the other across features.
BatchNorm — normalize each feature across the batchLayerNorm — normalize each sample across featuresbatchfeatures →batchfeatures →good for CNNs / large batchesgood for Transformers / RNNs / any batch size
Normalize activations to mean 0, variance 1, then rescale by learned $\gamma,\beta$. BatchNorm uses batch statistics (breaks with small/variable batches; keeps running stats for eval). LayerNorm (and RMSNorm) normalize per-sample, so they are batch-size-independent — the default in Transformers.

Normalization is the layer that tames exploding/vanishing gradients: subtract the mean, divide by std, then hand control back to the network via learned scale γ and shift β. Get the axis wrong and your model either crashes at batch-size 1 or leaks sequence information across samples.

THE UNIVERSAL FORMULA
$$\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \varepsilon}}, \quad y = \gamma \hat{x} + \beta$$normalize to zero-mean unit-variance, then re-scale and re-center with learned params
γ (scale)Learned per-feature; lets the net recover any desired variance, or even undo normalization. β (shift)Learned per-feature shift; equivalent to a bias after normalization. ϵSmall constant (~1e-5) added under the sqrt — guards against division by zero.
WHICH NORM, WHICH AXIS
NormStats overUse when
BatchNormBatch dim, per channelCNN + large batch (≥32)
LayerNormFeature dim, per sampleTransformers, RNNs, any batch size
RMSNormFeature dim, no mean subtractionLLMs (LLaMA) — cheaper LN
GroupNormFeature groups, per sampleSmall-batch detection/segmentation
InstanceNormH×W per channel per sampleStyle transfer, GANs
BATCHNORM: GREAT POWER, SHARP EDGES

BN normalizes across the batch axis: during training it uses live mini-batch stats and accumulates running EMA estimates; at eval() it freezes to those estimates. This train/eval split is the #1 BN bug source. Other gotchas:

Small batchesBatch mean/var are noisy estimates — BN becomes unstable. Fix: GroupNorm or SyncBN. Variable-length / RNNsBatch stats are ill-defined across timesteps. Fix: LayerNorm. bias=True before BNRedundant — BN subtracts the mean, killing any preceding bias. Use bias=False. RegularizationMini-batch noise is a mild regularizer; heavy dropout right after BN often hurts. Multi-GPU (DDP)Each GPU normalizes only its shard; use SyncBatchNorm or GroupNorm for small per-GPU batches.
LAYERNORM vs RMSNORM

LayerNorm normalizes over the feature dimension of one sample — no batch dependence, no running stats, identical behavior at train and eval. This is why Transformers default to it.

$$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2 + \varepsilon}} \cdot \gamma$$skip mean subtraction and beta — divide by RMS only, scale by learned gain

RMSNorm drops re-centering (mean subtraction) and the β bias. Empirically, re-scaling matters more than re-centering; the result is slightly faster with fewer ops and equal quality — which is why LLaMA and most modern LLMs use it.

PRE-NORM vs POST-NORM PLACEMENT
PlacementPositionPractical verdict
Pre-normLN before attention/FFN, inside residual branchStable deep training, often no warmup needed — modern default
Post-normLN after residual add (original Transformer)Slightly better ceiling but needs careful LR warmup; diverges in deep stacks without it

Pre-norm keeps the residual path clean (identity shortcut untouched), giving better gradient flow in very deep models.

⚠ Clears up — The train/eval BN accuracy cliff If your model trains well but collapses at model.eval(), blame BatchNorm first: the running mean/variance is stale (too few steps, tiny batch, distribution shift). The fix is not tuning dropout or weights — it is either training longer, increasing batch size, tuning BN momentum, or switching to GroupNorm/LayerNorm entirely.
◆ Interview probe "Does BatchNorm regularize, and does that interact with dropout?" → Yes — mini-batch noise adds stochasticity that acts like mild regularization, so BN models typically need less dropout than non-BN ones. Stacking strong dropout after BN often degrades performance because both mechanisms fight for the same noise budget, and BN's regularization strength changes with batch size.
Remember   BatchNorm normalizes over the batch (great for CNNs, breaks at small/variable batches); LayerNorm normalizes over features per sample (Transformer default, no running stats); RMSNorm is LayerNorm minus mean subtraction (cheaper, LLaMA's choice); always use Pre-norm for deep stacks.
Tricky interview questions 12
Walk me through what BatchNorm does differently at training vs inference time.
During training BN computes mean and variance from the current mini-batch and uses them to normalize, while also updating running EMA estimates. At inference it freezes and uses those running estimates — outputs are deterministic and batch-independent. Learned γ/β are applied in both modes. Trap: Saying BN uses batch stats at inference too — that breaks single-sample inference and makes outputs depend on which other samples are in the batch.
Your CNN trains well but accuracy collapses the moment you call model.eval(). What's the likely cause?
BatchNorm running mean/variance are unreliable — caused by too few training steps to warm the EMA, a tiny batch size, or a train/test distribution mismatch so frozen stats don't match inference data. Fix: train longer, increase batch size, tune BN momentum, or switch to GroupNorm/LayerNorm. Trap: Blaming dropout or weights first — the tell is that only toggling train/eval mode (which flips BN stats and dropout) causes the drop, so BN is the primary suspect.
Why do Transformers use LayerNorm instead of BatchNorm?
LayerNorm normalizes across the feature dimension per token, making it independent of batch size, sequence length, and padding, with identical behavior at train and eval time and no running stats required. BatchNorm's batch-axis statistics are corrupted by variable-length padded sequences and small per-device batches. Trap: Calling LayerNorm "BatchNorm transposed" without noting it removes batch dependence entirely and requires no running estimates.
What's the difference between LayerNorm and RMSNorm, and why do modern LLMs prefer RMSNorm?
RMSNorm skips mean subtraction (re-centering) entirely, dividing activations only by their root-mean-square, then scaling by a learned gain γ (no β). It has fewer ops than LayerNorm, matches it in quality, because re-scaling invariance drives most of the benefit while re-centering adds little. Trap: Saying RMSNorm computes a different variance — it does not subtract the mean at all, so the denominator is RMS, not standard deviation around a mean.
Pre-norm vs post-norm in Transformers — which would you pick for a deep model and why?
Pre-norm (LN before attention/FFN, inside the residual branch) keeps the skip connection as a clean identity path, giving stable gradient flow in deep stacks often without warmup. Post-norm can achieve slightly better final quality but is prone to divergence in deep models without careful warmup. Most modern LLMs use pre-norm. Trap: Treating placement as cosmetic — post-norm without warmup commonly diverges past ~24 layers, which is why pre-norm became the default.
Should you use bias=True on a conv or linear layer immediately before BatchNorm?
No — BN subtracts the batch mean, which cancels any preceding additive bias. Set bias=False on the layer and rely on BN's learned β as the effective shift. This also saves parameters with no accuracy cost. Trap: Leaving bias=True "just in case" — it genuinely has zero effect because BN removes it, so it's wasted memory and compute.
Why does BatchNorm perform poorly with very small batch sizes, and what do you use instead?
Small batches produce high-variance estimates of the mean and variance, making normalization noisy and running stats unreliable. Use GroupNorm (detection/segmentation), LayerNorm, or SyncBatchNorm across GPUs to recover stable statistics. Trap: Just lowering the learning rate — the root issue is statistical estimation quality, not optimization, so GroupNorm/LayerNorm is the real fix.
How do you handle BatchNorm when training across multiple GPUs with small per-GPU batches?
Use SyncBatchNorm to compute mean and variance across all GPUs, so the effective statistics batch is large. Alternatively switch to GroupNorm or LayerNorm, which are batch-size-independent and require no cross-GPU sync. Trap: Assuming standard BN "just works" in DDP — by default each GPU normalizes only its local shard, making stats noisy with small per-GPU batches.
Why does LayerNorm not need running statistics while BatchNorm does?
LayerNorm computes statistics across the feature dimension within a single sample, so those stats are always available from the sample itself at both train and inference time. BN aggregates across the batch, which isn't available for a lone inference example, so it must accumulate running estimates during training. Trap: Saying LayerNorm "also keeps running stats" — it doesn't, which is precisely why it has no train/eval discrepancy.
Does BatchNorm actually reduce internal covariate shift?
That was the original motivation, but Santurkar et al. (2018) showed the dominant effect is smoothing the loss landscape and making gradients more predictable, enabling higher and more stable learning rates. The internal-covariate-shift story is at best incomplete and empirically unverified. Trap: Stating internal covariate shift as the settled mechanism — strong candidates cite the smoother-optimization-landscape explanation as the more defensible one.
What is the ϵ term in normalization and why does its placement matter?
ϵ (typically 1e-5 to 1e-6) is added inside the square root to prevent division by zero when activations have near-zero variance and to improve numerical stability. It must go under the sqrt — placing it outside changes the mathematical behavior and can under-stabilize edge cases. Trap: Putting ϵ outside the sqrt or omitting it — the standard formula is (x-μ) / sqrt(σ² + ϵ), and the ϵ placement is load-bearing.
When would you use GroupNorm or InstanceNorm instead of BatchNorm?
Use GroupNorm when batches are small — object detection, segmentation, video models — since it normalizes within groups of channels per sample with no batch dependence. Use InstanceNorm for style transfer and GANs where per-sample-per-channel normalization removes instance-specific contrast information. Trap: Defaulting to BatchNorm for detection/segmentation where small per-GPU batches make BN statistics unreliable — GroupNorm is the established standard there (e.g., Mask R-CNN).
06
PART I · THE BUILDING BLOCKS

Residual connections & gating

🎯Residuals give gradients a highway — learn the change, not the whole function.
A residual block: learn the change, keep an identity pathxweight → φweight+outidentity skip (x)out = x + F(x)Use residuals once depth &gt; ~10 layers; they give gradients a direct highway so very deep nets train.
Instead of learning $H(x)$, the block learns the residual $F(x)=H(x)-x$ and outputs $x+F(x)$. The skip path carries gradients unchanged, so depth stops hurting — the core trick behind ResNets and every Transformer block.

Think of a residual block as a scaffold: the elevator shaft (skip) guarantees you can always reach any floor, while the staircase (residual branch) only needs to carry the difference from where you already are. Gating is the door that decides how much of each staircase actually enters the corridor.

THE CORE IDENTITY: RESIDUAL / SKIP
$$\mathbf{y} = \mathbf{x} + F(\mathbf{x})$$output = input + learned residual (correction)

The block learns the change F(x), not the full mapping H(x). If the optimal transform is near-identity, driving F→0 is trivial for SGD; learning an exact identity through stacked nonlinearities is not. Use residuals once depth exceeds ~10 layers; they are mandatory in every ResNet and Transformer block.

Degradation problemTraining error rises with more layers in plain nets — an optimization failure, not overfitting. The skip makes identity the default so extra layers can't hurt. Gradient path∂y/∂x = ∂F/∂x + I. The identity term guarantees a gradient floor; across N stacked blocks you get a sum (not a product) containing a near-identity component — no full vanish. Dimension mismatchIf skip and branch shapes differ, add a 1×1 conv (strided for spatial) or linear projection on the skip path only. Don't project everywhere — it obstructs the clean identity highway.
PRE-ACTIVATION vs POST-NORM PLACEMENT

Where you put BatchNorm / LayerNorm relative to the skip matters a lot at depth.

VariantFormulaUse when
Post-norm (original ResNet)y = LN(x + F(x))Shallow–medium depth; needs LR warmup; can edge out in peak quality
Pre-norm (modern default)y = x + F(LN(x))Deep transformers / LLMs; stable without long warmup; skip stays clean identity

Pre-norm keeps the residual stream un-normalized so gradients flow cleanly. Gotcha: in very deep pre-norm nets the residual stream variance accumulates (~proportional to depth), so later layers contribute less. Fix with branch scaling: e.g., multiply each branch output by $\sim 1/\sqrt{2N}$, or use ReZero (learnable scalar α initialized to 0 on each branch).

GATING: HIGHWAY, LSTM, GLU/SWIGLU

A gate is a learned sigmoid that decides how much signal passes — a soft, data-dependent skip:

$$\mathbf{y} = T(\mathbf{x})\cdot F(\mathbf{x}) + (1-T(\mathbf{x}))\cdot \mathbf{x}$$Highway: T=transform gate, (1−T)=carry gate
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$LSTM cell: forget gate ≈ residual skip through time
$$\text{GLU}(\mathbf{x}) = (\mathbf{W}_1\mathbf{x}) \odot \sigma(\mathbf{W}_2\mathbf{x})$$content path × sigmoid gate; SwiGLU replaces σ with Swish
Highway vs ResNetHighway's carry gate can close (→0) and choke gradients; ResNet's identity has weight exactly 1 with zero parameters — empirically better past 100+ layers. LSTM forget gate biasInitialize to +1 or +2 so the gate opens early; without it long-range gradients vanish before learning begins. SwiGLU costNeeds a 3rd projection — shrink FFN hidden dim to ~2/3 to keep FLOPs matched when benchmarking against a plain FFN.
SKIP TOPOLOGY VARIANTS
ArchitectureSkip operationBest for
ResNetAdd (identity or 1×1 proj)Image classification; constant width, parameter-efficient
DenseNetConcat all prior feature mapsFeature reuse, fewer params per layer; memory-hungry
U-NetConcat encoder → decoder at matching resolutionSegmentation; restores spatial detail lost to pooling
Transformer blockAdd + pre/post-normAll sequence tasks; two sub-layer skips (attn + FFN)

Add vs concat: addition mixes into a fixed channel budget (refinement); concatenation preserves all prior features and grows width (reuse). Use concat when fine-grained spatial detail or feature diversity matters and you can afford the memory.

SCALING VERY DEEP NETS WITHOUT NORM

Skips alone don't make arbitrary depth trivially trainable. Options when you want minimal or no normalization:

ReZeroMultiply each residual branch by a scalar α initialized to 0; starts as pure identity, gradually opens. No norm needed. FixupScale residual-branch weights by a function of depth at init; removes need for BatchNorm in ResNets. DeepNetScales residuals and uses scaled post-norm; keeps activation variance bounded for 1000-layer transformers.
⚠ Clears up — "skip connections fix vanishing gradients" He et al. (2015) explicitly noted that vanishing gradients were largely addressed by BatchNorm and better initialization; the headline motivation was the degradation problem — training error (not test error) going up as you add more layers. Skip connections fix an optimization problem, not a gradient-norm problem per se.
◆ Interview probe "Your teammate adds 50 more layers and training loss gets worse. Why, and what do you do?" → This is the degradation problem — an optimization failure. Add residual connections so identity is the default, switch to pre-norm, consider branch scaling (ReZero / Fixup). Not a regularization or data problem — do not reach for dropout or more data to fix rising training loss.
Remember   The skip makes identity the free default, so every added layer can only help — that's why 1000-layer nets suddenly became trainable.
Tricky interview questions 12
What problem do residual connections actually solve — vanishing gradients or something else?
Primarily the degradation problem: plain deep nets get higher training error as depth grows, even though they could in principle match a shallower net. The skip makes identity the default so extra layers can't make things worse. Gradient-flow help is real but secondary. Trap: Saying "they solve vanishing gradients" — He et al. noted BN/init already addressed that; the headline motivation is an optimization problem.
Mechanically, why does y = F(x) + x help gradients flow backward?
Because ∂y/∂x = ∂F/∂x + I. The additive identity term guarantees a gradient floor even when the residual branch's Jacobian goes to zero. Across stacked blocks the gradient becomes a sum containing a near-identity term instead of a long product of small factors. Trap: Forgetting the "+I" is the whole point, or claiming gradients can never vanish — the residual branch itself can still vanish; the skip only guarantees a floor.
Why is learning a residual F(x) easier than learning the full mapping H(x)?
When the optimal transform is near-identity (each layer refines rather than overhauling), driving F→0 is trivial for SGD. Learning an exact identity through stacked nonlinearities is not. The shortcut provides a strong baseline; each block learns a small correction. Trap: Treating it as a math trick — the real point is the optimization landscape; identity is an easy-to-reach default, not a mathematical identity.
Input and output of a residual block have different shapes. How do you implement the skip?
Add a projection shortcut only on the skip path: a 1×1 conv (strided for spatial downsampling) or a linear layer. He et al. found projecting only when shapes mismatch (plain identity otherwise) works best. Trap: Adding projections everywhere "for symmetry" — extra projections on every block add params and obstruct the clean identity path.
What does it mean to keep the shortcut path "clean," and why did He's follow-up paper emphasize it?
Put all norm, activation, and conv on the residual branch, never on the skip path — making the shortcut a pure identity (pre-activation ResNet). This preserves an unobstructed additive gradient highway across all depths, which is why pre-activation ResNets train deeper and better. Trap: Adding a ReLU or BN on the skip — anything that scales or gates the identity reintroduces multiplicative factors and degrades the gradient highway.
What's the difference between a Highway Network gate and a ResNet skip, and why did plain identity win for vision?
Highway: y = H·T(x) + x·(1−T(x)) — learned gates, extra params. ResNet: y = F(x) + x — fixed weight-1 identity, zero extra params. The carry gate can close (→0) and choke gradients; the constant identity can't. ResNet also reliably trained past 100+ layers. Trap: Assuming "more learnable = better" — learnable carry gates add a failure mode that the constant identity avoids.
How is an LSTM's cell-state update the same idea as a residual connection?
Cell updates additively: C_t = f_t · C_{t−1} + i_t · C̃_t. With forget gate open (f_t ≈ 1), gradients flow nearly unchanged across timesteps — the "constant error carousel," a skip through time. Trap: Describing all three LSTM gates equally — the forget gate controlling the additive cell path is what gives the residual-like gradient highway.
Why initialize the LSTM forget-gate bias to +1 or +2?
A positive bias makes the gate open (≈1) early in training, so the cell behaves like an identity/skip and long-range gradients propagate from the start. Without it the gate sits near 0.5 or lower and long-range gradients vanish before the model learns to keep them. Trap: Initializing all gate biases to zero — that needlessly cripples early long-range gradient flow.
Pre-norm vs post-norm: when does the choice matter and which should you default to?
Post-norm: LN outside the skip — needs LR warmup, unstable at large depth, may edge out peak quality. Pre-norm: LN inside the branch — skip stays a clean identity, stable without warmup, default for deep LLMs. The tradeoff is stability vs potential final quality. Trap: Saying one is strictly better — pre-norm for depth/stability; post-norm for potentially higher peak performance with careful tuning.
In a deep pre-norm transformer, residual stream variance grows with depth. Why, and how do you fix it?
Each block adds its output to the stream, accumulating variance roughly proportional to depth. Later layers' contributions shrink relative to the large accumulated stream — the net behaves "effectively shallower." Fix: scale residual branches by ~1/√(2N), or use ReZero (learnable α initialized to 0) or DeepNet's scaled post-norm. Trap: Stacking more layers expecting linear gains — without residual scaling, very deep pre-norm nets hit diminishing returns.
DenseNet concatenates instead of adding. What's the practical tradeoff vs ResNet?
ResNet adds (fixed width, parameter-efficient, mixes features into a constant channel budget). DenseNet concatenates all prior feature maps (maximum reuse, strong gradient flow, fewer params per layer) but channel count grows and activation memory balloons. Use addition for constant-width refinement; concatenation when feature diversity matters and you can pay the memory cost. Trap: Thinking concat is "just a bigger ResNet" — they have different inductive biases and memory profiles.
Why does SwiGLU need a hidden-dim reduction, and when is the gating benefit worth it?
GLU/SwiGLU requires a third projection (gate), adding ~50% of the FFN's compute. To compare fairly against a plain FFN, shrink the hidden dim to ~2/3 to keep FLOPs matched. The gating adds multiplicative (higher-order) interactions a plain MLP lacks; empirically it improves transformer quality across scale, making it the default in most modern LLMs. Trap: Benchmarking full-width SwiGLU vs plain FFN and attributing the gain purely to gating — part of the win is just from more compute.
07
PART I · THE BUILDING BLOCKS

Weight initialization

🎯Bad init = a dead or exploding net before step 1. He for ReLU, Xavier for tanh.

Weight initialization is the starting gun for training — get it wrong and every layer either amplifies or smothers the signal before a single gradient step lands. The goal is one thing: keep activation variance roughly constant across depth, so the first forward and backward passes carry useful information everywhere.

THE SYMMETRY PROBLEM

If every weight in a layer is the same constant (even all-zero), every neuron computes the same output and receives the same gradient — they update identically and the layer permanently collapses to one effective neuron. Random asymmetric weights break this. Biases can safely start at zero because the weights are already different.

XAVIER / GLOROT vs HE / KAIMING
Xavier/GlorotUse with tanh / sigmoid / linear. Averages fan_in and fan_out to balance forward and backward variance. He/KaimingUse with ReLU and its variants (Leaky, ELU, GELU). Uses fan_in only; the factor 2 corrects for ReLU zeroing ~half its inputs.
$$\text{Var}(W) = \dfrac{2}{n_{in}+n_{out}}$$Xavier: balances forward and backward pass scale
$$\text{Var}(W) = \dfrac{2}{n_{in}}$$He: the "2" compensates for ReLU killing half the signal

Why not plain $\mathcal{N}(0,1)$? The pre-activation is a sum of $n_{in}$ terms, so its variance grows as $n_{in}$ — saturating or exploding wide layers.

ActivationInit to useVariance formula
tanh / sigmoidXavier/Glorot$2/(n_{in}+n_{out})$
ReLU / Leaky ReLUHe/Kaiming$2/n_{in}$
GELU / SwishHe (close enough)$2/n_{in}$
Linear / noneXavier$2/(n_{in}+n_{out})$
SPECIAL CASES: BIASES, OUTPUT LAYER, RESIDUAL NETS, RNNs
BiasesDefault zero. Exception: LSTM forget gate bias ≈ 1 (encourages remembering early); ReLU units sometimes get a small positive bias. Output layerSmall weights + set bias to $\log(p/(1-p))$ or $\log(\text{base rate})$ so initial loss ≈ $\ln(C)$ for a balanced C-class problem. Skipping this gives a huge unstable early loss. Residual nets / transformersZero-init the last layer of each residual block (BN gamma or final projection) → block starts as identity. GPT-2 scales residual output projections by $1/\sqrt{N_{layers}}$ to limit depth growth. RNNsOrthogonal init for recurrent weights: an orthogonal matrix preserves vector norms under repeated multiplication, preventing vanishing/exploding across many timesteps.
PYTORCH DEFAULTS & OVERRIDING THEM

nn.Linear and nn.Conv2d both default to Kaiming-uniform with a=√5 — a legacy variant that does not adapt to your activation. For ReLU networks, apply kaiming_normal_(w, nonlinearity='relu'); for tanh, use xavier_uniform_. The calculate_gain argument supplies the correct rescaling factor (√2 for ReLU). Fine-tuning: always initialize new heads with small weights so they don't corrupt pretrained features immediately. LoRA zero-inits one matrix so the adapter starts as a no-op.

DOES INIT STILL MATTER WITH BATCHNORM?

BatchNorm rescales activations each layer, making training robust to moderately bad init — a key reason very deep nets became trainable. But init still matters for: the first forward pass, norm-free architectures, residual / output projections, and attention-heavy transformers. Never use "just add BN" as an excuse to skip a sensible init.

Empirical sanity check: after one batch, log per-layer activation and gradient std across depth — they should be roughly flat. Also verify initial loss ≈ $\ln(C)$ for balanced classification. If either is off, fix init before touching the learning rate.

⚠ Clears up — Xavier with ReLU A very common mistake is using Xavier/Glorot with ReLU layers. Because ReLU zeroes ~half its inputs, it halves the activation variance per layer. Xavier's denominator $n_{in}+n_{out}$ doesn't compensate for this, so activations shrink with depth. He init's factor of 2 is the exact fix. If you see activations collapsing to near-zero in deep ReLU nets, check init first.
◆ Interview probe "Why is a 100-layer residual net initialized differently from a shallow one?" → Without identity-preserving init (zero-init the last layer of each block), the residual branches add noise at random scale to the skip connection, so the effective LR is 100× too large and training diverges or is very slow. Zero-init makes each block start as a pass-through, so you're effectively training a shallow net at first and gradually engaging depth.
Remember   He init (var = 2/fan_in) for ReLU, Xavier (var = 2/(fan_in+fan_out)) for tanh — and zero-init the last residual layer to start deep nets as identity.
Tricky interview questions 12
Why can't you initialize all weights to zero (or any single constant)?
With identical weights every neuron computes the same output and receives the same gradient, so they update identically and stay identical — the layer collapses to one effective neuron. Biases can safely be zero because the weights are already asymmetric. Trap: saying "zero gives zero gradients" — the real issue is symmetry, and it applies to any constant, not just zero.
What does Xavier/Glorot init do and when should you use it vs He?
Glorot sets weight variance to 2/(fan_in + fan_out), balancing forward and backward variance for symmetric near-linear activations (tanh, sigmoid, linear). He/Kaiming uses 2/fan_in; the factor 2 compensates for ReLU zeroing ~half its inputs. Use Xavier for tanh/sigmoid, He for ReLU and variants. Trap: using Xavier with ReLU — activations shrink with depth because Xavier doesn't account for ReLU's half-zeroing.
Why is the factor 2 in He init not arbitrary?
ReLU zeroes ~half of its inputs, which halves the variance of activations passing through. Without the factor 2, activation variance would decay geometrically with depth. The 2 exactly compensates for that halving, keeping variance stable across a deep ReLU stack. Trap: treating it as a rule-of-thumb — it is a precise analytical correction.
Why not just use standard normal N(0,1) weights?
The pre-activation is a sum of fan_in weighted inputs, so its variance scales as fan_in — huge for wide layers, causing saturation or explosion. Proper init scales variance as ~1/fan_in (Xavier/He) so the weighted sum stays O(1) regardless of width. Trap: saying "it's convention" — the variance-of-sum argument is the actual math.
How do you diagnose vanishing/exploding activations caused by bad init?
Run one forward/backward pass on a batch and log per-layer activation and gradient standard deviations across depth. Good init keeps them roughly constant; shrinking stds signal vanishing, growing stds signal explosion. Also check that the initial loss ≈ ln(C) for a balanced C-class classifier. Trap: reaching for BatchNorm or gradient clipping before confirming init is the cause via statistics.
If BatchNorm is present, does initialization still matter?
Much less — BN rescales activations every layer, which is why very deep nets first became trainable. But init still matters for the very first forward pass, norm-free architectures, residual output projections, and attention-heavy transformers where sensitivity persists. Trap: claiming init is completely irrelevant with BN — early steps and norm-free models remain sensitive.
How should you initialize the output (final) layer of a classifier?
Use small weights so logits start near zero (predictions near uniform), and set the output bias to log(base_rate) so the initial loss is close to ln(C) for balanced problems or accounts for class imbalance. Starting with large logits saturates softmax and causes tiny early gradients. Trap: ignoring class imbalance — for rare positives, a zero output bias gives a massive, unstable initial loss.
How do you initialize a very deep residual net or transformer to train stably?
Zero-init the last layer of each residual block (the final BN gamma or output projection) so each block starts as the identity and the network trains depth gradually. GPT-2 style also scales residual output projections by 1/sqrt(num_layers) to bound variance growth. Trap: treating a 100-layer net like a shallow one — residual noise adds up and the effective gradient scale is 100× too large without identity-preserving init.
What is orthogonal initialization and when is it useful?
Orthogonal init sets the weight matrix to an orthogonal matrix (Q from QR decomposition), which preserves vector norms exactly under multiplication — so gradients and activations neither grow nor shrink across many steps. Most valuable for RNN recurrent weights and very deep near-linear stacks. Trap: treating it as universally better — once BatchNorm or residuals stabilize training, orthogonal init adds little over He/Xavier.
What are PyTorch's defaults for nn.Linear and nn.Conv2d, and when should you override?
Both use Kaiming-uniform with a=sqrt(5) — a legacy He variant that does not adapt to your actual activation. For ReLU networks, apply kaiming_normal_ with nonlinearity='relu'; for tanh use xavier_uniform_. Override whenever the default activation differs or the model is deep and sensitive. Trap: assuming the default auto-adapts to your activation — it is a fixed formula regardless of the nonlinearity you attach.
How does initialization interact with learning rate choice?
Init sets the starting scale of activations and gradients, so it directly determines what learning rate is safe and effective — a badly scaled init may require 10-100× smaller LR to avoid divergence, or waste warmup steps on near-zero gradients. Proper variance-scaled init is part of what lets standard LRs with linear warmup work out of the box. Trap: shrinking the learning rate to paper over bad init instead of fixing the init — symptoms look similar but the root cause differs.
Does initialization matter when fine-tuning a pretrained model?
The pretrained backbone stays as-is; only newly added layers (heads, adapters) need careful init — usually small weights so they don't immediately distort pretrained representations. LoRA zero-inits one of its low-rank matrices so the adapter starts as a no-op and is engaged gradually. Trap: reinitializing the pretrained backbone, or giving new heads large random weights that corrupt useful pretrained features in the first few steps.
08
PART I · THE BUILDING BLOCKS

Loss functions

🎯Cross-entropy for classes, MSE for numbers — every loss is just a negative log-likelihood.
00.250.50.75101234Why cross-entropy beats MSE for classificationpredicted prob of true classlosscross-entropy −log psquared error (1−p)²confident & wrong → huge CE gradient
Match the loss to the output: cross-entropy for classification (Bernoulli/softmax MLE — strong gradient when confidently wrong), MSE for regression, Huber for robust regression, label smoothing to stop over-confidence, focal loss for heavy class imbalance.

Your loss function is not just a training signal — it is a probabilistic statement about how you believe your data was generated. Pick the wrong one and the model optimizes the wrong thing, no matter how good the architecture.

THE GOLDEN RULE

Every common loss is a negative log-likelihood. Choosing a loss = choosing a noise/label model. MSE assumes Gaussian noise; MAE assumes Laplacian; CE assumes a categorical distribution. Start there and the formula follows automatically.

$$\mathcal{L} = -\log p_\theta(y \mid x)$$NLL: minimize this and you maximize the probability of your labels under your model
REGRESSION LOSSES — PICK BY OUTLIER TOLERANCE
LossBest forGotcha
MSE / L2Gaussian noise, clean labels, smooth gradientsOne bad outlier can dominate the whole batch
MAE / L1Outlier-heavy data, targets the medianConstant ±1 gradient slows convergence near 0
HuberBest of both — needs δ tuned to error scaleExtra hyperparameter; MSE for |e|<δ, MAE beyond
$$L_\delta(e)=\begin{cases}\tfrac{1}{2}e^2 & |e|\leq\delta \\ \delta|e|-\tfrac{1}{2}\delta^2 & \text{otherwise}\end{cases}$$Huber: quadratic near zero, linear in the tails
CLASSIFICATION LOSSES — AND THE LOGIT RULE
Binary CE$-[y\log p + (1-y)\log(1-p)]$; sigmoid output; use BCEWithLogitsLoss (fused, numerically stable) Categorical CE$-\log p_{\text{true}}$; softmax output; pass raw logits to CrossEntropyLoss Why not MSE for classification?MSE gradient through sigmoid/softmax gets multiplied by the activation derivative — vanishes when the model is most wrong. CE gradient w.r.t. logits is simply $\hat{p}-y$: large when confidently wrong. Double-softmax bugApplying softmax before CrossEntropyLoss — it trains but is miscalibrated and slow; classic code-review trap

Sanity-check at init: for a $C$-class classifier, untrained CE ≈ $\ln C$ (0.69 for binary, 2.30 for 10 classes, 6.9 for 1000). If you're far off, you have a bug.

CALIBRATION, IMBALANCE & SPECIAL LOSSES
Label smoothingTarget $\to (1-\varepsilon)$ for true class, $\varepsilon/(K-1)$ elsewhere (ϵ ≈ 0.1). Reduces overconfidence, improves calibration. Do not use on a distillation teacher — destroys the inter-class logit structure ("dark knowledge") the student needs. Focal loss$(1-p)^\gamma \cdot \text{CE}$. γ (default 2) down-weights easy examples; α balances class frequency. Use for heavy imbalance with many easy negatives (e.g., object detection). γ=0 reduces to CE. Class-weighted CESimple, but extreme weights destabilize training. Pair with threshold tuning; evaluate with PR-AUC/F1, never raw accuracy on imbalanced sets. Dice lossOptimizes region overlap (F1-like); robust to foreground/background imbalance; unstable on near-empty masks. Combine CE + Dice for segmentation (nnU-Net pattern). Triplet / contrastive / InfoNCEEmbeddings & self-supervised learning. Contrastive: pairs with margin. Triplet: anchor/pos/neg with margin. InfoNCE: one positive vs many in-batch negatives (SimCLR); scales with batch size. All share the same pitfall: easy negatives give zero gradient — hard-negative mining is essential. KL divergence$D_{KL}(p\|q) = \sum p\log(p/q)$. Used in distillation (teacher vs student soft targets) and VAEs. CE = H(p) + KL(p‖q); with one-hot targets H(p)=0 so CE = KL = NLL. With soft targets (distillation), KL is the right object. CTCUnaligned sequence labeling (speech, OCR). Marginalizes over all valid alignments so no frame-level label required.
$$\text{Focal}(p_t) = -(1-p_t)^\gamma \log(p_t)$$focal loss: the (1-p)^gamma term crushes easy examples' contribution

Distillation loss: $\mathcal{L} = \alpha \cdot \text{CE}(y_\text{hard}) + (1-\alpha)\cdot T^2 \cdot \text{KL}(\text{softmax}(z_T/T) \| \text{softmax}(z_S/T))$. Multiply by $T^2$ because softening by $T$ shrinks soft-target gradients by $\sim 1/T^2$ — omit it and distillation barely contributes.

⚠ Clears up — NaN loss & training silent failures

NaN loss: almost always log(0) from a manual softmax+log, or exploding gradients from a high LR. Fix: use fused logit losses (BCEWithLogitsLoss, CrossEntropyLoss), add gradient clipping, assert no NaN in inputs. Loss won't decrease at all: run the overfit-one-batch test first — if a single batch won't reach ~0 loss, you have a wiring bug (labels detached, loss on wrong tensor, frozen params), not an LR problem. This single test separates graph bugs from optimization bugs instantly.

◆ Interview probe Why pass logits rather than probabilities to the loss? → Fusing softmax/sigmoid with log into a single op (log-sum-exp trick) avoids exp() overflow and log(0)=-inf underflow. The math is equivalent but numerically stable.
Remember   Every loss is a noise model in disguise — choose it by asking "what distribution do I assume over my targets?" and the formula writes itself.
Tricky interview questions 12
Why do we use cross-entropy instead of MSE for classification?
Softmax + CE gives a clean gradient of $\hat{p} - y$ w.r.t. logits — large when the model is confidently wrong. MSE through a sigmoid/softmax multiplies by the saturating activation derivative, so gradients vanish exactly when the model is most wrong. CE also directly minimizes the NLL of the true class. Trap: Saying "CE is for classification, MSE for regression" without the vanishing-gradient/saturation mechanism — that's what interviewers actually want.
What loss should you expect at initialization, and why check it?
An untrained C-class classifier outputs roughly uniform probabilities, so expected CE ≈ ln(C): ~0.69 (binary), ~2.30 (10 classes), ~6.9 (1000 classes). If your initial loss is far off, you have a bug — bad init, wrong logit scale, mislabeled targets, or a bad final-layer bias. Trap: Panicking when initial loss is below ln(C) on imbalanced data when you initialized the final-layer bias to log class priors — that's expected behavior, not a bug.
Why pass raw logits to the loss instead of applying softmax/sigmoid first?
Fusing softmax/sigmoid with the log via the log-sum-exp trick avoids exp() overflow and log(0) = -inf underflow. Use BCEWithLogitsLoss and CrossEntropyLoss, never pre-apply the activation. Trap: Applying softmax before CrossEntropyLoss — it still trains, but is miscalibrated and slow: a classic code-review gotcha that's easy to miss.
When would you use focal loss, and what do gamma and alpha each control?
Use focal loss under extreme class imbalance with many easy negatives (dense object detection). Gamma (focusing, default ~2) down-weights easy, well-classified examples so training focuses on hard ones; gamma=0 reduces to CE. Alpha is a per-class balancing weight that handles class frequency. Trap: Conflating the two knobs — gamma handles easy-vs-hard difficulty; alpha handles class frequency imbalance. They are independent and both needed.
What is label smoothing, when does it help, and when does it hurt?
Label smoothing replaces one-hot targets with (1-ε) for the true class and ε/(K-1) elsewhere (ε ≈ 0.1), reducing overconfidence and improving calibration and generalization. It hurts when the model will be a distillation teacher — it collapses the inter-class logit structure ("dark knowledge") the student needs to learn. Trap: Recommending label smoothing for a teacher model in a distillation pipeline — it erases the relative-logit information that makes distillation valuable.
MSE vs MAE vs Huber — when do you pick each for regression?
MSE (assumes Gaussian noise): smooth gradients, outlier-sensitive, targets the mean. MAE (assumes Laplace): outlier-robust, targets the median, but constant ±1 gradient converges slowly near the optimum and is non-smooth at 0. Huber: quadratic for small errors, linear beyond delta — robust like MAE but smooth like MSE. Trap: Saying "MAE is always better because it's robust" — the constant gradient slows convergence near the optimum, and Huber's delta must be tuned to the actual error scale.
How do you handle class imbalance through the loss, and what are the tradeoffs vs resampling?
Class-weighted CE / pos_weight in BCE (simple but extreme weights destabilize training), focal loss (down-weights easy examples, more hyperparameters), or resampling (oversampling risks overfitting duplicates, undersampling discards data). Evaluate with PR-AUC/F1, not accuracy, and tune inference threshold separately. Trap: Stacking class weights on top of oversampling — you double-count the correction and overshoot the minority class.
Your loss goes to NaN during training. What's your debugging checklist?
Likely causes: LR too high (exploding gradients), log(0) from an unstable manual softmax+log, division by zero in a custom loss, or NaN/inf in inputs/targets. Fixes: switch to fused logit losses, add gradient clipping, lower LR, normalize inputs, assert no NaNs in data pipeline. Trap: Jumping straight to lowering LR — the root cause is often a numerically unstable custom loss or NaN in data, not the optimizer.
Training loss won't decrease at all. What is your checklist?
First: try to overfit a single batch to near-zero loss. If that fails, you have a wiring bug — labels detached from the graph, loss computed on the wrong tensor, no gradient flow, or frozen parameters. Only once a single batch overfits should you check LR, normalization, data pipeline, and model capacity. Trap: Skipping the overfit-one-batch test and going straight to hyperparameter tuning — you'll spend hours on the wrong problem.
Explain the relationship between cross-entropy, KL divergence, and NLL.
H(p,q) = H(p) + KL(p‖q). With one-hot labels H(p)=0, so minimizing CE = minimizing KL = minimizing NLL — all three are identical. When the target is itself a distribution (soft labels, distillation), the entropy term is non-constant and KL is the correct object. Trap: Claiming CE and KL are always interchangeable — they differ by the target entropy term, which is only constant for one-hot targets.
How is the knowledge-distillation loss built, and why scale by temperature squared?
The distillation loss is a weighted sum of KL divergence between teacher and student soft targets (both at temperature T) and standard CE on hard labels. Softening by T shrinks the soft-target gradients by ~1/T², so the distillation term is multiplied by T² to restore gradient magnitude parity with the hard-label term. Trap: Omitting the T² rescaling — the soft-target gradient vanishes at high temperature and distillation contributes almost nothing to training.
For segmentation, when do you use Dice loss vs cross-entropy, and why combine them?
Pixel-wise CE is dominated by the large background class. Dice loss (1 minus the F1/Dice overlap) directly optimizes region overlap and handles foreground/background imbalance — but has unstable gradients on near-empty masks. Combining CE (or focal) and Dice gives stable per-pixel supervision plus overlap optimization. This is the nnU-Net default. Trap: Using pure Dice loss and hitting zero/unstable gradients on near-empty masks, or assuming CE alone is sufficient for heavily imbalanced segmentation tasks.
09
PART I · THE BUILDING BLOCKS

Regularization: dropout, weight decay, augmentation

🎯Dropout trains an ensemble for free; weight decay keeps the weights humble.
Dropout: randomly zero units each step (train only)At test time keep all units but scale by p (or use inverted dropout in training).
Dropout zeros each unit with probability $1{-}p$ every step, so the net can't rely on any single neuron → it trains an implicit ensemble and reduces overfitting. Typical $p$: 0.5 for dense layers, 0.1 in Transformers. Off at inference.

Regularization is the art of making a model forget just enough to generalize — every technique below is a different way of injecting controlled noise or constraint so the net can't memorize its way to low training loss. Overfit = high-variance: reach for these knobs, in this order.

DROPOUT — IMPLICIT ENSEMBLE IN ONE PASS

Randomly zero each unit with probability $p$ during training. Modern practice: inverted dropout — surviving units are scaled by $\frac{1}{1-p}$ at train time, so inference runs the full network with no adjustment needed.

$$\tilde{h} = \frac{h \cdot \text{mask}}{1-p}$$activated unit divided by keep-prob at train time; mask is Bernoulli(1-p)
Typical $p$ (FC layers)0.5 — standard for large dense layers after activations Typical $p$ (Transformers)0.1–0.3 — bigger $p$ hurts attention span InferenceDropout OFF — call model.eval() or accuracy is garbage PlacementAfter activations on high-capacity layers; NEVER on output layer
WEIGHT DECAY — AND WHY YOU NEED AdamW

Penalize large weights by adding $\frac{\lambda}{2}\|w\|^2$ to the loss — equivalent to L2 regularization for vanilla SGD. With Adam, this equivalence breaks: L2-in-loss gets absorbed by the per-parameter second-moment $\hat{v}$, so weights with historically large gradients are under-regularized. AdamW fixes this by decoupling decay:

$$w \leftarrow w(1-\eta\lambda) - \eta \nabla_w \mathcal{L}$$weight shrink is applied directly, independent of Adam's adaptive scaling
L2-Adam (wrong)Decay diluted by $\hat{v}$; common silent bug AdamW (correct)Uniform regularization; default choice for Transformers Exclude from decayBiases, BN/LayerNorm gain & shift — use per-group settings
L1 vs L2 — SPARSITY vs SHRINKAGE
L2 gradient$2\lambda w$ → shrinks proportionally, rarely hits exactly zero L1 gradient$\lambda \cdot \text{sign}(w)$ → constant push, drives small weights to exact zero
WantUse
Uniform shrinkage, smooth lossL2 / weight decay
Feature selection, sparse weightsL1 (Lasso)
BothElastic Net
DATA AUGMENTATION & LABEL-LEVEL TRICKS

For vision, augmentation is usually the strongest regularizer and costs nothing extra at train time.

Standard visionRandom flips, crops, color jitter Mixup$x = \lambda x_i + (1-\lambda)x_j$, same mix on labels; $\lambda \sim \text{Beta}(\alpha,\alpha)$ CutMixPaste a patch from image $j$ into image $i$; mix labels by patch area Label smoothingTargets become $(1-\alpha)$ and $\alpha/K$; improves calibration, hurts distillation Text augmentationBack-translation, synonym swap, token masking Bad augmentationClass-altering transforms (horizontal flip on digits 6/9, rotation on orientation-critical medical images)
EARLY STOPPING, BatchNorm & STOCHASTIC DEPTH
Early stoppingStop when val loss stops improving for patience epochs; restore best checkpoint — cheap implicit regularizer BatchNorm side-effectMini-batch mean/var noise ≈ weak dropout; why BN'd CNNs often skip dropout. Primary job is stable optimization, not regularization Stochastic depthDrop entire residual layers at train time; scales with depth; mostly in vision transformers More dataAlways the best regularizer if you can get it
SymptomReach for
High train/val gap (overfit)Augmentation → weight decay → dropout → early stop → more data
High train loss (underfit)Reduce regularization, more capacity, train longer
Mismatched BN stats at evalmodel.eval(), check BN + dropout are both off
⚠ Clears up — Dropout before BatchNorm Never place dropout before a BN layer. Dropout changes activation variance at train time; at inference dropout is off, so BN's stored running statistics no longer match the true distribution — a "variance shift" that silently degrades accuracy. If you use both, put dropout after the last BN, right before the final classifier.
◆ Interview probe "L2 regularization and weight decay are the same thing — agree?" → Only true for SGD. With Adam, L2-in-loss is divided by the per-parameter second moment, so large-gradient weights are under-regularized. AdamW decouples the decay step from adaptive scaling, restoring uniform regularization — that's the entire reason it exists.
Remember   Match the tool to the symptom: overfit → augmentation first, then weight decay (AdamW, not L2-in-Adam), then dropout (inverted, OFF at eval) — and never put dropout before BatchNorm.
Tricky interview questions 12
How does dropout behave differently at training vs. inference time, and how is the scaling handled?
At training, neurons are zeroed with probability $p$ and surviving activations are scaled by $1/(1-p)$ (inverted dropout); at inference dropout is fully off and the full network runs unscaled, so expected activation magnitude matches across phases. Trap: Saying weights are scaled by $p$ at inference — that is the old non-inverted convention. Modern frameworks use inverted dropout. Also forgetting model.eval(), which leaves dropout on during evaluation and destroys accuracy.
Weight decay and L2 regularization are "the same" — when is that false, and why does AdamW exist?
They are equivalent only for vanilla SGD. With Adam, L2-in-the-loss is divided by the per-parameter second-moment estimate $\hat{v}$, so high-gradient weights are decayed less than intended; AdamW decouples decay by subtracting $\lambda w$ directly from the weights, restoring uniform regularization across all parameters. Trap: Claiming they are always identical, or that AdamW just "tunes $\lambda$ differently." The real point is decoupling decay from adaptive per-parameter learning-rate scaling.
What is the practical difference between L1 and L2 regularization, and why does L1 produce sparsity?
L2 shrinks weights proportionally toward zero but rarely reaches exactly zero; L1's gradient is constant magnitude $\lambda \cdot \text{sign}(w)$ and keeps pushing small weights all the way to zero, giving sparse solutions. The geometry: L2 is a circle constraint (smooth corners), L1 is a diamond (sharp corners at axes). Trap: Saying "L1 is just more aggressive L2." The distinction is exact zeros vs. uniform shrinkage, which is a qualitatively different outcome — feature selection vs. weight shrinkage.
Your model has low training loss but much higher validation loss. Walk through your diagnosis and fix.
The train/val gap signals overfitting. Start with the cheapest high-leverage knobs: add/strengthen data augmentation and weight decay, add dropout, use early stopping. If the gap persists, reduce capacity or get more data. If even training loss is high, it is underfitting — increase capacity or reduce regularization. Trap: Jumping straight to architecture changes. First confirm it is really overfitting (not a leaky/mismatched val set or a bad LR), then reach for data/augmentation/regularization before touching the model.
Is batch normalization a regularizer? Why or why not?
Mildly yes — each example's normalization depends on the random mini-batch's mean and variance, injecting stochastic noise into activations at train time, similar in spirit to dropout. This is why heavily BN'd CNNs (e.g., ResNets) often need little or no dropout. Trap: Saying BN's main purpose is regularization. Its primary role is stabilizing and accelerating optimization; the regularization is a side effect that shrinks as batch size grows (larger batches → less noise → less regularization).
Should you combine dropout and batch normalization in the same block? What goes wrong?
Avoid placing dropout before BN: dropout changes activation variance at train time, but at inference dropout is off, so BN's running statistics no longer match the actual distribution — a "variance shift" that silently hurts accuracy. If you use both, put dropout after the last BN layer, e.g., right before the final classifier. Trap: Assuming stacking both always helps. Ordering matters critically; many CNNs use BN alone, while Transformers pair dropout with LayerNorm, which has no train/inference statistic mismatch.
How do you choose a dropout rate and where do you place dropout in a network?
Typical rates are ~0.5 on large fully-connected layers and ~0.1–0.3 on inputs or convolutional features; place it on high-capacity FC layers after activations, never on the output layer. Treat it like any regularizer — raise it if overfitting persists, lower it (or remove it) if training is unstable. Trap: Applying heavy dropout (0.5) everywhere, including conv layers and small networks. Too much dropout on a small or already-regularized model causes underfitting and very noisy gradient estimates.
What is label smoothing, when does it help, and when can it hurt?
Label smoothing replaces hard 0/1 targets with $(1-\alpha)$ and $\alpha/K$, discouraging overconfident logits and improving calibration and generalization in many-class classification. It hurts when you need sharp confidence for downstream tasks, and especially in knowledge distillation — the soft inter-class logit structure the student relies on is destroyed by smoothed targets. Trap: Treating label smoothing as universally beneficial. It degrades distillation quality and can hurt tasks where well-separated, confident logits are the learning signal.
Explain mixup/CutMix as regularization and what hyperparameter controls them.
Mixup trains on convex combinations of input pairs and their labels ($x = \lambda x_i + (1-\lambda)x_j$, same for $y$), with $\lambda \sim \text{Beta}(\alpha,\alpha)$; CutMix instead pastes a rectangular patch from one image into another and mixes labels proportionally by patch area. Both smooth decision boundaries and improve calibration; $\alpha$ controls mixing strength — small $\alpha$ keeps blends near the originals, large $\alpha$ produces near-uniform mixtures. Trap: Forgetting that labels must be mixed too (not just inputs), and that very large $\alpha$ can over-smooth and severely slow convergence.
How does early stopping work and what are its pitfalls in practice?
Monitor a validation metric and stop training when it fails to improve for a set patience number of epochs, then restore the best checkpoint. It is a free implicit regularizer that also saves compute. Trap: Setting patience too small (stops on noisy validation fluctuations); not saving and restoring the best checkpoint (restoring final weights instead); or deciding on a non-representative or leaky validation set, which causes premature or incorrect stopping.
Should weight decay be applied to all parameters, including biases and normalization gains?
No — standard practice excludes biases and BN/LayerNorm scale and shift parameters from weight decay, applying it only to weight matrices and convolution kernels. Decaying these smaller parameters has negligible benefit and can interfere with their intended optimization dynamics. Trap: Setting a single global weight_decay that silently decays biases and norm gains — a subtle bug that can degrade results, especially in Transformers. Use per-parameter-group weight decay settings.
When is data augmentation a bad idea, and how do you pick augmentations?
Choose only label-preserving transforms that reflect real test-time variation. Bad augmentations change the semantic label: horizontal flip on digits 6 vs 9, rotation on orientation-sensitive medical images, or shuffles that break temporal order in time series. Augmentation also gives diminishing returns when data is already abundant and diverse. Trap: Applying a generic vision recipe blindly across domains. Class-altering or distribution-shifting augmentations inject label noise and can hurt far more than help — always sanity-check augmented samples visually.
10
PART II · TRAINING

Optimizers & learning-rate schedules

🎯Adam to move fast, SGD+momentum to finish sharp — and always warm up the learning rate.
025507510000.51Learning-rate schedule: warmup + cosine decaytraining steplearning rateend of warmup
Warmup (ramp the LR up over the first ~1-10k steps) stops early divergence when weights are random; cosine decay then anneals it smoothly to ~0 for a clean finish. The single most important hyperparameter is the peak LR — tune it first.
-8-4048-404Optimizers on an ill-conditioned lossw₁ (flat direction)w₂ (steep)SGD (zig-zags)+ momentumAdam (rescaled)
When curvature is uneven, plain SGD zig-zags across the steep direction while crawling along the flat one. Momentum averages successive gradients to damp the oscillation and accelerate down the valley; Adam/RMSProp rescale each coordinate by its own gradient history, taking near-straight steps to the minimum.

An optimizer is the engine that converts gradients into weight updates — pick the wrong one or botch the learning rate schedule and your model either never converges or stalls in a mediocre minimum. Think of it as choosing the right gear for the terrain: Adam is your all-terrain truck, SGD+momentum is a finely tuned racing car.

THE OPTIMIZER FAMILY TREE
SGDRaw gradient step. Noisy but generalizes well in vision with good tuning. +MomentumEWMA of past gradients (β≈0.9); rolls through ravines, damps oscillations. NesterovLook-ahead momentum — compute gradient at the anticipated next position. Slightly tighter convergence. AdagradPer-parameter LR scaled by accumulated squared gradients. Great for sparse; LR decays to zero permanently — fatal for deep nets. RMSpropAdagrad + EWMA of squared grads (fixes the dying-LR problem). Adam without momentum. AdamRMSprop + first-moment momentum + bias correction. Fast, robust default. AdamWAdam + decoupled weight decay. The modern Transformer default. LAMB/LARSLayer-wise trust-ratio scaling. Designed for extreme batch sizes (≥32k).
ADAM INTERNALS & HYPERPARAMETERS
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$$first moment (momentum) and second moment (variance) EMWAs
$$\hat m_t = \frac{m_t}{1-\beta_1^t} \quad \hat v_t = \frac{v_t}{1-\beta_2^t}$$bias-corrected estimates — critical in first ~200 steps
$$\theta \leftarrow \theta - \alpha \frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon}$$parameter update; ε≈1e-8 default, raise to 1e-4 in mixed precision

Defaults: β₁=0.9, β₂=0.999, ε=1e-8. Lower β₂ to ~0.98 for noisier/transformer training. ε is not just a divide-by-zero guard — a larger ε damps adaptivity and acts as a stabilizer.

AdamW vs Adam+L2: L2 adds λw to the gradient, then divides by √v — high-gradient params get less decay. AdamW subtracts αλw directly from weights, applying decay uniformly. They coincide for SGD but NOT for Adam.

WHEN TO USE WHAT
SituationReach for
Transformer / NLP / multimodalAdamW + warmup + cosine
CNN vision, established recipe existsSGD + momentum + step/cosine decay
Sparse features / embeddingsAdam (per-param LR handles sparsity)
Huge batches (≥32k)LAMB or LARS
Memory-constrained large models8-bit Adam, Adafactor, or ZeRO/FSDP sharding
Fast prototyping, unknown domainAdamW — low tuning needed
LR SCHEDULES — THE PEAK LR IS THE #1 HYPERPARAMETER
WarmupRamp LR from ~0 over first 500–2000 steps. Lets second-moment v stabilize before taking large steps. Essential for Adam + Transformers. Cosine decaySmoothly anneals to ~0. Modern default. Requires committing to a total step budget up front. Step decayDrop by factor (e.g., ×0.1) at fixed milestones. Simple; standard in older vision pipelines. One-cycleWarm up to max LR, then decay sharply. Fast convergence in limited budgets. LR finderSweep LR over a mini-run; pick value just before loss explodes.

Batch size ↔ LR: Linear scaling rule — multiply LR by k when batch × k. Add warmup. At extreme scales switch to sqrt(k) or LAMB.

Gradient clipping pairs naturally: clip-by-global-norm (max norm 1.0) rescales the whole gradient vector, preserving direction. Essential for RNNs and transformer training stability. Constant clipping = LR too high.

MEMORY COST & DEBUGGING

Adam stores m and v — roughly 2× parameter memory, more with fp32 master copies. For LLMs, optimizer state often exceeds model weight memory. Mitigations: 8-bit Adam (bitsandbytes), Adafactor (factorized v), ZeRO-1/2/3, FSDP.

Divergence checklist: LR too high → lower it; missing warmup → add it; no clipping → add it; mixed-precision instability → raise ε; late divergence → check schedule or weight decay.

⚠ Clears up — Adam vs SGD generalization Adam converges faster but can land in sharper minima; SGD's gradient noise biases it toward flatter, better-generalizing minima. That is why well-tuned SGD+momentum sometimes beats Adam on vision benchmarks. AdamW narrows the gap. For NLP/transformers the story reverses — Adam-family usually wins on both speed and quality.
◆ Interview probe "Why doesn't adding L2 regularization to Adam work the same as weight decay?" → Because Adam divides the gradient (including the L2 term) by √v, so high-gradient parameters receive proportionally less regularization. AdamW decouples the decay step, applying it uniformly outside the adaptive scaling, which is why it generalizes better and is the standard for transformers.
Remember   Peak LR is your most powerful knob — always pair Adam/AdamW with linear warmup and cosine decay, decouple weight decay with AdamW, and reserve well-tuned SGD+momentum for vision when you can afford the tuning.
Tricky interview questions 12
When would you use AdamW over SGD+momentum, and vice versa?
Use AdamW as the low-tuning robust default for transformers, NLP, attention models, and sparse gradients. Use SGD+momentum when following an established vision/CNN recipe where it is known to generalize slightly better. Trap: Saying Adam is always better because it converges faster — lower training loss does not guarantee better validation accuracy.
Why does Adam need bias correction, and when does it actually matter?
m and v are initialized to zero, so early estimates are biased toward zero; dividing by (1−β^t) corrects this and ensures the effective step size is sensible from the start. It matters most in the first ~200–500 steps, especially with β₂=0.999 which warms up very slowly. Trap: Claiming it matters throughout all of training — the correction terms approach 1 quickly and become negligible after early steps.
What is the difference between AdamW and Adam+L2 regularization, and why does it matter?
Adam+L2 adds λw to the gradient and then divides by √v, so high-gradient parameters receive less effective decay. AdamW decouples decay by subtracting αλw directly from weights, applying it uniformly regardless of gradient scale. AdamW generalizes better and is the standard for transformers. Trap: Assuming L2 and weight decay are equivalent — they coincide only for vanilla SGD, not for adaptive optimizers.
What do β₁, β₂, and ε control in Adam, and when would you change the defaults?
β₁ (0.9) controls first-moment momentum decay, β₂ (0.999) controls second-moment variance decay, and ε (1e-8) guards the denominator. Lower β₂ to ~0.98 for noisier or transformer training; raise ε to 1e-4 in mixed precision to prevent instability. Trap: Treating ε as purely numerical — a larger ε damps adaptivity and acts as a stabilizer, essentially making Adam behave more like SGD for high-variance parameters.
Explain the progression Adagrad → RMSprop → Adam: what limitation does each fix?
Adagrad scales LR per-parameter by accumulated squared gradients (great for sparse), but the ever-growing denominator causes LR to decay monotonically to zero. RMSprop replaces the sum with an EWMA so the LR stops vanishing. Adam adds a momentum first moment on top of RMSprop plus bias correction on both moments. Trap: Forgetting Adagrad's fatal flaw — the non-decaying accumulator, which is exactly what RMSprop's exponential moving average fixes.
Why is learning rate warmup important, especially for transformers with Adam?
Early in training, the second-moment estimate v is high-variance (very few gradient samples), so the adaptive step size can be erratically large and destabilize training. Warmup ramps the LR from ~0 over the first 500–2000 steps to let the moment estimates stabilize. Trap: Saying warmup is just "being safe at the start" without explaining the specific Adam mechanics — the high variance in v before enough gradient statistics accumulate is the precise cause.
How does batch size relate to learning rate, and where does the linear scaling rule break down?
Larger batches produce lower-variance gradients, so LR can scale linearly (LR × k when batch × k) while keeping per-example update statistics roughly constant. It breaks down at very large batch sizes (≥ a few thousand) where you need warmup and may need to switch to sqrt(k) scaling or LAMB. Trap: Scaling LR linearly to extreme batches without warmup — this causes divergence because the early high-variance phase is amplified by the large step.
When and why do you use gradient clipping, and which type?
Use it when loss spikes or NaNs appear, common in RNNs and transformers. Clip-by-global-norm (e.g., max norm 1.0) is preferred because it rescales the entire gradient vector while preserving its direction, unlike clip-by-value which distorts direction. Trap: Using clipping as a cure-all — if you're constantly hitting the clip threshold, the LR is probably too high; clipping is a safety net, not a substitute for proper LR tuning.
Your training loss diverges to NaN. Walk through the optimizer/LR debugging checklist.
First lower the LR (most common cause). Then add or extend warmup, enable gradient clipping, raise ε if using mixed precision, and verify loss scaling. If it diverges only late in training, suspect the decay schedule or excessive weight decay. Trap: Immediately blaming the architecture — the overwhelmingly common cause is an LR too high or missing warmup/clipping, not model structure.
Why might Adam generalize worse than SGD on vision tasks, and how do you close the gap?
Adam's per-parameter adaptive steps can converge to sharper minima; SGD's gradient noise biases toward flatter minima that generalize better. Switching to AdamW (decoupled decay), tuning the LR schedule carefully, or switching to SGD for fine-tuning can close the gap. Trap: Stating it as a universal law — AdamW substantially narrows the gap, and for NLP/transformer tasks Adam-family often generalizes better than SGD.
How much extra memory does Adam use, and what do you do when optimizer states don't fit?
Adam stores m and v per parameter — roughly 2× the model parameter memory (more with fp32 master copies in mixed precision training). Mitigations: 8-bit Adam (bitsandbytes), Adafactor (factorized second moment), ZeRO-1/2/3 optimizer state sharding, or FSDP. Trap: Forgetting optimizer state when GPU memory budgeting — for large models, m + v + fp32 master weights can easily exceed the model weights themselves.
Your validation loss plateaus early even though training loss keeps decreasing. What optimizer/schedule levers do you try?
Check if the LR is decaying too fast (schedule ends too early), reduce weight decay if over-regularized, try a higher peak LR or longer warmup, or switch from step decay to cosine. If overfitting instead, increase weight decay or decay the LR more aggressively. Trap: Only tuning model architecture or data augmentation and ignoring the schedule — an early plateau is frequently just the LR reaching near-zero too soon, starving the optimizer of useful updates.
11
PART II · TRAINING

The training loop: batch size, gradient accumulation, clipping, mixed precision

🎯Out of memory? Accumulate gradients, go bf16, checkpoint activations — same math, less RAM.
Gradient accumulation = a big batch on a small GPUmicro-batch 1forward + backward+= gradmicro-batch 2forward + backward+= gradmicro-batch 3forward + backward+= gradmicro-batch 4forward + backward+= gradoptimizer.step()after N micro-batchesDon't call step() every micro-batch: accumulate N grads, step once → effective batch = N × micro-batch.
Run several small "micro-batches," sum their gradients, and take one optimizer step — you get the statistics of a large batch without its memory. Pair it with mixed precision (bf16) and gradient checkpointing to fit big models.

The training loop is a pipeline of small decisions that compound into 10x memory savings, 2x speed, and the difference between a stable run and a NaN at step 47. Master the knobs and you can squeeze a 7B-parameter model onto a single 40 GB GPU.

THE LOOP SKELETON

Every step: forward → loss → loss.backward() → clip_grads → optimizer.step() → optimizer.zero_grad(). Putting zero_grad at the end (not the start) is equivalent but keeps gradients live for inspection longer. Log the gradient norm every step — it's your pulse monitor.

BATCH SIZE & THE LINEAR SCALING RULE

Bigger batches = more stable gradient estimates and better GPU utilization, but they push toward sharp minima that generalize worse and eat memory fast. The standard heuristic: scale LR linearly with batch size and add warmup.

$$\text{LR}_{\text{new}} = \text{LR}_{\text{base}} \times \frac{B_{\text{new}}}{B_{\text{base}}}$$multiply LR by the same factor you multiplied the batch

The rule breaks down above a critical batch size — past that, use sqrt-scaling or just cap the LR. Always add a warmup ramp (few hundred–few thousand steps) to let Adam's variance estimates stabilize before the full LR kicks in.

GRADIENT ACCUMULATION

Run k micro-batches, do not call zero_grad between them, divide each micro-batch loss by k, step once. Effective batch = k × micro-batch. Free large-batch simulation with no extra memory — but BatchNorm stats are still computed per micro-batch, not over the effective batch, which is a silent accuracy leak.

$$\mathcal{L}_{\text{step}} = \frac{1}{k}\sum_{i=1}^{k}\mathcal{L}_i \quad\Rightarrow\quad \text{one optimizer.step()}$$accumulate, then step
GRADIENT CLIPPING & MIXED PRECISION
Clip by global normrescales the whole gradient vector, preserving direction; threshold ~1.0 for transformers Clip by valueclips each element independently — distorts gradient direction, avoid fp165 exponent / 10 mantissa bits — fast but narrow range; requires loss scaling to prevent gradient underflow bf168 exponent / 7 mantissa bits — same range as fp32, no loss scaling needed; prefer on A100/H100/TPU Dynamic loss scalingstarts with a large scale factor, halves it (skipping the step) on any Inf/NaN, raises it after a clean streak

Critical gotcha: unscale gradients before clipping and the optimizer step, or you're clipping on the wrong scale.

MEMORY TRICKS TOOLKIT
TechniqueWhat it savesCost
Gradient accumulationPeak activation memory per stepNone (slower wall-clock)
Mixed precision (bf16/fp16)~2× activation + weight memoryMinimal (bf16 ≈ free)
Activation checkpointingActivation memory (huge for long seqs)+20-30% compute
ZeRO / FSDPOptimizer state + params across GPUsCommunication overhead
EMA of weightsNo memory savings; better eval stabilityExtra weight copy

Activation checkpointing recomputes activations during backward — it only helps when activations are your OOM bottleneck. If optimizer state dominates, reach for ZeRO/FSDP instead.

SANITY CHECKS & STABILITY DRILLS

Before scaling up: (1) verify init loss ≈ $-\log(1/C)$ for C-class softmax; (2) overfit a single batch to near-zero loss — if you can't, the bug is in the pipeline, not the data volume; (3) watch gradient norms for the first 100 steps. BatchNorm is dangerous at small batch sizes (< 8) — switch to LayerNorm or GroupNorm. For reproducibility: seed Python/NumPy/torch, set cudnn.deterministic=True and torch.use_deterministic_algorithms(True).

⚠ Clears up — NaN Loss Debugging Order

The instinct is to add gradient clipping immediately, but clipping is a symptom suppressor. Diagnose first: print per-step gradient norms to pinpoint the blow-up step, then check LR (too high?), loss inputs (log(0)?), and data (NaN/Inf?). For fp16, check for overflow and switch to bf16 or enable dynamic loss scaling. Add clipping only once you understand why gradients explode.

◆ Interview probe Q: You increase your batch size 8× to speed up training. What do you change and what can go wrong? → Scale LR ~8× with warmup (linear scaling rule), verify that BatchNorm stats aren't silently wrong under gradient accumulation, and watch for sharp-minima generalization degradation. Past the critical batch size, linear scaling breaks — use sqrt-scaling or cap the LR. Generalization on the test set is the final arbiter.
Remember   Gradient accumulation, mixed precision, and activation checkpointing are a three-layer stack for fitting big models on small GPUs — but each solves a different memory bucket, so match the tool to the bottleneck.
Tricky interview questions 12
Your training loss goes to NaN after a few steps. Walk through how you debug it.
Print the per-step gradient norm to find exactly when it blows up, then check root causes in order: LR too high (cut 10×), log(0) or divide-by-zero in the loss (add epsilon), NaN/Inf inputs (scan the data), and fp16 overflow (switch to bf16 or enable dynamic loss scaling). Add gradient clipping only after you understand the cause. Trap: jumping straight to clipping — it masks the symptom but a bad LR or log(0) is usually the real root.
You can't fit your desired batch size in GPU memory. What are your options and their tradeoffs?
Gradient accumulation (no quality loss, free), mixed precision bf16 (~2× memory savings), activation checkpointing (+20-30% compute, large activation savings), ZeRO/FSDP (shards optimizer state + params across GPUs), or genuinely reduce the batch. Combine as needed. Trap: thinking gradient accumulation is identical to true large-batch — BatchNorm stats are still computed per micro-batch, not the effective batch, which can quietly hurt accuracy.
When you increase batch size by k, what happens to the learning rate and why?
Scale LR roughly linearly (multiply by k) per Accurate, Large Minibatch SGD, combined with warmup to handle early instability when the LR is large. The rule breaks down past a critical batch size — sqrt-scaling or LR saturation works better beyond that. Trap: assuming linear scaling holds indefinitely — it doesn't, and skipping warmup causes early divergence.
What is the difference between fp16 and bf16 for mixed-precision training? When do you need loss scaling?
fp16 has 5 exponent / 10 mantissa bits — more precision but narrow range, so small gradients underflow; loss scaling is required. bf16 has 8 exponent / 7 mantissa bits — same range as fp32, less precision, but typically no loss scaling needed. Prefer bf16 on A100/H100/TPU. Trap: saying mixed precision is always safe — fp16 silently drops gradients without loss scaling, and on old hardware without Tensor Cores you may see no speedup.
Explain dynamic loss scaling. What is the critical implementation detail when combined with gradient clipping?
Loss scaling multiplies the loss by a large factor before backprop so small fp16 gradients stay representable, then unscales before the optimizer step. Dynamic loss scaling raises the factor after a clean streak and halves it (skipping that optimizer step) when Inf/NaN appear. Trap: forgetting to unscale before clipping — clipping on scaled gradients applies the wrong threshold and silently throttles updates.
How does BatchNorm behave differently in train vs eval mode, and why does it matter?
In training, BN normalizes using the current mini-batch mean/variance and updates running stats; in eval it uses stored running statistics for deterministic, batch-independent outputs. Always call model.eval() at inference. Trap: forgetting model.eval() — predictions then depend on the batch composition and degrade badly at batch size 1.
Your model trains fine but performs poorly with small batch sizes. What's likely wrong?
BatchNorm statistics become noisy and unreliable at small batches (e.g., 2-8). Switch to GroupNorm or LayerNorm, which don't depend on batch size, or use synchronized BN across devices. Trap: blaming the learning rate — the real issue is that BN's per-batch statistics are unstable for tiny batches.
Gradient clipping by value vs. by global norm — which do you prefer and how do you pick the threshold?
Prefer clip-by-global-norm: it rescales the entire gradient vector and preserves its direction. Set the threshold by logging gradient norms over early steps and clipping a bit above the typical norm (1.0 is the standard for transformers). Clip-by-value clips each element independently and distorts direction. Trap: setting the threshold so low it clips nearly every step — that throttles learning and hides the real instability.
What is gradient/activation checkpointing and when is it worth using?
Instead of storing all forward-pass activations for backprop, checkpointing discards them and recomputes on the fly during the backward pass. This cuts activation memory substantially at ~20-30% extra compute cost. It's worth it when activations (not optimizer state or parameters) are the OOM bottleneck — common for long sequences or very deep nets. Trap: expecting it to fix every OOM — if optimizer state dominates you need ZeRO/FSDP/offloading instead.
When would you choose AdamW over SGD+momentum, or vice versa?
AdamW is the default for transformers and NLP — fast convergence, little LR tuning, and correct decoupled weight decay. SGD+momentum can generalize better in vision (ResNets) but requires careful LR scheduling. Always use AdamW over plain Adam when you want weight decay — Adam couples decay with the adaptive step, which is incorrect. Trap: claiming Adam is universally better — it can converge to sharper minima that generalize worse, and plain Adam's weight decay is not equivalent to L2 regularization.
How do you verify a new training pipeline is correct before committing to a full run?
Three checks: (1) verify the initial loss matches theory — for C-class softmax cross-entropy it should be near log(C); (2) overfit a single batch to near-zero loss to confirm the model, optimizer, and labels are wired correctly; (3) confirm training loss decreases monotonically on the full dataset early on. Fix seeds and inspect a data batch first. Trap: skipping the single-batch overfit — if the model can't memorize 4 examples, scaling up data will not help; the bug is in the pipeline.
Your validation loss is lower than your training loss. Is something broken?
Usually not: dropout and weight decay are active during training but disabled at eval, so training loss is artificially inflated; also, training loss is averaged over the epoch while the model was still learning, whereas val is measured at epoch end with the updated weights. It can also happen if val is smaller or easier. Trap: immediately suspecting a data leak — the benign explanations (dropout, measurement timing) are far more common and should be ruled out first.
12
PART II · TRAINING

Data pipeline & preprocessing

🎯Garbage in, garbage out: normalize inputs, augment hard, never leak the test set.

Your model is only as smart as what you feed it — garbage in, garbage out is the iron law of deep learning. Master the pipeline and you win before the optimizer even runs.

NORMALIZATION VS STANDARDIZATION
Standardization (z-score)subtract mean, divide std — default for NNs, SVM, KNN, PCA, logistic regression Min-maxscale to [0,1] or [-1,1] — use when bounded range matters (pixel inputs, saturating activations) RobustScaleruses IQR instead of std — reach for this with heavy-tailed or outlier-laden data Tree modelsscale-invariant; RF, XGBoost, LightGBM need no scaling at all
$$x' = \frac{x - \mu}{\sigma}$$subtract train mean, divide by train std — compute stats on train only
Golden rule: fit the scaler on train, call transform() on val/test. Fitting on all data leaks test stats into training (optimistic bias). Inside cross-validation, fit inside each fold via a Pipeline.
IMAGE PIPELINE

Standard flow: decode → resize/crop → convert to float → normalize with per-channel ImageNet stats → random augmentations (train only).

ImageNet mean/std[0.485, 0.456, 0.406] / [0.229, 0.224, 0.225] — must replicate exactly when using pretrained backbones Train augmentationsRandomCrop, horizontal flip, color jitter, RandAugment — random, train only Val/test transformsresize, center crop, same normalization — deterministic, no random augmentation Mixupblend two images + labels linearly — good for calibration and general classification CutMixpaste a patch from one image into another, area-proportional labels — better when spatial/localization cues matter
TechniqueUse when
Random augmentationtraining set only
Normalizationall splits (train, val, test)
TTAcompetition / offline inference; skip in latency-sensitive production
TEXT PIPELINE & TOKENIZATION

Word-level vocabularies explode and can't handle unseen words. Subword tokenizers (BPE, WordPiece, SentencePiece) cap vocab size while representing rare words as known pieces — nearly eliminating OOV and capturing morphology.

BPEGPT-family — merge most frequent byte pairs iteratively WordPieceBERT — maximizes language model likelihood at each merge SentencePiece/UnigramT5, multilingual — language-agnostic, no whitespace assumptions

After tokenizing: pad to a common length per batch (dynamic padding + bucketing beats global max), pass an attention mask (1 for real tokens, 0 for pad). Without the mask, pad tokens pollute attention and corrupt loss.

TABULAR: CATEGORICALS, MISSING VALUES, IMBALANCE
ProblemGo-to solutionCaveat
Low-cardinality categoricalOne-hot encodingExplodes with many categories
High-cardinality categoricalLearned embeddings or frequency encodingTarget encoding needs fold-level fit to prevent leakage
Unseen category at inferenceReserved <UNK> bucket or hashingPipeline crashes without this
Skewed / outlier numericsMedian imputation + RobustScalerLog-transform then scale for heavy tails
Class imbalance (moderate)Class weights in lossKeep val/test imbalanced — reflect real distribution
Class imbalance (extreme, detection)Focal lossPair with PR-AUC, not accuracy
SMOTETabular train fold onlyMeaningless on images/text; breaks in high dimensions
$$FL(p_t) = -\alpha_t (1-p_t)^\gamma \log(p_t)$$focal loss — down-weights easy negatives; γ=2, α tuned per class is default
Always add a binary "was-missing" indicator when missingness correlates with the target. Fit imputers on train only.
INPUT PIPELINE PERFORMANCE & DATA LEAKAGE

GPU starvation is a silent killer. The fix: shuffle → batch → prefetch (tf.data / DataLoader with num_workers > 0 and pin_memory=True on GPU). Overlap CPU preprocessing with GPU compute so the accelerator is never waiting.

Leakage checklist:

  • Never compute scaler/imputer stats on val or test
  • Stratify splits by class; split by time or group when samples are correlated
  • SMOTE and oversampling: train fold only
  • Target encoding: fit inside each CV fold, not globally
  • Serve with the exact same frozen preprocessing artifacts versioned alongside the model — re-implementing in the serving stack creates silent training/serving skew
⚠ Clears up — "just normalize everything" Tree models (RF, XGBoost, LightGBM) are scale-invariant and gain nothing from normalization. Applying it wastes time but usually doesn't hurt — the real trap is forgetting to normalize for NNs, SVMs, and KNN, where it does matter critically.
◆ Interview probe "Your model accuracy drops sharply in production despite great offline metrics — what's your first suspicion?" → Training/serving skew: the serving stack re-implements preprocessing or re-computes stats, shifting the input distribution. Check that scaler mean/std, vocab, and tokenizer config loaded at inference match the training artifacts exactly. Also verify val/test were never accidentally augmented or re-normalized with test-set stats.
Remember   Fit all preprocessing on training data only, replicate it exactly at inference, and you've eliminated the most common source of phantom accuracy gains and production failures.
Tricky interview questions 12
When do you use min-max scaling vs standardization, and which models don't need either?
Standardize (zero mean, unit variance) for gradient-based models (NNs), SVM, KNN, PCA, and logistic regression. Use min-max when a bounded range is required (pixel inputs to [0,1], saturating activations). Tree-based models (RF, XGBoost, LightGBM) are scale-invariant and need neither. Trap: saying "always normalize" — applying it to trees wastes time, and using min-max on heavy-tailed data lets outliers squash everything else (use RobustScaler instead).
Why must you fit the scaler on the training set only, and what breaks if you don't?
Fitting on all data leaks test-set statistics (mean, std, min/max) into training, giving optimistically biased metrics that don't generalize to production. Fit on train, call transform() on val/test; inside cross-validation, fit inside each fold via a Pipeline. Trap: calling fit_transform on the whole dataset before splitting, or fitting the scaler before cross-validation folds instead of inside each fold.
How do you handle class imbalance in a deep classifier, and how do you choose your tool?
For neural nets, start with class weights in the loss (weight inversely to class frequency). For extreme imbalance (object detection, fraud), use focal loss. SMOTE is a tabular-only tool — meaningless on images or text where linear pixel/token interpolation is semantically invalid. Always pair these with PR-AUC, F1, or recall rather than accuracy. Trap: reporting accuracy on a 99/1 split and calling the model good, or keeping val/test balanced when the real world is imbalanced.
Should data augmentation be applied to validation and test sets?
Random augmentations (flips, color jitter, random crop) apply to training only — val/test must reflect real, unmodified data for stable evaluation. Deterministic transforms (resize, center crop, normalization with train statistics) DO apply to all splits. TTA is a deliberate, labeled exception used at inference time with known latency cost. Trap: accidentally augmenting the test set, or over-correcting by dropping resize/normalization on val/test so the input distribution no longer matches training.
When using a pretrained ImageNet backbone, what preprocessing must you replicate exactly?
Match the backbone's exact preprocessing: same resize/crop size, channel order (RGB), scaling to [0,1], and per-channel ImageNet mean [0.485, 0.456, 0.406] / std [0.229, 0.224, 0.225]. Mismatched normalization silently shifts the input distribution away from what the model learned and degrades accuracy even if the rest of the fine-tuning is perfect. Trap: using your own dataset's mean/std with an ImageNet-pretrained net, or feeding BGR/0–255 images when the model expects RGB/0–1.
Why use subword tokenization (BPE/WordPiece/SentencePiece) instead of word-level or char-level?
Subword tokenizers cap vocabulary size while representing rare and unseen words as sequences of known pieces, nearly eliminating OOV and capturing morphology. Word-level explodes the vocab and chokes on unseen words; char-level makes sequences impractically long. BPE is used by GPT, WordPiece by BERT, SentencePiece/Unigram by T5. Trap: claiming subword tokenizers have zero OOV (rare scripts or bytes can still be unknown), or assuming you can swap a tokenizer without retraining the model's embedding table.
How do you batch variable-length sequences, and what breaks without an attention mask?
Pad sequences to a common length per batch (dynamic padding per batch plus bucketing groups similar lengths to minimize waste). Pass an attention mask (1 for real tokens, 0 for padding) so padded positions are excluded — set them to -∞ before the attention softmax and exclude them from the loss. Without masking, pad tokens leak into attention scores and corrupt both training and inference. Trap: padding to a global max length (wasteful GPU memory), or omitting the mask so the model learns to attend to and predict padding.
How do you handle high-cardinality categoricals and categories unseen at inference?
Use learned embeddings or frequency/target encoding for high-cardinality columns. For unseen categories at inference, map to a reserved <UNK> bucket or use feature hashing so the pipeline doesn't crash. Target encoding must be fit on training folds only to prevent label leakage. Trap: label-encoding nominal categories as ordered integers (implies false ordinal relationship), one-hot encoding a million-category column (memory explosion), or a pipeline that errors on any novel category in production.
Why should SMOTE be applied only to the training fold, and when is it a bad idea?
Resampling before splitting puts synthetic/duplicated copies of the same point in both train and test, leaking labels and inflating evaluation scores. SMOTE also breaks down in high dimensions and on images, text, or audio — interpolating raw pixels or tokens produces semantically meaningless samples. Use class weights or focal loss there instead. Trap: running SMOTE on the full dataset, or proposing SMOTE for image or embedding data where linear interpolation between inputs is invalid.
What is the difference between Mixup and CutMix, and when would you choose each?
Mixup linearly blends two images and their one-hot labels (good for general classification and improving calibration). CutMix pastes a rectangular patch from one image onto another, assigning labels proportional to patch area (better when spatial and localization cues matter, e.g. fine-grained recognition). Label smoothing replaces one-hot targets with soft targets to reduce overconfidence without mixing inputs. Trap: applying Mixup/CutMix to tasks where blended labels are nonsensical (detection with tight bounding boxes), or using label smoothing with a distillation teacher (it destroys the dark knowledge in teacher logits).
How do you prevent training/serving skew in your preprocessing pipeline?
Persist all fitted preprocessing artifacts (scaler mean/std, vocabulary, tokenizer config, encoders) and apply the exact same code at inference — ideally by bundling preprocessing inside the model graph or a shared library versioned alongside the model weights. Never re-implement preprocessing separately in the serving stack or recompute stats at serve time. Trap: re-implementing normalization or tokenization in a serving microservice with slightly different logic or recomputed statistics — this silently shifts the input distribution from what the model trained on, causing mysterious production degradation.
How do you build an input pipeline that doesn't starve the GPU?
Use shuffle → batch → prefetch (tf.data) or DataLoader with num_workers > 0 and pin_memory=True (PyTorch) to overlap CPU preprocessing with GPU compute. Use bucketing for variable-length sequences to minimize padding waste. Profile with tools like PyTorch Profiler or TensorBoard to confirm the GPU utilization is high and data loading isn't the bottleneck. Trap: using num_workers=0 (single-threaded, GPU idles while CPU preprocesses) or prefetching but forgetting to set pin_memory, which forces an extra memory copy per batch.
13
PART II · TRAINING

Hyperparameter tuning & debugging

🎯First debugging move: overfit a single batch. If you can't, it's a bug, not the model.
025507510000.20.40.6Learning curves: training vs true risktraining-set size nexpected errorirreducible (Bayes) errorgeneralization gaptest / true risktraining error
More data shrinks the generalization gap: training error rises and test error falls until both hit the irreducible Bayes error. The gap, not the training error, is what regularization and more data attack.

Debugging a deep learning model is like diagnosing a sick engine: most failures aren't tuning problems — they're wiring problems. Fix the bugs first, then tune one knob at a time, starting with the one that moves the needle most: the learning rate.

TUNE IN THIS ORDER — ALWAYS
PriorityHyperparameterWhy it matters most
1stLearning rateBiggest single effect; wrong LR breaks everything else
2ndBatch sizeAffects noise, throughput, and must pair with LR scaling
3rdArchitecture sizeCapacity sets the bias-variance ceiling
4thRegularizationOnly useful once the model can actually overfit
5thLR schedule / warmupSqueeze out final performance

Use random search over grid search — only a few hyperparameters actually matter, and random sampling covers those critical dimensions far better per trial. Use Bayesian / ASHA / Hyperband when trials are expensive.

THE DEBUGGING PLAYBOOK (6 STEPS)
Step 1 — overfit a single batchThe single most powerful sanity check. A correct model must drive a tiny batch to ~0 loss. If it can't, the bug is in data, loss, or graph — not hyperparameters. Loss rising = wrong sign or LR too high; NaN = numerical issue; flat = LR too low or frozen params; oscillating = corrupted labels. Step 2 — check init lossFor C-class balanced softmax, expect $-\log(1/C) = \log C$. Wildly off? Wrong loss, bad init, or wrong label encoding. Takes 10 seconds to check. Step 3 — hunt NaNsLower LR, add gradient clipping, add $\epsilon$ in logs/divisions, use log-softmax/logsumexp, enable anomaly detection, add fp16 loss scaling. Step 4 — train vs. val curvesLarge gap = overfitting (high variance). Both high and flat = underfitting (high bias). Guides the exact fix. Step 5 — normalize inputs & verify labelsMismatched normalization or shuffled labels torpedo everything silently. Step 6 — set seedsPython, NumPy, PyTorch/TF, CUDA — pin them all. Enable deterministic ops.
LOSS FORMULAS — KNOW THESE COLD
$$\mathcal{L}_\text{init} \approx \log C$$expected cross-entropy at random init for C balanced classes; e.g. ~2.30 for 10 classes, ~4.61 for 100
$$\text{LR}_\text{new} = \text{LR}_\text{base} \times \frac{B_\text{new}}{B_\text{base}}$$linear scaling rule: when batch size grows k×, scale LR k×, plus add warmup

LR range test: ramp LR exponentially over ~300 steps, pick the value just below where loss starts diverging. Default for Adam: 3e-4. Tune on a log scale.

BIAS VS. VARIANCE — THE CORE DIAGNOSTIC
SymptomDiagnosisFix
Train loss high, val loss highUnderfitting / high biasMore capacity, less regularization, higher LR, longer training
Train loss low, val loss risingOverfitting / high varianceMore/augmented data, dropout, weight decay, early stopping, smaller model
Val fine, test badDistribution shiftCollect test-like data; check for leakage
Good val, zeroed-input matchesData leak or majority-class exploitFix split, recompute normalization stats on train only
VANISHING / EXPLODING GRADIENTS

Instrument per-layer gradient norms (log or histogram every N steps). Near-zero early-layer norms = vanishing; blowing up or NaN = exploding.

Vanishing fixesReLU-family activations, He/Xavier init, BatchNorm/LayerNorm, residual connections Exploding fixesGradient clipping (clip by global norm, threshold ~1.0), lower LR, loss scaling for fp16 train vs. eval modeDropout and BatchNorm behave differently — always call model.eval() at inference or your numbers are garbage
⚠ Clears up — "The model isn't learning, let me tune the LR" The #1 wrong reflex. Flat loss from step 0 almost always means a broken pipeline: detached graph, frozen params, wrong loss reduction, shuffled labels, or inputs never reaching the model. Overfit a single batch first — if it can't, no LR value will save you.
◆ Interview probe Q: You just started training and the loss is completely flat. What's your first move? → Don't touch the LR. Check that the init loss equals $\log C$, then try to overfit a single batch with no augmentation and no dropout. If it still can't overfit, inspect that gradients are non-zero and non-NaN, the optimizer is updating the right params, and labels are correctly paired with inputs.
Remember   Debug before you tune: overfit one batch, confirm init loss, hunt NaNs — then and only then touch the learning rate (first) and everything else (later).
Tricky interview questions 12
Your model isn't learning at all — loss flat from step 0. Walk me through how you debug it.
Check that the init loss equals log(C) for C classes, confirm inputs/labels are paired and normalized, then try to overfit a single batch with dropout/augmentation off. If it can't overfit, the bug is in the model/loss/data pipeline — inspect that gradients are non-zero and non-NaN, and that the optimizer is updating the right parameters. Trap: jumping straight to tuning LR or architecture when the real culprit is a detached graph, frozen layers, wrong loss reduction, or shuffled labels.
Why is "overfit a single batch" the first debugging step, and what do different failure modes mean?
A correctly wired model with enough capacity must drive a tiny batch to ~0 loss; failure means something is broken before any tuning matters. Loss rising = flipped sign or LR too high; NaN = numerical issue; flat plateau = LR too low or blocked gradients; oscillating = corrupted labels or LR too high. Trap: declaring success when the batch "overfits" but you forgot to turn off dropout, which can prevent reaching zero loss.
Your loss suddenly goes NaN partway through training. What are the likely causes and fixes?
Likely causes: exploding gradients or LR too high, log(0) or divide-by-zero in a custom loss, corrupted inputs (NaN/Inf), or fp16 overflow. Fixes: lower LR, gradient clipping, add ε in logs/divisions, use log-softmax/logsumexp, enable anomaly detection to locate the op, add fp16 loss scaling, and clean the data. Trap: only lowering the LR when the real culprit is log(0) in a custom loss or a single bad sample — or ignoring fp16 loss scaling entirely.
What value should the loss be at initialization, and why bother checking it?
For balanced C-class cross-entropy it should be ~log(C) (e.g., ~2.30 for 10 classes, ~4.61 for 100). A wildly different value instantly reveals a wrong loss function, mis-scaled logits, bad weight init, or incorrect label encoding — bugs that would otherwise be misdiagnosed as tuning problems. Trap: never checking it; a 10× off init loss is a free bug signal that takes 10 seconds to read.
Training loss drops but validation loss starts rising. What's happening and what do you do?
Classic overfitting (high variance): the model memorizes training data. Fix in order: more/augmented data, then regularization (weight decay, dropout), early stopping at val-loss minimum, and finally a smaller model if nothing else works. Trap: reaching for a bigger model or more epochs — both make it worse — or confusing it with a genuine train/test distribution shift.
Both training and validation loss are high and plateau. How do you fix it?
This is underfitting (high bias): the model lacks capacity to fit training data. Increase model width/depth, reduce regularization, train longer, raise LR, fix any blocked-gradient issue, and consider a better-suited architecture. Trap: adding regularization or collecting more data — those address variance, not bias — and making underfitting worse.
How do you pick a learning rate in practice?
Start from Adam ~3e-4 as a default. Run an LR-range test: increase LR exponentially over ~300 steps and pick the value just below where loss diverges. Always tune on a log scale via random search, since LR is the single most impactful hyperparameter and its effect is multiplicative. Trap: tuning LR linearly or with grid search, or hard-coding one value across very different batch sizes, optimizers, or architectures.
How do you detect vanishing vs. exploding gradients, and how do you fix each?
Log per-layer gradient norms or histograms: near-zero early-layer norms = vanishing; norms blowing up or NaN = exploding. Fix vanishing with ReLU-family activations, He/Xavier init, normalization layers, and residual connections. Fix exploding with gradient clipping (clip by global norm, ~1.0) and a lower LR. Trap: naming the symptom without actually instrumenting gradient norms, or applying gradient clipping (an exploding-gradient fix) to a vanishing-gradient problem.
How does batch size affect training, and what must you adjust when you scale it up?
Larger batches give lower-variance gradients, higher throughput, and more stable steps but can generalize slightly worse (sharp minima "generalization gap"). Smaller batches inject noise that can regularize. When raising batch size k×, scale LR roughly k× linearly and add warmup to stabilize early training. Trap: increasing batch size while keeping the same LR, then concluding "big batches don't work" when the real issue is an unscaled, un-warmed-up LR.
Why use LR warmup, and when is it most critical?
Early in training, weights are random and Adam's moment estimates are unreliable — a high LR can blow up those early updates. Warmup ramps LR from ~0 over the first few hundred to a thousand steps to stabilize things. It's especially important for transformers, large batch sizes, and adaptive optimizers. Trap: treating warmup as optional everywhere — skipping it on a transformer or large-batch run often causes early divergence or NaNs.
Grid search vs. random search vs. Bayesian/Hyperband — which and when?
Random search beats grid search because only a few hyperparameters matter; random sampling covers those critical dimensions far better per trial budget. Use coarse-to-fine random search first, then Bayesian optimization or Hyperband/ASHA when each trial is expensive and you want sample efficiency. Trap: defaulting to exhaustive grid search, which wastes most of the budget jointly varying unimportant hyperparameters.
What is an input-independent baseline and why run one?
Train or evaluate with inputs zeroed or randomly shuffled: performance should be clearly worse than with real inputs. If zeroed-input performance matches real-input performance, your pipeline isn't connecting features to the loss — you have a data leak, a label imbalance exploit, or a broken data loader. Trap: never running it and shipping a model whose "good" accuracy is just majority-class prediction or a leaked target.
14
PART III · ARCHITECTURES

Convolutional networks (CNNs)

🎯A CNN slides one small filter everywhere — shared weights, translation built in.
Convolution: slide a small kernel, share weightsinput + 3×3 kernelfeature map⟨kernel, patch⟩Same kernel everywhere → translation equivariance + few params. Stack convs → bigger receptive field.
A conv layer slides one small set of shared weights over the image; each output is the dot product of the kernel with a local patch. That gives translation equivariance and far fewer parameters than a dense layer — pooling/stride shrink the map and grow the receptive field.

A CNN is a parameter-efficient feature detector: instead of every pixel talking to every neuron, a tiny kernel slides across the image sharing the same weights everywhere — so a cat detector stays a cat detector whether the cat is left or right. Stack those detectors deeper and you go from edges → textures → parts → objects.

THE CORE MECHANICS
$$\text{output}[i,j] = \sum_{u,v,c} W[u,v,c]\,x[i \cdot S+u,\;j \cdot S+v,\;c] + b$$dot product of kernel with local patch, slid by stride S
Kernel (filter)shared weight grid — F×F×C_in; output has C_out such filters Stride Sstep size; S=2 halves the spatial size Padding P"same": P=(F−1)/2 for stride 1, keeping H×W intact Output size⌊(W − F + 2P)/S⌋ + 1 per spatial dim ParametersC_out × (F²×C_in + 1) — independent of input H/W

Translation equivariance: shift input → shift feature map by same amount. Translation invariance comes later, from pooling or the global head.

RECEPTIVE FIELD, POOLING & DILATION

Receptive field (RF) = the input region one output unit "sees". It grows with depth, kernel size, stride, and dilation. Two stacked 3×3 convs match the RF of one 5×5 but cost less: 2×9C² vs 25C² params, plus an extra non-linearity — the core VGG argument.

Max poolingkeeps strongest activation; local translation invariance; being replaced by strided convs in modern nets Strided convlearnable downsampling, preferred in ResNet, EfficientNet Dilated (atrous) convgaps between taps grow RF without downsampling — key for segmentation (DeepLab); watch for gridding artifacts at high dilation
KEY LAYER TYPES & ARCHITECTURE MILESTONES
ToolWhat it doesWhen to use
1×1 convChannel-wise linear projection, no spatial mixingBottleneck: cut channels before expensive 3×3 (ResNet, Inception)
Depthwise sep. convSpatial per-channel + pointwise channel-mix; ~8–9× cheaperOn-device / latency-constrained (MobileNet)
Residual blockOutput = F(x) + x; identity shortcut fixes gradient flowNets deeper than ~20 layers; degradation problem vanishes
Global Avg PoolCollapses H×W to 1×1; replaces flatten+FCAny-size input; fewer overfit params; standard modern head
Batch NormNormalize per-channel; shift running stats at inferenceStandard in CNNs; swap for GroupNorm at tiny batch sizes

Lineage: LeNet → AlexNet → VGG (3×3 stacks) → ResNet (residuals, still the workhorse) → MobileNet (depthwise sep.) → EfficientNet (compound scaling).

TRANSFER LEARNING & PRACTICAL DEFAULTS
Freeze early layersedges/textures are generic; fine-tune later layers and head LR for pretrained weights10×–100× smaller than the new head's LR Train from scratchonly with large data or very different domain Overfitting fixes (in order)data aug first (flips, crops, color jitter, CutMix/MixUp), then weight decay, then dropout in the head Debug a stuck lossoverfit a single batch first — if it can't memorize 8 samples, the bug is in model/loss/labels, not hyperparameters

CNNs are not scale- or rotation-invariant by default — bake it in via augmentation or multi-scale design.

⚠ Clears up — "parameter count scales with resolution" Conv layer params = C_out × (F²×C_in + 1). They do NOT multiply by output H×W — weight sharing is the whole point. Multiply by H×W only for FLOPs, not params. Confusing params with FLOPs is the classic interview slip.
◆ Interview probe "Why can't you just use one big kernel instead of stacking 3×3s?" → Two stacked 3×3 convs cover the same 5×5 receptive field as a single 5×5, but cost 2×9C² = 18C² params vs 25C², and add a non-linearity between them, increasing model expressiveness. Deeper with smaller kernels is almost always strictly better.
Remember   Shared weights + local connectivity give CNNs their power; stack 3×3s for efficiency, add residuals for depth, and always debug by overfitting one batch before touching architecture.
Tricky interview questions 12
Why use convolutions for images instead of fully-connected layers?
Convolutions exploit local connectivity and weight sharing: an FC layer on a 200×200×3 image needs ~120K weights per neuron and ignores spatial layout; a conv reuses the same small kernel everywhere, giving far fewer parameters and built-in translation equivariance. The inductive bias (locality + equivariance) is why CNNs generalize on images with less data. Trap: Mentioning only parameter count and forgetting the spatial inductive bias — equivariance is the deeper reason.
Given input W×W, filter F, stride S, padding P — what is the output size, and how do you pick P for "same" output?
Output = ⌊(W − F + 2P)/S⌋ + 1. For stride-1 "same" output set P = (F−1)/2, which is why odd-sized kernels (3×3, 5×5) are standard — even kernels don't have a clean center. Trap: Forgetting the +1, or applying the same-padding formula when stride > 1.
How many parameters does a convolutional layer have?
Per filter: F² × C_in + 1 (bias). Total: C_out × (F² × C_in + 1). Crucially this does NOT depend on input H or W — weight sharing means the same filter is reused at every spatial location. Trap: Multiplying by output H×W (confusing params with FLOPs).
What does a 1×1 convolution do and why is it useful?
A 1×1 conv is a learned linear combination across the channel dimension at each pixel — it changes depth (channel count) without touching spatial size or mixing spatial information. It's the bottleneck trick: cut channels cheaply before an expensive 3×3 to reduce FLOPs. Trap: Calling it a "no-op" — it still performs a full projection over C_in channels.
What is a depthwise separable convolution and when would you use it?
It factors a standard conv into a depthwise step (one F×F spatial filter per input channel) plus a pointwise 1×1 step (channel mixing), cutting compute by roughly 8–9× for 3×3 kernels. Use it when latency or compute is constrained — e.g., MobileNet on mobile devices. Trap: Assuming it's free accuracy — it trades representational capacity for speed and may underperform if not properly scaled.
What is a dilated (atrous) convolution and why use it instead of pooling?
Dilation inserts gaps between kernel taps so the receptive field grows rapidly without downsampling or adding parameters. This lets segmentation models (DeepLab) see large context while keeping full output resolution. Trap: Ignoring gridding artifacts from aggressive dilation — alternate dilation rates or HDC schedules are needed to mitigate this.
Why stack 3×3 kernels instead of using large kernels?
Two stacked 3×3 convs match the 5×5 receptive field but cost 18C² vs 25C² params, and add an extra non-linearity — making them more expressive and efficient. VGG validated this empirically; it's now the default design choice. Trap: Thinking a single larger kernel captures more context at equal cost — it doesn't, and loses the extra non-linearity.
How do residual (skip) connections enable very deep CNNs?
A residual block outputs F(x) + x: the identity shortcut lets gradients flow directly to earlier layers, solving the degradation problem where plain very-deep nets trained worse than shallower ones despite more capacity. Near-identity mappings are trivially learnable. Trap: Saying ResNet solves overfitting — it primarily fixes an optimization/gradient-flow problem, not a generalization one.
How does batch norm behave at train vs. inference in a CNN, and what breaks with small batches?
At training, BN normalizes per-channel using the current mini-batch's mean and variance; at inference it uses running averages accumulated during training — so you must call model.eval() to switch modes. With batch size 1–2 the batch statistics are too noisy and BN hurts; use GroupNorm or LayerNorm instead. Trap: Forgetting eval mode — training stats leak into inference and give inconsistent predictions.
How do you adapt a CNN to variable input sizes or dense prediction tasks?
Replace the flatten+FC head with Global Average Pooling (or convert FC layers to 1×1 convs) making the network fully convolutional — it then accepts any spatial size. For dense tasks (segmentation, detection) also avoid strided downsampling or use dilated convs + upsampling (U-Net, FPN). Trap: Keeping a fixed-size flatten layer and being stuck at one resolution.
Are CNNs translation invariant? What about rotation and scale?
Convolution is translation equivariant (shift input → shift output by same amount); approximate translation invariance comes from pooling and the global head. CNNs are NOT inherently rotation- or scale-invariant — you must bake that in via data augmentation or multi-scale architecture design. Trap: Conflating equivariance with invariance, or claiming rotation invariance is automatic.
Your CNN is overfitting on a small dataset — what's your prioritized fix list?
In order: (1) data augmentation (flips, crops, color jitter, CutMix/MixUp — highest leverage); (2) transfer learning from a pretrained backbone; (3) weight decay (L2); (4) dropout in the head only; (5) reduce model capacity. Freeze early backbone layers and use a 10–100× smaller LR for pretrained weights. Trap: Adding dropout inside conv layers — it interacts poorly with BatchNorm and spatial correlations; augmentation first.
15
PART III · ARCHITECTURES

RNNs, LSTMs & GRUs

🎯An LSTM is a conveyor belt with gates; Transformers replaced it by looking everywhere at once.
LSTM cell: a gated memory that survives long sequencescell state C×+f σforgeti σinputo σoutputg tanhcand.Gates (sigmoids in [0,1]) decide what to forget, write, and read → gradients flow along C, dodging vanishing.
An LSTM keeps a cell state and three sigmoid gates — forget, input, output — that control what to erase, add, and emit. The additive cell-state path is what lets gradients survive long sequences (a GRU merges the gates into two). Largely superseded by Transformers, but still asked.

An RNN is a loop that hands memory forward one step at a time — powerful in theory, crippled in practice by gradients that die (or explode) over long sequences. LSTMs and GRUs bolt on gates to keep a separate, addition-friendly memory lane alive across hundreds of steps.

VANILLA RNN — THE LOOP AND ITS CURSE
$$h_t = \phi(W_h h_{t-1} + W_x x_t)$$new hidden = activation( recurrent weight × old hidden + input weight × current input )

Weights are shared across every timestep — compact, but BPTT multiplies the same $W_h$ over and over. If the largest singular value of $W_h$ is <1, gradients vanish; if >1, they explode. This is a temporal depth problem (sequence length), not stacked-layer depth.

Vanishing gradient fixSwitch to LSTM/GRU (architectural), not just a larger hidden size Exploding gradient fixGradient clipping by global norm (clip threshold ≈ 1–5); preserves gradient direction unlike clip-by-value
LSTM — GATES AND THE ADDITIVE CELL STATE

Four learned transforms, each combining $h_{t-1}$ and $x_t$:

Forget gate $f$sigmoid → what to erase from cell state Input gate $i$sigmoid → how much new info to write Candidate $\tilde{C}$tanh → new info proposal (±1 range) Output gate $o$sigmoid → what portion of cell to expose as hidden state
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$cell state = forget-gate × old cell + input-gate × candidate
$$h_t = o_t \odot \tanh(C_t)$$hidden output = output-gate × squashed cell

The additive path means the backprop gradient through $C$ is approximately $f_t$ (not a product of saturating derivatives). With $f_t \approx 1$, gradients travel many steps nearly unchanged — the "gradient highway." Tanh bounds the cell; sigmoid provides 0–1 soft masking. Never swap them.

Param count: $4 \times [(d_{in} + d_h) \times d_h + d_h]$ — the factor 4 covers all three gates plus the candidate.

GRU — LIGHTER ALTERNATIVE

Two gates: update (merge of forget + input) and reset (how much past to expose when computing the candidate). No separate cell state — hidden state doubles as memory.

Params vs LSTM≈25% fewer (factor 3 not 4) Pick GRU whensmaller data, shorter sequences, tight compute or latency Pick LSTM whenmore data, very long dependencies, more tuning budget

Empirically near-identical performance — benchmark both and decide.

PRACTICAL WIRING
Task topologyWhat you doLoss wiring
Many-to-one (sentiment)Use final hidden state onlySingle output loss
One-to-many (captioning)Seed decoder with one inputPer-step loss
Aligned many-to-many (NER/POS)return_sequences=True, per-step outputPer-step loss + padding mask
Seq2seq (translation)Encoder last state → decoder initPer-step decoder loss
return_sequencesTrue → 3D output (all timesteps); False → 2D (last only). Stack recurrent layers? Must be True on lower layers. BidirectionalConcatenate forward + backward states; doubles output dim. Cannot use for generation or streaming (future not available). Variable-length batchesPad + mask; PyTorch: pack_padded_sequence / pad_packed_sequence. Unmasked padding corrupts hidden state and loss. Teacher forcingFeed ground-truth previous token during training — faster but causes exposure bias at inference (model sees own errors). Fix: scheduled sampling. DropoutStandard on inputs/outputs; recurrent (variational) dropout uses a single fixed mask per sequence on hidden-to-hidden weights — resampling per step destroys memory. Stateful LSTMCarries state across batches for one long split sequence; reset state between independent sequences and don't shuffle data.
RNNs vs TRANSFORMERS — WHEN TO USE WHICH
AxisRNN/LSTM/GRUTransformer
ParallelismSequential — can't parallelize across timestepsFully parallel over sequence
Long-rangeMitigated by gates; still fades at >100s stepsDirect attention to any position
Memory footprintO(sequence length) computation, O(hidden) stateO(n²) attention — hurts on very long seqs
Still use RNNs forStreaming/online inference, edge devices, small data, tight latencyEverything else at scale
⚠ Clears up — "gates fix vanishing gradients" Saying "the gates fix it" is imprecise. The real fix is the additive cell-state update, which creates a near-identity gradient highway. Gates control what flows through that highway; without the additive path the gates alone would not prevent multiplicative gradient decay. Also: LSTMs mitigate vanishing gradients — they don't eliminate them entirely.
◆ Interview probe "Why does clipping fix exploding but not vanishing gradients?" → Clipping rescales the gradient vector when its norm is too large — it only fires when gradients are too big. Vanishing gradients are already near zero, so clipping never triggers; the cure there is architectural (LSTM cell state, attention), not numerical.
Remember   LSTMs beat vanilla RNNs because their cell-state update is additive, not multiplicative — gradients flow through it like a highway, not through a stack of squashing functions.
Tricky interview questions 12
Why do vanilla RNNs suffer from vanishing/exploding gradients?
BPTT multiplies by the same recurrent weight matrix $W_h$ and the activation's Jacobian at every timestep, so the gradient scales as $(W_h)^T$ — exponentially shrinking or blowing up with sequence length. It's a temporal depth problem. Trap: blaming stacked layer depth instead of the repeated multiplication by the same weight over time.
How does the LSTM cell state actually fix vanishing gradients — be precise.
The cell-state update is additive: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$. During backprop the gradient through $C$ has a local derivative of $f_t$ (the forget gate), not a product of tanh/sigmoid derivatives. With $f_t \approx 1$ the gradient flows many steps nearly unchanged. Trap: saying "the gates fix it" without explaining that it's the additive path, not the gates per se, that creates the gradient highway.
What activation does each LSTM gate use and why?
Gates (forget, input, output) use sigmoid (range 0–1) to act as soft on/off masks; the candidate cell uses tanh (range −1 to 1) so the LSTM can add or subtract information. The output $h_t = o_t \odot \tanh(C_t)$ — the cell state is squashed before being exposed. Trap: claiming the output gate emits the raw cell state, or swapping which activation each component uses.
LSTM vs GRU — when do you pick each?
GRU merges forget + input into one update gate, has no separate cell state, uses ≈25% fewer parameters, and trains faster. Performance is empirically similar. Lean toward GRU for smaller data, shorter sequences, or tight compute; LSTM for large data and very long dependencies. In practice, benchmark both. Trap: claiming one is universally better — it depends on data size, sequence length, and compute budget.
How do you handle exploding gradients in practice?
Use gradient clipping by global norm (e.g., PyTorch's clip_grad_norm_ with threshold 1.0–5.0), which rescales the entire gradient vector proportionally, preserving direction. Symptoms to watch: NaNs, loss spikes. Also lower the learning rate. Trap: using clip-by-value instead of clip-by-norm — clip-by-value distorts gradient direction. Also: clipping fixes exploding, not vanishing, gradients.
How do you handle variable-length sequences in a batch?
Pad to a common length and mask padded positions so they don't contribute to the loss or hidden-state updates. In PyTorch, use pack_padded_sequence / pad_packed_sequence so the RNN skips padding entirely. Bucket similar-length sequences to reduce waste. Trap: padding without masking — unmasked padding corrupts the final hidden state, the loss, and metrics.
What is return_sequences and when should it be True?
return_sequences=True outputs the hidden state at every timestep (3D tensor); False outputs only the last step (2D). Use True when stacking another recurrent layer or for aligned many-to-many tasks (NER, POS tagging); False for many-to-one (classification). Trap: stacking LSTM layers without setting return_sequences=True on lower layers — the next layer needs a sequence, not a single vector.
What is teacher forcing and what problem does it create?
Teacher forcing feeds the ground-truth previous token as the decoder input during training, which speeds and stabilizes learning. At inference, the model uses its own predictions, so errors compound — this train/test mismatch is called exposure bias. Mitigation: scheduled sampling (gradually switching to model predictions during training). Trap: not knowing about exposure bias, or not offering a mitigation.
When can you use a bidirectional LSTM, and when can't you?
BiLSTMs run forward and backward passes and concatenate states, giving each token full past + future context — great for classification, NER, and tagging. You cannot use them for autoregressive generation or real-time/streaming inference because future input isn't available yet. Trap: proposing BiLSTM for a generative or streaming task. Also: BiLSTM doubles the output dimension.
How do you apply dropout correctly to an LSTM?
Standard dropout on inputs and outputs (feedforward connections) is fine. For recurrent connections, use variational/recurrent dropout: sample one fixed mask per sequence and reuse it across all timesteps, rather than resampling at every step. Trap: independently resampling the dropout mask on hidden-to-hidden connections at each step — this destroys the hidden state's memory continuity.
Your RNN only learns short-range patterns and ignores early context. How do you debug and fix it?
Classic vanishing-gradient / limited-memory symptom. Inspect per-timestep gradient norms to confirm decay. Fixes: switch from vanilla RNN to LSTM/GRU, ensure BPTT window isn't truncated too short, add gradient clipping, add attention, verify no premature sequence resets or masking errors. Trap: just increasing hidden size or training longer — the root cause is gradient/memory decay, so architectural fixes matter more than capacity.
Why did Transformers largely replace RNNs, and when do RNNs still make sense?
RNNs are inherently sequential — no parallelism across timesteps — and still struggle with very long dependencies. Transformers parallelize over the full sequence and access any position via direct attention, scaling far better on modern hardware. RNNs still fit streaming/online inference, edge/low-latency devices, tight-memory settings, and small-data regimes. Trap: declaring RNNs obsolete with no nuance. The real differentiator is parallelism and direct long-range access, not just accuracy.
16
PART III · ARCHITECTURES

Transformers & attention

🎯Attention is a soft lookup: every token asks every token, and softmax decides who to listen to.
Scaled dot-product attention: a soft, content-based lookupQqueryKkeysVvaluesscores = QKᵀ/√dsoftmax→ weightsweighted sum× VoutEach token asks (Q) every token's key (K); softmax turns match scores into weights that mix the values (V).
Attention is a differentiable dictionary: $\text{softmax}(QK^\top/\sqrt{d})\,V$. The $\sqrt{d}$ keeps the dot products from saturating softmax. Multi-head runs several in parallel to capture different relations; self-attention lets every token see every other in one step (O(n²)).
A Transformer block (pre-norm)input tokensLayerNormMulti-Head Attn+LayerNormFeed-Forward (MLP)+→ next block ×N
The repeating unit: LayerNorm → attention → residual, then LayerNorm → MLP → residual. Attention mixes information across tokens; the MLP transforms each token; residuals + norm keep the deep stack stable. Stack N of these = a GPT/BERT.

A transformer is a stack of "attend then transform" blocks: every token asks a question (Q), scans all keys (K) to find what's relevant, and pulls the corresponding values (V) — then a per-token MLP digests the result. Repeat 12–96 times and you have a modern LLM.

SELF-ATTENTION: THE CORE OPERATION
$$\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$softmax of scaled dot-products applied to values

Every token attends to every other token in one shot. Why divide by $\sqrt{d_k}$? Dot products of $d_k$-dim unit-variance vectors have variance $\sim d_k$; without scaling, logits explode and softmax collapses to near-one-hot → vanishing gradients. Dividing by $\sqrt{d_k}$ keeps logit variance $\approx 1$.

Q (Query)"What am I looking for?" K (Key)"What do I advertise for matching?" V (Value)"What do I actually pass on?"

Q, K, V are separate learned projections — this lets attention be asymmetric (i attending to j ≠ j attending to i) and decouples matching space from content space.

MULTI-HEAD ATTENTION & THE TRANSFORMER BLOCK

Multi-head: run $h$ attention heads in parallel, each on a $d_k = d_{model}/h$ subspace, then concat + linear project. With $d_{model}$ fixed this costs the same FLOPs but each head learns a different relation type (syntax, coreference, position, …).

The Pre-LN block (standard today):

$$x \leftarrow x + \text{MHA}(\text{LN}(x))$$residual + multi-head attn on layer-normed input
$$x \leftarrow x + \text{FFN}(\text{LN}(x))$$residual + position-wise MLP

FFN width is typically $4\times d_{model}$ with GELU or SwiGLU; it operates per token independently (cross-token mixing is only in attention). LayerNorm not BatchNorm: normalizes across features per token — no batch-size or sequence-length dependence, identical at train and inference.

POSITIONAL ENCODINGS (ATTENTION IS PERMUTATION-INVARIANT)
SchemeHowExtrapolates?Use today?
SinusoidalFixed trig added to embeddingsSomewhatRare (classic)
Learned absoluteTrainable emb per positionNo (fails past max)BERT-era
RoPERotate Q & K by position angleYes (relative)LLaMA, GPT-4 era
ALiBiSubtract linear bias from attn scoresYesSome LLMs

Sinusoidal/learned encodings are added to input embeddings; RoPE is applied inside attention to Q and K — a common interview trip-wire.

ARCHITECTURE VARIANTS & COMPLEXITY
VariantAttentionBest forExample
Encoder-onlyBidirectionalClassification, NER, embeddingsBERT
Decoder-onlyCausal maskOpen-ended generation, LLM scalingGPT, LLaMA
Encoder-decoderBidirectional + cross-attnTranslation, summarizationT5, BART

Causal mask: add $-\infty$ to upper-triangular logits before softmax so future tokens get weight 0. Bug: setting to 0 (not $-\infty$) still gets nonzero weight; applying after softmax does nothing.

Complexity: $O(n^2)$ compute and memory in sequence length $n$ — the core bottleneck. FlashAttention is exact but IO-aware (tiles computation, never materializes the full $n \times n$ matrix → memory $O(n)$ vs $O(n^2)$, no quality loss). Sparse/sliding-window attention is approximate.

KV cache: during autoregressive decode, K and V for all prior positions are fixed — cache them, attend one new query per step instead of recomputing. Cost: memory grows with batch × layers × seq_len × head_dim; at long context this dominates. MQA/GQA share K/V across query heads to shrink the cache; GQA (Llama 3, Mistral) is the standard tradeoff.

TRAINING STABILITY & PRACTICAL DEFAULTS
Pre-LN vs Post-LNPre-LN: better gradient flow, little/no warmup needed; Post-LN: often slightly better final quality but needs careful warmup Gradient clippingnorm ≤ 1.0 is standard; first sign of trouble check grads before LR fp16 vs bf16bf16 preferred for LLMs (wider dynamic range, fewer NaN overflow issues) FFN activationGELU (classic), SwiGLU (modern, better perplexity at same params) NaN mid-trainingCheck: gradient clipping → fp16 overflow (switch bf16 or scale loss) → all-masked attention rows (softmax over all $-\infty$ = NaN)
⚠ Clears up — "multi-head adds capacity" With $d_{model}$ fixed ($d_k = d_{model}/h$), multi-head attention has essentially the same parameter count and FLOPs as single-head. The gain is diversity of attention patterns across subspaces, not raw capacity. Don't claim it adds parameters.
◆ Interview probe "FlashAttention is faster — does it change the model's outputs?" → No. FlashAttention is mathematically identical to standard attention; it reorders computation using tiling to avoid materializing the $n \times n$ matrix in HBM, cutting memory bandwidth. Outputs are bit-identical (with correct implementation). It reduces IO cost, not asymptotic FLOPs.
Remember   The whole transformer is two ideas stacked: attention mixes across tokens (O(n²), needs positional encoding), and the FFN transforms each token in place — Pre-LN residuals tie them together stably.
Tricky interview questions 12
Why are attention scores divided by √d_k before softmax? What breaks if you don't?
Two d_k-dim unit-variance vectors have a dot product with variance ~d_k. Without scaling, logits grow with dimension and push softmax into a saturated near-one-hot regime with near-zero gradients. Dividing by √d_k keeps logit variance ≈ 1 so gradients stay healthy throughout training. Trap: saying it's "for normalization" generically, or dividing by d_k instead of √d_k — variance scales with d_k, so you take the square root.
What does multi-head attention buy you over a single big head with the same d_model?
Splitting into h heads lets the model jointly attend to different representation subspaces and relationship types (syntax, coreference, position) in parallel, each with its own Q/K/V projection. A single head forms only one attention distribution per query. With d_model fixed, parameter count is essentially unchanged — the gain is diversity of patterns, not capacity. Trap: claiming multi-head adds parameters or compute; the head dimension shrinks proportionally.
Why do transformers need positional encodings? How do sinusoidal, learned, and RoPE compare?
Self-attention is permutation-invariant, so without positional info the model cannot distinguish word order. Sinusoidal is parameter-free and extrapolates somewhat; learned absolute embeddings fit the training distribution but fail past the trained max length; RoPE injects relative position by rotating Q and K vectors and extrapolates better, which is why modern LLMs prefer it. Trap: saying positional info is added to attention scores — sinusoidal/learned encodings are added to input embeddings, while RoPE is applied inside attention to Q and K.
What is the KV cache, why does it speed up generation, and what is its main cost?
During autoregressive decoding, K and V for all prior positions are unchanged each step, so caching them avoids recomputing attention over the whole prefix — each step only attends one new query against the cache. The cost is memory: cache size grows with batch × layers × kv_heads × seq_len × head_dim and becomes the dominant bandwidth bottleneck at long context. Trap: thinking the KV cache helps training or prefill — it only benefits incremental decode.
What is the time and memory complexity of self-attention, and how do FlashAttention vs sparse attention differ?
Standard self-attention is O(n²) in both compute and memory because it materializes an n×n attention matrix. FlashAttention is exact and IO-aware — it tiles computation to avoid writing the full matrix to HBM, so memory drops to O(n) with no quality loss. Sparse/sliding-window attention is approximate, restricting which tokens attend to which. Trap: confusing FlashAttention (exact, only IO-efficient) with sparse attention (approximate, fewer FLOPs) — FlashAttention does not reduce asymptotic FLOPs.
Explain causal masking: how is it implemented and what bugs can arise?
Causal masking prevents position i from attending to positions j > i. It is implemented by adding -∞ (a large negative constant) to the upper-triangular logits before softmax, so those positions get weight ≈ 0. Bugs: setting masked entries to 0 instead of -∞ (0 still gets nonzero softmax weight), applying the mask after softmax (does nothing), or off-by-one masking that prevents a token from seeing itself. Trap: using 0 as the mask value — it is not large enough to zero out the softmax output.
When would you choose encoder-only vs decoder-only vs encoder-decoder?
Encoder-only (BERT) with bidirectional attention suits understanding tasks — classification, NER, embeddings — where the full input is known. Decoder-only (GPT) with causal attention suits open-ended generation and dominates LLM scaling. Encoder-decoder (T5) fits seq2seq tasks like translation and summarization where the source and target are distinct and cross-attention is beneficial. Trap: defaulting to "decoder-only is always best" — for pure understanding or dense retrieval, a bidirectional encoder is more parameter-efficient and accurate.
What are MQA and GQA, and what problem do they solve?
Multi-Query Attention (MQA) shares a single K/V head across all query heads; Grouped-Query Attention (GQA) shares K/V across small groups of query heads. Both reduce KV cache size and memory bandwidth at decode time — the binding constraint for long-context inference. GQA is the standard middle ground: nearly MHA quality at close to MQA speed (used in Llama 3, Mistral). Trap: saying MQA/GQA reduce query heads — they reduce only K/V heads. Query heads remain unchanged.
Why do transformers use LayerNorm instead of BatchNorm?
LayerNorm normalizes across features for each token independently, so it works for any batch size, any sequence length, and is identical at train and inference. BatchNorm normalizes across the batch dimension — unstable for small or variable batches, awkward for variable-length sequences, and introduces a train/inference mismatch via running statistics. Trap: just saying BatchNorm "doesn't work" — the precise reason is that per-sample normalization removes batch and length dependence and the statistics mismatch.
Pre-LN vs Post-LN: what changes, and does it affect learning-rate warmup?
Post-LN (original "Attention is All You Need") places LayerNorm after the sublayer and residual addition; gradients near the output can be large early in training, so aggressive LR warmup is needed. Pre-LN places LayerNorm before the sublayer input, giving more stable gradient flow through depth — models can train with little or no warmup and are more robust to hyperparameters, at a small potential cost to final perplexity. Trap: treating LR warmup as always mandatory — it is largely a Pre-LN workaround for Post-LN instability.
Why are Q, K, V separate learned projections rather than raw embeddings?
Separate projections let the model learn distinct subspaces for querying, matching, and content transmission. They also make attention asymmetric: i attending to j can differ from j attending to i because Q and K live in different spaces. Raw embeddings would force symmetric matching and conflate the matching signal with the transmitted content. Trap: claiming Q and K could be the same matrix — tying them makes attention symmetric and removes directional relationship learning.
Your transformer's training loss goes NaN mid-run. How do you debug it?
Check in order: (1) gradient norms — add clipping at norm 1.0 if exploding; (2) fp16 overflow — switch to bf16 or enable dynamic loss scaling; (3) all-masked attention rows — if every logit in a row is -∞, softmax outputs NaN; (4) data issues — bad tokens, padding, or a corrupted batch. Reduce LR last, after ruling out the above. Trap: jumping straight to "lower the LR" without first checking gradient clipping, fp16 overflow, or all-masked attention rows.
17
PART III · ARCHITECTURES

Transfer learning & fine-tuning (LoRA / PEFT)

🎯Don't retrain the giant — freeze it and bolt on a tiny LoRA adapter.
LoRA: freeze the big weights, train a tiny low-rank patchWpretrained, FROZENd × d (millions)Ar × dBd × r+outputW xB A xTrain only A,B (rank r ≪ d) → &lt;1% of the params, tiny checkpoints, no extra inference cost once merged.
Full fine-tuning updates billions of weights. LoRA freezes $W$ and learns a low-rank update $\Delta W = BA$ (rank $r\ll d$), cutting trainable params by 100-1000×. The dominant parameter-efficient fine-tuning (PEFT) method; QLoRA adds 4-bit quantization of the frozen base.

A pretrained model is a compressed encyclopedia of the world — fine-tuning just teaches it your dialect. Get this hierarchy right and you'll spend pennies instead of millions on compute while still beating from-scratch baselines.

THE THREE-TIER LADDER
OptionWhen to useCost
Feature extraction (frozen backbone)Small data, domain near pretrainingCheapest; one head per task
PEFT / LoRAMedium data, want efficiency; LLM adaptation~1% params trained; tiny checkpoints
Full fine-tuningLarge data, very different domain, max qualityFull copy per task; catastrophic forgetting risk

Rule of thumb: always start frozen, unfreeze top layers only when val loss plateaus, and only unfreeze everything if data is abundant.

LoRA IN ONE PARAGRAPH

Freeze the pretrained weight matrix $W \in \mathbb{R}^{d \times k}$. Learn a low-rank correction:

$$\Delta W = B A, \quad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k}, \quad r \ll d$$delta-W equals B times A, rank r much smaller than model dimension

The forward pass adds $\frac{\alpha}{r} BA$ to $W$. After training, merge: $W' = W + \frac{\alpha}{r}BA$ — zero inference overhead. You train <1% of parameters, checkpoints are megabytes not gigabytes.

rank $r$4–8 for formatting/style; 16–64 for hard domain shift alpha $\alpha$commonly set to $2r$; the ratio $\alpha/r$ is the real effective scale target modulesdefault: q_proj, v_proj; extend to all linear layers (MLP, k, o) for more capacity

QLoRA = LoRA on a 4-bit NF4 frozen base + double quantization (saves ~0.37 bits/param on quant constants) + paged optimizers (handles OOM spikes). Enables ~65B fine-tuning on a single 48 GB GPU. The base stays quantized; only the fp16 adapters get gradient updates.

PEFT FAMILY AT A GLANCE
MethodTrainable paramsInference costSweet spot
LoRA / QLoRA~0.1–1%Zero (merged)Default choice for LLMs
Adapter modules~0.5–5%Added latency (unless fused)NLP tasks with multiple heads
Prefix / prompt tuningTiny (soft tokens)Longer KV cacheVery small footprint; weaker for big behavior shifts
Full fine-tuning100%Zero overhead; one copy/taskMax quality, ample data
PRACTICAL FINE-TUNING CHECKLIST
Learning rate10–100× lower than pretraining; typically 1e-5 to 1e-4; always use warmup Discriminative LREarlier layers get smaller LR (generic features); later layers get larger LR (task-specific) Epochs1–3 on small data; early stop on val loss Data quality50–500 clean examples beat 10k noisy ones for style/format tasks Catastrophic forgettingMitigate with PEFT, lower LR, replay data, EWC, or distillation from base
LLM ALIGNMENT PIPELINE: SFT → DPO / RLHF
SFTSupervised fine-tuning on labeled demonstrations — teaches format and behavior by imitation RLHFTrain a reward model on human preference rankings, then optimize policy with PPO — powerful but complex and unstable DPODirect Preference Optimization — skips reward model, optimizes a classification-style loss on preference pairs; simpler, cheaper, often matches RLHF

Typical order: pretrain → SFT → DPO/RLHF. Never skip SFT and jump straight to preference optimization — DPO assumes a reasonable starting policy.

Serving many LoRA variants: keep adapters separate and swap them per request (S-LoRA / Punica batches across adapters with custom CUDA kernels) — serves thousands of variants from one base GPU. Merge only when you serve a single task and want zero latency overhead.

⚠ Clears up — Fine-tuning vs. RAG for knowledge injection Fine-tuning does not reliably inject new facts — it bakes in stale information, hallucinates confidently, and is expensive to refresh. Use RAG when you need fresh, factual, or citable knowledge. Use fine-tuning to change behavior: style, format, tone, output structure, or to teach a skill the model lacks. They compose well — fine-tune the model to better use retrieved context (RAFT).
◆ Interview probe "Your LoRA-tuned model aces the eval set but regresses on general capabilities in production." → Check for catastrophic forgetting by benchmarking the merged model against the base on general tasks. Root causes: over-training on narrow data, no replay, or evaluating only on the fine-tune distribution. Fix: add general-domain replay, reduce epochs/LR, switch to LoRA-without-merge for modularity, and build a regression suite of general-capability inputs to catch degradation before deployment.
Remember   LoRA freezes the base and learns $\Delta W = BA$ at rank $r \ll d$ — <1% params, mergeable, no inference tax; QLoRA does the same on a 4-bit base; for LLMs the recipe is pretrain → SFT → DPO, and fine-tune for behavior, RAG for facts.
Tricky interview questions 12
When would you freeze the backbone versus fully fine-tune, and how do you decide how many layers to unfreeze?
Freeze when data is small or domain is close to pretraining — fewer trainable params means less overfitting and lower cost. Unfreeze progressively from the top (task-specific layers) down while watching val loss, stopping when it stops improving. Full fine-tuning pays off only when data is plentiful and the domain differs substantially. Trap: Unfreezing the whole network on a small dataset destroys pretrained features and overfits; assuming more unfreezing is always better.
What is LoRA and why does it add zero inference latency once deployed?
LoRA freezes pretrained weights W and trains a low-rank update ΔW = BA (rank r ≪ d), touching fewer than 1% of parameters. After training you merge: W' = W + (α/r)BA — the adapter disappears into the weight matrix, so the deployed model has identical architecture and latency to the base. Trap: Saying LoRA always adds latency — that's only true if you leave adapters unmerged; also, a smaller adapter count doesn't shrink the frozen base's memory footprint.
How do you choose LoRA rank r and alpha, and which modules should you target?
Use r = 4–8 for simple style/format tasks; r = 16–64 for hard domain adaptation. Set alpha ≈ 2r (the ratio α/r is the real effective scale). Default targets are q_proj and v_proj; extend to all linear layers (MLP, k, o projections) when you need more capacity. Trap: Cranking r very high loses LoRA's efficiency and regularization benefit; raising r without adjusting alpha silently changes the effective update magnitude.
Explain QLoRA — what do NF4, double quantization, and paged optimizers each contribute?
QLoRA trains LoRA adapters while backpropagating through a frozen 4-bit base. NF4 is a 4-bit data type calibrated for normally-distributed weights. Double quantization quantizes the quantization constants themselves, saving ~0.37 bits per param. Paged optimizers use unified GPU/CPU memory to absorb optimizer-state spikes, preventing OOM — together enabling ~65B fine-tuning on a single 48 GB GPU. Trap: The base weights are never updated — only the fp16 adapters get gradients. You cannot cleanly merge an fp16 adapter into a still-4-bit base without dequantizing first.
What is catastrophic forgetting and how do you mitigate it?
Catastrophic forgetting is when fine-tuning on a narrow task degrades the model's prior general capabilities. Mitigate with PEFT/LoRA (base stays frozen), very low LR and few epochs, rehearsal/replay of general-domain data, EWC regularization toward original weights, or knowledge distillation from the base model. Trap: Assuming LoRA fully prevents forgetting — it greatly reduces it, but merging an over-trained adapter can still degrade general performance; always run a general-capability regression benchmark.
When should you fine-tune versus use RAG versus just prompt engineer?
Start with prompt engineering (free, fast). Use RAG when you need fresh, factual, or citable knowledge that changes over time. Fine-tune to change persistent behavior: style, format, tone, structured outputs, or to teach a skill that can't be demonstrated in context. They compose: fine-tuning a model to better use retrieved context (RAFT) often beats either alone. Trap: Reaching for fine-tuning to inject facts — fine-tuning bakes in stale data and is the wrong tool for dynamic knowledge; that's RAG's job.
Why use a lower learning rate when fine-tuning, and what are discriminative learning rates?
Use an LR 10–100× smaller than pretraining (typically 1e-5 to 1e-4) to adapt without destroying pretrained features. Discriminative LRs (from ULMFiT) assign smaller LRs to earlier layers (which hold generic, transferable features) and larger LRs to later layers (task-specific). Always include a warmup schedule to avoid instability in early steps. Trap: Using the original pretraining LR wrecks pretrained weights in the first few steps; skipping warmup causes gradient spikes.
How much data do you actually need to fine-tune, and how do you think about quality vs. quantity?
Style/format tasks can improve with 50–500 high-quality examples; domain adaptation typically needs 1k–10k+; new reasoning capabilities need more. A few hundred clean, diverse, consistently-formatted examples routinely beat thousands of noisy ones. Start small, evaluate on a held-out set, and add data only if the metric hasn't plateaued. Trap: Assuming more data is always better — bad labels and format inconsistency hurt more than small dataset size.
Distinguish SFT, RLHF, and DPO. When would you reach for each?
SFT trains on labeled demonstrations to teach format and behavior by imitation — always the first step. RLHF trains a reward model on human preference rankings then optimizes a policy with PPO — powerful but complex, expensive, and prone to reward hacking. DPO skips the reward model and optimizes a classification-style loss directly on preference pairs — simpler, cheaper, and often matches RLHF quality. Typical pipeline: pretrain → SFT → DPO (or RLHF for high-stakes alignment). Trap: Skipping SFT and jumping straight to DPO/RLHF — both assume a reasonable starting policy that SFT provides.
How do you serve thousands of fine-tuned LoRA variants efficiently at inference time?
Load the base model once and keep adapters separate. Systems like S-LoRA and Punica batch requests across different adapters using custom CUDA kernels, serving thousands of variants from a single GPU. Merge the adapter into weights only when you serve a single fixed task and want zero added latency, at the cost of losing modularity. Trap: Merging when you actually need to swap tasks, or trying to merge an fp16 LoRA adapter directly into a still-quantized 4-bit base without dequantizing first.
Your fine-tuned model aces the eval set but regresses on general capabilities in production. How do you debug it?
First confirm catastrophic forgetting by benchmarking the merged model against the base on standard general-capability tasks (MMLU, HellaSwag, etc.). Then check for distribution mismatch between your eval set and real traffic. Fixes: add general-domain replay data to the fine-tune mix, reduce LR and epochs, switch to LoRA (keeps base frozen), and build a regression suite of general inputs that runs on every fine-tune iteration. Trap: Trusting a narrow eval set as proof of quality; not maintaining a general-capability regression benchmark.
Should you freeze or train pretrained embeddings, and what changes your answer?
Freeze embeddings when data is small or vocabulary overlaps heavily with pretraining — avoids overfitting and preserves the learned semantic geometry. Train embeddings (at a smaller LR) when you have ample in-domain data or many out-of-vocabulary / domain-specific tokens. A common middle ground: freeze for the first N epochs, then unfreeze late in training at a reduced LR. Trap: Training embeddings from scratch on a small dataset destroys pretrained semantic structure and overfits rare tokens.
18
PART IV · SCALE & SOTA

Scaling, pretraining & modern LLMs

🎯Loss falls as a power law in compute; scale data and params together (Chinchilla).
Scaling laws: loss falls as a power of computelog compute (FLOPs) →log loss →straight line on a log-log plotloss ∝ C^(−α)
Test loss drops as a power law in model size, data, and compute — a straight line on a log-log plot. Chinchilla: for a compute budget, scale params and tokens together (~20 tokens/param) rather than just making the model bigger. This predictability is why labs bet on scale.

Modern LLMs are giant decoder-only Transformers eating trillions of tokens — and the game is knowing how to balance parameters, data, and compute so every FLOP lands where it matters. Nail the scaling math, the parallelism ladder, and the alignment pipeline and you understand how every frontier model is built.

SCALING LAWS & CHINCHILLA

Test loss falls as a power law in model size $N$, dataset size $D$, and compute $C$. Log-log plot → straight line. The Chinchilla rule: for a fixed compute budget, scale parameters and tokens together.

$$C \approx 6ND \quad\Rightarrow\quad D^* \approx 20N$$FLOPs ≈ 6 × params × tokens; compute-optimal token count ≈ 20 per parameter

Practical twist: 20 tokens/param is the training-compute optimum, not the total-cost-of-ownership optimum. Llama-style models deliberately overtrain a smaller model on far more tokens so inference is cheaper — when you serve billions of requests, training cost is amortized away.

Kaplan (2020)scale model size faster than data — led to large undertrained models Chinchilla (2022)scale both equally; GPT-3 was ~5× undertrained Emergent abilitiescapabilities that appear abruptly past a scale threshold (e.g. few-shot arithmetic)
DISTRIBUTED TRAINING: PARALLELISM LADDER

Escalate only as far as you must — each step adds communication overhead.

StrategyWhat it shardsUse when
Data parallelism (DDP/FSDP)Data batch across replicasModel fits on one GPU
ZeRO-1/2/3 (FSDP)Optimizer states / gradients / paramsOptimizer state or params overflow GPU
Tensor parallelismMatmuls within a layer, across GPUsSingle layer too large; keep intra-node (NVLink)
Pipeline parallelismLayer groups as sequential stagesDepth exceeds node memory; use micro-batches to reduce bubbles
3D / Megatron-styleAll three combined100B+ models across many nodes

LR scaling: effective batch = per-GPU batch × GPUs × accumulation steps. Scale LR roughly linearly (or $\sqrt{\text{batch}}$) with warmup to stabilize early high-variance gradients.

MEMORY TRICKS: PRECISION, CHECKPOINTING, ACCUMULATION
BF16FP32 exponent range → stable without loss scaling; fewer mantissa bits; default on A100/H100 FP16narrow range → needs dynamic loss scaling; use on V100 or older Gradient checkpointingrecompute activations on backward pass; ~20-30% extra compute, big activation-memory savings Gradient accumulationrun N micro-batches before optimizer step → larger effective batch without memory increase

These solve different problems: checkpointing cuts activation memory; accumulation raises effective batch size. A single micro-batch still uses its full activation memory even with accumulation.

MIXTURE-OF-EXPERTS (MoE)

Each token is routed to top-$k$ experts out of $E$ total FFN blocks. Only the selected experts run, so FLOPs ≈ dense model of active size, while total parameter count is much larger.

Computescales with active params (e.g. ~13B active in Mixtral 8×7B) Memoryscales with total params (~47B must reside in VRAM) Load-balancing lossauxiliary term penalizing expert utilization imbalance; without it, a few experts dominate and the rest are wasted Capacity factormax tokens per expert per batch; overflow tokens skip the FFN via residual

MoE wins when you have ample VRAM and many devices and want high quality per FLOP. Dense wins in memory-constrained settings.

ALIGNMENT: SFT → RLHF / DPO

Three-stage pipeline: (1) SFT on human demonstrations, (2) train a reward model on preference comparisons, (3) optimize policy against the reward model.

PPO (RLHF)online RL with KL penalty to frozen SFT reference; stable but complex DPOoffline; directly optimizes on preference pairs without a separate reward model KL penaltykeeps policy close to SFT reference, limiting reward hacking and fluency collapse Reward hackingpolicy exploits reward-model flaws (verbosity, sycophancy) — rising reward, flat human eval

Inference: autoregressive, one token at a time. KV-cache stores past key/value tensors to avoid recomputation; memory grows linearly with context length × batch size × layers × head dimension.

⚠ Clears up — "bigger is always better" Scaling laws are log-linear: doubling params gives a fixed, predictable (small) loss drop. What looks like an emergent ability is usually smooth improvement crossing a task-specific threshold. And MoE's "more parameters" comes with a memory bill, not a compute discount — total params must all fit in VRAM even though only a fraction run per token.
◆ Interview probe "Why does Llama train on more tokens than Chinchilla says is optimal?" → Chinchilla minimizes training FLOPs only. When a model serves massive inference traffic, overtraining a smaller model on extra tokens is cheaper overall: you pay once to train and save on every inference call forever. The 20-tokens-per-param rule ignores serving cost entirely.
Remember   Chinchilla balances params and tokens for training FLOPs, but production models break that rule intentionally — smaller models overtrained on more data are cheaper to serve at scale.
Tricky interview questions 12
What are the Chinchilla scaling laws, and how do you pick model size and token count for a fixed compute budget?
Loss falls as a power law in params N, tokens D, and compute C ≈ 6ND; the compute-optimal point scales N and D together, landing near ~20 tokens per parameter. In practice you fit the curves on small runs and extrapolate to choose N and D for your budget. Trap: citing the older Kaplan laws that favored bigger models over more data — many early large models (e.g. GPT-3) were badly undertrained because tokens were not scaled with size.
Chinchilla says ~20 tokens/param is compute-optimal. Why do Llama-style models train on far more tokens than that?
Chinchilla minimizes training cost only. If a model serves huge inference volume it pays to overtrain a smaller model on more tokens so each inference is cheaper and faster — training cost is amortized over a large inference lifetime. Trap: treating 20:1 as a hard universal rule; it is the training-compute optimum, not the total-cost-of-ownership optimum, and ignores serving cost entirely.
A model fits on a single GPU but you want to train across 256 GPUs. Which parallelism do you use and how do you adjust the learning rate?
Use data parallelism (DDP or FSDP): replicate the model, split the batch, all-reduce gradients. Effective batch = per-GPU batch × GPUs × accumulation; scale LR roughly linearly with batch size, add a warmup phase to stabilize early high-variance steps. Trap: reaching for tensor or pipeline parallelism when data parallelism suffices, or scaling batch without scaling LR/adding warmup — leaving accuracy on the table or causing early divergence.
Explain tensor vs pipeline parallelism and when you would use each.
Tensor parallelism splits individual layer matmuls across GPUs — heavy per-step communication so keep it intra-node on NVLink. Pipeline parallelism splits the model into sequential stages across nodes with less communication but introduces idle bubbles mitigated by micro-batching. Use tensor parallelism when a layer is too big for one GPU; pipeline when total depth exceeds node memory; combine both with data parallelism (3D parallelism) at massive scale. Trap: spreading tensor parallelism across nodes over slow interconnect — communication dominates and throughput collapses.
Walk through ZeRO stages 1, 2, and 3 (and FSDP). When do you escalate?
ZeRO-1 shards optimizer states, ZeRO-2 also shards gradients, ZeRO-3/FSDP also shards parameters and all-gathers them just-in-time per layer — each stage saves more memory at the cost of more communication. Escalate only as far as needed to fit: start stage 1, move to 2 then 3 as optimizer state and params stop fitting, since stage 3 has the highest comm overhead. Trap: jumping straight to ZeRO-3 when stage 1 or 2 would fit; with many tiny layers, ZeRO-3 can be communication-bound unless you tune prefetch and bucketing carefully.
When do you reach for gradient checkpointing vs gradient accumulation, and what does each cost?
Gradient checkpointing recomputes activations in the backward pass instead of storing them — ~20-30% extra compute, big memory savings. Gradient accumulation runs N micro-batches before an optimizer step to reach a large effective batch without holding it all in memory. Use checkpointing when activations are the bottleneck; accumulation when you need a bigger effective batch than fits. Trap: confusing the two — accumulation does not reduce the activation memory of a single micro-batch, and checkpointing does not change effective batch size.
FP16 vs BF16 for training large models: which do you pick and why?
BF16 has FP32's exponent range, so it is far more stable and needs no loss scaling; FP16 has more mantissa precision but a narrow range requiring dynamic loss scaling to avoid gradient underflow. On A100/H100 use BF16 by default; FP16 is mainly for older hardware (V100) lacking good BF16 support. Trap: using FP16 without loss scaling (silent underflow to zero gradients), or assuming BF16 needs no compensations when sensitive accumulation sums still benefit from FP32 master weights.
Your large-model training loss suddenly spikes and sometimes diverges. How do you diagnose and fix it?
Check the usual culprits: too-high LR or missing warmup, a bad data batch (corrupted/anomalous), FP16 overflow, exploding logits or attention weights. Mitigations: gradient clipping, longer LR warmup, switch FP16 to BF16, add QK-normalization or z-loss on logits, resume from a pre-spike checkpoint while skipping the offending data shard. Trap: just lowering the global LR and restarting from scratch — spikes are often data- or precision-induced, not purely an LR problem, and resuming from checkpoint is far cheaper.
Why does a Mixture-of-Experts model save compute but not memory, and when is MoE the right choice?
Only the top-k selected experts run per token so FLOPs scale with active (not total) parameters; but all experts must be resident in VRAM so memory scales with total parameters. MoE wins when you have ample VRAM and many devices and want high quality per FLOP; dense models win in memory-constrained settings. Trap: claiming MoE reduces memory like it reduces compute — Mixtral 8×7B needs VRAM for ~47B params even though only ~13B are active per token.
What is the load-balancing auxiliary loss in MoE and what breaks without it?
The auxiliary loss encourages the router to spread tokens evenly across experts; without it a few experts get favored in a self-reinforcing loop, leaving others undertrained and wasting capacity. A capacity factor (which drops overflow tokens past each expert's limit) works alongside it to keep utilization balanced. Trap: over-weighting the aux loss — it can hurt quality by forcing unnatural routing, and too-low capacity factor silently drops tokens that skip the FFN via the residual stream.
Describe the RLHF pipeline and the role of the KL penalty.
Three stages: SFT on demonstrations, training a reward model on human preference comparisons, then PPO (or offline DPO) to optimize the policy against the reward model. The KL penalty to the frozen SFT reference keeps the policy from drifting too far — preserving fluency and limiting reward hacking. Trap: dropping or under-weighting the KL term — the policy then exploits reward-model quirks (verbosity, sycophancy) and produces high-reward but low-quality text.
What is reward hacking in RLHF, and how do you detect and mitigate it?
Reward hacking is the policy exploiting flaws in the reward model to get high scores without real quality gains — verbosity, sycophancy, formatting tricks. Detect via rising reward curve alongside flat or declining human evaluation and growing KL divergence. Mitigate with a stronger KL penalty, reward-model ensembling, retraining the reward model on fresh on-policy data, and length/format normalization. Trap: trusting the reward curve as the success metric — reward going up while held-out human preference stalls is the classic sign of reward-model overfitting.
19
PART IV · SCALE & SOTA

Efficiency & deployment: quantization, distillation, pruning, KV-cache

🎯Quantize, distill, prune, and cache the KV — same model, a fraction of the cost.

Your model is brilliant but too slow and fat to ship — quantization, distillation, pruning, and a well-managed KV cache are the four levers that collapse it into something production can actually run. Master which lever to pull for which constraint (latency vs. memory vs. cost) and you're ahead of most ML engineers.

QUANTIZATION — FEWER BITS, SAME MODEL

Swap fp16 → int8/int4 for weights (and optionally activations). The two regimes:

TechniqueWhen to useCost
PTQ (Post-Training Quantization)Fast deployment, no training pipeline, small accuracy drop OK~hours, small calibration set
QAT (Quantization-Aware Training)INT4 or tiny models where accuracy is criticalFull fine-tune required
GPTQ / AWQ (weight-only INT4)LLM decode: memory- and bandwidth-bound single-request latencyCheap calibration
SmoothQuant (W8A8)LLM prefill / large-batch throughput via INT8 tensor coresMigrates outliers into weights
FP8 (E4M3/E5M2)Hopper/Blackwell GPUs — native tensor-core support, handles outliers better than INT8 without calibration tricksNeeds FP8-capable hardware

Key debug move: INT8 accuracy drop? Profile per-layer sensitivity, switch to per-channel weight quant, smooth activation outliers (SmoothQuant), or keep first/last layers in fp16 (mixed precision).

$$\hat{w} = \text{round}\!\left(\frac{w}{s}\right) \cdot s, \quad s = \frac{\max|w|}{2^{b-1}-1}$$scale-and-round: pick scale s from weight range, round to b-bit integer, dequantize for matmul
KV-CACHE — THE REAL MEMORY HOG AT SERVING TIME

Cache past keys and values so each decode step doesn't re-attend the full prompt — reduces attention from $O(n^2)$ per token to $O(n)$.

$$\text{KV mem} \approx 2 \times L \times H \times d_h \times S \times B \times \text{dtype\_bytes}$$2 tensors (K,V) × layers × heads × head-dim × seq-len × batch × bytes — grows with context AND batch
PagedAttentioneliminates KV fragmentation; underlies vLLM GQA / MQAshare KV heads across query heads; slashes KV size at the cost of slight quality — can't bolt on without retraining KV-cache quantstore cached K/V in INT8/FP8; ~2x KV memory savings Continuous batchingevict finished sequences mid-step, fill slots immediately — raises throughput, no single-request benefit FlashAttentionexact (not approximate) IO-aware kernel; tiles computation so full N×N matrix never lands in HBM; O(N) memory, big speed win Speculative decodingdraft model proposes tokens, target verifies in one pass; identical output distribution; only helps low-batch latency with a well-matched draft
DISTILLATION — SHRINK THE ARCHITECTURE ITSELF

Train a small student on the soft logits (not hard labels) of a big teacher. Soft targets carry richer signal: the teacher's uncertainty propagates cross-class knowledge.

$$\mathcal{L}_{\text{KD}} = \alpha \cdot T^2 \cdot \text{KL}(\sigma(z_T/T)\,\|\,\sigma(z_S/T)) + (1-\alpha)\cdot \mathcal{L}_{\text{CE}}$$temperature T softens distributions; α balances distillation vs hard-label loss

Use when you want a genuinely different (smaller/faster) architecture and have training data + a strong teacher. Requires retraining — unlike PTQ. Combine: distill first, then quantize the student (they're complementary, not mutually exclusive).

PRUNING — CUT WEIGHTS OR STRUCTURE
TypeWhat's removedReal speedup?
UnstructuredIndividual near-zero weights (sparse)No — needs sparse kernels/hardware (NVIDIA 2:4 semi-structured is the exception)
StructuredWhole channels, attention heads, layersYes — produces a smaller dense model, faster everywhere

Gotcha: "90% sparsity" ≠ 10x speedup on a standard GPU. Structured pruning is the practical choice for deployment.

BANDWIDTH VS COMPUTE — PICK THE RIGHT LEVER
PhaseBottleneckBest lever
Prefill (prompt processing)Compute-bound (large parallel matmuls)FlashAttention, FP8/INT8 tensor cores, raw FLOPs
Decode (single-token generation)Memory-bandwidth-bound (reload all weights per token)Weight quantization, larger batch, GQA/MQA, continuous batching

Parallelism for huge models: Tensor parallelism (split layers, all-reduce per layer) — use within a node on NVLink; fast but communication-heavy. Pipeline parallelism (split layers across stages) — use across nodes; lower per-step communication, accepts pipeline bubbles and higher latency.

⚠ Clears up — "any quantization speeds up inference" Weight-only INT4 (GPTQ/AWQ) helps decode latency by reducing bytes moved for weights — but the matmul itself still runs in higher precision, so it adds dequantization overhead and may not help (or can hurt) compute-bound prefill or large-batch regimes. True arithmetic speedup requires weight+activation quant (INT8 tensor cores, FP8) and a hardware path that supports it.
◆ Interview probe Q: Your 70B model OOMs at batch=8, sequence=4096 — weights fit fine alone. What's happening and how do you fix it? → A: The KV cache is the culprit: $2 \times L \times H \times d_h \times 4096 \times 8$ bytes dwarfs the weights at long context + batch. Fix: enable GQA/MQA (or use a model already trained with it), quantize the KV cache to INT8/FP8, use PagedAttention to kill fragmentation, and/or reduce context via a sliding window. Continuously batching smaller requests also helps if you're mixing long and short sequences.
Remember   Match the lever to the constraint: weight-only INT4 for decode bandwidth, W8A8/FP8 for compute throughput, structured pruning for real speedup, distillation for a new architecture — and watch the KV cache, not just the weights, when you size GPU memory.
Tricky interview questions 12
When would you choose PTQ over QAT, and vice versa?
Use PTQ when you need fast deployment, have no training pipeline access, and can tolerate a small accuracy drop — it calibrates scales on a small data sample in hours. Use QAT when you're pushing INT4 or a small model where accuracy is critical and you can afford fine-tuning, because weights adapt to quantization noise during training. Trap: saying QAT is always better — for most large pretrained LLMs, PTQ methods like GPTQ/AWQ are the practical default because the training pipeline is unavailable.
Your INT8 model lost significant accuracy after PTQ. How do you debug and recover it?
Profile per-layer sensitivity, keep the most sensitive layers (first/last, LayerNorm, attention output projections) in fp16, switch from per-tensor to per-channel weight quantization, use representative calibration data, and handle activation outliers via SmoothQuant or fallback to QAT. Trap: blaming the data or just lowering precision further — activation outliers in a few channels are almost always the culprit, and per-tensor quantization amplifies them.
What's the difference between weight-only and weight+activation quantization, and when do you pick each?
Weight-only (INT4 GPTQ/AWQ) shrinks the model and reduces bandwidth for the memory-bound decode phase, but the matmul still runs in higher precision. Weight+activation quant (INT8/FP8) actually engages integer tensor cores and speeds up compute-bound prefill and large-batch throughput. Trap: assuming any quantization speeds up compute — weight-only INT4 helps low-batch latency via bandwidth reduction, not arithmetic, and can add dequant overhead in compute-heavy regimes.
Explain GPTQ vs AWQ vs SmoothQuant and when you'd reach for each.
GPTQ does layer-wise second-order error-compensating weight quantization — great for INT4 weight-only. AWQ protects the most salient weight channels based on activation magnitude — robust INT4 with cheap calibration. SmoothQuant migrates activation outliers into weights to enable W8A8 (INT8 weight+activation) for throughput. Trap: treating them as interchangeable — GPTQ/AWQ target latency/memory wins; SmoothQuant targets compute throughput via tensor cores.
What is the KV cache, why does it dominate memory at serving time, and how do you reduce it?
The KV cache stores past keys and values so each decode step skips recomputing attention; its size scales as 2 × layers × heads × head-dim × seq-len × batch × dtype-bytes, so it quickly exceeds weight memory at long context or high concurrency. Reduce it with GQA/MQA (fewer KV heads), KV-cache quantization (INT8/FP8), PagedAttention to eliminate fragmentation, and shorter context. Trap: sizing GPU memory only for weights — at batch=8, context=4096, KV cache is the binding constraint, not weights.
Why is LLM decoding memory-bandwidth bound while prefill is compute bound, and what optimization follows?
Prefill processes all prompt tokens in parallel (large matmuls, high arithmetic intensity = compute bound). Decode generates one token at a time and reloads all weights for tiny compute (low arithmetic intensity = bandwidth bound). So batching and weight quantization help decode most; prefill benefits from FlashAttention and raw FLOPs. Trap: adding more FLOPs to speed up decode — it's bandwidth-limited, so you win by batching more requests and reducing bytes moved.
How does continuous batching differ from static batching, and why does it help?
Static batching waits for a full batch and runs until the slowest sequence finishes, wasting GPU on padding and idle slots. Continuous batching adds/evicts requests at the token-step level so finished sequences free slots immediately for new ones, raising throughput and lowering latency under load. Trap: thinking it improves single-request latency — with one request it does nothing; gains are purely from concurrency, and it needs careful KV memory management.
Structured vs unstructured pruning — which gives real inference speedups and why?
Structured pruning removes whole channels, attention heads, or layers, producing a smaller dense model that runs faster on any standard hardware. Unstructured pruning zeros individual weights and yields high sparsity but needs specialized sparse kernels or hardware (NVIDIA 2:4 semi-structured) — on standard GPUs it gives memory savings but not wall-clock speedup. Trap: reporting "90% sparsity" as a 10x speedup — without sparse hardware support it almost never is.
What does FlashAttention optimize, and is it an approximation?
FlashAttention is an exact, IO-aware attention kernel that tiles computation using online softmax so the full N×N attention matrix never materializes in HBM, cutting memory to O(N) and reducing HBM reads/writes dramatically. It is mathematically identical to standard attention — not an approximation. Trap: calling it approximate or sparse attention — it changes memory access patterns, not the math.
How does speculative decoding speed up generation, and when does it fail to help?
A small fast draft model proposes several tokens that the large target model verifies in one parallel forward pass, accepting the longest matching prefix — yielding multiple tokens per expensive step with identical output distribution. It helps when you're latency-bound (small batch) and the draft acceptance rate is high; it backfires at large batch (target already saturated) or with a poorly aligned draft. Trap: claiming it changes output quality — done correctly it is output-equivalent to the target model.
When should you use tensor parallelism vs pipeline parallelism for a model too large for one GPU?
Use tensor parallelism (split each layer across GPUs) within a node on fast NVLink for low latency — it parallelizes every layer but needs an all-reduce per layer. Use pipeline parallelism (split layers into stages) across slower inter-node links for very large models, accepting pipeline bubbles and higher latency for less per-step communication. Trap: applying tensor parallelism across nodes — its per-layer all-reduce saturates slow inter-node bandwidth and tanks performance.
When would you choose distillation over quantization or pruning to hit a latency target?
Choose distillation when you want a fundamentally smaller or faster architecture and have training data plus a strong teacher — it can change the model's structure and recover accuracy that quantization or pruning can't. Use quantization for the cheapest retraining-free win; they're complementary (distill, then quantize the student). Trap: treating them as mutually exclusive — in practice you stack them, and distillation's unique cost is it requires retraining and labeled/unlabeled data unlike PTQ.
20
PART IV · PRACTICE

DL interview rapid-fire & gotchas

🎯If an answer isn't instant, that's the chapter to reread.

The lightning round. If any answer below doesn't come out in one breath, that's the chapter to reread — this page is a recognition drill, not new material.

THE ONE-BREATH ANSWERS
No activation?stacked linear layers collapse to one linear map — depth wasted. ReLU > sigmoid?gradient is 1 (not ≤¼), no positive saturation, cheap; risk = dead units. Vanishing vs explodingproduct of Jacobian factors <1 shrinks / >1 blows up; fix: ReLU, residuals, norm, clipping. BN train vs evalbatch stats in train, frozen running stats at eval — forgetting eval() corrupts accuracy. BN vs LNBN normalizes down the batch (CNNs); LN across features per sample (Transformers, any batch size). Why residuals?identity highway for gradients; learn the change F(x), so depth stops hurting. He vs XavierHe (2/n) for ReLU, Xavier (1/n) for tanh/sigmoid; never init all-equal. CE not MSEcross-entropy is the classification MLE, convex, strong gradient when confidently wrong. Dropout at test?off — use all units (inverted dropout already scaled during training). AdamW vs AdamAdamW decouples weight decay; L2-in-Adam is not true weight decay. Why warmup?random early weights + large adaptive steps diverge; ramp LR up first. Grad accumulationsum grads over N micro-batches, step once → big-batch math on small memory. √d in attentionkeeps QKᵀ dot products from saturating softmax into one-hot. CNN edgeweight sharing + locality → translation equivariance, few params. LSTM gatesforget/input/output gates + additive cell state carry gradients across time. LoRAfreeze W, learn low-rank ΔW = BA → <1% params, no inference cost once merged. Debug first moveoverfit a single batch; if you can't, it's a bug, not the model. fp16 needs…loss scaling (underflow); bf16 doesn't (wider exponent).
⚠ Clears up — the four most-confused pairs BN vs LN (batch axis vs feature axis), Adam vs AdamW (coupled vs decoupled decay), vanishing vs exploding (shrink vs blow up), and parameters vs FLOPs (MoE adds params at ~constant compute/token). If you can state the distinction in each pair, you're interview-ready.
◆ Interview probe Q: Your model trains fine but is terrible at inference — what's your first guess? → A: You left it in train mode: BatchNorm is using batch stats and Dropout is still on. Call model.eval() (and torch.no_grad()). Next suspects: train/serve preprocessing mismatch, or a data leak that inflated the offline metric.
Remember   If an answer isn't instant, that's the chapter to reread — fluency here is the whole point.
Tricky interview questions 12
Gradients are exploding to NaN a few hundred steps in. What's your fix list?
Add gradient clipping by global norm (e.g. 1.0), lower the learning rate, add/verify normalization, and check for log(0) or divide-by-zero in the loss; if using fp16, verify loss scaling. Trap: clipping by value instead of by norm — it distorts the gradient direction. Clip by norm to preserve direction.
You switch from batch size 32 to 1024 and accuracy drops. Why, and what do you do?
Large batches take fewer, less-noisy steps and can land in sharper minima; the noise that aided generalization is gone. Scale the LR up (roughly linearly), add warmup, and consider more epochs. Trap: keeping the same LR — a bigger batch needs a bigger step or it underfits.
Why is a plain L2 penalty inside Adam not the same as weight decay?
Adam divides every gradient (including the L2 term) by the per-parameter RMS, so large-gradient weights get decayed less — the decay becomes uneven. AdamW applies the decay directly to the weights, decoupled from the adaptive step. Trap: assuming "add L2 to the loss" gives correct regularization under Adam — use AdamW.
Self-attention is O(n²) in sequence length. How do people make long context feasible?
FlashAttention (IO-aware exact attention, no n² memory), sparse/sliding-window attention, linear-attention approximations, and a KV-cache so decoding is O(n) per new token. Trap: saying FlashAttention changes the O(n²) compute — it cuts memory to O(n) and is faster, but the compute is still quadratic (exact).
Why do we scale by √d_k inside softmax attention?
Dot products of d_k-dim vectors grow with d_k; large logits push softmax toward a one-hot with near-zero gradient. Dividing by √d_k keeps variance ~1 so gradients stay healthy. Trap: calling it a normalization of the output — it's about keeping the softmax in its sensitive range.
When would you pick SGD+momentum over AdamW?
Vision/CNN/ResNet training where final generalization matters — well-tuned SGD often finds flatter minima that generalize better. AdamW wins for Transformers/NLP, sparse gradients, and fast prototyping. Trap: "Adam is always better" — it converges faster and is more robust to LR, but not always better at the final number.
Your validation loss is far above training loss. Concrete fixes in order?
That's overfitting (high variance): get more/augmented data first, then add regularization (dropout, weight decay, early stopping), then reduce capacity. Trap: reaching for a bigger model or more features — that worsens variance; those fix the opposite problem (high bias).
How do you fit a model that's just barely too big for your GPU?
Gradient accumulation (large effective batch on small memory), mixed precision (bf16 → half the memory), gradient/activation checkpointing (recompute in backward), and a smaller micro-batch. Trap: forgetting to divide the accumulated loss by N (or not zero-ing grads at the right time) — it rescales your effective LR.
Why does adding more layers sometimes increase training error (not just test error)?
The degradation problem: very deep plain nets are hard to optimize because layers can't easily learn the identity, so even training error rises. Residual connections make identity the default and fix it. Trap: calling it overfitting — overfitting raises test error while training error falls; here training error itself goes up.
What exactly breaks if you forget model.eval() at inference?
BatchNorm keeps using the current batch's statistics instead of the frozen running stats (so single-sample or shifted-distribution inference is wrong), and Dropout stays active, randomly zeroing units. Trap: thinking it only affects Dropout — the BatchNorm stat switch is the bigger and subtler accuracy killer.
Cross-entropy vs MSE for classification — why not MSE?
With a sigmoid/softmax output, MSE is non-convex and its gradient vanishes when the model is confidently wrong, so learning stalls; cross-entropy (the Bernoulli/categorical MLE) is convex and keeps a strong gradient on confident mistakes. Trap: "MSE just trains slower" — it can actively get stuck, not merely lag.
What's the single best first step when a network won't learn at all?
Try to overfit one tiny batch. If you can drive its loss to ~0, the pipeline works and you have a capacity/regularization/LR issue; if you can't, there's a bug in the data, loss, or gradient wiring. Trap: tuning hyperparameters before this check — you'd be tuning around a bug.
That's the toolkit
Building blocks → training mechanics → architectures → scale & deployment, each page practical-first with a hook, a figure, the formulas, the gotcha, and a 10-question bank. Revise with the flashcards.