SYSTEMS · TRAINING INFRASTRUCTURE

Distributed training

DDP fits a 7B model on one node. Past 70B you need ZeRO/FSDP. Past 100B you need 3D parallelism. Past 405B you need failure-handling for hardware that breaks every few hours. This page is the full memory-bandwidth-bubble triangulation a Sr Staff candidate is expected to do live.

Read ~35 min Asked at Anthropic, OpenAI, DeepMind, Meta, xAI Difficulty Sr Staff bar

What you'll learn

The memory budget — why DDP doesn't fit at 70B+
ZeRO & FSDP — sharding optimizer state, grads, params
Tensor parallel (Megatron) — splitting a matmul across GPUs
Pipeline parallel — stages, microbatches, the bubble
Sequence & context parallel — when sequences are too long
Expert parallelism for MoE — and the all-to-all bottleneck
Mixed precision — FP16 / BF16 / FP8 (the H100 era)
Activation checkpointing & gradient accumulation
The collectives toolkit — AllReduce, AllGather, AllToAll
Hardware reality — NVLink, NVSwitch, IB, NCCL, SHARP
Checkpointing strategies — sync, async, sharded
Failures at 10k+ GPU scale — what kills naive runs

FOUNDATIONS · MEMORY ACCOUNTING

The memory budget — why DDP doesn't fit at 70B+

TL;DR

Adam mixed-precision needs 16 bytes per parameter of optimizer state alone. A 70B model = 1.12 TB of state. No single GPU holds that. Every distributed-training scheme exists to spread those 16 bytes/param across many GPUs without crushing the network.

DDP — the floor everyone starts from

Replicate the model on each GPU, shard the batch. After backward, all-reduce gradients across replicas. Memory-heavy: each GPU holds the full model + grads + optimizer state. Comm cost = one all-reduce per step on the gradient tensor.

DDP is the right answer when the entire training state fits on one GPU with room for activations. Below ~7B parameters in BF16, fine. Above that, you're allocating optimizer state you can't afford.

The 16 bytes/param breakdown

For Adam mixed-precision (the dominant LLM recipe), the per-parameter memory accounting from the ZeRO paper (Table 1):

THE INSIGHT — 16 bytes/param

Where the bytes go in Adam mixed-precision

State	Bytes / param	Why
FP32 master weights	4	Source of truth for the optimizer step
FP32 Adam `m` (1st moment)	4	EMA of gradient
FP32 Adam `v` (2nd moment)	4	EMA of squared gradient
BF16 weights	2	Used in forward/backward
BF16 grads	2	Output of backward
Total	16	Per parameter, before activations

That ignores activations and KV cache. For a 70B model: 70e9 × 16 = 1.12 TB. An H100 has 80 GB. Even a B200 has 192 GB. You cannot fit a 70B model on one GPU with Adam. Distributed training is not a choice; it is a memory ultimatum.

EXAMPLE — concrete walkthrough for 70B on 8 H100s

State to fit per replica = 1.12 TB. Per-GPU budget on an 8×H100 node = 8 × 80 GB = 640 GB total HBM.
DDP: each GPU holds the full 1.12 TB. Won't fit. Game over.
ZeRO-3 (state sharded across 8 ranks): per-GPU state = 1.12 TB / 8 = 140 GB. Still won't fit (we lose 60 GB to activations + KV).
ZeRO-3 + activation checkpointing: now per-GPU sits around ~50–60 GB; comfortably fits with room for moderate batch.
Lesson: at the 70B/8-GPU edge, you are both sharding state and trading FLOPs for activation memory.

REMEMBER

Adam mixed-precision = 16 bytes/param. Memorize this.
70B model = 1.12 TB of state. No GPU holds that.
Distributed training exists to shard those 16 bytes across DP ranks.
"Will it fit?" is the first question, before any throughput discussion.

DATA PARALLEL · SHARDING STATE

ZeRO & FSDP — sharding optimizer state, grads, params

TL;DR

ZeRO partitions the 16 bytes/param across DP ranks in three stages. ZeRO-1 shards optimizer state; ZeRO-2 adds gradients; ZeRO-3 adds parameters. Stages 1 and 2 are free (same comm as DDP). Stage 3 costs ~1.5× DDP comm. FSDP is PyTorch's production ZeRO-3.

The three ZeRO stages

From Rajbhandari et al. 2019 (arxiv 1910.02054). ZeRO progressively partitions training state across DP ranks instead of replicating.

Stage	What's sharded	Comm vs DDP	Per-rank state (N ranks)
ZeRO-1	Optimizer state only	Identical (1× all-reduce on grads)	4 + 12/N bytes/param
ZeRO-2	Optimizer state + gradients	Identical (all-reduce → reduce-scatter, same volume)	2 + 14/N bytes/param
ZeRO-3	Optimizer + grads + parameters	1.5× DDP (per-layer all-gather × 2 + reduce-scatter)	16/N bytes/param

The 1.5× number is from ZeRO paper Table 5 — frequently misquoted as "2×". Forward needs an all-gather on each layer's params (used, then freed). Backward needs another all-gather (re-fetch what was freed) plus a reduce-scatter on grads. Three collective passes vs DDP's one all-reduce, but the reduce-scatter and all-reduce cost the same per byte.

FSDP — PyTorch's production ZeRO-3

FSDP wraps modules; manages all-gather/reduce-scatter automatically per-wrap-unit. FSDP-2 (PyTorch 2.x) introduces per-parameter sharding (rather than per-flat-buffer), which composes cleanly with TP and is more memory-efficient for irregular shapes.

ZeRO-Offload and ZeRO-Infinity

When even ZeRO-3 doesn't fit, push state out of HBM:

ZeRO-Offload: optimizer state and parts of gradient live in CPU RAM. Optimizer step runs on CPU. Slow but unblocks training a 13B model on a single 24 GB GPU.
ZeRO-Infinity: extends to NVMe. Used as an absolute last resort; CPU/NVMe bandwidth caps throughput hard.

DDP — when it's right

Model + state fit on one GPU (sub-10B BF16)
Want simplest comm pattern (single all-reduce/step)
Tight latency budget where extra collectives hurt

FSDP / ZeRO-3 — when it's right

Model state exceeds single-GPU memory
You can absorb 1.5× DDP comm
NVLink within node is fast enough for the all-gathers

REMEMBER

ZeRO-1 and ZeRO-2 are "free": same comm as DDP, less memory.
ZeRO-3 / FSDP costs 1.5× DDP comm — not 2×.
FSDP-2's per-parameter sharding is now the default in modern PyTorch.
Offload/Infinity are escape hatches, not strategies.

MODEL PARALLEL · TENSOR SPLIT

Tensor parallel (Megatron) — splitting a matmul across GPUs

TL;DR

Shard each weight matrix across GPUs so a single matmul becomes a parallel matmul + all-reduce. Megatron's column-then-row pattern needs exactly one all-reduce per direction per MLP block. Bandwidth-hungry — keep TP within a single NVLink domain.

The column-then-row trick

From Shoeybi et al. 2019 (arxiv 1909.08053). For a two-matmul block Z = GeLU(X · A) · B, choose the shard axes so the GeLU is local:

Column-parallel A: each GPU holds A_i (a column slice). Computes Y_i = GeLU(X · A_i). No comm — GeLU is elementwise.
Row-parallel B: each GPU holds B_i (a row slice). Computes partial Z_i. All-reduce (sum) to combine.

Net cost: one all-reduce in forward, one in backward, per MLP block. For attention, shard heads across GPUs (each head is independent), then row-parallel the output projection — same pattern.

Why TP is a within-node story

Each TP all-reduce moves the full hidden-state activation tensor. For a Llama 70B forward pass with hidden=8192, BF16, 4M tokens/step, the per-step volume is gigabytes per layer. NVLink (900 GB/s on H100, 1.8 TB/s on B200) absorbs it. InfiniBand at 400 Gb/s would crush throughput.

Rule of thumb: TP degree ≤ GPUs per NVLink domain. On H100 nodes that's 8. On NVL72 racks it's 72.

EXAMPLE — TP=8 inside a node, why not TP=16

Llama 3 405B: TP=8 within an 8×H100 node (NVLink), PP=16 spanning 16 nodes per replica. Picking TP=16 would require all-reduces over IB — moving multi-GB tensors over 400 Gb/s links each layer. Even with SHARP, the latency tax destroys throughput. NVLink is what makes TP cheap.

REMEMBER

MLP block = column-then-row → one all-reduce per direction.
Attention = head-parallel + row-parallel output proj — same idea.
TP degree ≤ GPUs per NVLink domain. Cross-node TP is malpractice.
Sequence parallel (next chapter) cuts TP's activation memory further.

MODEL PARALLEL · LAYER SPLIT

Pipeline parallel — stages, microbatches, the bubble

TL;DR

Split layers into stages, each on a different GPU. Microbatches flow stage-to-stage like an assembly line. The "pipeline bubble" is the idle time at fill and drain. More microbatches → smaller bubble. Modern schedules (1F1B, interleaved, zero-bubble) attack the bubble from different angles.

The bubble formula

For S stages and M microbatches:

bubble fraction = (S − 1) / (S − 1 + M)

Want bubble < 5%? M ≈ 20·(S−1). With S=16 stages, you need ~300 microbatches per pipeline flush — which constrains your effective batch size and global batch.

Schedule evolution

GPipe: all forwards, then all backwards. Holds activations for the full pipeline depth → activation memory blows up.
1F1B (one-forward-one-backward): alternate forward and backward as soon as a microbatch's backward becomes available. Activation memory bounded by stage depth, not pipeline depth. Megatron default.
Interleaved 1F1B: each stage holds non-contiguous layers (layers 1, 5, 9, 13 on stage 0; 2, 6, 10, 14 on stage 1). More micro-microbatches in flight per macro → smaller bubble.
Zero Bubble Pipeline (Qi 2023): split backward into "input grad" and "weight grad" — reorder so weight grads fill what would otherwise be bubble slots. Near-zero overhead in idealized cases.
DualPipe (DeepSeek V3): bidirectional pipeline — microbatches flow forward AND backward through the stages simultaneously, overlapping compute with the all-to-all comm of MoE. Custom kernels (DeepEP).

Why PP costs little bandwidth

Unlike TP, PP only sends the activation tensor between adjacent stages once per microbatch. That's MBs, not GBs. PP comfortably crosses InfiniBand. PP is your across-node parallelism.

PITFALL — bubble vs batch tension

Cutting the bubble requires more microbatches, but each microbatch increases activation memory at every stage. At a fixed memory budget you face a hard trade: smaller microbatches → more of them → smaller bubble, but more activation memory. Activation checkpointing (chapter 8) is what lets you push microbatch count high without OOM.

REMEMBER

Bubble = (S−1) / (S−1+M). Memorize this.
1F1B is the default schedule. Interleaved 1F1B is what Megatron actually runs.
PP is the across-node parallelism — it sends megabytes, not gigabytes.
Bubble shrinks with more microbatches; activation checkpointing makes that possible.

SEQUENCE PARALLEL · LONG CONTEXT

Sequence & context parallel — when sequences are too long

TL;DR

"Sequence parallel" extends TP by also sharding the sequence dim through LayerNorm/dropout/residual — saves activation memory cheaply. "Context parallel" goes further: shards the sequence inside attention, enabling million-token training. Two competing approaches: ring attention (latency-bound) and DeepSpeed-Ulysses (bandwidth-bound).

Sequence parallel — the cheap activation-memory win

In TP regions, the hidden state is sharded along the feature dim. But LayerNorm, dropout, and residual add are not TP-parallelized — so each rank holds the full activation through them. SP fixes that by also sharding the sequence dim through those ops, with all-gather and reduce-scatter at the SP/TP boundaries. Net effect: cuts activation memory roughly in proportion to TP degree, with negligible throughput cost.

Context parallel — sharding inside attention

True end-to-end sequence sharding, including across the attention computation itself. This is what unlocks multi-million-token training. Two competing approaches you should be able to distinguish:

Ring Attention (Liu 2023)

K/V chunks circulate around a ring of GPUs (point-to-point).
Each GPU computes partial attention against the K/V it currently holds; accumulates via online softmax.
Latency-bound — P2P send/recv each step.
Best for very long contexts (≫ heads count).
Combines with FlashAttention block-wise softmax.

DeepSpeed-Ulysses

Two all-to-all collectives swap parallelism dim between sequence (in attention) and head (in MLP).
Bandwidth-bound — every token crosses the all-to-all twice.
Better for shorter contexts with many heads.
Capped by number of attention heads.

USP (Unified Sequence Parallel) combines them: partition the GPU group into a 2D grid, run Ulysses across one axis and ring across the other. State of the art for very long context training in 2025.

Striped Attention: load-balanced ring — causal mask creates triangular work, so naive ring has stragglers; striped reorders chunks so each GPU computes about the same amount.

REMEMBER

Sequence parallel = cheap activation-memory win on top of TP.
Context parallel = ring (latency-bound) vs Ulysses (bandwidth-bound) vs USP (both).
Ring attention + FlashAttention is what enables million-token training.
If you're training at >128k tokens, you're using one of these.

MOE · EXPERT PARALLEL

Expert parallelism for MoE — and the all-to-all bottleneck

TL;DR

Place experts on different GPUs. Each token routes to top-k experts → two all-to-alls per layer (dispatch + combine). At 256 experts on 256 GPUs the all-to-all dominates compute. Mitigations: capacity factor, topology-aware routing, comm/compute overlap (DualPipe).

Why MoE adds a fifth parallelism dim

An MoE layer replaces one dense FFN with N experts (typically 8 to 256+) and a router that picks the top-k for each token. Different experts hold different weights — so different GPUs hold different experts. Tokens must be shipped to the GPU holding their assigned expert, then the result shipped back. That's an all-to-all dispatch and an all-to-all combine, per MoE layer.

Mitigations

Capacity factor: cap how many tokens any single expert can take. Overflow tokens are dropped (or routed to a fallback). Without this, a popular expert serializes the layer.
Topology-aware routing: prefer experts on GPUs in the same node / NVLink domain. Reduces IB pressure.
DeepSeek V3 DualPipe: bidirectional pipeline that overlaps the all-to-all with compute on the opposite-direction microbatches. Custom kernels (DeepEP) hide ~all the comm.
Expert-level shared weights (DeepSeek MoE): always-active "shared" experts handle generic patterns; routed experts specialize.

EXAMPLE — concrete walkthrough for DeepSeek V3

671B total params, 37B activated per token. 256 routed experts + 1 shared. EP degree = 64 (across nodes). DualPipe + DeepEP overlaps the all-to-all so well that comm overhead is ~zero on H800-class hardware. This is the canonical 2025 MoE-at-scale recipe — reference it explicitly in interviews.

PITFALL — MoE without comm overlap

A naive MoE layer spends 30–60% of step time blocked on all-to-all. If you propose MoE in an interview and don't mention comm/compute overlap (DualPipe-style) or capacity factor, expect pushback. The canonical answer in 2026 is "we'd use DualPipe + DeepEP, like DeepSeek V3".

REMEMBER

MoE adds expert parallelism — a fifth dim on top of DP/TP/PP/CP.
Two all-to-alls per MoE layer: dispatch + combine.
Capacity factor + topology routing + comm overlap make MoE viable.
DeepSeek V3's DualPipe is the public 2025 reference recipe.

PRECISION · NUMERICS

Mixed precision — FP16 / BF16 / FP8 (the H100 era)

TL;DR

BF16 is the modern default — same dynamic range as FP32, no loss scaling needed. FP16 needs loss scaling. FP8 (E4M3 forward, E5M2 backward) doubles throughput on H100/Blackwell but requires fine-grained scaling and FP32 master weights. FP4 is experimental on Blackwell.

The format zoo

Format	Range	Notes
FP16	[6e-5, 65504]	Easy overflow on large activations/grads. Loss scaling needed.
BF16	same exponent as FP32, 7 mantissa bits	Wider range; no loss scaling needed. Default for LLMs.
FP8 E4M3	4-bit exp, 3-bit mantissa	Forward + weights. H100/Blackwell.
FP8 E5M2	5-bit exp, 2-bit mantissa	Backward (needs wider range). Per-tensor scales.
FP4 (Blackwell)	experimental	Microscaling formats (MX).

FP8 — the H100 unlock

FP8 doubles tensor-core throughput vs BF16 and halves activation memory. The cost is operational complexity: per-tensor or per-tile scaling factors must be tracked, FP32 master weights kept, partial sums accumulated in higher precision.

The canonical 2025 recipe is DeepSeek V3's: per-tile scaling (1×128 for activations, 128×128 for weights), online scale computation, FP32 partial-sum promotion every 128 elements, BF16 fallback for embeddings/output head/normalization. Reference this in interviews — it's the public state of the art.

PITFALL — FP8 silent quality regressions

FP8 with naive per-tensor scales loses ~1% on downstream evals. Per-tile scaling closes most of the gap but still leaves a small loss curve gap. Always validate FP8 runs against a BF16 reference at small scale before committing to a long FP8 run. Loss-curve eyeballing is not enough — run downstream evals.

REMEMBER

BF16 is the default. FP16 only if you're stuck on Volta/older.
FP8 = E4M3 forward, E5M2 backward. Per-tile scaling is mandatory at scale.
Critical layers (embedding, output, LN) stay BF16.
FP8 doubles tensor-core throughput — the dominant 2025 H100 efficiency lever.

MEMORY TRADEOFFS

Activation checkpointing & gradient accumulation

TL;DR

Activation checkpointing trades FLOPs for activation memory — recompute during backward instead of storing. Selective recomputation (Megatron) keeps expensive ops, recomputes cheap ones — ~33% extra FLOPs for ~10× memory savings. Gradient accumulation lets you grow effective batch beyond what fits per step.

Activation checkpointing

Save a subset of activations during forward; recompute the rest during backward. Two flavors:

Full block checkpointing: save only block boundaries, recompute everything inside. Simple. Costs ~33% extra FLOPs.
Selective recomputation (Korthikanti 2022): keep expensive ops (attention output, matmul outputs), recompute cheap ops (LayerNorm, GeLU, dropout). Megatron's standard. Roughly 10× activation memory reduction at < 5% throughput cost.

Gradient accumulation

Effective batch = micro-batch × accumulation steps × DP. Run multiple micro-batches sequentially per optimizer step, summing grads, before the all-reduce. Lets you fit when memory caps per-step batch but you still want a large effective batch (good for stable Adam moments).

EXAMPLE — concrete walkthrough at 70B / 8 GPUs

You want effective batch = 4M tokens. Per-GPU memory limits you to 2k tokens/microbatch. With 8 GPUs DP and a single accumulation step, that's 16k tokens/step — not 4M.
Solution: 256 accumulation steps per optimizer update. 2k × 8 × 256 = 4.1M tokens/step. Optimizer step runs 256× less frequently, comm dominated by activation patterns within each microbatch's forward/backward.

REMEMBER

Selective recomputation is the modern default — keep attention outputs, recompute LN/GeLU.
Full-block checkpointing is the fallback if memory still doesn't fit.
Gradient accumulation is how you hit a "4M-token batch" with limited per-step memory.
These two tricks are what let pipeline parallel push microbatch count high enough to shrink the bubble.

COLLECTIVES · COMM PRIMITIVES

The collectives toolkit — AllReduce, AllGather, AllToAll

TL;DR

Every distributed-training scheme is a recipe of collective ops. AllReduce dominates DP. AllGather and ReduceScatter are how ZeRO-3 works. AllToAll is the MoE bottleneck. Cost in ring all-reduce is 2(N−1)/N · M per rank — close to 2M for large N.

Primitive	Result	Cost (ring, M bytes, N ranks)
AllReduce	every rank ends with sum (or other reduction)	2(N−1)/N · M per rank
AllGather	every rank ends with concatenation	(N−1)/N · M
ReduceScatter	reduce + each rank gets a chunk	(N−1)/N · M
Broadcast	one rank → all ranks	O(M log N)
AllToAll	every rank sends a chunk to every other	(N−1)/N · M. Critical for MoE.

The identity to memorize: AllReduce = ReduceScatter + AllGather. This is why ZeRO-2 is "free" relative to DDP — replacing one all-reduce with a reduce-scatter (then later an all-gather where the data was needed anyway) keeps total volume identical.

REMEMBER

Ring AllReduce: ~2M per rank, regardless of N. Bandwidth-bound.
AllReduce = ReduceScatter + AllGather (the fundamental identity).
AllToAll is the MoE / Ulysses primitive — N² messages but linear total volume.
Tree algorithms beat ring for very small messages (latency-bound regime).

HARDWARE · INTERCONNECT

Hardware reality — NVLink, NVSwitch, IB, NCCL, SHARP

TL;DR

NVLink (within node) and InfiniBand (across node) form a two-tier bandwidth hierarchy. NVSwitch makes 8-GPU nodes look like one big GPU; NVL72 extends that to 72 GPUs in a rack. NCCL is the collectives library that abstracts it all. SHARP does in-network reduction on IB switches.

NCCL: NVIDIA's collectives library. GPU-direct via NVLink/PCIe/IB. The substrate everything sits on.
NVLink: GPU-to-GPU direct. H100: 900 GB/s aggregate per GPU. B200: 1.8 TB/s.
NVSwitch: full-mesh of NVLinks within a node (8 GPUs typical). Every GPU talks to every other at full NVLink bandwidth.
NVL72 (Blackwell): NVLink Switch extends NVSwitch fabric to 72 GPUs in a single rack. Lets you do TP=72 (vs 8 on H100) or fit much larger TP×PP combos within one rack.
InfiniBand: across-node interconnect. ConnectX-7 NDR = 400 Gb/s. Topology is fat-tree, often "rail-optimized" (each GPU has its own NIC).
SHARP: in-network reduction on IB switches. The switch sums fragments and broadcasts the result, eliminating one half of all-reduce traffic. Big win for small messages, less so for huge ones.

The bandwidth cliff

The reason TP stays within node is the bandwidth cliff: NVLink is ~20× the bandwidth of a single IB link. Every parallelism choice is a placement problem on this two-tier hierarchy.

REMEMBER

NVLink within node, IB across node. Two tiers.
NVSwitch = full mesh inside a node. NVL72 = full mesh inside a rack.
SHARP halves all-reduce traffic — useful for small messages.
Rail-optimized fat-tree is the standard cluster topology in 2025.

DURABILITY · CHECKPOINTING

Checkpointing strategies — sync, async, sharded

TL;DR

Sync checkpointing pauses training for 10–30 minutes on a 70B model — unacceptable at scale. Async (NVMe + background upload) cuts pause to under a minute. Sharded (each rank writes its own slice) is mandatory for trillion-param models. Frequency is set by MTBF — every 30 min to a few hours.

Sync: all ranks pause, write to shared FS. 70B model write ~10–30 min on slow FS. Only acceptable for small jobs.
Async: dump to local NVMe (< 1 min for 70B), separate process uploads to S3/GCS. Training continues immediately.
Sharded: each rank writes its own shard. Resume requires same parallelism config — or a re-shard tool. Megatron's dist-ckpt and TorchSnapshot handle resharding.
In-memory replication: state copied to other nodes for fast recovery without disk. Emerging at frontier labs.

Frequency tuning — MTBF dictates cadence

At 10k+ GPUs, hardware fails every few hours. Checkpoint cadence should be set so expected lost work per failure is acceptable. Llama 3 405B writeup: checkpoint every ~30 min → ~hour, which dominates the engineering pain budget.

EXAMPLE — async checkpoint pipeline

Step N completes. Sharded NVMe dump fires (each rank writes ~17 GB / 70B/8). Training continues with no observable pause. Background process tars and uploads to S3 over the next few minutes. Step N+200 complete before the upload finishes — that's fine. Recovery reads from NVMe if the node survived; from S3 otherwise.

REMEMBER

Async + sharded + NVMe-tier-1 + S3-tier-2 is the modern recipe.
Sync to shared FS is fine for <1k-GPU runs, malpractice at 10k+.
Checkpoint every ~30 min — set by MTBF, not by intuition.
Resharding tools matter when you change parallelism config mid-run.

SCALE · OPERATIONS

Failures at 10k+ GPU scale — what kills naive runs

TL;DR

At 10k+ GPUs, the failure rate is once every few hours. NCCL hangs, silent data corruption, ECC errors, network stragglers. A 70B run on 100 GPUs is "build it and run it"; a 405B run on 16k GPUs is "build the failure-handling system that happens to also train a model". This chapter is the public failure-handling toolkit from Llama 3 + xAI Colossus writeups.

The failure modes

NCCL hang: one rank dies mid-collective; the rest spin forever. Without a watchdog, the whole job stalls indefinitely.
Silent data corruption (SDC): H100 SDC rates are non-trivial at fleet scale. A rank computes the wrong result without any error signal.
Stragglers: ring all-reduce is bottlenecked by the slowest rank. A degraded NIC or thermally throttled GPU drags the whole job.
OOM cascades: a single OOM on rank N takes down the whole job, requires restart from checkpoint, possibly with different parallelism config.
Loss spikes: a bad batch, a numerical instability, or rank corruption causes loss to diverge.

The mitigation toolkit

NCCL watchdog: monitors collective progress; aborts hung ranks with timeout. Without this, debugging is impossible.
Heartbeat-based stale rank detection: per-rank heartbeats on a sidecar; missing beat → abort + replace.
SDC detection: gradient-norm spike checks, hash-based comparison across DP replicas, periodic loss-on-validation-batch sanity checks. A sudden 10× gradient norm with normal-looking loss is a strong SDC signal.
ECC + scrubbing: in-band ECC on HBM auto-corrects single-bit errors; out-of-band scrubbing reports double-bit errors so you can preemptively replace the GPU.
Graceful node replacement: hot spares standing by; on failure, restart the affected stage from last checkpoint while other stages continue.
Async checkpoint to NVMe + lazy upload: NVMe write is < 1 min; S3/GCS upload happens in background.
Last-known-good restart: if loss diverges, restart from the most recent healthy checkpoint, not just the most recent.
Network straggler detection: per-link latency monitoring; rebalance job placement when degraded links found.

PITFALL — autoscaling lag and silent data corruption

Two classes of failure are easy to miss in interview answers:
(1) Autoscaling lag: GPU spin-up takes minutes. If your hot-spare strategy assumes instant replacement, you'll lose hours per failure. Pre-warm spares and keep them in the same NCCL group.
(2) Silent data corruption: an H100 fleet at 10k+ scale will have undetected bit-flips. Loss curves look fine. You only notice via hash-comparison across DP replicas or periodic validation-batch sanity. Naming SDC unprompted is a strong signal in OpenAI/Anthropic loops.

REMEMBER

10k+ GPU runs fail every few hours. Plan for it.
NCCL watchdog + heartbeats + hot spares are non-negotiable.
SDC is real; detect via cross-DP hash compare and grad-norm spikes.
"Last-known-good restart" beats "most-recent restart" when loss diverges.

0 → hero reading path for distributed training

foundation HF docs — Multi-GPU training
foundation DeepSpeed tutorials — ZeRO 1/2/3 explained
foundation PyTorch FSDP docs
build nanoGPT with DDP — modify to FSDP — modify to TP
build Run a 7B fine-tune on 8 GPUs with FSDP; profile with PyTorch profiler
depth ZeRO paper (Rajbhandari 2019)
depth Megatron-LM (Shoeybi 2019)
depth Sequence parallelism (Korthikanti 2022)
depth Ring Attention (Liu 2023)
depth DeepSpeed-Ulysses
depth DeepSeek V3 technical report — DualPipe, FP8 recipe
depth Llama 3 paper — practical 16k-GPU training
depth Simon Boehm — Data parallelism overview
depth Horace He — Thonking.ai on PyTorch internals

Distributed training quiz — readiness check

You have a 70B model on 8 H100s. What's your parallelism plan?
Show answer
Single node, NVLink. ZeRO-3 / FSDP across 8 GPUs. Per-GPU memory: bf16 weights ~17.5 GB + grads ~17.5 GB + optimizer state ~17.5 GB (sharded) + activations + KV. Add activation checkpointing for moderate batch.
Now 405B on 16384 H100s — what's the plan?
Show answer
3D: TP=8 (within node), PP=16 (across 16 nodes per replica), DP=128 (16384/8/16). Microbatches per pipeline = enough to keep bubble < 5%. Selective activation recomputation. ZeRO-1 on top for optimizer-state sharding.
DDP vs FSDP — when each?
Show answer
DDP if model + optimizer state fit on one GPU (small models, < ~10B). FSDP otherwise. FSDP costs ~1.5× DDP comm.
Why does TP need high bandwidth?
Show answer
Two all-reduces per layer over the full hidden state. For Llama 70B (hidden=8192, bf16, batch=4M tokens/step), each TP all-reduce moves GBs/s. NVLink (900 GB/s) handles this; cross-node IB would crush it.
What's the pipeline bubble formula?
Show answer
Bubble fraction = (p − 1) / (p − 1 + m), where p = stages, m = microbatches. Increase m to reduce. Interleaved 1F1B partitions layers non-contiguously to shrink the bubble further.
Explain MoE all-to-all bottleneck.
Show answer
Each token routed to top-k experts on different GPUs. Two all-to-alls per layer (dispatch + combine). At 256 experts on 256 GPUs, all-to-all dominates compute. Mitigations: capacity factor, topology-aware routing, comm/compute overlap (DualPipe).
Bandwidth requirements for Llama 405B training?
Show answer
TP all-reduces: NVLink (900 GB/s) handles within-node. PP cross-node: only activation tensor between adjacent stages (~MB), low BW. DP gradient all-reduce: 405B params bf16 = 810 GB; over 16k GPUs in fat-tree IB = seconds per step.
What goes wrong at 10k+ GPU scale that doesn't at 100?
Show answer
Failures every few hours (NIC, GPU ECC, OOM). Need fast checkpointing + restart, watchdog on NCCL hangs, async retries, hot-swap of failed nodes. Loss-spike detection that pauses training. Network straggler detection. Silent data corruption (SDC) at H100 fleet scale.
What does sequence parallel shard?
Show answer
In TP regions, also shard the sequence dim during LN/dropout/residual (where TP doesn't help). All-gather + reduce-scatter on TP/SP boundaries. Cuts activation memory; doesn't change throughput much.
Ring attention vs DeepSpeed-Ulysses — when each?
Show answer
Ring: P2P circulation of K/V chunks; latency-bound; great for very long contexts. Ulysses: 2 all-to-alls swap parallelism dim (sequence ↔ head); bandwidth-bound; better for short contexts with many heads. USP combines both via 2D grid.
16 bytes/param breakdown?
Show answer
fp32 master weights (4) + fp32 Adam m (4) + fp32 Adam v (4) + bf16 weights (2) + bf16 grads (2) = 16. ZeRO sharding distributes these across DP ranks.
Why is FP8 training tricky?
Show answer
FP8 has only 8 bits — limited dynamic range. Two formats (E4M3 fwd, E5M2 bwd) for different ranges. Need careful per-tensor or per-tile scaling, FP32 master weights, FP32 partial-sum accumulation. DeepSeek V3 fine-grained per-tile scaling is the canonical recipe.
NCCL hang — diagnose and fix.
Show answer
Symptom: training stalls; one rank not making progress on a collective. Causes: dead rank (GPU ECC, OOM), network partition, deadlock from mismatched collective on different ranks. Fix: NCCL watchdog (NCCL_TIMEOUT) aborts the job; restart from checkpoint, replace bad node.
Why interleaved 1F1B instead of vanilla 1F1B?
Show answer
Each stage holds non-contiguous layers (e.g., layers 1, 5, 9 on stage 0; 2, 6, 10 on stage 1). More micro-microbatches in flight per macro-microbatch → smaller pipeline bubble. Megatron uses this.
Difference between gradient accumulation and increasing batch size?
Show answer
Effective batch = micro-batch × accumulation × DP. Same effective batch, but accumulation lets you fit when memory limits per-step batch. Can be combined with mixed precision and activation checkpointing for further memory reduction.
What's selective activation recomputation?
Show answer
Recompute only cheap ops (LN, GELU, dropout) during backward; keep expensive (attention, matmul outputs) in memory. Megatron's standard. ~33% extra FLOPs vs full recompute, but 10× memory savings on activations.
Difference between ZeRO-Infinity, ZeRO-Offload, and ZeRO-3?
Show answer
ZeRO-3: shard params/grads/optimizer across DP ranks (all in HBM). ZeRO-Offload: offload optimizer state + parts of gradient to CPU RAM. ZeRO-Infinity: extends to NVMe. Used when model truly doesn't fit in HBM; slow due to CPU/NVMe bandwidth.
NVL72 vs PCIe — what changes?
Show answer
NVLink Switch (NVL72) gives 72 GPUs per "node" with full NVLink bandwidth. Lets you do TP=72 (vs 8 on H100), or much larger TP × PP combos within one rack. Fewer cross-IB hops needed for many parallelism plans.
What is SHARP and why does it matter?
Show answer
NVIDIA's in-network reduction on InfiniBand switches. The switch performs the reduction (sum) and broadcasts the result, eliminating one half of all-reduce comm time. Speeds up small all-reduces; less impact on huge ones.
Async checkpointing — how does it work?
Show answer
Dump model state to local NVMe (fast — < 1 min for 70B). Separate process uploads to durable storage (S3/GCS) in background. Training continues. Reduces pause from ~30 min (sync to slow FS) to ~1 min.