Distributed training
DDP fits a 7B model on one node. Past 70B you need ZeRO/FSDP. Past 100B you need 3D parallelism. Past 405B you need failure-handling for hardware that breaks every few hours. This page is the full memory-bandwidth-bubble triangulation a Sr Staff candidate is expected to do live.
What you'll learn
- The memory budget — why DDP doesn't fit at 70B+
- ZeRO & FSDP — sharding optimizer state, grads, params
- Tensor parallel (Megatron) — splitting a matmul across GPUs
- Pipeline parallel — stages, microbatches, the bubble
- Sequence & context parallel — when sequences are too long
- Expert parallelism for MoE — and the all-to-all bottleneck
- Mixed precision — FP16 / BF16 / FP8 (the H100 era)
- Activation checkpointing & gradient accumulation
- The collectives toolkit — AllReduce, AllGather, AllToAll
- Hardware reality — NVLink, NVSwitch, IB, NCCL, SHARP
- Checkpointing strategies — sync, async, sharded
- Failures at 10k+ GPU scale — what kills naive runs
Adam mixed-precision needs 16 bytes per parameter of optimizer state alone. A 70B model = 1.12 TB of state. No single GPU holds that. Every distributed-training scheme exists to spread those 16 bytes/param across many GPUs without crushing the network.
DDP — the floor everyone starts from
Replicate the model on each GPU, shard the batch. After backward, all-reduce gradients across replicas. Memory-heavy: each GPU holds the full model + grads + optimizer state. Comm cost = one all-reduce per step on the gradient tensor.
DDP is the right answer when the entire training state fits on one GPU with room for activations. Below ~7B parameters in BF16, fine. Above that, you're allocating optimizer state you can't afford.
The 16 bytes/param breakdown
For Adam mixed-precision (the dominant LLM recipe), the per-parameter memory accounting from the ZeRO paper (Table 1):
Where the bytes go in Adam mixed-precision
| State | Bytes / param | Why |
|---|---|---|
| FP32 master weights | 4 | Source of truth for the optimizer step |
FP32 Adam m (1st moment) | 4 | EMA of gradient |
FP32 Adam v (2nd moment) | 4 | EMA of squared gradient |
| BF16 weights | 2 | Used in forward/backward |
| BF16 grads | 2 | Output of backward |
| Total | 16 | Per parameter, before activations |
That ignores activations and KV cache. For a 70B model: 70e9 × 16 = 1.12 TB. An H100 has 80 GB. Even a B200 has 192 GB. You cannot fit a 70B model on one GPU with Adam. Distributed training is not a choice; it is a memory ultimatum.
State to fit per replica = 1.12 TB. Per-GPU budget on an 8×H100 node = 8 × 80 GB = 640 GB total HBM.
DDP: each GPU holds the full 1.12 TB. Won't fit. Game over.
ZeRO-3 (state sharded across 8 ranks): per-GPU state = 1.12 TB / 8 = 140 GB. Still won't fit (we lose 60 GB to activations + KV).
ZeRO-3 + activation checkpointing: now per-GPU sits around ~50–60 GB; comfortably fits with room for moderate batch.
Lesson: at the 70B/8-GPU edge, you are both sharding state and trading FLOPs for activation memory.
- Adam mixed-precision = 16 bytes/param. Memorize this.
- 70B model = 1.12 TB of state. No GPU holds that.
- Distributed training exists to shard those 16 bytes across DP ranks.
- "Will it fit?" is the first question, before any throughput discussion.
ZeRO partitions the 16 bytes/param across DP ranks in three stages. ZeRO-1 shards optimizer state; ZeRO-2 adds gradients; ZeRO-3 adds parameters. Stages 1 and 2 are free (same comm as DDP). Stage 3 costs ~1.5× DDP comm. FSDP is PyTorch's production ZeRO-3.
The three ZeRO stages
From Rajbhandari et al. 2019 (arxiv 1910.02054). ZeRO progressively partitions training state across DP ranks instead of replicating.
| Stage | What's sharded | Comm vs DDP | Per-rank state (N ranks) |
|---|---|---|---|
| ZeRO-1 | Optimizer state only | Identical (1× all-reduce on grads) | 4 + 12/N bytes/param |
| ZeRO-2 | Optimizer state + gradients | Identical (all-reduce → reduce-scatter, same volume) | 2 + 14/N bytes/param |
| ZeRO-3 | Optimizer + grads + parameters | 1.5× DDP (per-layer all-gather × 2 + reduce-scatter) | 16/N bytes/param |
The 1.5× number is from ZeRO paper Table 5 — frequently misquoted as "2×". Forward needs an all-gather on each layer's params (used, then freed). Backward needs another all-gather (re-fetch what was freed) plus a reduce-scatter on grads. Three collective passes vs DDP's one all-reduce, but the reduce-scatter and all-reduce cost the same per byte.
FSDP — PyTorch's production ZeRO-3
FSDP wraps modules; manages all-gather/reduce-scatter automatically per-wrap-unit. FSDP-2 (PyTorch 2.x) introduces per-parameter sharding (rather than per-flat-buffer), which composes cleanly with TP and is more memory-efficient for irregular shapes.
ZeRO-Offload and ZeRO-Infinity
When even ZeRO-3 doesn't fit, push state out of HBM:
- ZeRO-Offload: optimizer state and parts of gradient live in CPU RAM. Optimizer step runs on CPU. Slow but unblocks training a 13B model on a single 24 GB GPU.
- ZeRO-Infinity: extends to NVMe. Used as an absolute last resort; CPU/NVMe bandwidth caps throughput hard.
DDP — when it's right
- Model + state fit on one GPU (sub-10B BF16)
- Want simplest comm pattern (single all-reduce/step)
- Tight latency budget where extra collectives hurt
FSDP / ZeRO-3 — when it's right
- Model state exceeds single-GPU memory
- You can absorb 1.5× DDP comm
- NVLink within node is fast enough for the all-gathers
- ZeRO-1 and ZeRO-2 are "free": same comm as DDP, less memory.
- ZeRO-3 / FSDP costs 1.5× DDP comm — not 2×.
- FSDP-2's per-parameter sharding is now the default in modern PyTorch.
- Offload/Infinity are escape hatches, not strategies.
Shard each weight matrix across GPUs so a single matmul becomes a parallel matmul + all-reduce. Megatron's column-then-row pattern needs exactly one all-reduce per direction per MLP block. Bandwidth-hungry — keep TP within a single NVLink domain.
The column-then-row trick
From Shoeybi et al. 2019 (arxiv 1909.08053). For a two-matmul block Z = GeLU(X · A) · B, choose the shard axes so the GeLU is local:
- Column-parallel A: each GPU holds
A_i(a column slice). ComputesY_i = GeLU(X · A_i). No comm — GeLU is elementwise. - Row-parallel B: each GPU holds
B_i(a row slice). Computes partialZ_i. All-reduce (sum) to combine.
Net cost: one all-reduce in forward, one in backward, per MLP block. For attention, shard heads across GPUs (each head is independent), then row-parallel the output projection — same pattern.
Why TP is a within-node story
Each TP all-reduce moves the full hidden-state activation tensor. For a Llama 70B forward pass with hidden=8192, BF16, 4M tokens/step, the per-step volume is gigabytes per layer. NVLink (900 GB/s on H100, 1.8 TB/s on B200) absorbs it. InfiniBand at 400 Gb/s would crush throughput.
Rule of thumb: TP degree ≤ GPUs per NVLink domain. On H100 nodes that's 8. On NVL72 racks it's 72.
Llama 3 405B: TP=8 within an 8×H100 node (NVLink), PP=16 spanning 16 nodes per replica. Picking TP=16 would require all-reduces over IB — moving multi-GB tensors over 400 Gb/s links each layer. Even with SHARP, the latency tax destroys throughput. NVLink is what makes TP cheap.
- MLP block = column-then-row → one all-reduce per direction.
- Attention = head-parallel + row-parallel output proj — same idea.
- TP degree ≤ GPUs per NVLink domain. Cross-node TP is malpractice.
- Sequence parallel (next chapter) cuts TP's activation memory further.
Split layers into stages, each on a different GPU. Microbatches flow stage-to-stage like an assembly line. The "pipeline bubble" is the idle time at fill and drain. More microbatches → smaller bubble. Modern schedules (1F1B, interleaved, zero-bubble) attack the bubble from different angles.
The bubble formula
For S stages and M microbatches:
bubble fraction = (S − 1) / (S − 1 + M)
Want bubble < 5%? M ≈ 20·(S−1). With S=16 stages, you need ~300 microbatches per pipeline flush — which constrains your effective batch size and global batch.
Schedule evolution
- GPipe: all forwards, then all backwards. Holds activations for the full pipeline depth → activation memory blows up.
- 1F1B (one-forward-one-backward): alternate forward and backward as soon as a microbatch's backward becomes available. Activation memory bounded by stage depth, not pipeline depth. Megatron default.
- Interleaved 1F1B: each stage holds non-contiguous layers (layers 1, 5, 9, 13 on stage 0; 2, 6, 10, 14 on stage 1). More micro-microbatches in flight per macro → smaller bubble.
- Zero Bubble Pipeline (Qi 2023): split backward into "input grad" and "weight grad" — reorder so weight grads fill what would otherwise be bubble slots. Near-zero overhead in idealized cases.
- DualPipe (DeepSeek V3): bidirectional pipeline — microbatches flow forward AND backward through the stages simultaneously, overlapping compute with the all-to-all comm of MoE. Custom kernels (DeepEP).
Why PP costs little bandwidth
Unlike TP, PP only sends the activation tensor between adjacent stages once per microbatch. That's MBs, not GBs. PP comfortably crosses InfiniBand. PP is your across-node parallelism.
- Bubble = (S−1) / (S−1+M). Memorize this.
- 1F1B is the default schedule. Interleaved 1F1B is what Megatron actually runs.
- PP is the across-node parallelism — it sends megabytes, not gigabytes.
- Bubble shrinks with more microbatches; activation checkpointing makes that possible.
"Sequence parallel" extends TP by also sharding the sequence dim through LayerNorm/dropout/residual — saves activation memory cheaply. "Context parallel" goes further: shards the sequence inside attention, enabling million-token training. Two competing approaches: ring attention (latency-bound) and DeepSpeed-Ulysses (bandwidth-bound).
Sequence parallel — the cheap activation-memory win
In TP regions, the hidden state is sharded along the feature dim. But LayerNorm, dropout, and residual add are not TP-parallelized — so each rank holds the full activation through them. SP fixes that by also sharding the sequence dim through those ops, with all-gather and reduce-scatter at the SP/TP boundaries. Net effect: cuts activation memory roughly in proportion to TP degree, with negligible throughput cost.
Context parallel — sharding inside attention
True end-to-end sequence sharding, including across the attention computation itself. This is what unlocks multi-million-token training. Two competing approaches you should be able to distinguish:
Ring Attention (Liu 2023)
- K/V chunks circulate around a ring of GPUs (point-to-point).
- Each GPU computes partial attention against the K/V it currently holds; accumulates via online softmax.
- Latency-bound — P2P send/recv each step.
- Best for very long contexts (≫ heads count).
- Combines with FlashAttention block-wise softmax.
DeepSpeed-Ulysses
- Two all-to-all collectives swap parallelism dim between sequence (in attention) and head (in MLP).
- Bandwidth-bound — every token crosses the all-to-all twice.
- Better for shorter contexts with many heads.
- Capped by number of attention heads.
USP (Unified Sequence Parallel) combines them: partition the GPU group into a 2D grid, run Ulysses across one axis and ring across the other. State of the art for very long context training in 2025.
Striped Attention: load-balanced ring — causal mask creates triangular work, so naive ring has stragglers; striped reorders chunks so each GPU computes about the same amount.
- Sequence parallel = cheap activation-memory win on top of TP.
- Context parallel = ring (latency-bound) vs Ulysses (bandwidth-bound) vs USP (both).
- Ring attention + FlashAttention is what enables million-token training.
- If you're training at >128k tokens, you're using one of these.
Place experts on different GPUs. Each token routes to top-k experts → two all-to-alls per layer (dispatch + combine). At 256 experts on 256 GPUs the all-to-all dominates compute. Mitigations: capacity factor, topology-aware routing, comm/compute overlap (DualPipe).
Why MoE adds a fifth parallelism dim
An MoE layer replaces one dense FFN with N experts (typically 8 to 256+) and a router that picks the top-k for each token. Different experts hold different weights — so different GPUs hold different experts. Tokens must be shipped to the GPU holding their assigned expert, then the result shipped back. That's an all-to-all dispatch and an all-to-all combine, per MoE layer.
Mitigations
- Capacity factor: cap how many tokens any single expert can take. Overflow tokens are dropped (or routed to a fallback). Without this, a popular expert serializes the layer.
- Topology-aware routing: prefer experts on GPUs in the same node / NVLink domain. Reduces IB pressure.
- DeepSeek V3 DualPipe: bidirectional pipeline that overlaps the all-to-all with compute on the opposite-direction microbatches. Custom kernels (DeepEP) hide ~all the comm.
- Expert-level shared weights (DeepSeek MoE): always-active "shared" experts handle generic patterns; routed experts specialize.
671B total params, 37B activated per token. 256 routed experts + 1 shared. EP degree = 64 (across nodes). DualPipe + DeepEP overlaps the all-to-all so well that comm overhead is ~zero on H800-class hardware. This is the canonical 2025 MoE-at-scale recipe — reference it explicitly in interviews.
- MoE adds expert parallelism — a fifth dim on top of DP/TP/PP/CP.
- Two all-to-alls per MoE layer: dispatch + combine.
- Capacity factor + topology routing + comm overlap make MoE viable.
- DeepSeek V3's DualPipe is the public 2025 reference recipe.
BF16 is the modern default — same dynamic range as FP32, no loss scaling needed. FP16 needs loss scaling. FP8 (E4M3 forward, E5M2 backward) doubles throughput on H100/Blackwell but requires fine-grained scaling and FP32 master weights. FP4 is experimental on Blackwell.
The format zoo
| Format | Range | Notes |
|---|---|---|
| FP16 | [6e-5, 65504] | Easy overflow on large activations/grads. Loss scaling needed. |
| BF16 | same exponent as FP32, 7 mantissa bits | Wider range; no loss scaling needed. Default for LLMs. |
| FP8 E4M3 | 4-bit exp, 3-bit mantissa | Forward + weights. H100/Blackwell. |
| FP8 E5M2 | 5-bit exp, 2-bit mantissa | Backward (needs wider range). Per-tensor scales. |
| FP4 (Blackwell) | experimental | Microscaling formats (MX). |
FP8 — the H100 unlock
FP8 doubles tensor-core throughput vs BF16 and halves activation memory. The cost is operational complexity: per-tensor or per-tile scaling factors must be tracked, FP32 master weights kept, partial sums accumulated in higher precision.
The canonical 2025 recipe is DeepSeek V3's: per-tile scaling (1×128 for activations, 128×128 for weights), online scale computation, FP32 partial-sum promotion every 128 elements, BF16 fallback for embeddings/output head/normalization. Reference this in interviews — it's the public state of the art.
- BF16 is the default. FP16 only if you're stuck on Volta/older.
- FP8 = E4M3 forward, E5M2 backward. Per-tile scaling is mandatory at scale.
- Critical layers (embedding, output, LN) stay BF16.
- FP8 doubles tensor-core throughput — the dominant 2025 H100 efficiency lever.
Activation checkpointing trades FLOPs for activation memory — recompute during backward instead of storing. Selective recomputation (Megatron) keeps expensive ops, recomputes cheap ones — ~33% extra FLOPs for ~10× memory savings. Gradient accumulation lets you grow effective batch beyond what fits per step.
Activation checkpointing
Save a subset of activations during forward; recompute the rest during backward. Two flavors:
- Full block checkpointing: save only block boundaries, recompute everything inside. Simple. Costs ~33% extra FLOPs.
- Selective recomputation (Korthikanti 2022): keep expensive ops (attention output, matmul outputs), recompute cheap ops (LayerNorm, GeLU, dropout). Megatron's standard. Roughly 10× activation memory reduction at < 5% throughput cost.
Gradient accumulation
Effective batch = micro-batch × accumulation steps × DP. Run multiple micro-batches sequentially per optimizer step, summing grads, before the all-reduce. Lets you fit when memory caps per-step batch but you still want a large effective batch (good for stable Adam moments).
You want effective batch = 4M tokens. Per-GPU memory limits you to 2k tokens/microbatch. With 8 GPUs DP and a single accumulation step, that's 16k tokens/step — not 4M.
Solution: 256 accumulation steps per optimizer update. 2k × 8 × 256 = 4.1M tokens/step. Optimizer step runs 256× less frequently, comm dominated by activation patterns within each microbatch's forward/backward.
- Selective recomputation is the modern default — keep attention outputs, recompute LN/GeLU.
- Full-block checkpointing is the fallback if memory still doesn't fit.
- Gradient accumulation is how you hit a "4M-token batch" with limited per-step memory.
- These two tricks are what let pipeline parallel push microbatch count high enough to shrink the bubble.
Every distributed-training scheme is a recipe of collective ops. AllReduce dominates DP. AllGather and ReduceScatter are how ZeRO-3 works. AllToAll is the MoE bottleneck. Cost in ring all-reduce is 2(N−1)/N · M per rank — close to 2M for large N.
| Primitive | Result | Cost (ring, M bytes, N ranks) |
|---|---|---|
| AllReduce | every rank ends with sum (or other reduction) | 2(N−1)/N · M per rank |
| AllGather | every rank ends with concatenation | (N−1)/N · M |
| ReduceScatter | reduce + each rank gets a chunk | (N−1)/N · M |
| Broadcast | one rank → all ranks | O(M log N) |
| AllToAll | every rank sends a chunk to every other | (N−1)/N · M. Critical for MoE. |
The identity to memorize: AllReduce = ReduceScatter + AllGather. This is why ZeRO-2 is "free" relative to DDP — replacing one all-reduce with a reduce-scatter (then later an all-gather where the data was needed anyway) keeps total volume identical.
- Ring AllReduce: ~2M per rank, regardless of N. Bandwidth-bound.
- AllReduce = ReduceScatter + AllGather (the fundamental identity).
- AllToAll is the MoE / Ulysses primitive — N² messages but linear total volume.
- Tree algorithms beat ring for very small messages (latency-bound regime).
NVLink (within node) and InfiniBand (across node) form a two-tier bandwidth hierarchy. NVSwitch makes 8-GPU nodes look like one big GPU; NVL72 extends that to 72 GPUs in a rack. NCCL is the collectives library that abstracts it all. SHARP does in-network reduction on IB switches.
- NCCL: NVIDIA's collectives library. GPU-direct via NVLink/PCIe/IB. The substrate everything sits on.
- NVLink: GPU-to-GPU direct. H100: 900 GB/s aggregate per GPU. B200: 1.8 TB/s.
- NVSwitch: full-mesh of NVLinks within a node (8 GPUs typical). Every GPU talks to every other at full NVLink bandwidth.
- NVL72 (Blackwell): NVLink Switch extends NVSwitch fabric to 72 GPUs in a single rack. Lets you do TP=72 (vs 8 on H100) or fit much larger TP×PP combos within one rack.
- InfiniBand: across-node interconnect. ConnectX-7 NDR = 400 Gb/s. Topology is fat-tree, often "rail-optimized" (each GPU has its own NIC).
- SHARP: in-network reduction on IB switches. The switch sums fragments and broadcasts the result, eliminating one half of all-reduce traffic. Big win for small messages, less so for huge ones.
The bandwidth cliff
The reason TP stays within node is the bandwidth cliff: NVLink is ~20× the bandwidth of a single IB link. Every parallelism choice is a placement problem on this two-tier hierarchy.
- NVLink within node, IB across node. Two tiers.
- NVSwitch = full mesh inside a node. NVL72 = full mesh inside a rack.
- SHARP halves all-reduce traffic — useful for small messages.
- Rail-optimized fat-tree is the standard cluster topology in 2025.
Sync checkpointing pauses training for 10–30 minutes on a 70B model — unacceptable at scale. Async (NVMe + background upload) cuts pause to under a minute. Sharded (each rank writes its own slice) is mandatory for trillion-param models. Frequency is set by MTBF — every 30 min to a few hours.
- Sync: all ranks pause, write to shared FS. 70B model write ~10–30 min on slow FS. Only acceptable for small jobs.
- Async: dump to local NVMe (< 1 min for 70B), separate process uploads to S3/GCS. Training continues immediately.
- Sharded: each rank writes its own shard. Resume requires same parallelism config — or a re-shard tool. Megatron's dist-ckpt and TorchSnapshot handle resharding.
- In-memory replication: state copied to other nodes for fast recovery without disk. Emerging at frontier labs.
Frequency tuning — MTBF dictates cadence
At 10k+ GPUs, hardware fails every few hours. Checkpoint cadence should be set so expected lost work per failure is acceptable. Llama 3 405B writeup: checkpoint every ~30 min → ~hour, which dominates the engineering pain budget.
Step N completes. Sharded NVMe dump fires (each rank writes ~17 GB / 70B/8). Training continues with no observable pause. Background process tars and uploads to S3 over the next few minutes. Step N+200 complete before the upload finishes — that's fine. Recovery reads from NVMe if the node survived; from S3 otherwise.
- Async + sharded + NVMe-tier-1 + S3-tier-2 is the modern recipe.
- Sync to shared FS is fine for <1k-GPU runs, malpractice at 10k+.
- Checkpoint every ~30 min — set by MTBF, not by intuition.
- Resharding tools matter when you change parallelism config mid-run.
At 10k+ GPUs, the failure rate is once every few hours. NCCL hangs, silent data corruption, ECC errors, network stragglers. A 70B run on 100 GPUs is "build it and run it"; a 405B run on 16k GPUs is "build the failure-handling system that happens to also train a model". This chapter is the public failure-handling toolkit from Llama 3 + xAI Colossus writeups.
The failure modes
- NCCL hang: one rank dies mid-collective; the rest spin forever. Without a watchdog, the whole job stalls indefinitely.
- Silent data corruption (SDC): H100 SDC rates are non-trivial at fleet scale. A rank computes the wrong result without any error signal.
- Stragglers: ring all-reduce is bottlenecked by the slowest rank. A degraded NIC or thermally throttled GPU drags the whole job.
- OOM cascades: a single OOM on rank N takes down the whole job, requires restart from checkpoint, possibly with different parallelism config.
- Loss spikes: a bad batch, a numerical instability, or rank corruption causes loss to diverge.
The mitigation toolkit
- NCCL watchdog: monitors collective progress; aborts hung ranks with timeout. Without this, debugging is impossible.
- Heartbeat-based stale rank detection: per-rank heartbeats on a sidecar; missing beat → abort + replace.
- SDC detection: gradient-norm spike checks, hash-based comparison across DP replicas, periodic loss-on-validation-batch sanity checks. A sudden 10× gradient norm with normal-looking loss is a strong SDC signal.
- ECC + scrubbing: in-band ECC on HBM auto-corrects single-bit errors; out-of-band scrubbing reports double-bit errors so you can preemptively replace the GPU.
- Graceful node replacement: hot spares standing by; on failure, restart the affected stage from last checkpoint while other stages continue.
- Async checkpoint to NVMe + lazy upload: NVMe write is < 1 min; S3/GCS upload happens in background.
- Last-known-good restart: if loss diverges, restart from the most recent healthy checkpoint, not just the most recent.
- Network straggler detection: per-link latency monitoring; rebalance job placement when degraded links found.
(1) Autoscaling lag: GPU spin-up takes minutes. If your hot-spare strategy assumes instant replacement, you'll lose hours per failure. Pre-warm spares and keep them in the same NCCL group.
(2) Silent data corruption: an H100 fleet at 10k+ scale will have undetected bit-flips. Loss curves look fine. You only notice via hash-comparison across DP replicas or periodic validation-batch sanity. Naming SDC unprompted is a strong signal in OpenAI/Anthropic loops.
- 10k+ GPU runs fail every few hours. Plan for it.
- NCCL watchdog + heartbeats + hot spares are non-negotiable.
- SDC is real; detect via cross-DP hash compare and grad-norm spikes.
- "Last-known-good restart" beats "most-recent restart" when loss diverges.
0 → hero reading path for distributed training
- foundation HF docs — Multi-GPU training
- foundation DeepSpeed tutorials — ZeRO 1/2/3 explained
- foundation PyTorch FSDP docs
- build nanoGPT with DDP — modify to FSDP — modify to TP
- build Run a 7B fine-tune on 8 GPUs with FSDP; profile with PyTorch profiler
- depth ZeRO paper (Rajbhandari 2019)
- depth Megatron-LM (Shoeybi 2019)
- depth Sequence parallelism (Korthikanti 2022)
- depth Ring Attention (Liu 2023)
- depth DeepSpeed-Ulysses
- depth DeepSeek V3 technical report — DualPipe, FP8 recipe
- depth Llama 3 paper — practical 16k-GPU training
- depth Simon Boehm — Data parallelism overview
- depth Horace He — Thonking.ai on PyTorch internals
Distributed training quiz — readiness check
- You have a 70B model on 8 H100s. What's your parallelism plan?
Show answer
Single node, NVLink. ZeRO-3 / FSDP across 8 GPUs. Per-GPU memory: bf16 weights ~17.5 GB + grads ~17.5 GB + optimizer state ~17.5 GB (sharded) + activations + KV. Add activation checkpointing for moderate batch.
- Now 405B on 16384 H100s — what's the plan?
Show answer
3D: TP=8 (within node), PP=16 (across 16 nodes per replica), DP=128 (16384/8/16). Microbatches per pipeline = enough to keep bubble < 5%. Selective activation recomputation. ZeRO-1 on top for optimizer-state sharding.
- DDP vs FSDP — when each?
Show answer
DDP if model + optimizer state fit on one GPU (small models, < ~10B). FSDP otherwise. FSDP costs ~1.5× DDP comm.
- Why does TP need high bandwidth?
Show answer
Two all-reduces per layer over the full hidden state. For Llama 70B (hidden=8192, bf16, batch=4M tokens/step), each TP all-reduce moves GBs/s. NVLink (900 GB/s) handles this; cross-node IB would crush it.
- What's the pipeline bubble formula?
Show answer
Bubble fraction = (p − 1) / (p − 1 + m), where p = stages, m = microbatches. Increase m to reduce. Interleaved 1F1B partitions layers non-contiguously to shrink the bubble further.
- Explain MoE all-to-all bottleneck.
Show answer
Each token routed to top-k experts on different GPUs. Two all-to-alls per layer (dispatch + combine). At 256 experts on 256 GPUs, all-to-all dominates compute. Mitigations: capacity factor, topology-aware routing, comm/compute overlap (DualPipe).
- Bandwidth requirements for Llama 405B training?
Show answer
TP all-reduces: NVLink (900 GB/s) handles within-node. PP cross-node: only activation tensor between adjacent stages (~MB), low BW. DP gradient all-reduce: 405B params bf16 = 810 GB; over 16k GPUs in fat-tree IB = seconds per step.
- What goes wrong at 10k+ GPU scale that doesn't at 100?
Show answer
Failures every few hours (NIC, GPU ECC, OOM). Need fast checkpointing + restart, watchdog on NCCL hangs, async retries, hot-swap of failed nodes. Loss-spike detection that pauses training. Network straggler detection. Silent data corruption (SDC) at H100 fleet scale.
- What does sequence parallel shard?
Show answer
In TP regions, also shard the sequence dim during LN/dropout/residual (where TP doesn't help). All-gather + reduce-scatter on TP/SP boundaries. Cuts activation memory; doesn't change throughput much.
- Ring attention vs DeepSpeed-Ulysses — when each?
Show answer
Ring: P2P circulation of K/V chunks; latency-bound; great for very long contexts. Ulysses: 2 all-to-alls swap parallelism dim (sequence ↔ head); bandwidth-bound; better for short contexts with many heads. USP combines both via 2D grid.
- 16 bytes/param breakdown?
Show answer
fp32 master weights (4) + fp32 Adam m (4) + fp32 Adam v (4) + bf16 weights (2) + bf16 grads (2) = 16. ZeRO sharding distributes these across DP ranks.
- Why is FP8 training tricky?
Show answer
FP8 has only 8 bits — limited dynamic range. Two formats (E4M3 fwd, E5M2 bwd) for different ranges. Need careful per-tensor or per-tile scaling, FP32 master weights, FP32 partial-sum accumulation. DeepSeek V3 fine-grained per-tile scaling is the canonical recipe.
- NCCL hang — diagnose and fix.
Show answer
Symptom: training stalls; one rank not making progress on a collective. Causes: dead rank (GPU ECC, OOM), network partition, deadlock from mismatched collective on different ranks. Fix: NCCL watchdog (NCCL_TIMEOUT) aborts the job; restart from checkpoint, replace bad node.
- Why interleaved 1F1B instead of vanilla 1F1B?
Show answer
Each stage holds non-contiguous layers (e.g., layers 1, 5, 9 on stage 0; 2, 6, 10 on stage 1). More micro-microbatches in flight per macro-microbatch → smaller pipeline bubble. Megatron uses this.
- Difference between gradient accumulation and increasing batch size?
Show answer
Effective batch = micro-batch × accumulation × DP. Same effective batch, but accumulation lets you fit when memory limits per-step batch. Can be combined with mixed precision and activation checkpointing for further memory reduction.
- What's selective activation recomputation?
Show answer
Recompute only cheap ops (LN, GELU, dropout) during backward; keep expensive (attention, matmul outputs) in memory. Megatron's standard. ~33% extra FLOPs vs full recompute, but 10× memory savings on activations.
- Difference between ZeRO-Infinity, ZeRO-Offload, and ZeRO-3?
Show answer
ZeRO-3: shard params/grads/optimizer across DP ranks (all in HBM). ZeRO-Offload: offload optimizer state + parts of gradient to CPU RAM. ZeRO-Infinity: extends to NVMe. Used when model truly doesn't fit in HBM; slow due to CPU/NVMe bandwidth.
- NVL72 vs PCIe — what changes?
Show answer
NVLink Switch (NVL72) gives 72 GPUs per "node" with full NVLink bandwidth. Lets you do TP=72 (vs 8 on H100), or much larger TP × PP combos within one rack. Fewer cross-IB hops needed for many parallelism plans.
- What is SHARP and why does it matter?
Show answer
NVIDIA's in-network reduction on InfiniBand switches. The switch performs the reduction (sum) and broadcasts the result, eliminating one half of all-reduce comm time. Speeds up small all-reduces; less impact on huge ones.
- Async checkpointing — how does it work?
Show answer
Dump model state to local NVMe (fast — < 1 min for 70B). Separate process uploads to durable storage (S3/GCS) in background. Training continues. Reduces pause from ~30 min (sync to slow FS) to ~1 min.