PERCEPTION PILLAR · GEN MODELS + VLMS + ROBOTS

Multimodal — diffusion, VLMs, world models

Generative video and robotics labs assume you can derive DDPM, justify SigLIP over CLIP, and compare LLaVA-style fusion to Flamingo. This chapter is the working vocabulary for any vision-team loop in 2026.

Read ~45 min Asked at World Labs, Black Forest, Luma, Physical Intelligence, Midjourney, Anthropic, OpenAI, Google Difficulty Sr Staff bar
Part I — Diffusion
01
DIFFUSION · FOUNDATIONS

DDPM — forward noising, reverse denoising, ε prediction

TL;DR

Denoising Diffusion Probabilistic Models (Ho 2020) define a fixed forward process that gradually noises data and learn a network to reverse it. The clean parameterization is predict the noise ε, optimize MSE — the math collapses into a one-line training loss.

Forward (noising) process

Gradually corrupt clean data x_0 with Gaussian noise over T steps:

q(x_t | x_{t-1}) = N(x_t; √(1-β_t) x_{t-1}, β_t I)

Closed-form for any t: q(x_t | x_0) = N(x_t; √(ᾱ_t) x_0, (1-ᾱ_t) I) where ᾱ_t = Π_{s=1}^{t} (1 − β_s). So you can sample any x_t directly from x_0 via x_t = √(ᾱ_t) x_0 + √(1-ᾱ_t) ε, ε ~ N(0,I).

Reverse (denoising) process

Learn p_θ(x_{t-1} | x_t). Parameterize it as predicting noise:

L_simple = E_{t, x_0, ε} [ ||ε - ε_θ(x_t, t)||² ]

The trained ε_θ can predict the noise component of any noisy x_t. Sample by iteratively denoising from x_T ~ N(0,I).

REMEMBER
  • Forward = closed-form Gaussian; sample any x_t in one line.
  • Train an ε-predictor with MSE — that's the whole loss.
  • Sample = iteratively denoise from pure noise over T steps.
02
DIFFUSION · SAMPLING

DDIM & classifier-free guidance — sampling tricks

TL;DR

DDIM (Song 2020) replaces stochastic sampling with a deterministic ODE — same model, ~50 steps instead of 1000. CFG (Ho 2022) trains one model to be both conditional and unconditional, then composes the two scores at sample time to push the generation harder toward the prompt.

DDIM — deterministic sampling

Replaces stochastic reverse process with a deterministic ODE solver. Same trained model, fewer steps (~50 instead of 1000). Enables interpolation in latent space.

THE INSIGHT

Classifier-Free Guidance — score arithmetic at sample time

Train one model conditional on c and unconditional (drop c with prob 0.1). At sampling, compose:

ε̃_θ(x, c) = ε_θ(x, ∅) + w · (ε_θ(x, c) − ε_θ(x, ∅))

w > 1 sharpens conditioning at the cost of diversity. w ≈ 7.5 typical for Stable Diffusion. Why it works: ε(x,c) − ε(x,∅) is a gradient pointing toward conditional samples; amplifying it pushes generation further into class.

REMEMBER
  • DDIM = deterministic ODE sampler, fewer steps, same model.
  • CFG = train both conditional and unconditional; extrapolate at sample time.
  • Higher CFG scale = more on-prompt, less diverse.
03
DIFFUSION · 2024 WINNER

Flow matching & rectified flow — the 2024 winner

TL;DR

Flow matching (Lipman 2023) trains a velocity field to transport noise to data along an ODE — simpler than DDPM, no stochastic schedule. Rectified flow (Liu 2022) shapes the trajectories to be straight lines, enabling 1–4 step sampling. Together they powered SD3 and Flux.

Flow matching

Train a velocity field v_θ(x, t) that transports samples from a simple distribution to the data distribution along an ODE:

dx/dt = v_θ(x, t)

Loss: regress v_θ(x_t, t) against the target velocity that would move x_t toward x_1 (data) along a chosen probability path. Common choice: linear interpolation x_t = (1-t) x_0 + t x_1, target velocity v* = x_1 − x_0.

Rectified flow

Variant of flow matching that trains the velocity field to produce straight-line trajectories. Why care? Straight trajectories can be sampled in 1–4 steps without quality loss (vs ~50 for DDPM). Used by SD3, Flux.

REMEMBER
  • Flow matching = velocity-field ODE; no stochastic schedule.
  • Rectified flow = straight trajectories → ultra-few-step sampling.
  • SD3, Flux, modern frontier image models use rectified flow.
04
DIFFUSION · ARCHITECTURE

DiT & MM-DiT — transformers eat U-Net

TL;DR

DiT (Peebles & Xie 2022) replaces the U-Net backbone with a transformer over patch tokens, conditioned via adaLN-Zero. It scales like an LLM. MM-DiT extends this with joint attention over text and image tokens — the architecture inside SD3 and Flux.

DiT — Diffusion Transformer

Replaces the U-Net backbone with a transformer operating on patch tokens. Conditioning via adaLN-Zero (per-token affine modulation conditioned on time + class). Scales like LLMs (predictable loss vs compute).

MM-DiT — joint text + image attention

Used in SD3: joint attention over text + image tokens — text tokens and image patches share the same transformer. Enables tighter prompt-image alignment than cross-attention U-Nets.

REMEMBER
  • DiT = transformer backbone for diffusion; conditioned via adaLN-Zero.
  • MM-DiT = joint attention over text + image tokens.
  • Scales like an LLM — that's the whole point.
05
DIFFUSION · STATE OF THE ART

The modern image-gen stack — SD3, Flux, Sora, Veo

TL;DR

The 2025-26 image-gen frontier is MM-DiT + rectified flow + multiple text encoders (CLIP-L, CLIP-G, T5-XXL). Video generation extends DiT to spacetime patches (Sora). Cascaded approaches (Imagen) still ship but are no longer dominant.

Image generators

ModelArchitectureTraining objective
Stable Diffusion 1/2U-Net + cross-attention to CLIP textDDPM ε-prediction
SDXLU-Net (larger) + 2 text encodersDDPM
SD3 / SD3.5MM-DiT + 3 text encoders (CLIP-L, CLIP-G, T5-XXL)Rectified flow
Flux (Black Forest Labs)MM-DiT (12B) + T5-XXL + CLIP-LRectified flow
Imagen 3 (Google)Cascaded text-to-image (low → high res)DDPM-style
DALL-E 3 / 4o imageClosed; likely DiT + autoregressive elementsClosed

Video generators

EXAMPLE — Sora architecture

DiT over spacetime patches: a video chunk is patched in (t, h, w), each patch becomes a token. The transformer is sequence-length flexible — it handles variable resolution and duration with the same weights. Positional encoding generalizes across resolutions. Demonstrates emergent world-model properties (object permanence, physics — imperfect).

REMEMBER
  • SD3 / Flux = MM-DiT + rectified flow + 2-3 text encoders.
  • Sora = DiT over spacetime patches; variable resolution + duration.
  • T5-XXL is what makes long, structured prompts work.
Part II — VLMs
06
VLM · FOUNDATIONS

ViT — the patch tokenization that started it all

TL;DR

Vision Transformer (Dosovitskiy 2020) treats an image as a sequence of 16×16 patches, projects each into a token, and feeds them through a standard transformer encoder. No CNN inductive bias — and that's the point: with enough data, attention learns spatial structure.

Patch the image into 16×16 (or 14×14) non-overlapping patches; flatten + linear-project each into a token; add learnable positional encoding; prepend a [CLS] token; standard transformer encoder. The [CLS] token's final state is the image representation.

AnyRes / dynamic resolution — modern VLMs (LLaVA-1.6, GPT-4V, Qwen-VL) handle variable input resolutions by chunking large images into multiple ViT-sized tiles plus a global thumbnail. Trades token budget for resolution.
REMEMBER
  • ViT = patches → tokens → transformer encoder; [CLS] is the image embedding.
  • No CNN bias; needs scale to win.
  • AnyRes = tile-based variable resolution.
07
VLM · CONTRASTIVE

CLIP — contrastive softmax baseline

TL;DR

CLIP (Radford 2021) trains an image and text encoder so matched (image, caption) pairs have high cosine similarity, unmatched pairs low. 400M web pairs, contrastive softmax loss, learnable temperature. Enables zero-shot classification by encoding class names as text.

The contrastive softmax loss

L = −(1/2N) Σ_i [ log(e^{s_ii / τ} / Σ_j e^{s_ij / τ})
                  + log(e^{s_ii / τ} / Σ_j e^{s_ji / τ}) ]

where s_ij = cos(image_emb_i, text_emb_j) and τ is a learned temperature.

Trained on 400M (image, text) pairs scraped from the web. Enables zero-shot classification: encode class names as text, compute similarity, argmax.

REMEMBER
  • CLIP = symmetric contrastive softmax over a batch of N pairs.
  • Big batches needed for many negatives.
  • Zero-shot via encode-class-as-text + argmax cosine.
08
VLM · 2023 UPGRADE

SigLIP — why sigmoid loss scales better

TL;DR

SigLIP (Zhai 2023) replaces CLIP's softmax with a per-pair sigmoid loss. No cross-batch normalization, no all-gather of embeddings across data-parallel ranks, works at small batch sizes. SigLIP-2 is the current SOTA open vision-language encoder.

The sigmoid loss

L = −(1/N²) Σ_{i,j} log σ(z_ij · (s_ij / τ + b))
where z_ij = +1 if i==j else -1
THE INSIGHT

Why per-pair sigmoid scales where softmax doesn't

  • Softmax requires the full N×N similarity matrix to normalize. SigLIP's sigmoid is per-pair → no cross-batch normalization.
  • You can train at tiny batch sizes without quality loss — CLIP needs huge batches for many negatives, which means cross-rank embedding all-gathers.
  • Cleanly composes across data-parallel ranks (no all-gather of embeddings needed for the loss).
  • Better behavior with noisy / partially-matched data.

SigLIP-2 (2024-25) is the current SOTA open vision-language encoder used in many VLM stacks.

CLIP

  • Softmax over N×N similarity matrix
  • Needs huge batches for many negatives
  • All-gather embeddings across DP ranks
  • Sensitive to noisy pairs

SigLIP

  • Per-pair sigmoid; no normalization
  • Trains well at small batch sizes
  • No cross-rank gather needed
  • Robust to partial matches
REMEMBER
  • SigLIP loss is per-pair → no cross-rank normalization.
  • Scales better than CLIP at small batches and noisy data.
  • SigLIP-2 = today's open VLM vision encoder default.
09
VLM · ARCHITECTURES

VLM fusion architectures — adapter, cross-attention, native

TL;DR

Three families: (a) project vision into the LLM's token space (LLaVA), (b) interleave cross-attention layers (Flamingo), (c) feed all modalities into one decoder (Gemini, GPT-4o). Cheaper-to-train at the top, stronger fusion at the bottom. Q-Former is the BLIP-2 trick for compressing variable-length vision tokens to a fixed K.

LLaVA — adapter / projection

  • Frozen CLIP/SigLIP → small MLP → LLM
  • Image tokens enter the LLM's embedding space
  • Cheap to train; strong quality
  • Used by LLaVA, MiniGPT-4, IDEFICS variants

Flamingo — cross-attention

  • Vision tokens feed cross-attention layers interleaved in the LLM
  • Vision and text branches stay separate
  • Preserves visual representation
  • Used by Flamingo, IDEFICS-2/3

(c) Native multimodal (Gemini, GPT-4o)

Tokens of all modalities (image patches, audio frames, text) are interleaved in a single sequence. One decoder transformer processes them. No separate vision encoder — image patches go through learned input projection only.

Stronger fusion, harder to train (need huge multimodal data + careful tokenization).

(d) Q-Former (BLIP-2)

Lightweight transformer with learnable query tokens that cross-attend to vision features. Reduces the variable-length vision token sequence to a fixed K queries (e.g., 32). Plug those into the LLM.

PITFALL — VLM TRAINING TRAPS
Common training failures: (1) freezing the vision encoder too aggressively starves the adapter of useful gradients; (2) skipping the alignment-only stage causes the LLM to drift from text quality; (3) tiny visual instruction-tuning sets cause overfitting to LLaVA-style answer formats; (4) AnyRes blows up token budgets — set a tile cap and a global-thumbnail policy; (5) for native-fusion training, mismatched tokenizer scales between modalities cause optimization instability.
REMEMBER
  • LLaVA = projection (cheap). Flamingo = cross-attention (preserves vision). Native = one decoder (strongest fusion).
  • Q-Former = compress variable visual tokens to fixed K queries.
  • Most teams start with LLaVA-style and graduate to native later.
10
VLM · DATA

Multimodal training data & pipelines

TL;DR

Stack: web image-text pairs (LAION, COYO) → interleaved web docs (OBELICS) → visual instruction tuning (LLaVA-Instruct, ShareGPT4V) → optional re-captioning with a strong VLM. Video adds HD-VILA, WebVid. The recipe is staged: pretrain encoders, align adapter, then visual instruction tune.

EXAMPLE — canonical VLM training pipeline
  1. Pretrain vision encoder (SigLIP) on web image-text pairs.
  2. Pretrain LLM separately.
  3. Stage 1 alignment: train projection adapter only on caption data; LLM frozen.
  4. Stage 2 instruction tuning: visual-instruct data with full LLM fine-tune.
  5. Optional: RLHF / DPO on visual preference data.
REMEMBER
  • Two-stage training is the default: align adapter, then full instruction tune.
  • Re-caption with a strong VLM to improve label quality cheaply.
  • OBELICS is the go-to interleaved doc dataset.
Part III — World Models & Robots
11
WORLD MODELS · TWO MEANINGS

World models — generative video and action-conditioned

TL;DR

"World model" in 2026 means one of two things: a generative video model with implicit physics (Sora, Veo, Cosmos) or an action-conditioned latent dynamics model that an agent can plan inside (Dreamer V3, Genie 2, GAIA-1). Robotics labs care about both.

The two meanings

Genie 2 (DeepMind 2024)

Action-conditioned world model. From a single image, generates a navigable 3D-like environment that responds to keyboard input. Trained on internet video with self-supervised action labels (inferring what actions occurred between frames).

GAIA-1 (Wayve)

Driving world model trained on real-world driving footage. Conditioned on action (steering, acceleration). Used to imagine counterfactual scenarios for AV planning.

Cosmos (NVIDIA 2025)

Foundation models for physical AI. Pre-trained on 20M hours of robotics/driving video. Tokenizers + autoregressive + diffusion variants.

World Labs (Fei-Fei Li)

Generative spatial intelligence. Marble: generates persistent 3D scenes from a single image / video / text prompt. Targeting use in robotics, gaming, simulation.

REMEMBER
  • "World model" = generative video OR action-conditioned latent dynamics.
  • Genie 2 = playable from a single image; GAIA-1 = driving counterfactuals.
  • Cosmos = pretraining at 20M hours scale for physical AI.
12
ROBOTICS · VLA

VLA — Vision-Language-Action for robotics

TL;DR

VLA = take a VLM and add an action head. RT-2 discretized actions as tokens. OpenVLA built the open Llama 2 7B + SigLIP version. π0 swapped token-discretization for a flow-matching action head — continuous distributions, real-robot quality.

EXAMPLE — π0 robot policy

VLM backbone provides vision + language understanding. Add a flow-matching action head that produces continuous action distributions (vs token-discretized actions in RT-2 / OpenVLA). Trained on real robot demos. Outputs joint torques / end-effector deltas at high frequency. The combination of strong perception (VLM) + smooth continuous control (flow matching) is the state of the art for general-purpose robot policies.

REMEMBER
  • RT-2 = action-as-token; OpenVLA = open Llama version; π0 = flow-matching action head.
  • Continuous action distributions beat token-discretized for fine motor control.
  • VLM backbone is the unifying substrate.

0 → hero multimodal path

  1. foundation Lilian Weng — What are Diffusion Models?
  2. foundation Yang Song — Generative Modeling by Estimating Gradients
  3. foundation Jay Alammar — Illustrated Stable Diffusion
  4. build denoising-diffusion-pytorch — train DDPM on MNIST
  5. build OpenAI CLIP — read training code
  6. depth DDPM (Ho 2020)
  7. depth DDIM (Song 2020)
  8. depth Classifier-Free Guidance (Ho 2022)
  9. depth Flow Matching (Lipman 2023)
  10. depth Rectified Flow (Liu 2022)
  11. depth DiT (Peebles & Xie 2022)
  12. depth CLIP (Radford 2021)
  13. depth SigLIP (Zhai 2023)
  14. depth ViT (Dosovitskiy 2020)
  15. depth LLaVA

Multimodal quiz — readiness check

  1. Why does SigLIP scale better than CLIP?
    Show answer

    Sigmoid loss is per-pair (no cross-batch normalization), composes cleanly across data-parallel ranks (no cross-rank embedding gather), works at small batch sizes (CLIP needs huge batches for many negatives).

  2. Walk through DDPM training and sampling.
    Show answer

    Forward: noise x_0 incrementally with a known schedule, closed form for any t: x_t = √(ᾱ_t) x_0 + √(1−ᾱ_t) ε. Train ε_θ(x_t, t) to predict the noise. Sample: iteratively denoise from N(0, I) over T steps using the learned ε_θ.

  3. What is classifier-free guidance and why does it work?
    Show answer

    Train one model with conditioning (prompt c) and unconditional (drop c with prob 0.1). At sampling: ε̃ = ε(x, ∅) + w · (ε(x, c) − ε(x, ∅)). Amplifies the gradient of log-likelihood toward the condition. w > 1 sharpens conditioning at the cost of diversity.

  4. Difference between flow matching and DDPM?
    Show answer

    Flow matching learns a velocity field for an ODE: dx/dt = v_θ(x, t). DDPM learns noise prediction for an SDE. Flow matching with rectified flow can sample in 1–4 steps; DDPM needs 50+. SD3 / Flux use rectified flow.

  5. How does Sora handle variable resolution + duration?
    Show answer

    DiT (diffusion transformer) over spacetime patches. Each video chunk is patched in (t, h, w) and tokenized; the transformer is sequence-length flexible. Positional encoding generalizes across resolutions and lengths.

  6. Compare LLaVA-style adapter vs Flamingo cross-attention vs native multimodal.
    Show answer

    LLaVA: project vision into LLM's token space; cheap, good. Flamingo: cross-attention layers interleaved with LLM; preserves visual representation; more compute. Native (Gemini, GPT-4o): all tokens in one stream — strongest fusion, hardest to train.

  7. Why does SD3 use 3 text encoders?
    Show answer

    CLIP-L (rich semantic), CLIP-G (broader knowledge), T5-XXL (long-text understanding, structured prompts). Concatenated; MM-DiT attends jointly over text + image tokens. Each encoder contributes complementary information.

  8. Design a VLM training pipeline.
    Show answer

    (1) Pretrain vision encoder (SigLIP) on web image-text pairs. (2) Pretrain LLM. (3) Stage 1: train projection adapter only on caption data. (4) Stage 2: visual instruction tuning with full LLM fine-tune. (5) Optional RLHF/DPO on visual preference data.

  9. What's a "world model" in the 2026 AI sense?
    Show answer

    Two related meanings: (1) Generative video model (Sora, Veo, Genie) — generates plausible video conditioned on text/image/action; implicit physics. (2) Action-conditioned latent dynamics model used by RL agents (Dreamer, Genie 2, GAIA-1) — for planning/imagination.

  10. VLA (Vision-Language-Action) — how does π0 work?
    Show answer

    VLM backbone provides vision + language understanding. Add a flow-matching action head that produces continuous action distributions (vs token-discretized actions in RT-2 / OpenVLA). Trained on real robot demos. Outputs joint torques / end-effector deltas at high frequency.

  11. What's a Q-Former (BLIP-2)?
    Show answer

    Lightweight transformer with K learnable query tokens that cross-attend to vision features. Reduces variable-length vision tokens to fixed K (e.g., 32). Plug those into the LLM. Cheap and effective adapter.

  12. Why is patch tokenization useful?
    Show answer

    Treats images as sequences of N patches. Each patch is linearly projected to a token. Avoids the inductive bias of CNNs (translation invariance), letting attention learn whatever spatial relations are needed. ViT proved this scales beautifully with data.

  13. What's AnyRes / dynamic resolution in modern VLMs?
    Show answer

    Instead of one fixed input resolution, chunk large images into multiple ViT-sized tiles + a global thumbnail. Each tile contributes its own tokens. Used by LLaVA-1.6, GPT-4V, Qwen-VL. Trades token budget for resolution per request.

  14. Why is text-to-video harder than text-to-image?
    Show answer

    (1) Temporal consistency — objects must persist coherently. (2) Compute scales with frames × resolution. (3) Less aligned (text, video) data than (text, image). (4) Physics — incorrect dynamics are immediately obvious. (5) Audio synchronization. Sora and Veo are state of the art.

  15. Explain DiT's adaLN-Zero conditioning.
    Show answer

    Conditioning info (timestep + class) modulates each transformer block via per-token affine modulation: scale + shift the LayerNorm output. Initialized to zero (so the residual is identity at init), letting the model learn whether to use the conditioning. Standard in modern diffusion transformers.