Scaling rectified flow transformers for high-resolution image synthesis — straight-line trajectories from noise to data, a multimodal DiT with separate text/image streams joined by attention, and predictable scaling laws.
By early 2024, diffusion models dominated image generation. SDXL, DALL-E 3, and Midjourney could produce stunning images from text. But three stubborn problems remained:
All diffusion-style models share the same core idea: define a forward process that turns data into noise, then learn to reverse it. The question is what path the forward process takes.
Given a data sample x0 and noise ε ~ N(0, I), define the noisy sample at time t as:
At t = 0, z0 = x0 (pure data). At t = 1, z1 = ε (pure noise). For any t in between, zt is a linear interpolation between data and noise. This is the defining property of rectified flow — the path from data to noise is a straight line.
Compare this to DDPM, which uses a variance-preserving schedule: zt = αt x0 + σt ε, where αt and σt follow specific curved schedules. The DDPM path from data to noise is curved — the signal and noise don't mix linearly.
Why does this matter? Because at inference time, you need to follow these paths backwards. A curved path requires many small steps to trace accurately (like driving on a winding mountain road). A straight path can theoretically be traversed in a single step (like a highway).
The network vΘ learns the velocity field by minimizing the conditional flow matching loss:
The target velocity ε − x0 is constant for a given (x0, ε) pair — it doesn't depend on t. This makes the learning problem simpler than DDPM's time-dependent noise prediction.
Rectified flow (teal) takes a straight path from data to noise. DDPM (orange) follows a curved variance-preserving path. Toggle between them to see how the straight path needs fewer solver steps.
Previous text-to-image models (Stable Diffusion 1/2, SDXL) use a U-Net backbone with cross-attention: the image features attend to text features, but text features are frozen — they never get to see what the image is doing. This is a one-way street.
SD3 replaces the U-Net with a Multimodal Diffusion Transformer (MMDiT), which treats text and image as two parallel streams that interact through joint attention.
The MMDiT block has two independent pathways:
Each stream has its own LayerNorm, linear projections for Q/K/V, and MLP. But for the attention operation, the two sequences are concatenated and attention is computed jointly. This means every image patch can attend to every text token, and every text token can attend to every image patch.
Both streams are modulated by the diffusion timestep t and a pooled text embedding y (from CLIP). These modulation parameters (α, β, γ) scale and shift the hidden states — identical to DiT's adaptive layer norm (adaLN). Each stream gets its own set of modulation parameters, so the same timestep signal can affect text and image processing differently.
The model size is controlled by a single parameter: depth d (number of MMDiT blocks). The hidden dimension is 64d, the MLP expands to 4 × 64d, and there are d attention heads. This gives a clean scaling axis: d=15 is ~500M params, d=24 is ~2B, d=38 is ~8B.
Two parallel streams (image in teal, text in orange) with separate weights for LayerNorm, Q/K/V projections, and MLPs. They merge only for the joint attention computation, then split back into their own streams.
When finetuning on high resolutions, the attention logits can grow uncontrollably — the largest attention values explode, causing entropy collapse (all attention weight concentrates on one token). SD3 applies RMSNorm to Q and K before computing attention, which stabilizes training and allows efficient bf16 mixed-precision even at 1024×1024.
SD3 doesn't rely on a single text encoder. It uses three pretrained text models, each capturing different aspects of the prompt:
OpenAI's CLIP with a ViT-L/14 image encoder. Produces a 77-token sequence embedding and a pooled 768-dim vector. Good at visual-semantic alignment — it knows what words "look like" because it was trained on image-text pairs.
A larger CLIP variant (ViT-bigG/14). Also produces 77 tokens + a pooled vector. The larger model captures more nuanced visual concepts.
Google's 4.7B parameter text-to-text transformer. Unlike CLIP (which is trained on image-text pairs), T5 is a pure language model. It understands complex sentence structure, negation, spatial relationships, and counting — precisely the things CLIP struggles with.
The two CLIP models provide the pooled text embedding y (concatenated, then projected), which drives the timestep modulation in every MMDiT block. The sequence outputs from all three encoders are concatenated into the context sequence c (77 + 77 tokens from CLIP, plus T5 tokens), which forms the text stream in MMDiT's joint attention.
Two key training innovations separate SD3 from vanilla rectified flow: logit-normal timestep sampling and resolution-dependent shifting.
In standard rectified flow, timesteps t are sampled uniformly from [0, 1]. But not all timesteps are equally informative. At t ≈ 0, zt is almost pure data — easy to predict. At t ≈ 1, zt is almost pure noise — the prediction target is roughly the dataset mean. The hardest (most informative) predictions are at intermediate timesteps.
SD3 samples timesteps from a logit-normal distribution:
where logit(t) = log(t / (1−t)). With m = 0 and s = 1, this puts most sampling weight on intermediate timesteps while still covering the endpoints with some probability.
Compare uniform sampling (flat) vs logit-normal (peaked at middle). The logit-normal biases training toward the most informative intermediate timesteps. Adjust m and s to see how the distribution changes.
Higher resolution images have more pixels, so they need more noise to destroy the signal. Think of it this way: a 1024×1024 image has 16× more pixels than 256×256. If you average the noisy pixels (as a rough estimate of the clean image), the law of large numbers gives you a much better estimate with more pixels. So the same amount of noise is "less noisy" at higher resolutions.
SD3 addresses this with a resolution-dependent shift. Given a model pretrained at resolution n and finetuned at resolution m, the timestep mapping is:
For 1024×1024 finetuning from 256×256, a shift value of α = √(m/n) = 3.0 is used. This effectively pushes the noise schedule to add more noise at each timestep, compensating for the higher resolution.
| Component | Parameters | Status |
|---|---|---|
| CLIP-L/14 | ~124M | Frozen (pretrained by OpenAI) |
| CLIP-G/14 | ~1.8B | Frozen (pretrained, OpenCLIP) |
| T5-XXL encoder | 4.7B | Frozen (pretrained by Google) |
| VAE (16-ch) | ~80M | Pretrained separately, frozen during MMDiT training |
| MMDiT (d=38) | ~8B | Trained from scratch |
Total system: ~14.7B parameters. Of those, only the 8B MMDiT is trained for the diffusion task. The 6.7B in text encoders and VAE are pretrained and frozen — they provide the "language" and "pixel" expertise, while MMDiT learns to compose them.
Three concrete reasons:
One of SD3's most important contributions is demonstrating that MMDiT follows predictable scaling trends, just like language models.
The paper trains MMDiT models at depths d = 15, 18, 21, 24, 30, and 38 (roughly 500M to 8B parameters) for 500k steps each on 256² images. The validation loss decreases smoothly and predictably as model size increases. Crucially, there are no diminishing returns — the curves don't flatten out, suggesting even larger models would continue to improve.
This sounds obvious but isn't guaranteed. Many generative models show poor correlation between training loss and sample quality. SD3 demonstrates a strong, monotonic relationship: lower validation loss consistently correlates with better CLIP scores, better FID, and better human preference ratings.
The 8B model (d=38) loses only 2.71% CLIP score when reducing from 50 to 5 sampling steps. The 500M model (d=15) loses 4.30%. Larger models learn straighter trajectories, so they tolerate aggressive step reduction better. This means bigger models are not only better but also faster per quality level.
Validation loss decreases smoothly with model depth. Larger models also maintain more performance when using fewer sampling steps. Drag the step count to see how step-efficiency improves with scale.
SD3's 8B model achieves state-of-the-art performance across multiple benchmarks, outperforming both open-source and closed-source alternatives.
GenEval tests compositional text-to-image generation: can the model render the right number of objects, in the right colors, in the right positions? SD3 (d=38, 1024², DPO-aligned) scores 0.74 overall, compared to DALL-E 3's 0.67 and SDXL's 0.55.
The breakdown is striking:
In head-to-head human evaluations on PartiPrompts, SD3 wins against SDXL Turbo, SDXL, Pixart-α, and DALL-E 3 across visual quality, prompt following, and typography. The typography advantage is particularly large — a direct benefit of MMDiT's joint attention allowing the model to "see" text tokens while generating image patches.
Overall GenEval scores measuring compositional text-to-image generation. SD3 (8B + DPO) outperforms all competitors including DALL-E 3.
Understanding what changed from SDXL to SD3 reveals why each design choice matters.
DALL-E 3 (OpenAI) uses a U-Net with cross-attention and a two-stage approach: a prior model generates CLIP embeddings from text, then a diffusion model generates images from those embeddings. SD3 eliminates this two-stage pipeline entirely — text encoders feed directly into MMDiT.
DALL-E 3's key innovation was training on highly descriptive synthetic captions. SD3 adopts this insight (50/50 original + CogVLM synthetic captions) while also advancing the architecture and noise formulation.
This is where rectified flow shines. SD3's performance degrades gracefully with fewer steps, while DDPM-based models collapse. At 5 steps, rectified flow formulations still produce coherent images, whereas traditional formulations produce blurry messes.
| Metric | SD3-2B (d=24) | SD3-8B (d=38) |
|---|---|---|
| MMDiT params | ~2B | ~8B |
| Total system params | ~8.7B | ~14.7B |
| Training resolution | 256², then 1024² finetune | 256², then 1024² finetune |
| Training steps | 500K+ | 500K+ |
| VRAM for inference (bf16) | ~18 GB | ~30 GB |
| Steps for good quality | 28-50 Euler | 20-50 Euler |
| Inference time (A100) | ~5-7s | ~10-14s |
The VRAM requirement is dominated by the text encoders at inference: T5-XXL alone needs ~9.4 GB in bf16. For memory-constrained deployment, T5 can be dropped (with the compositional quality degradation noted above) to save nearly 10 GB, or quantized to int8 (~4.7 GB).
Like its predecessors, SD3 operates in a compressed latent space, not in pixel space. A pretrained autoencoder maps images to and from this latent representation.
A 1024×1024 RGB image has ~3.1M values. Running a transformer with attention over 3.1M tokens is computationally infeasible. The autoencoder compresses the image by a factor of 8× in each spatial dimension: 1024×1024×3 → 128×128×16 — about 97% fewer values. The diffusion model works in this compact space, then the decoder reconstructs the final image.
This is a key upgrade from SDXL (4 channels) and previous Stable Diffusion versions. More channels = richer latent representation = higher reconstruction quality. The paper shows this systematically:
The 16-channel autoencoder reduces reconstruction FID by 56% compared to 4-channel. This is crucial because the autoencoder's reconstruction quality is an upper bound on the final image quality — the diffusion model can never generate images better than the autoencoder can reconstruct.
The 128×128×16 latent is further divided into 2×2 patches, yielding 64×64 = 4096 tokens, each of dimension 16×4 = 64. These tokens form the image sequence in MMDiT. This patch size matches DiT's design and keeps the sequence length manageable for attention.
The 16-channel autoencoder is trained independently before any diffusion training begins. It uses a combination of:
Once trained, the VAE is frozen for all subsequent diffusion training. This two-stage approach is inherited from the original Latent Diffusion paper (Rombach et al., 2022). The VAE has ~80M parameters — small compared to the 8B MMDiT it supports.
Why the VAE is a bottleneck: The autoencoder's reconstruction quality is an absolute ceiling on final image quality. No matter how perfect the MMDiT becomes, it cannot generate details the VAE cannot reconstruct. This is why upgrading from 4 to 16 channels matters so much — it lifts the ceiling. Fine details like thin lines, small text, and fabric textures that were permanently lost in a 4-channel latent are now preserved.
SD3 sits at the intersection of several research threads. Understanding these connections clarifies both where SD3 came from and where the field is going.
MMDiT is a direct descendant of DiT, which first showed that replacing U-Net with a transformer backbone for diffusion models works and scales. DiT used class-conditional generation with adaLN modulation. MMDiT extends this to text-to-image by adding the text stream and joint attention. In fact, DiT with concatenated text+image tokens is a special case of MMDiT with shared weights.
DDPM established the modern diffusion framework: define a forward noising process, train a network to reverse it. SD3's rectified flow is a different choice of forward process (— straight lines instead of curved variance-preserving paths —) but the same fundamental idea: learn to undo the noise.
Rectified flow is a special case of the flow matching framework. Flow matching provides the theoretical foundation: you can train velocity fields by matching conditional vector fields along specified probability paths. Rectified flow simply chooses the straight-line path zt = (1−t)x0 + tε. SD3's contribution is showing this works at scale with the right timestep sampling.
Several SD3 authors later founded Black Forest Labs and created Flux, which builds on the same MMDiT foundation but introduces a hybrid architecture: early layers use MMDiT-style joint attention, while later layers use single-stream attention (text and image tokens are fully merged). Flux also drops the T5 encoder's contribution to the pooled embedding and simplifies the modulation mechanism.
SD3 inherits the core LDM insight: train the diffusion model in a compressed latent space, not in pixel space. The key upgrade is moving from 4 to 16 latent channels and using a transformer backbone instead of a U-Net.
| Transition | What changed | What it improved |
|---|---|---|
| DDPM → LDM | Pixel space → latent space | 9x compute reduction, same quality |
| LDM → DiT | U-Net → transformer backbone | Clean scaling laws, SOTA FID |
| DiT → SD3 | Class-cond → text-cond (MMDiT), DDPM → rectified flow, 4ch → 16ch VAE | Text generation, 4x fewer steps, sharper images |
| SD3 → Flux | Dual-stream → hybrid single/dual, simplified modulation | Further efficiency gains, production deployment |