DiT — Veanors

Chapter 0: The Problem

By 2022, diffusion models had become the dominant paradigm for image generation. DALL-E 2, Imagen, Stable Diffusion — all of them produced stunning images. And all of them used the same backbone: a U-Net.

The U-Net was inherited from early pixel-level models (PixelCNN++) and adapted by Ho et al. for DDPM. It's convolutional, with ResNet blocks at multiple resolutions and self-attention sprinkled in at lower resolutions. Dhariwal and Nichol (ADM) ablated some architectural choices — channel counts, normalization layers — but the high-level design remained essentially unchanged from 2020 to 2022.

Meanwhile, a different story was playing out everywhere else in deep learning. Transformers had taken over NLP, vision (ViT), reinforcement learning, and even protein folding. The reason? Transformers scale. Double the compute, get predictably better performance. Clean scaling laws. No architectural ceilings.

But diffusion models were stuck with U-Nets. Nobody had seriously asked: what happens if you replace the U-Net with a transformer?

The core tension: U-Nets work well for diffusion, but they are a bespoke convolutional architecture with domain-specific inductive biases (skip connections across resolutions, multi-scale processing). Transformers have proven more scalable in every other domain. Can we bring the scaling properties of transformers to diffusion models?

There were practical concerns too. U-Nets are awkward to scale — you can add channels or attention heads, but there's no clean "just make it bigger" knob like there is for transformers (where you simply increase depth and width). The architecture is also hard to unify across modalities, making it difficult to share training recipes or transfer insights from language modeling.

Why is the U-Net backbone potentially limiting for diffusion model scaling?

U-Nets are bespoke convolutional architectures without a clean scaling axis — unlike transformers, there's no simple "depth x width" knob that yields predictable performance gains U-Nets can't generate high-resolution images U-Nets don't support conditioning on class labels

Chapter 1: The Key Insight

Peebles and Xie's insight is beautifully simple: use a standard Vision Transformer as the denoising backbone in a latent diffusion model.

The recipe:

Take a noised latent representation (from a pretrained VAE)
Patchify it into a sequence of tokens, just like ViT does with images
Add positional embeddings
Process through a stack of transformer blocks
Decode back to predict noise and covariance

That's it. No multi-scale processing. No skip connections across resolutions. No convolutional layers (except inside the frozen VAE). Just a plain transformer operating on a flat sequence of patches.

Parameter budget breakdown (DiT-XL/2)

Component	Parameters	Notes
Patch embedding	~74K	Linear: (p×p×4) → 1152 = 16 × 1152 + bias
Positional embedding	~295K	256 positions × 1152 dims (sine-cosine, not learned)
28 DiT blocks	~668M	Per block: MHSA (4 × 1152²) + FFN (2 × 1152 × 4608) + adaLN MLP
adaLN conditioning MLP	~5M shared	Maps t+c embedding to 6 scalars per block (shared MLP, block-specific outputs)
Final linear decode	~37K	1152 → p×p×2C = 1152 → 32
Label embedding	~1.2M	1000 classes × 1152 dim lookup table
Timestep MLP	~2.7M	Sinusoidal embed → 2-layer MLP → 1152 dim
Total DiT-XL/2	675M	All trained from scratch

The overwhelming majority (99%+) of parameters sit in the 28 transformer blocks. The conditioning, embedding, and decode layers are negligible. This is what makes DiT so clean: scale the blocks and everything else is noise.

Why this is surprising: The U-Net's multi-resolution structure — downsampling, processing, upsampling with skip connections — seemed essential. It lets the network reason at multiple scales simultaneously. Removing all of that and replacing it with a flat sequence of patches processed at a single resolution shouldn't work as well. But it does. The self-attention mechanism in the transformer implicitly learns to handle multi-scale reasoning without the architectural scaffolding.

The paper calls this architecture DiT: Diffusion Transformer. The key finding is that DiTs exhibit the same clean scaling behavior that makes transformers so powerful in language: more Gflops = lower FID, with strong correlation (-0.93). This means you can improve image quality simply by making the model bigger, with no architectural changes needed.

The best model, DiT-XL/2, achieves a state-of-the-art FID of 2.27 on class-conditional ImageNet 256x256, outperforming all prior diffusion models while using fewer Gflops than pixel-space U-Net models like ADM.

What is the central finding of the DiT paper?

Replacing the U-Net with a standard Vision Transformer backbone in latent diffusion yields clean compute scaling (more Gflops = lower FID) and state-of-the-art image quality U-Nets are always better than transformers for diffusion DiT uses a hybrid CNN-transformer architecture

Chapter 2: Latent Diffusion Setup

DiT doesn't operate on raw pixels. It uses the Latent Diffusion Model (LDM) framework from Rombach et al. — the same framework behind Stable Diffusion. This is a two-stage approach:

Stage 1: Learn a VAE

A variational autoencoder is trained to compress images into a compact latent space. For a 256x256x3 RGB image, the VAE encoder E produces a latent z = E(x) with shape 32x32x4 — an 8x spatial downsampling with 4 channels. DiT uses the off-the-shelf VAE from Stable Diffusion, frozen during DiT training.

Stage 2: Diffuse in latent space

The forward diffusion process adds Gaussian noise to z:

z_t = √(ᾱ_t) · z₀ + √(1 − ᾱ_t) · ε, ε ~ N(0, I)

The DiT model learns to reverse this process — given a noisy latent z_t and timestep t, predict the noise ε. After denoising, the clean latent z₀ is decoded back to an image via the VAE decoder: x = D(z).

Why latent space? Training diffusion directly on 256x256x3 pixels is expensive. ADM, a pixel-space U-Net model, requires 1120 Gflops per forward pass. By compressing to 32x32x4 latents first, DiT-XL/2 achieves better results with only 118.6 Gflops — a 9.4x reduction in compute. The VAE does the heavy lifting of learning pixel-level details; the diffusion model only needs to learn the high-level structure.

The complete data flow with exact shapes

Let's trace a single image through the full DiT pipeline, tracking every tensor shape:

Input image

256 × 256 × 3 (RGB pixels)

↓

VAE Encoder E(x)

256 × 256 × 3 → 32 × 32 × 4 latent (8× spatial downsample, 4 channels)

↓

Forward diffusion

32 × 32 × 4 → 32 × 32 × 4 noised latent z_t

↓

Patchify (p=2)

32 × 32 × 4 → 256 tokens of dim d=1152 (each token = 2×2×4 = 16 values, linearly projected to d)

↓

+ Positional embedding

256 × 1152 (sine-cosine embeddings added)

↓

28 × DiT Block (adaLN-Zero)

256 × 1152 → 256 × 1152 (self-attention over all 256 tokens per block)

↓

Final layer norm + Linear decode

256 × 1152 → 256 tokens × (2×2×8) = 32 × 32 × 8 (noise + diagonal covariance)

↓

Split output

32 × 32 × 4 predicted noise ε + 32 × 32 × 4 predicted variance Σ

↓

VAE Decoder D(z₀)

32 × 32 × 4 clean latent → 256 × 256 × 3 output image

Frozen vs. trained: The VAE (both encoder and decoder) is pretrained from Stable Diffusion and completely frozen during DiT training. Only the DiT transformer weights are trained. This separation is critical — it means DiT never needs to learn pixel-level details, only the high-level latent structure. The VAE has ~80M parameters; DiT-XL has 675M parameters, all trained from scratch on ImageNet.

What is the shape of the latent representation z for a 256x256x3 input image in DiT?

32x32x4 — the VAE encoder downsamples spatially by 8x and uses 4 latent channels 128x128x3 — a 2x downsample 16x16x8 — a 16x downsample with 8 channels

Chapter 3: DiT Architecture

Here's the complete forward pass of DiT, step by step. This is the core of the paper.

Step 1: Patchify

The noised latent z_t (shape 32x32x4) is divided into non-overlapping patches of size p x p. Each patch is linearly embedded into a d-dimensional token vector. For patch size p, the number of tokens is:

T = (I / p)²

Where I = 32 (the spatial dimension of the latent). With p=2, you get T = 256 tokens. With p=8, just T = 16 tokens. Halving p quadruples T and thus quadruples the Gflops (since self-attention is quadratic in sequence length).

Step 2: Positional embeddings

Standard sine-cosine frequency-based positional embeddings are added to all tokens, exactly as in the original ViT. Nothing fancy.

Step 3: N transformer blocks

The token sequence passes through N DiT blocks. Each block contains multi-head self-attention and a pointwise feedforward network (MLP), with layer normalization. The key modification is how conditioning information (timestep t and class label c) enters the block — via adaptive layer normalization with zero initialization (adaLN-Zero). We'll detail this in the next chapter.

Step 4: Decode

After the final block, a final adaptive layer norm is applied, followed by a linear projection that maps each d-dimensional token to a p x p x 2C tensor (predicting both noise and diagonal covariance). The decoded tokens are rearranged back into spatial layout to produce the final noise prediction ε_θ and covariance Σ_θ.

The design philosophy: DiT is intentionally as close to a standard ViT as possible. No multi-scale processing, no skip connections between blocks at different depths, no convolutional layers. This faithfulness to the vanilla transformer architecture is what gives DiT its scaling properties — you can simply increase N (depth) and d (width) following standard ViT configs (S, B, L, XL).

Engineering decisions: why these choices?

Why patchify instead of pixel-level tokens? A 32×32 latent has 1024 spatial positions. Self-attention is O(N²), so 1024 tokens would cost 16× the compute of 256 tokens (p=2). Patchification is a compute-quality tradeoff. With p=2, you get 256 tokens — manageable for attention — while still preserving fine-grained spatial information within each patch via the linear projection.

Why predict both noise AND variance? Standard DDPM predicts only noise ε and uses a fixed variance schedule. But Nichol & Dhariwal (2021) showed that learning the variance improves log-likelihood and sample quality, especially with fewer sampling steps. DiT outputs 2C = 8 channels: 4 for noise prediction, 4 for an interpolation parameter v that blends between the fixed DDPM upper and lower variance bounds. The variance prediction adds zero parameters (just doubles the output projection) but meaningfully helps quality.

Why adaLN-Zero instead of cross-attention for conditioning? Cross-attention adds 15% Gflops overhead (extra QKV projections + attention computation) for a length-2 conditioning sequence. adaLN-Zero adds negligible overhead (a single MLP that regresses 6 scalars per block: γ₁, β₁, α₁, γ₂, β₂, α₂) yet achieves lower FID. The insight: conditioning on timestep and class doesn't need per-token flexibility (every patch should denoise the same amount), so a global modulation suffices.

Model configurations

Model	Layers N	Hidden dim d	Heads	Gflops (p=4)
DiT-S	12	384	6	1.4
DiT-B	12	768	12	5.6
DiT-L	24	1024	16	19.7
DiT-XL	28	1152	16	29.1

With patch size p=2, DiT-XL/2 reaches 118.6 Gflops. The naming convention is DiT-{size}/{patch_size}.

What does decreasing the patch size p do to the number of tokens and compute?

Halving p quadruples the number of tokens T = (I/p)^2, which quadruples the Gflops — more tokens means more self-attention computation Halving p doubles the number of tokens Patch size doesn't affect compute

Chapter 4: Conditioning Mechanisms

A diffusion model needs to know two things beyond the noised input: the timestep t (how noisy is this?) and the class label c (what should this image be?). The question is: how do you inject this conditioning information into each transformer block?

DiT explores four approaches. All of them first embed t and c into vector representations using learned MLPs, then combine them (usually by summing). The difference is how this combined conditioning vector enters each transformer block.

1. In-Context Conditioning

Simply append the conditioning embeddings as two extra tokens in the input sequence. The transformer processes them alongside image tokens with no architectural changes. After the final block, remove the extra tokens. Gflops overhead: negligible.

2. Cross-Attention

Add a cross-attention layer after self-attention in each block. The image tokens attend to the conditioning tokens (a length-2 sequence of t and c embeddings). This is similar to how the original transformer decoder attends to encoder outputs, and how LDM conditions on text. Gflops overhead: ~15%.

3. Adaptive Layer Norm (adaLN)

Replace the standard learnable scale (γ) and shift (β) parameters in layer norm with ones regressed from the conditioning vector. Instead of learning fixed normalization parameters, the model computes them as a function of t + c. This applies the same transformation to all tokens (unlike cross-attention, which can apply different weights per token). Gflops overhead: minimal.

4. adaLN-Zero (the winner)

Same as adaLN, but with a critical addition: regress additional scaling parameters α that are applied immediately before each residual connection. The α parameters are initialized to zero, which means each DiT block starts as the identity function — it passes the input straight through. The network gradually learns to "turn on" each block during training.

h = x + α₁ · MHSA(adaLN(γ₁, β₁, x))
out = h + α₂ · FFN(adaLN(γ₂, β₂, h))

Where γ, β, α are all regressed from the conditioning embedding via a shared MLP.

Why zero initialization matters: This is inspired by a trick from ResNet training — Goyal et al. found that zero-initializing the final batch norm in each residual block accelerates training. U-Net diffusion models use a similar trick (zero-init final conv). For DiT, zero-initializing α means each block is the identity at initialization, so the full DiT block is a no-op. Gradients flow cleanly through the residual connections from the start, giving the network a stable foundation to learn from.

The verdict

At 400K training steps, adaLN-Zero achieves roughly half the FID of in-context conditioning. Cross-attention is better than in-context but worse than adaLN-Zero, despite costing 15% more Gflops. Vanilla adaLN (without zero init) is also worse than adaLN-Zero, confirming that the zero initialization matters. All subsequent DiT models use adaLN-Zero.

Why does adaLN-Zero outperform vanilla adaLN?

The zero-initialized scaling parameters α make each block start as the identity function, giving stable gradients and a clean learning signal from the beginning of training adaLN-Zero uses more parameters adaLN-Zero applies different conditioning to each token

Chapter 5: Scaling Laws

The most important result in the paper isn't the final FID number — it's the scaling behavior. Peebles and Xie train 12 DiT models spanning 4 model sizes (S, B, L, XL) and 3 patch sizes (8, 4, 2), then plot FID against Gflops.

Finding 1: More Gflops = Lower FID

Across all 12 models, there is a -0.93 correlation between model Gflops and FID-50K at 400K training steps. This is remarkably clean — almost a straight line on a log-log plot. You can predict a DiT model's quality from its Gflops alone.

Finding 2: Gflops matter more than parameters

Here's a subtle but crucial point. When you decrease the patch size (say from p=4 to p=2), you quadruple the number of tokens and thus the Gflops. But the parameter count barely changes — the transformer weights are the same, you're just processing more tokens. Yet FID improves substantially. This means compute (Gflops), not parameter count, is the true driver of quality.

Models with similar Gflops achieve similar FID regardless of how they get there (bigger model vs. smaller patches). For example, DiT-S/2 and DiT-B/4 have similar Gflops and similar FID.

Finding 3: Larger models are more compute-efficient

When you plot FID against total training compute (Gflops x batch size x steps x 3), larger models reach any given FID threshold with less total compute. A small model trained for a long time is eventually overtaken by a large model trained for fewer steps. This mirrors the compute-optimal scaling behavior seen in language models (Chinchilla).

Training efficiency across model sizes

A key subtlety: larger DiT models are more compute-efficient at reaching a given quality level. Consider what it takes to reach FID 50:

DiT-S/2 (5.8 Gflops): Never reaches FID 50 in 400K steps — it converges around FID 68.
DiT-B/2 (22.6 Gflops): Reaches FID 50 at roughly 250K steps. Total compute: 250K × 22.6 × 3 = 1.7 × 10¹⁰ Gflops.
DiT-XL/2 (118.6 Gflops): Reaches FID 50 at roughly 80K steps. Total compute: 80K × 118.6 × 3 = 2.8 × 10¹⁰ Gflops.

DiT-XL/2 uses more compute per step but reaches the target in fewer steps, with similar total compute. And unlike DiT-S, the XL model keeps improving beyond FID 50 — it hasn't saturated. This mirrors the "Chinchilla" finding in language models: it's better to train a large model for fewer steps than a small model for many steps.

This is the paper's legacy: Before DiT, it wasn't clear that diffusion models could exhibit clean scaling laws. U-Nets don't have an obvious "make it bigger" axis, and their performance doesn't correlate as cleanly with compute. DiT shows that once you adopt the transformer architecture, diffusion models inherit the same predictable scaling that has driven progress in language modeling. This is what made DiT so influential — not just the FID number, but the promise of a clear path to ever-better generative models.

What is the correlation between DiT model Gflops and FID-50K?

-0.93 — a strong negative correlation, meaning more Gflops reliably yields lower (better) FID +0.93 — more Gflops means higher FID -0.5 — a weak correlation

Chapter 6: Results

After the scaling analysis, Peebles and Xie train their best model — DiT-XL/2 — for 7 million steps (up from the 400K used in ablations). The results speak for themselves.

256x256 ImageNet (class-conditional)

With classifier-free guidance (cfg scale = 1.50):

Model	FID ↓	sFID ↓	IS ↑	Precision	Recall
ADM-G	4.59	5.25	186.7	0.82	0.52
ADM-G + ADM-U	3.94	6.14	215.8	0.83	0.53
LDM-4-G (cfg=1.50)	3.60	—	247.7	0.87	0.48
DiT-XL/2-G (cfg=1.50)	2.27	4.60	278.2	0.83	0.57

DiT-XL/2 achieves 2.27 FID, beating the previous best of 3.60 from LDM-4 by a large margin. It also achieves the highest Inception Score (278.2) and a strong balance between Precision (0.83) and Recall (0.57).

512x512 ImageNet

DiT-XL/2 also sets a new state-of-the-art at 512x512 resolution with an FID of 3.04, outperforming ADM-G + ADM-U (3.85 FID) while being substantially more compute-efficient.

Compute efficiency

DiT-XL/2 uses 118.6 Gflops per forward pass. Compare this to:

ADM: 1120 Gflops (pixel space) — 9.4x more expensive
ADM-U: 742 Gflops (pixel space with upsampler)
LDM-4: 103.6 Gflops (latent space, similar efficiency)

DiT achieves better FID than all of these despite comparable or lower compute cost.

What degrades and when

The scaling analysis reveals clear degradation patterns:

Smaller model: DiT-S/2 (33M params, 5.8 Gflops) → FID 68. DiT-XL/2 (675M params, 118.6 Gflops) → FID 19 at 400K steps. 20x more compute buys 3.6x better FID.
Larger patch size: DiT-XL/2 (p=2, 256 tokens) → FID 19. DiT-XL/4 (p=4, 64 tokens) → FID 38. DiT-XL/8 (p=8, 16 tokens) → FID 80. Coarser patches lose spatial detail catastrophically.
Fewer training steps: DiT-XL/2 at 400K steps → FID 19. Same model at 7M steps → FID 9.62 (without CFG). 17.5x more training yields another 2x improvement.
Fewer sampling steps: 250 DDPM steps is used for all final results. Reducing to 50 steps degrades FID noticeably. This is a limitation of DDPM noise schedule — later architectures (SD3, Flux) with rectified flow fix this.
No CFG: Without classifier-free guidance, FID goes from 2.27 to 9.62 — a 4.2x degradation. CFG is essential, not optional.

Concrete training numbers

Training budget: DiT-XL/2 was trained for 7M steps at batch size 256 on ImageNet (1.28M images, 1000 classes). That's ~1.79 billion images seen. At 118.6 Gflops per forward pass × 3 (forward + backward + EMA), total training compute is approximately 1.27 × 10²¹ FLOPs (or ~1.27 ZettaFLOPs). For reference, training GPT-3 took ~3.1 × 10²³ FLOPs — DiT-XL/2 is ~250x cheaper. Inference: 250 sampling steps × 118.6 Gflops × 2 (CFG) = ~59,300 Gflops per image, or ~30 seconds on a single A100 GPU.

What FID does DiT-XL/2 achieve on class-conditional ImageNet 256x256 with classifier-free guidance?

2.27 — a new state-of-the-art, beating the previous best of 3.60 from LDM 4.59 — matching ADM 10.56 — competitive but not SOTA

Chapter 7: CFG and Sampling

DiT's strong results depend on classifier-free guidance (CFG), a technique that dramatically improves sample quality at the cost of some diversity. Let's understand how it works.

The intuition

During sampling, you want images x where the class probability p(c|x) is high — if you asked for "golden retriever," the image should clearly look like a golden retriever, not an ambiguous blob. By Bayes' rule:

∇_x log p(c|x) ∝ ∇_x log p(x|c) − ∇_x log p(x)

The gradient of the class-conditional score minus the unconditional score points toward images that strongly belong to class c.

The CFG formula

Classifier-free guidance modifies the noise prediction during sampling:

ε̂_θ(z_t, c) = ε_θ(z_t, ∅) + s · (ε_θ(z_t, c) − ε_θ(z_t, ∅))

Where s > 1 is the guidance scale (s = 1 recovers standard sampling). Each sampling step requires two forward passes: one conditioned on c, one unconditional (with a learned null embedding ∅). The unconditional pass is enabled by randomly dropping the class label during training (replacing it with ∅).

The precision-recall tradeoff

Higher guidance scale s pushes the model toward higher-fidelity but lower-diversity samples. For DiT-XL/2:

Without CFG (s=1): FID = 9.62, Recall = 0.67
With CFG (s=1.50): FID = 2.27, Recall = 0.57

CFG reduces FID by over 4x but slightly reduces diversity (recall drops from 0.67 to 0.57). This tradeoff is well-known and consistent across all diffusion model architectures.

Training for CFG: During training, the class label c is randomly replaced with a learned null embedding ∅ 10% of the time. This teaches the model to produce both conditional and unconditional noise predictions, enabling CFG at inference time with no architectural changes. DiT uses 250 DDPM sampling steps with a standard linear noise schedule.

Inference cost breakdown

With CFG, every sampling step requires two forward passes (one conditioned, one unconditional). The full cost:

Per-step cost: 2 × 118.6 = 237.2 Gflops
Total (250 steps): 250 × 237.2 = 59,300 Gflops per image
Wall-clock: ~25-30 seconds on a single A100 (at ~2000 Gflops/s effective throughput)
VAE decode: ~0.1 seconds (negligible)

Compare to ADM (pixel-space U-Net): 250 steps × 2 × 1120 = 560,000 Gflops — nearly 10x more expensive for worse results. This is the fundamental efficiency win of latent diffusion: the 9.4x compute reduction from working in latent space compounds across all 250 steps.

Why does classifier-free guidance require two forward passes per sampling step?

One pass predicts noise conditioned on class c, the other predicts unconditional noise with the null embedding — the difference is amplified by guidance scale s to steer toward the target class One pass generates the image, the other classifies it One pass is for the encoder, the other for the decoder

Chapter 8: Why Transformers Win

DiT's success isn't just about one paper's results. It's about why transformers are fundamentally better suited as a backbone for scaling generative models. Let's unpack the structural advantages.

1. Clean scaling axes

Transformers have two orthogonal axes for scaling: depth (number of layers N) and width (hidden dimension d). Doubling either increases Gflops predictably. U-Nets have channels, resolution levels, attention layers at specific resolutions — but these interact in complex, non-linear ways. There's no clean "make it 2x bigger" knob.

2. Hardware efficiency

Modern GPUs and TPUs are optimized for the dense matrix multiplications that dominate transformer computation (attention QKV projections, FFN layers). U-Nets mix convolutions at various spatial resolutions with attention at lower resolutions — this heterogeneous compute profile is harder to optimize and often leaves hardware underutilized.

3. Architecture unification

If your image generator, text model, and video model all use transformers, you can share training recipes, optimization tricks, and infrastructure. The community's collective knowledge about transformer training (learning rate schedules, initialization, regularization, mixed precision) transfers directly. With U-Nets, every insight had to be re-discovered within the diffusion community.

4. Flexibility for conditioning

Transformers naturally handle variable-length sequences and cross-attention over conditioning tokens. This makes it straightforward to condition on text embeddings, multiple images, or any other modality. U-Nets require bespoke injection points for each conditioning type.

The big picture: DiT showed that the U-Net's inductive biases (multi-scale processing, skip connections) are not necessary for high-quality diffusion. The transformer's ability to learn arbitrary interactions between patches via self-attention is sufficient. And unlike the U-Net, the transformer comes with a proven playbook for scaling that has been validated across language, vision, and every other domain.

What key inductive bias of U-Nets did DiT show is NOT necessary for high-quality diffusion?

Multi-scale processing with downsampling/upsampling and skip connections across resolutions — a flat sequence of patches processed at a single resolution suffices Using convolutional layers at all Batch normalization

Chapter 9: Connections

DiT sits at a pivotal point in the evolution of generative models. Here's how it connects to the broader landscape.

Predecessors

DDPM (Ho et al., 2020) — Established the U-Net as the default diffusion backbone. DiT replaces it.
Latent Diffusion / Stable Diffusion (Rombach et al., 2022) — Introduced the VAE + diffusion-in-latent-space framework that DiT builds on. DiT keeps the VAE, swaps the U-Net for a transformer.
Vision Transformer (ViT) (Dosovitskiy et al., 2020) — DiT directly inherits the patchify + positional embedding + transformer block design from ViT. The model configs (S, B, L, XL) also follow ViT conventions.
ADM (Dhariwal & Nichol, 2021) — The strongest U-Net diffusion model before DiT. Established adaptive normalization, classifier guidance, and many training recipes that DiT retains.

Contemporaries and successors

U-ViT (Bao et al., 2023) — A concurrent work that also replaces U-Net with a transformer but keeps skip connections between early and late blocks (a "U-shaped" transformer). DiT's fully flat architecture proved more influential.
Sora (OpenAI, 2024) — OpenAI's video generation model is built on a "diffusion transformer" architecture, directly citing DiT. Sora extends DiT to video by treating space-time patches as tokens, validating DiT's thesis that transformers scale for visual generation.
Stable Diffusion 3 (Esser et al., 2024) — Stability AI's SD3 uses a modified DiT architecture called "MM-DiT" with separate transformer streams for text and image tokens that interact via attention. This is DiT's principles applied to text-to-image generation.
Flux (Black Forest Labs, 2024) — Built by former Stability AI researchers, Flux uses a DiT-based backbone with flow matching (instead of DDPM). It further validates that transformer-based diffusion architectures scale to production-quality text-to-image models.
SiT (Ma et al., 2024) — Scalable Interpolant Transformers explore different interpolation formulations while keeping DiT's transformer backbone, showing the architecture is compatible with multiple diffusion formulations.

DiT's lasting impact: Nearly every major image and video generation system released after DiT has adopted a transformer-based backbone. The paper's contribution wasn't just a better FID number — it was a paradigm shift. It showed the diffusion community that the U-Net era was over, and that the path to better generative models runs through the same scaling playbook that transformed NLP. Today, "diffusion transformer" is the default architecture for visual generation, exactly as DiT predicted.

Which of these systems directly builds on DiT's transformer-based diffusion architecture?

Sora, Stable Diffusion 3, and Flux — all adopt transformer backbones for diffusion, validating DiT's thesis that transformers scale for visual generation Only Stable Diffusion 3 DALL-E 2 and Imagen

Scalable Diffusion Models with Transformers