Part 1 gave us autoregressive models and VAEs. Now the heavyweights: adversarial training, denoising diffusion, and the modern latent diffusion stack that powers DALL-E, Stable Diffusion, and Sora.
Every generative model answers one question: given some training images, how do we produce new images that look like they came from the same distribution? The approaches differ in how explicitly they model the data distribution p(x).
We can split generative models along a single axis: does the model give you an explicit density function, or does it just give you samples?
| Category | Density | Examples |
|---|---|---|
| Explicit — Tractable | You can compute pθ(x) exactly | Autoregressive (PixelCNN, GPT) |
| Explicit — Approximate | You optimize a bound on pθ(x) | VAEs (ELBO), Normalizing Flows |
| Implicit — Direct | No density; just a sampler | GANs |
| Implicit — Indirect | Learns a score/velocity field | Diffusion, Flow Matching |
Autoregressive models factor the joint distribution into a product of conditionals. PixelCNN generates images one pixel at a time, left-to-right, top-to-bottom. GPT generates text one token at a time. The key equation:
Strengths: exact likelihood, stable training, no mode collapse. Weakness: sequential generation is painfully slow for images (one pixel at a time).
Variational Autoencoders (VAEs) introduce a latent variable z and optimize a lower bound on log p(x):
Strengths: fast sampling (just decode a random z), smooth latent space. Weakness: blurry outputs because the decoder must hedge across all plausible images consistent with z.
Part 1 gave us density-based models (autoregressive, VAE). Today we cover the rest: GANs (implicit, adversarial), Diffusion (score-based, iterative denoising), Latent Diffusion (the modern hybrid), and the text-to-image/video pipelines that combine everything.
Here's a completely different approach. Forget about modeling p(x) explicitly. Instead, train a neural network to generate samples that are so realistic a second neural network can't tell them from real data. This is the GAN idea (Goodfellow et al., 2014).
Start with a simple distribution we can easily sample from — a standard Gaussian z ~ N(0, I). The generator G is a neural network that maps z → x: it takes random noise and produces an image. The discriminator D is a second neural network that takes an image and outputs a probability: "is this image real or fake?"
A neural network that maps latent noise z ~ p(z) to a synthetic image. Its goal: produce outputs that the discriminator classifies as real. G defines an implicit distribution pG(x) — the distribution of images it generates.
A neural network that outputs D(x) ∈ [0, 1], the probability that x is a real image. D is trained to output 1 for real data and 0 for generated (fake) data.
Training is a two-player game. D wants to correctly classify real vs. fake. G wants to fool D. Formally:
Read the two terms: Term 1 — for real images, D wants D(x) ≈ 1, so log D(x) ≈ 0 (large). Term 2 — for fake images G(z), D wants D(G(z)) ≈ 0, so log(1 − D(G(z))) ≈ 0 (large). Meanwhile, G wants D(G(z)) ≈ 1 (fool the discriminator), which makes log(1 − D(G(z))) → −∞. D maximizes V; G minimizes V.
Early in training, G produces garbage. D easily classifies everything correctly: D(G(z)) ≈ 0. The gradient of log(1 − D(G(z))) with respect to G's parameters is essentially zero when D(G(z)) ≈ 0 — the log function is flat near log(1) = 0. G receives almost no learning signal.
When D is confident that G's outputs are fake, ∂/∂θG log(1 − D(G(z))) ≈ 0. The generator is stuck: it knows it's bad, but gets no gradient to improve. This is called the saturating loss.
The fix: instead of minimizing log(1 − D(G(z))), train G to maximize log D(G(z)). Same fixed points, but the gradient is large when D(G(z)) ≈ 0. This is the non-saturating GAN loss used in practice:
For a fixed G, what discriminator maximizes V(G, D)? We can solve this analytically.
For each x, V contains the integrand: pdata(x) log D(x) + pG(x) log(1 − D(x)). Take the derivative with respect to D(x) and set to zero:
pdata(x) / D(x) − pG(x) / (1 − D(x)) = 0
Solving: D*(x) = pdata(x) / (pdata(x) + pG(x))
When pG = pdata, D*(x) = 1/2 everywhere. The discriminator is maximally confused — it can't tell real from fake. This is the equilibrium.
Plug D* back into V(G, D*) and simplify:
V(G, D*) = Ex~pdata[log pdata(x)/(pdata(x)+pG(x))] + Ex~pG[log pG(x)/(pdata(x)+pG(x))]
Let M = (pdata + pG)/2. Then:
V(G, D*) = Epdata[log (pdata/2M)] + EpG[log (pG/2M)]
= DKL(pdata || M) + DKL(pG || M) − 2 log 2
= 2 · JSD(pdata || pG) − 2 log 2
Since JSD ≥ 0 with equality iff pdata = pG, the global minimum of V(G, D*) = −2 log 2, achieved when the generator perfectly matches the data distribution.
Because D and G are adversaries, neither loss decreases monotonically. D's loss goes up when G improves; G's loss goes up when D improves. There is no single number you can watch to know if training is succeeding. You have to look at generated samples. This makes GAN training notoriously tricky.
DC-GAN (Radford et al., 2015) established the convolutional architecture recipe: fractional-strided convolutions in G (upsampling), strided convolutions in D (downsampling), batch normalization everywhere, ReLU in G, LeakyReLU in D. No fully connected layers except at the bottleneck.
StyleGAN (Karras et al., 2019) was the pinnacle of GAN image quality. Three innovations: (1) a mapping network that transforms z into an intermediate latent w, giving a more disentangled space. (2) Adaptive Instance Normalization (AdaIN) injecting w at each resolution layer — controlling "style" at different scales. (3) Progressive growing — train at 4×4, then 8×8, gradually increasing resolution.
Latent space interpolation: Walk smoothly between two z vectors and the generated images morph smoothly — a man gains a smile, sunglasses appear gradually. This smooth latent space is a hallmark of well-trained GANs.
| Pros | Cons |
|---|---|
| Beautiful, sharp samples | No density estimation (can't compute p(x)) |
| Fast single-pass generation | Training instability (mode collapse, oscillation) |
| Smooth, interpolable latent space | No loss curve to monitor convergence |
| Conceptually elegant (game theory) | Hyperparameter-sensitive, hard to scale |
GANs dominated image generation from 2014–2021. StyleGAN produced the most photorealistic faces ever seen. But GANs never solved their core problems: mode collapse, training instability, and the inability to scale to diverse, multi-modal datasets. Diffusion models solved all three, and by 2022, GANs were largely superseded.
Forget everything about adversarial training. Diffusion models take a completely different approach, and it starts with a beautifully simple idea.
Pick any noise distribution pnoise — a standard Gaussian works perfectly. Now imagine two processes:
Forward process (corruption): Take a real image x and gradually add noise to it over "time" t ∈ [0, 1]. At t = 0, it's the clean image. At t = 1, it's pure noise. At any t in between, it's a noisy version of the image — call it xt.
Reverse process (denoising): Start from pure noise x1 ~ pnoise and gradually remove the noise, stepping backward from t = 1 to t = 0. If you can do this perfectly, you've turned random noise into a realistic image.
Corrupting data is trivial — just add noise. The hard part is removing noise. But here's the trick: if you have pairs of (noisy image, clean image), you can train a neural network to predict how to denoise. That's it. Train a denoiser, then run it iteratively to generate new images from random noise.
There's a deep theoretical justification. The score function of a distribution p(x) is defined as:
This is a vector field: at every point x, the score points in the direction of increasing probability density. If you're at a low-density region, the score tells you which direction to walk to reach higher-density areas — toward the data.
The remarkable connection: the optimal denoiser is intimately related to the score function. When you train a network to remove noise from corrupted data, you're implicitly learning the score ∇x log p(xt) at each noise level t. This is called score matching (Hyvarinen 2005, Song & Ermon 2019).
Consider data corrupted by Gaussian noise: xt = x + σtε where ε ~ N(0, I). The optimal denoiser satisfies: E[x | xt] = xt + σt2 ∇xt log p(xt). So predicting the noise ε is equivalent to estimating the score ∇ log p(xt), up to scaling. Tweedie's formula makes this precise.
Training: Sample a clean image x from the dataset. Sample a noise level t ~ Uniform(0, 1). Create the noisy version xt. Train the network fθ(xt, t) to predict how to clean it up (predict the noise, predict the clean image, or predict a velocity — more on this in Chapter 4).
Inference: Sample x1 ~ N(0, I). Run fθ iteratively: x0.99 = denoise(x1, t=1), x0.98 = denoise(x0.99, t=0.99), ... until you reach x0. Each step removes a little noise. After enough steps, you have a clean, realistic image.
Unlike GANs (one forward pass to generate), diffusion models require many forward passes (50–1000 steps). Each step is a full neural network evaluation. This makes generation slower. The entire field of diffusion acceleration (distillation, consistency models, few-step methods) exists to solve this.
The original diffusion formulation (DDPM, Ho et al. 2020) uses a complex noise schedule and variance-preserving SDE. Modern practice uses a much cleaner formulation called Rectified Flow (Liu et al. 2022) or Flow Matching (Lipman et al. 2022). The core training loop is just a few lines of code.
The idea: connect each data point x to a noise point z with a straight line. At time t ∈ [0, 1], the interpolated point is:
The velocity along this straight line is constant:
Train a neural network fθ(xt, t) to predict this velocity:
This is the entire training algorithm. No complex noise schedules. No forward/reverse SDE. No KL divergence. Just: mix data with noise, predict the velocity, regress with MSE. The power of diffusion models comes from scale (big networks, big datasets), not algorithmic complexity.
At inference, we want to go from noise (t = 1) to data (t = 0). We follow the learned velocity field backward:
This is just Euler integration of the ODE dx/dt = −fθ(x, t) from t = 1 to t = 0. More steps = more accurate integration = better samples. Fewer steps = faster but noisier.
Earlier diffusion models (DDPM, VP-SDE) use curved paths between data and noise. Curved paths require more integration steps to follow accurately — small errors in the Euler step accumulate along curves. Rectified Flow uses straight lines, which the Euler method can follow with fewer steps. Lipman et al. (2022) showed that flow matching with optimal transport couplings produces even straighter paths, enabling 10–20 step generation with minimal quality loss.
Even 50 steps is slow for real-time applications. Distillation trains a student model to jump directly from noise to data in fewer steps (4, 2, or even 1 step). The teacher model runs the full sampling trajectory; the student learns to shortcut it. Consistency models (Song et al. 2023) learn to map any point on the trajectory directly to x0, enabling one-step generation.
So far, our diffusion model generates random images from the training distribution. But we want conditional generation: "generate a photo of a golden retriever" or "generate a sunset over mountains." How do we steer the diffusion process toward a specific condition y?
The simplest approach: make the network condition-aware. Instead of fθ(xt, t), train fθ(xt, y, t) where y is a class label, text embedding, or any conditioning signal. The training loop is identical; you just feed y as additional input.
This works, but the samples are "okay" — diverse but not strongly aligned with the condition. We want sharper conditioning: images that clearly match the prompt, even at the cost of some diversity.
During training, randomly replace y with a null token y∅ (empty conditioning) some fraction of the time (e.g., 10%). This means the same network learns both:
• Unconditional generation: fθ(xt, y∅, t) — "generate any image"
• Conditional generation: fθ(xt, y, t) — "generate an image matching y"
At sampling time, compute both velocities and combine them:
Think of v∅ as pointing toward "any plausible image" and vy as pointing toward "images matching condition y." The difference (vy − v∅) is the direction that makes the image more y-like. CFG adds an extra push in that direction:
vcfg = vy + w · (vy − v∅)
When w = 0, you get plain conditional sampling. When w > 0, you amplify whatever makes the image specifically match y, pushing samples toward higher p(y|x). In score-function terms: vcfg ∝ ∇x log p(xt) + (1+w) · ∇x log p(y|xt).
Prompt: "a photo of a cat." At w = 0: diverse images, some clearly cats, some ambiguous. At w = 3: all images are unambiguously cats, sharp and detailed, but less variety (similar poses/angles). At w = 10: even more cat-like but oversaturated colors, artifacts appear. At w = 20: extreme artifacts, the image is "too much cat."
Typical production values: w = 3–7 for text-to-image. There's a quality-diversity tradeoff: higher w = more aligned with the prompt but less diverse and potentially lower quality.
Earlier work (Dhariwal & Nichol, 2021) used a separately-trained classifier p(y|x) to guide diffusion. This required training an extra model and running it at every denoising step. CFG eliminates the classifier entirely — the single diffusion model serves as both the unconditional and conditional generator. The name "classifier-free" distinguishes it from the earlier "classifier-guided" approach.
Every sampling step requires two forward passes: one unconditional (v∅) and one conditional (vy). This doubles the compute per step. Some methods batch the two passes together to use GPU parallelism, but the FLOPs still double.
The velocity predictor fθ(xt, y, t) needs a neural network architecture. Two families have dominated: the U-Net (inherited from image segmentation) and the Diffusion Transformer (DiT) (adapted from the vision transformer).
The U-Net has an encoder-decoder structure with skip connections:
Encoder: Stack of downsampling convolutional blocks. Resolution halves at each level (64→32→16→8). The network sees the image at multiple scales.
Decoder: Stack of upsampling blocks. Resolution doubles at each level (8→16→32→64). Skip connections concatenate encoder features at matching resolutions, preserving fine details.
Timestep injection: The timestep t is embedded via sinusoidal positional encoding (like transformers), then injected into each block via scale-and-shift (also called FiLM conditioning): given embedding e(t), compute scale γ and shift β, then transform features as γ · h + β. This is called Adaptive Layer Normalization (AdaLN).
Text conditioning: Text embeddings from CLIP or T5 are injected via cross-attention layers. At each resolution, the image features attend to the text token sequence, allowing each spatial location to "look at" relevant words in the prompt.
Instead of fixed scale/shift in layer norm, predict them from conditioning: γ, β = MLP(e(t)). Apply as: AdaLN(h) = γ · LayerNorm(h) + β. This modulates the network's behavior based on the current timestep (and optionally class label).
Peebles & Xie (2023) replaced the U-Net with a standard Vision Transformer. The key insight: transformers scale better with compute than convolutions. As you increase model size, DiT quality improves more smoothly and predictably.
Patchification: Split the image (or latent) into non-overlapping patches, flatten each patch into a vector, add positional embeddings. These patch tokens become the input sequence for the transformer — exactly like ViT for classification, but now for generation.
Conditioning injection: Timestep via AdaLN-Zero (like AdaLN but initialized to the identity function so the network starts as a plain transformer). Text via cross-attention layers interleaved with self-attention.
U-Nets have inductive biases (locality, translation equivariance) that help at small scale but limit scaling. Transformers have no spatial inductive bias — they learn everything from data. This is worse at small scale (need more data) but better at large scale. Since modern image generation uses billions of training images and billions of parameters, the transformer's superior scaling wins.
Modern text-to-image models (FLUX, SD3) use MM-DiT (Multi-Modal DiT). Instead of separate self-attention for image tokens and cross-attention for text, MM-DiT concatenates all tokens (image patches + text tokens) into a single sequence and runs joint self-attention over everything. Each image patch can attend to every text token and vice versa. This is simpler and empirically better than the cross-attention design.
U-Net (DDPM/LDM): ~860M params. Conv encoder-decoder + self-attention at low resolutions + cross-attention for text. Trained on 256×256 images.
DiT-XL/2: ~675M params. 28 transformer blocks, patch size 2. Trained on ImageNet 256×256. Achieved FID 2.27 (SOTA at the time).
FLUX.1: ~12B params. MM-DiT with joint attention. T5-XXL + CLIP text encoders. Trained on billions of text-image pairs.
Running diffusion in pixel space is expensive. A 512×512×3 image has 786,432 dimensions. Every denoising step processes all of them. At 50 steps, that's 39 million pixel operations. Can we work in a smaller space?
Stage 1: Compression. Train a VAE (variational autoencoder) to compress images into a low-dimensional latent space. The encoder maps a 256×256×3 image to a 32×32×16 latent. That's a 48× compression in spatial dimensions. The decoder maps the latent back to the image.
Stage 2: Diffusion on latents. Train the diffusion model (U-Net or DiT) on the 32×32×16 latents, not the 256×256×3 images. Everything is identical — add noise to latents, predict velocity, denoise — but now each step processes only 16,384 dimensions instead of 786,432.
The autoencoder isn't a plain VAE — its decoder is trained with a GAN discriminator (perceptual + adversarial loss). This is critical: pure MSE reconstruction produces blurry images; the discriminator forces the decoder to produce sharp, detailed outputs. The encoder has a mild KL penalty to keep the latent space smooth.
It's a three-part system: (1) A VAE encoder compresses images to latents. (2) A diffusion model generates in latent space. (3) A GAN-enhanced VAE decoder maps latents back to sharp images. Three of the four generative model families from Chapter 1, working together.
Perceptual compression removes imperceptible high-frequency details (like JPEG). The VAE's first job: collapse the pixel space by removing information humans can't see. Semantic compression captures the meaningful content: objects, layout, style. The diffusion model operates on semantically-rich latents, not redundant pixels. This separation is why LDMs are both fast and high-quality.
Input: text prompt "a cyberpunk city at night, neon lights, rain." Text encoders: CLIP-ViT-L + OpenCLIP-ViT-bigG (dual encoders for richer text understanding). Latent size: 128×128×4 (for 1024×1024 output). DiT denoiser: ~2.6B params. 50 Euler steps with CFG w=7.5. The entire generation: ~4 seconds on an A100 GPU.
Latent diffusion + CFG + DiT = the complete text-to-image stack. Let's trace the full pipeline from text prompt to pixels, then extend to video.
Step 1: Text encoding. The text prompt goes through one or more frozen text encoders. Common choices: CLIP (contrastive vision-language model, produces a single embedding) and T5-XXL (encoder-decoder language model, produces a sequence of token embeddings). Using both captures complementary information — CLIP gives global semantics, T5 gives fine-grained token-level detail.
Step 2: Diffusion generation. The text embeddings condition a DiT via cross-attention (or joint attention in MM-DiT). Starting from random noise in latent space, T denoising steps with CFG produce a clean latent.
Step 3: Decoding. The VAE decoder maps the latent to pixels.
Text encoders: T5-XXL (4.7B params, produces 256 token embeddings) + CLIP-ViT-L (produces 1 global embedding). DiT: 12B params, MM-DiT with joint attention, patch size 2, 8×8 latent downsampling. VAE: 16 latent channels, 8× spatial downsampling. Trained on billions of captioned images.
The conceptual leap from image to video generation is smaller than you'd think. An image is a 2D grid of latents (h × w × c). A video is a 3D grid of latents (t × h × w × c), where t is the time dimension. The diffusion model operates on this 3D tensor.
A video VAE compresses both spatially and temporally. A 16-frame 256×256 video might compress to a 4×32×32×16 latent (4× temporal, 8× spatial downsampling). The diffusion model generates in this compressed 3D space, and the video decoder inflates it back to frames.
Full 3D self-attention over (t × h × w) tokens is quadratic in the total count — prohibitively expensive for long videos. The standard solution: factorized attention.
Spatial attention: Within each frame, all (h × w) patches attend to each other. This captures spatial relationships (objects, layout).
Temporal attention: For each spatial position, the tokens across all T frames attend to each other. This captures motion and temporal coherence.
These alternate in the transformer blocks: spatial-attention → temporal-attention → spatial → temporal → ...
Scale: 30B parameter DiT. Latent space: 8×8×8 downsampling (spatial 8×, temporal 8×). A 16-second 768×768 video at 16fps = 256 frames → 32×96×96 latent → ~295K tokens per sample. Training data: Hundreds of millions of video clips. Text conditioning: T5-XXL + MetaCLIP. Result: Photorealistic, temporally coherent videos from text prompts.
| Year | Model | Key Innovation |
|---|---|---|
| 2024 Feb | Sora (OpenAI) | Spacetime patches, long coherent videos |
| 2024 Jun | Gen-3 (Runway) | Commercial text-to-video |
| 2024 Oct | MovieGen (Meta) | 30B DiT, 76K tokens, audio generation |
| 2024 Dec | HunyuanVideo (Tencent) | Open-source, dual-stream DiT |
| 2025 | Cosmos (NVIDIA) / Wan (Alibaba) | World models, open weights |
Video models still struggle with: physics (objects passing through each other, gravity violations), counting (wrong number of fingers, objects), long-term coherence (character appearance drifting over time), and text rendering (garbled text on signs). These are active research areas.
Rectified Flow is one specific choice. The general framework parameterizes the interpolation and prediction target with four functions:
Different choices of a, b, c, d recover all the famous diffusion formulations:
| Method | a(t) | b(t) | c(t) | d(t) | Predicts |
|---|---|---|---|---|---|
| Rectified Flow | 1−t | t | −1 | 1 | Velocity v = z−x |
| VP (DDPM) | √α(t) | √(1−α(t)) | 0 | 1 | Noise ε |
| VE (Score SDE) | 1 | σ(t) | 0 | 1 | Noise ε |
| x-prediction | a(t) | b(t) | 1 | 0 | Clean data x |
| v-prediction | a(t) | b(t) | −b(t) | a(t) | Velocity v |
Despite the different parameterizations, all these methods learn to transport the noise distribution to the data distribution along some path. The differences are in: (1) the shape of the path (straight vs. curved), (2) what the network predicts (noise, data, velocity), (3) the resulting noise schedule. In the infinite-step limit, they're all equivalent. In practice, rectified flow (straight paths, velocity prediction) tends to work best with fewer steps.
There's a beautiful connection to the VAE framework. Think of the entire forward trajectory {x0, x0.01, x0.02, ..., x1} as a hierarchy of latent variables. The forward process is the "encoder" q(x1:T|x0). The reverse process is the "decoder" pθ(x0:T-1|xT). The ELBO decomposes into a sum of per-step denoising losses, which is exactly the MSE velocity prediction loss.
An alternative to continuous diffusion: use a VQ-VAE (Vector-Quantized VAE) to encode images into a grid of discrete tokens, then train an autoregressive transformer to generate the tokens left-to-right, top-to-bottom. This is the approach behind DALL-E 1 (Ramesh et al. 2021) and recent models like LlamaGen.
The remaining weakness of diffusion: sampling speed. Several approaches compress the multi-step process:
| Method | Steps | Key Idea |
|---|---|---|
| DDIM | 10–50 | Deterministic sampling, skip timesteps |
| Progressive Distillation | 4–8 | Student learns to combine 2 teacher steps into 1 |
| Consistency Models | 1–2 | Map any point on trajectory directly to x0 |
| Adversarial Distillation | 1–4 | GAN loss forces student to produce sharp images in few steps |
Key idea: learn a function fθ such that fθ(xt, t) = x0 for all t along the same trajectory. In other words, no matter where you are on the noisy-to-clean path, the model predicts the endpoint directly. Self-consistency condition: fθ(xt, t) = fθ(xt', t') for any t, t' on the same trajectory. Training enforces this via a consistency loss between adjacent timesteps. Result: one-step generation with quality approaching 50-step diffusion.
| Property | Autoregressive | VAE | GAN | Diffusion/Flow |
|---|---|---|---|---|
| Density | Exact | Lower bound | None | Score-based |
| Training | Stable (MLE) | Stable (ELBO) | Unstable (minimax) | Stable (MSE) |
| Sample quality | Good (text), Fair (images) | Blurry | Sharp | Best |
| Diversity | High | High | Low (mode collapse) | High |
| Speed | Slow (sequential) | Fast (one pass) | Fast (one pass) | Slow (multi-step) |
| Scalability | Excellent (GPT-4) | Limited | Poor | Excellent (FLUX, Sora) |
| Era | 2016–present | 2013–2018 | 2014–2021 | 2020–present |
A modern text-to-image model like Stable Diffusion 3 is not one technique — it's five, stacked together:
(1) VAE encoder compresses images to latents. (2) Diffusion/Flow Matching generates in latent space. (3) DiT architecture parameterizes the denoiser. (4) CFG sharpens conditional generation. (5) Distillation reduces the number of sampling steps. Each piece was invented separately; together they're greater than the sum.
| Era | Dominant Method | Milestone Models |
|---|---|---|
| 2014–2017 | GANs + VAEs | GAN, DCGAN, VAE |
| 2017–2021 | GANs | StyleGAN, BigGAN, StyleGAN2 |
| 2020–2022 | Diffusion emergence | DDPM, Score SDE, ADM |
| 2022–2023 | Latent Diffusion | Stable Diffusion, DALL-E 2, Imagen |
| 2023–present | DiT + Flow Matching | DiT, SD3, FLUX, Sora, MovieGen |
| # | Paper |
|---|---|
| 1 | Goodfellow et al. "Generative Adversarial Nets." NeurIPS, 2014. |
| 2 | Radford et al. "Unsupervised Representation Learning with DCGANs." ICLR, 2016. |
| 3 | Karras et al. "A Style-Based Generator Architecture for GANs." CVPR, 2019. |
| 4 | Ho et al. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020. |
| 5 | Song et al. "Score-Based Generative Modeling through SDEs." ICLR, 2021. |
| 6 | Dhariwal & Nichol. "Diffusion Models Beat GANs on Image Synthesis." NeurIPS, 2021. |
| 7 | Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR, 2022. |
| 8 | Lipman et al. "Flow Matching for Generative Modeling." ICLR, 2023. |
| 9 | Liu et al. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." ICLR, 2023. |
| 10 | Peebles & Xie. "Scalable Diffusion Models with Transformers." ICCV, 2023. |
| 11 | Song et al. "Consistency Models." ICML, 2023. |
| 12 | Ho & Salimans. "Classifier-Free Diffusion Guidance." NeurIPS Workshop, 2022. |
| 13 | Ramesh et al. "Zero-Shot Text-to-Image Generation." ICML, 2021. |
| 14 | Polyak et al. "Movie Gen: A Cast of Media Foundation Models." Meta, 2024. |
| 15 | Esser et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." ICML, 2024. |
| 16 | Black Forest Labs. "FLUX.1: Open-Weight Flow Matching Models." 2024. |
Modern image and video generation = compress (VAE), denoise (diffusion), condition (CFG), and scale (transformers). Everything else is optimization.