Holderrieth & Erives, Chapter 5

Guidance: How to Condition on a Prompt

An unconditional model generates some image. Guidance makes it generate the image you asked for.

Prerequisites: Flow matching (Ch 3) + Score functions (Ch 4). That's it.
9
Chapters
5
Simulations
9
Quizzes

Chapter 0: Why Guide?

You type "a corgi wearing sunglasses on a beach" into an image generator. A few seconds later, a photo-realistic corgi appears, exactly matching your description. But in Chapter 3, we learned to train flow models that sample from pdata(x) — the unconditional distribution of all images. An unconditional model might give you a car, a cat, a sunset, anything. How do we make it listen to a prompt?

The answer is guidance: we modify the generative process so that instead of sampling from the full data distribution pdata(x), we sample from a conditional distribution pdata(x|y), where y is a text prompt, class label, or any conditioning variable. The key insight of this chapter is that there are two fundamentally different ways to achieve this — and the one that works best in practice (classifier-free guidance) is actually a heuristic, not a mathematically exact procedure.

Terminology note. The book uses the word guided instead of "conditional" to avoid confusion with the conditional probability paths from Chapter 2. When we say "guided generation" or "guided vector field," we mean conditioning on a prompt y (text, class label, etc.). When we say "conditional probability path," we mean conditioning on a data point z ∼ pdata.

Let's see the problem visually. In the simulation below, an unconditional model generates points from a mixture of two Gaussians (blue cluster and orange cluster). We want to generate points from only one of the clusters. That's guidance.

Unconditional vs. Guided Generation

Click "Generate" to see unconditional sampling (both clusters). Then select a class to see guided sampling (one cluster only).

Target

When you select a class, you see samples cluster in only one region. The guided model learned to associate each class label with a specific part of the distribution. This is exactly what happens in large-scale image generators — except the "class" is a rich text prompt and the distribution lives in a million-dimensional pixel space.

An unconditional generative model samples from pdata(x). A guided model samples from:

Chapter 1: Vanilla Guidance

The simplest approach to guidance is almost embarrassingly straightforward: just give the prompt to the neural network as an extra input. This is called vanilla guidance.

Recall from Chapter 3 that an unconditional flow model has a neural network uθt(x) that takes a noisy point x and time t, and outputs a velocity vector. For vanilla guidance, we simply add the prompt y as a third input:

uθt(x|y) : Rd × Y × [0,1] → Rd

The network now takes three things: the noisy image x, the conditioning variable y (from some space Y), and the time t. It outputs a velocity vector uθt(x|y) ∈ Rd.

What is Y? The conditioning space Y can be anything. If y is a class label (like "cat" or "dog"), then Y = {0, 1, ..., N}. If y is a text prompt, then Y is the space of all text strings. If y is another image (for image-to-image translation), then Y = Rd'. The framework makes no assumptions about Y.

At inference time, sampling works exactly as before, but with y provided at every step:

Initialize
X0 ∼ pinit (Gaussian noise)
Fix prompt
Choose y (e.g. "a corgi on a beach")
Simulate ODE
dXt = uθt(Xt|y) dt, from t=0 to t=1
Output
X1 ∼ pdata(·|y)

The ODE is identical to the unconditional case from Chapter 2 — the only difference is that the neural network now also sees y at every evaluation. Think of y as a steering wheel: the vector field changes direction depending on what prompt you provide.

Vanilla Guided Vector Field

This shows the velocity field for a mixture of two Gaussians. Toggle the class to see how the vector field changes. With y="blue", vectors point toward the blue cluster. With y="orange", they point toward the orange cluster.

Guide class

Notice how the arrows in the vector field change direction depending on which class you select. With no guidance, arrows point toward both clusters. With guidance, they converge toward just one — the network has learned to route different prompts to different parts of the data distribution.

Worked example: class-conditional generation. Suppose we have 10 classes (digits 0–9). The conditioning space is Y = {0, 1, ..., 9}. When we want to generate a "7", we set y=7 and pass it to the network at every ODE step. The network has learned that y=7 means "converge toward the region of image space where sevens live."

How does the network "see" y? For class labels, y is typically converted to a learned embedding vector yemb ∈ Rd via an embedding table (just like word embeddings in NLP). For text prompts, y is processed by a pretrained text encoder like CLIP. We'll cover this in detail in Chapter 6.

python
# Vanilla guided inference (class-conditional)
def guided_sample(model, class_label, n_steps=50):
    x = torch.randn(1, 3, 32, 32)  # X_0 ~ N(0,I)
    dt = 1.0 / n_steps
    for i in range(n_steps):
        t = torch.tensor([i * dt])
        # Network sees x, t, AND the class label y
        velocity = model(x, t, class_label)
        x = x + dt * velocity  # Euler step
    return x  # X_1 ~ p_data(x | y=class_label)
The analogy. Think of the conditioning variable y as a destination address. The unconditional model generates directions to a random house in the city. The guided model generates directions to a specific house — the one specified by the address y. The neural network has learned a map from addresses to destinations.
In vanilla guidance, how is the prompt y incorporated into the model?

Chapter 2: Training Guided Models

How do we train the guided network uθt(x|y)? The answer builds directly on the conditional flow matching (CFM) loss from Chapter 3. We simply add y to the picture.

Recall the unconditional CFM loss from Chapter 3:

LCFM(θ) = Ez∼pdata, t∼Unif[0,1], x∼pt(·|z) ||uθt(x) − utargett(x|z)||2

For the guided version, we make two changes:

Change 1: Sample pairs (z, y) from the data distribution. Instead of just sampling images z, we now sample an image z together with its associated label/prompt y. In a PyTorch dataloader, each batch returns both the image tensor and the conditioning information.

Change 2: Feed y into the neural network. The network gets y as input and outputs a velocity conditioned on that prompt.

This gives us the guided conditional flow matching loss:

LguidedCFM(θ) = E(z,y)∼pdata(z,y), t∼Unif[0,1], x∼pt(·|z) ||uθt(x|y) − utargett(x|z)||2
Key observation. The conditional probability path pt(·|z) and the conditional vector field utargett(x|z) do not depend on y at all. The label y only affects the network's input — it tells the network which data point z to aim for, but the target velocity is still the same straight-line interpolation from noise to z. In other words, the paths are unchanged; only the network's "address book" changes.

Let's write this out as a training algorithm:

1. Sample data
(z, y) ∼ pdata — image z with its label/prompt y
2. Sample time
t ∼ Unif[0,1]
3. Sample noise
ε ∼ N(0, Id)
4. Construct noisy point
x = αt z + βt ε
5. Compute target
utarget = α̇t z + β̇t ε
6. Compute loss
L = ||uθt(x|y) − utarget||2
7. Update
Gradient descent on θ
↻ repeat
python
# Guided Conditional Flow Matching training loop
for z, y in dataloader:           # (z, y) pairs from dataset
    t = torch.rand(z.shape[0])     # random time per sample
    eps = torch.randn_like(z)       # Gaussian noise

    # Construct noisy point on probability path
    x_t = alpha(t) * z + beta(t) * eps

    # Target velocity (straight line: alpha_dot * z + beta_dot * eps)
    u_target = alpha_dot(t) * z + beta_dot(t) * eps

    # Network prediction, conditioned on y
    u_pred = model(x_t, t, y)

    # MSE loss
    loss = ((u_pred - u_target) ** 2).mean()
    loss.backward()
    optimizer.step()

The only difference from unconditional training is on the line u_pred = model(x_t, t, y) — we pass y to the network. Everything else is identical.

Worked example: shapes of the tensors. Let's trace the exact data flow for training on 32×32 images with 10 classes:

VariableShapeSource
z (clean image)(B, 3, 32, 32)Dataloader
y (class label)(B,) integers in {0,...,9}Dataloader
t (time)(B,) floats in [0,1]Random
ε (noise)(B, 3, 32, 32)N(0, I)
x (noisy image)(B, 3, 32, 32)αtz + βtε
utarget (target velocity)(B, 3, 32, 32)α̇tz + β̇tε
uθ(x|y) (prediction)(B, 3, 32, 32)Network forward pass

Notice that the target velocity utarget is computed from z and ε only — it doesn't involve y at all. The label y enters only through the network's forward pass.

Sanity check. The guided loss LguidedCFM has exactly the same expected value as the marginal flow matching loss from Theorem 12 (Chapter 3), by the same "expand ||a−b||2, swap integrals" proof. The only difference is that we now sample (z, y) jointly instead of z alone, and feed y to the network. Everything else — the probability path, the target, the loss structure — is unchanged.
Does the conditioning variable y affect the conditional probability path pt(x|z) or the target velocity utargett(x|z)?

Chapter 3: The Prompt Problem

In theory, vanilla guidance should work perfectly: the network learns utargett(x|y) and generates samples from pdata(x|y). In practice, something goes wrong. Generated images often don't match the prompt well enough. A model asked to generate a "corgi" might produce something vaguely dog-like but not clearly a corgi. Why?

There are several reasons this happens:

1. Underfitting. The model might not have enough capacity or training time to faithfully learn the true conditional vector field. Neural networks are imperfect approximators — the learned uθt(x|y) is only an approximation of the true utargett(x|y).

2. Noisy data pairings. Real-world datasets like LAION (text-image pairs scraped from the web) have many mismatched pairs. An image of a cat might be labeled "my furry friend" — the conditioning signal is weak and noisy.

3. The diversity problem. Even with perfect training, the conditional distribution pdata(x|y = "dog") contains enormous variety: different breeds, poses, backgrounds, lighting, art styles. The model samples from this full conditional distribution, which often means the samples are "correct" but not strongly reflective of the prompt.

The core tension. Sampling exactly from pdata(x|y) gives high diversity but lower prompt fidelity. To get images that strongly match the prompt, we need to artificially amplify the conditioning signal — even though this means we are no longer sampling from the true conditional distribution.

This is the motivation for classifier-free guidance (CFG): a controlled way to trade diversity for prompt adherence. Before we get there, we need to understand its predecessor, classifier guidance.

Quantifying the problem. Researchers measure two competing metrics:

FID (Fréchet Inception Distance): measures how similar the generated distribution is to the real data distribution. Lower = better. Penalizes both poor quality and low diversity.

CLIP score: measures how well each generated image matches its text prompt, using the CLIP embedding similarity. Higher = better prompt adherence.

With vanilla guidance (w=1), FID is often good (the distribution is correct) but CLIP scores are mediocre. The whole point of guidance scaling is to push CLIP scores higher, even if it costs some FID.

Diversity vs. Fidelity Tradeoff

Each dot is a generated sample. The orange target region represents "ideal" samples for the prompt. Drag the guidance scale w to see how increasing it concentrates samples near the target but reduces diversity.

Guidance scale w 1.0

At w=1 (no guidance), samples spread across the full distribution. As w increases, they collapse toward the prompt target. But notice: at very high w, the samples lose variety — they all look nearly identical. This is the fundamental tradeoff.

Why does vanilla guidance (w=1) often produce samples that don't match the prompt well?

Chapter 4: Classifier Guidance

The first approach to boosting prompt fidelity was classifier guidance. It starts from a simple idea: decompose the guided vector field into an unconditional part plus a classifier gradient, then scale up the classifier part.

Recall from Chapter 4 that for Gaussian probability paths, the vector field can be written in terms of the score function (Proposition 1):

utargett(x|y) = at ∇ log pt(x|y) + bt x

Now comes the key move. We apply Bayes' rule to the score:

∇ log pt(x|y) = ∇ log pt(x) + ∇ log pt(y|x)

Let's derive this step by step. Starting from Bayes' theorem:

pt(x|y) = pt(x) · pt(y|x) / pt(y)

Take the log of both sides:

log pt(x|y) = log pt(x) + log pt(y|x) − log pt(y)

Take the gradient ∇ with respect to x. Since pt(y) doesn't depend on x, its gradient vanishes:

x log pt(x|y) = ∇x log pt(x) + ∇x log pt(y|x)
Read it as two forces. The guided score is the unconditional score (where the data lives, regardless of prompt) plus a classifier gradient (how likely is the prompt y given a noisy image x). The first term pulls toward high-density regions generally; the second pulls specifically toward regions where the prompt matches.

Substituting back into the vector field decomposition:

utargett(x|y) = utargett(x) + at ∇ log pt(y|x)

The guided vector field = the unconditional vector field + a classifier gradient. To amplify prompt fidelity, we scale up the classifier term with a guidance scale w > 1:

ũt(x|y) = utargett(x) + w · at ∇ log pt(y|x)

At w=1, this is the true guided vector field. At w>1, we over-emphasize the classifier, pushing generated samples to match the prompt more strongly — at the cost of moving away from the true data distribution.

To use this in practice, we need to train a classifier pt(y|x) on noisy data (since x = αtz + βtε is noisy at intermediate times). This classifier tells us: given this noisy image x at time t, what label y is most likely?

Worked example: what does scaled classifier guidance look like? Consider a Gaussian mixture with two classes (blue and orange). The unconditional vector field utarget(x) points toward the average of both clusters. The classifier gradient ∇ log p(y="blue"|x) points toward the blue cluster specifically. The guided field combines both:

ũ(x|y="blue") = utarget(x) + w · at ∇ log p(y="blue"|x)

At w=1, we get the true guided field (exactly right). At w=3, the classifier gradient is tripled — the field points much more aggressively toward blue, producing samples that are clearly blue but with less diversity. At w=10, nearly all samples collapse to the center of the blue cluster.

Problem with classifier guidance. Training a separate classifier alongside the generative model is expensive (two networks instead of one). Worse, if y is a text prompt rather than a class label, learning pt(y|x) is extremely hard — how do you classify a noisy image into the space of all possible text prompts? Finally, the classifier must work on noisy images at all timesteps, not just clean images — a specialized model is needed.
In classifier guidance, the guided vector field is decomposed as utarget(x|y) = utarget(x) + at∇ log pt(y|x). What does the term ∇ log pt(y|x) represent?

Chapter 5: Classifier-Free Guidance

Classifier guidance requires a separate classifier. Classifier-free guidance (CFG) achieves the same amplification effect using only the generative model itself — no extra network needed. This is the method used by virtually every modern image and video generator.

The derivation is elegant. Start from the classifier guidance formula:

ũt(x|y) = utargett(x) + w · at ∇ log pt(y|x)

We want to eliminate the classifier gradient ∇ log pt(y|x). From Bayes' rule (Chapter 4), we know:

∇ log pt(y|x) = ∇ log pt(x|y) − ∇ log pt(x)

Substitute this into the classifier guidance formula:

ũt(x|y) = utargett(x) + w · at [∇ log pt(x|y) − ∇ log pt(x)]

Now, recall that utargett(x) = at∇ log pt(x) + btx (score-velocity relationship). Similarly, utargett(x|y) = at∇ log pt(x|y) + btx. Let's expand and simplify:

ũt(x|y) = [at∇ log pt(x) + btx] + w[at∇ log pt(x|y) − at∇ log pt(x)]

Group the terms with ∇ log pt(x) and ∇ log pt(x|y):

ũt(x|y) = at(1−w)∇ log pt(x) + btx + w · at∇ log pt(x|y)

Recognize that at∇ log pt(x) + btx = utargett(x) and at∇ log pt(x|y) + btx = utargett(x|y). Add and subtract wbtx:

ũt(x|y) = (1−w) · utargett(x) + w · utargett(x|y)
The CFG formula. The classifier-free guided vector field is a linear interpolation (and extrapolation when w > 1) between the unconditional and the guided vector fields:
ũt(x|y) = (1 − w) uθt(x|∅) + w · uθt(x|y)
At w=1 we recover vanilla guidance. At w>1 we amplify the prompt. We use ∅ to denote "no conditioning" — the unconditional model.

This is remarkable. We don't need a classifier at all! We just need two things: an unconditional model uθt(x|∅) and a guided model uθt(x|y). And as we'll see next, we can train both in a single network.

Verification: w=1 recovers vanilla guidance. Plug w=1 into the CFG formula: ũ(x|y) = (1−1)u(x|∅) + 1·u(x|y) = u(x|y). The unconditional term vanishes entirely. This confirms that CFG is a strict generalization of vanilla guidance.

What does w=0 mean? At w=0: ũ(x|y) = u(x|∅). The model completely ignores the prompt and generates unconditionally. Between w=0 and w=1, we interpolate between unconditioned and conditioned generation. Beyond w=1, we extrapolate, amplifying the prompt signal beyond what the true conditional distribution dictates.

Remark: CFG for general probability paths. The construction ũ(x|y) = (1−w)utarget(x) + w·utarget(x|y) is valid for any probability path, not just Gaussian ones. Our derivation through Bayes' rule and classifier guidance was specific to Gaussian paths (to build intuition), but the final CFG formula applies universally. This is because the formula is equivalent to amplifying a hypothetical classifier regardless of the path choice.

CFG Vector Field Interpolation

The guided field points toward the blue cluster, the unconditional field toward both. Drag w to see how CFG interpolates (w=1) or extrapolates (w>1) between them.

w 1.0

At w=0, the field is fully unconditional (pointing toward both clusters). At w=1, it's the exact guided field (pointing toward blue). At w>1, it overshoots — the arrows point even more aggressively toward the target. This overshoot is what produces sharper, more prompt-adherent images, but it also shrinks the region of generated outputs.

The CFG formula is: ũ(x|y) = (1−w)u(x|∅) + w·u(x|y). What happens when w=1?

Chapter 6: CFG Training

The CFG formula requires both uθt(x|y) (guided) and uθt(x|∅) (unconditional). Training two separate networks would be wasteful. The trick is to train a single network that handles both cases.

The idea is beautifully simple: during training, with some probability η, we drop the label — we replace y with a special "null" token ∅. The network then learns to handle both y=∅ (unconditional) and y=real prompt (guided) seamlessly.

Label dropping = one model, two modes. By training with random label dropping, the same network learns both the conditional and unconditional vector fields. At inference, we evaluate the network twice per step — once with y and once with ∅ — and combine them via the CFG formula.

The CFG conditional flow matching objective is:

LCFGCFM(θ) = E ||uθt(x|y) − utargett(x|z)||2

where the expectation □ includes: (z,y) ∼ pdata(z,y), t ∼ Unif[0,1], x ∼ pt(·|z), and replace y with ∅ with probability η.

Here's the full training algorithm:

python
# CFG training with label dropping
eta = 0.1  # probability of dropping the label

for z, y in dataloader:
    t = torch.rand(z.shape[0])
    eps = torch.randn_like(z)
    x_t = alpha(t) * z + beta(t) * eps
    u_target = alpha_dot(t) * z + beta_dot(t) * eps

    # KEY STEP: randomly drop labels
    mask = torch.rand(z.shape[0]) < eta
    y_dropped = y.clone()
    y_dropped[mask] = NULL_TOKEN  # replace with empty label

    u_pred = model(x_t, t, y_dropped)
    loss = ((u_pred - u_target) ** 2).mean()
    loss.backward()
    optimizer.step()

And the inference procedure with CFG:

python
# CFG inference (2 network evaluations per step)
w = 4.0  # guidance scale

x = torch.randn(batch, d)  # X_0 ~ N(0, I)
for t in timesteps:
    # Evaluate network twice
    u_uncond = model(x, t, NULL_TOKEN)  # unconditional
    u_cond   = model(x, t, y)           # guided

    # CFG combination
    u_cfg = (1 - w) * u_uncond + w * u_cond

    # Euler step
    x = x + dt * u_cfg
Cost of CFG at inference. Because we evaluate the network twice per step (once with y, once with ∅), CFG roughly doubles the inference cost. This is why various "distillation" techniques have been proposed to approximate CFG with a single forward pass, but the basic two-evaluation version remains dominant.
Label Dropping During Training

Each training sample is a (z, y) pair. With probability η, the label y is replaced with ∅. Green = label kept, red = label dropped. Adjust η to see the effect.

Drop rate η 0.10

A typical value is η = 0.1, meaning 10% of the time the label is dropped. This is enough for the network to learn a good unconditional mode while still spending most of its capacity on the conditional task.

Why does one network learn both modes? Think of it this way: when y is a real label, the network learns "given this class, where should the velocity point?" When y = ∅, the network learns "with no class information, where should the velocity point on average?" The null token ∅ is just another entry in the vocabulary — the embedding table has N+2 entries (N classes + the null token + optionally padding). The network doesn't know which entries are "real" classes and which is the null — it just learns the appropriate velocity for each conditioning value.

Implementation detail: what is the null token? Common choices:

• For class labels: a special integer ID (e.g., class N+1) with its own learned embedding.

• For text prompts: an empty string "" or a special [PAD] sequence that produces a fixed embedding from the frozen text encoder.

• For image conditioning: a tensor of zeros.

The computational cost of CFG. At inference, every ODE step requires two network evaluations: once with y and once with ∅. For a 50-step sampler, that's 100 network forward passes. This doubles the wall-clock time compared to vanilla guidance. Strategies to reduce this cost include:
Guidance distillation: train a student model to produce CFG-quality outputs in one pass.
Batched evaluation: evaluate both the conditional and unconditional inputs in a single batched forward pass (the default in practice).
Reduced guidance steps: only apply CFG for the first N steps, then switch to vanilla guidance.
In CFG training, labels are randomly dropped with probability η. Why?

Chapter 7: Guidance Scale

The guidance scale w is the most important hyperparameter at inference time. It controls the tradeoff between diversity and prompt fidelity. Let's build deep intuition for what it does.

w = 0: The CFG formula becomes ũ(x|y) = u(x|∅) — purely unconditional. The model ignores the prompt entirely.

w = 1: We get ũ(x|y) = u(x|y) — vanilla guidance. The model uses the prompt but doesn't amplify it.

w > 1: The prompt signal is amplified. The model generates samples that match the prompt more strongly, at the cost of reduced diversity and potential artifacts.

Very large w: The model "over-amplifies" the prompt. Images become saturated, artifact-ridden, and identical to each other. At extreme values, the ODE becomes unstable.

CFG is a heuristic. For w ≠ 1, the CFG vector field ũt(x|y) is not the true guided vector field utargett(x|y). The generated distribution is no longer pdata(x|y) — it's a sharpened version. CFG is justified entirely by its empirical success: almost every AI-generated image you've seen used w ≥ 4.
Guidance Scale: Quality vs. Diversity

Watch how the generated sample distribution changes as w increases. Low w = diverse but sometimes off-prompt. High w = on-prompt but repetitive and potentially distorted.

w 1.0

In the simulation above, observe how increasing w causes the sample cloud to shrink and concentrate. The "sweet spot" is typically w ∈ [2, 7], depending on the application:

ApplicationTypical wWhy
Stable Diffusion 32.0 – 5.0High-quality photorealistic images
DALL-E 33.0 – 7.5Creative illustration
Class-conditional (MNIST)1.5 – 4.0Simple, clear digit generation
Video generation4.0 – 8.0Temporal consistency matters more

A mathematical perspective. Think of what CFG does to the effective distribution. If pdata(x|y) ∝ pdata(x) · p(y|x), then the CFG distribution roughly corresponds to:

pCFG(x|y) ∝ pdata(x) · p(y|x)w

Raising the likelihood p(y|x) to the power w sharpens the distribution around high-likelihood regions — a "temperature scaling" effect on the classifier part.

Worked example: CFG on MNIST. The book shows results for class-conditional MNIST digit generation at various guidance scales:

wFID (quality)Visual effect
1.0ModerateDigits are recognizable but some are ambiguous or poorly formed
2.0BestDigits are sharp, clear, and well-formed; good variety of styles
4.0GoodDigits are very uniform; less style variety but extremely clear
8.0DegradingDigits start to look "over-sharpened" with artifacts around edges

The sweet spot (w ≈ 2–4 for MNIST) balances clarity with diversity. For more complex data like ImageNet or text-to-image models, higher w values (4–7) are typically needed because the conditioning signal is weaker relative to the data complexity.

Guidance for diffusion models (SDEs). Everything we derived for flow models extends directly to diffusion models. At inference, simply replace uθt(x|y) with ũθt(x|y) in the SDE simulation. The CFG formula is the same — we just apply it to the drift term of the SDE rather than the ODE velocity. The noise term σtdWt remains unchanged.

Rewriting the CFG formula in alternative forms. The formula ũ(x|y) = (1−w)u(x|∅) + w·u(x|y) can be rearranged into a more intuitive form:

ũ(x|y) = u(x|∅) + w · [u(x|y) − u(x|∅)]

Read it as: start with the unconditional velocity, then add w times the "guidance direction" (the difference between conditional and unconditional). This form makes it clear that CFG amplifies the difference between the guided and unconditional models. When w=1, the guidance direction is added once (vanilla guidance). When w=4, it's added four times.

Negative guidance (w < 0). What if we set w negative? The CFG formula becomes ũ(x|y) = (1−w)u(x|∅) + w·u(x|y), which pushes away from the prompt. This is the basis of negative prompting: replace the unconditional term with a negative-prompt term u(x|yneg) to steer away from undesired content (e.g. "blurry, low quality, text watermark").

CFG ODE Sampling: Full Simulation

Watch particles flow from noise to data with different guidance scales. Each particle starts at a random noise point and follows the CFG vector field. Higher w = particles converge more tightly to the target class.

w 1.0
A user generates images with guidance scale w=10 and notices severe artifacts and lack of diversity. What should they do?

Chapter 8: Connections

Guidance is the bridge between mathematical generative modeling and practical AI systems. Without it, diffusion and flow models would generate beautiful but random images. With it, they follow your instructions.

Summary of the key ideas:

MethodHow it worksProsCons
Vanilla guidanceFeed y into the network directlySimple, exact sampling from p(x|y)Often poor prompt fidelity in practice
Classifier guidanceTrain a separate classifier, scale its gradientCan amplify prompt signalNeeds extra classifier, hard for text prompts
Classifier-free guidanceTrain one network with label dropping, combine at inferenceNo extra model, works with any conditioning type2× inference cost, is a heuristic

What lies ahead. Chapter 6 will show how CFG is integrated into large-scale generators like Stable Diffusion 3 and Meta Movie Gen. The architecture details (DiT blocks, VAEs, text embeddings) are all designed to work with the CFG framework we developed here. Chapter 7 will extend these ideas to discrete data, where "guidance" takes a different form.

Guidance extends to diffusion models too. Everything in this chapter applies equally to diffusion models (SDEs). Just replace the ODE simulation with SDE simulation — the CFG formula is the same. In fact, the derivation using Bayes' rule works for any probability path, not just Gaussian ones.

Open research directions:

Guidance distillation: Training models that produce CFG-quality outputs in a single forward pass, halving inference cost.

Dynamic guidance schedules: Varying w over time during inference (e.g. high w early for global structure, low w late for fine detail).

Negative prompting: Using CFG with negative prompts — the unconditional term u(x|∅) is replaced with u(x|yneg) to steer away from undesired content.

The Full CFG Pipeline: Step by Step

To consolidate everything from this chapter, here is the complete pipeline for classifier-free guided generation in a single, annotated algorithm:

python
# ====== TRAINING ======
# Data: dataset of (image, label/prompt) pairs
# Model: neural network u_theta(x, t, y) -> velocity vector
# Hyperparameters: eta (label drop rate), learning rate

for epoch in range(num_epochs):
    for z, y in dataloader:
        # Step 1: Sample random time
        t = torch.rand(B)

        # Step 2: Sample noise and construct noisy x
        eps = torch.randn_like(z)
        x_t = t.view(-1,1,1,1) * z + (1-t).view(-1,1,1,1) * eps

        # Step 3: Compute target velocity (straight-line path)
        u_target = z - eps  # alpha_dot * z + beta_dot * eps

        # Step 4: LABEL DROPPING (the CFG trick)
        drop_mask = torch.rand(B) < eta
        y[drop_mask] = NULL_TOKEN

        # Step 5: Forward pass and loss
        u_pred = model(x_t, t, y)
        loss = ((u_pred - u_target)**2).mean()
        loss.backward(); optimizer.step()

# ====== INFERENCE ======
# Choose guidance scale w (e.g. 4.0)
# Choose prompt y (e.g. "a corgi on a beach")

x = torch.randn(B, C, H, W)      # Pure noise
for i in range(n_steps):
    t = torch.tensor([i / n_steps])
    # TWO forward passes per step:
    u_uncond = model(x, t, NULL_TOKEN)
    u_cond   = model(x, t, y)
    # CFG combination:
    u = (1 - w) * u_uncond + w * u_cond
    x = x + (1.0/n_steps) * u
# x is now the generated image!

That's the entire algorithm. Everything in this chapter — Bayes' rule decomposition, classifier guidance, the CFG formula, label dropping — distills down to: (1) randomly drop labels during training, (2) evaluate the network twice at inference and mix the outputs.

Summary of Key Equations

For reference, here are the essential equations from this chapter in one place:

NameEquationWhere used
Guided CFM lossE ||uθ(x|y) − utarget(x|z)||2Training (vanilla)
Bayes' rule on score∇log p(x|y) = ∇log p(x) + ∇log p(y|x)Classifier guidance derivation
Classifier guidanceũ(x|y) = u(x) + w·at∇log p(y|x)Inference (requires classifier)
CFG formulaũ(x|y) = (1−w)u(x|∅) + w·u(x|y)Inference (no classifier needed)
CFG training lossE ||uθ(x|y) − utarget(x|z)||2 with y→∅ w.p. ηTraining (with label dropping)

The logical flow: Vanilla guidance (Ch 1–2) → prompt problem motivates amplification (Ch 3) → Bayes' rule decomposes into unconditional + classifier (Ch 4) → scale up classifier = classifier guidance → eliminate classifier using Bayes' rule again = CFG (Ch 5) → label dropping trains one model for both modes (Ch 6) → w controls fidelity/diversity (Ch 7).

Key historical context. Classifier guidance was introduced by Dhariwal & Nichol (2021) in their paper "Diffusion Models Beat GANs on Image Synthesis." It required training a separate noisy classifier alongside the diffusion model. Classifier-free guidance was introduced by Ho & Salimans (2022), who recognized that the same effect could be achieved within a single model using label dropping. This simplification was a major breakthrough: it made guidance practical for arbitrary conditioning types (text, images, audio) without needing specialized classifiers.

CFG in modern systems. As of 2026, classifier-free guidance is used in virtually every production-grade generative model:

SystemYearCFG w rangeDrop rate η
DALL-E 220222.0 – 4.00.1
Stable Diffusion 1.520227.0 – 12.00.1
Imagen20224.0 – 10.00.1
Stable Diffusion 320242.0 – 5.00.1
FLUX20241.5 – 4.00.1
Movie Gen Video20244.0 – 8.00.1
VEO-320253.0 – 7.00.1

Notice that different systems use different w ranges. Earlier models (SD 1.5) needed very high w (7–12) because their conditioning was weaker. Later models with better text encoders and architectures (SD3, FLUX) can achieve strong prompt adherence with lower w values (2–5), producing more natural-looking outputs.

Advanced Topics in Guidance

Dynamic guidance schedules. Instead of using a fixed w for all timesteps, some systems vary w over the ODE trajectory. A common approach is to use high w early (for global composition) and low w late (for fine detail):

python
# Dynamic guidance schedule
def guidance_schedule(t, w_max=7.0, w_min=1.5):
    # High guidance early (global structure)
    # Low guidance late (fine details)
    return w_max + (w_min - w_max) * t

for i in range(n_steps):
    t = i / n_steps
    w = guidance_schedule(t)
    u_cfg = (1 - w) * u_uncond + w * u_cond
    x = x + dt * u_cfg

This often produces better results than a fixed w because early steps determine the broad composition (where objects are, overall color scheme) while late steps handle textures and fine patterns that benefit from more diversity.

Multi-prompt guidance. CFG can be extended to multiple prompts simultaneously. For example, guide toward "a corgi" AND "on a beach" AND "photorealistic" by combining multiple guidance terms:

ũ(x|y1, y2) = u(x|∅) + w1[u(x|y1) − u(x|∅)] + w2[u(x|y2) − u(x|∅)]

Each prompt has its own guidance scale, allowing fine-grained control over how strongly each aspect of the prompt is enforced.

GLASS Flows (Holderrieth et al., 2025). A recent extension called GLASS (Generalized Learning of Aligned Score and Sampling) addresses the fact that CFG with w≠1 produces samples from a different distribution than pdata(x|y). GLASS introduces a principled way to sample from a target distribution that is close to the CFG distribution but has better theoretical properties, including improved diversity at high guidance scales.

Deriving CFG from First Principles: Complete Algebraic Steps

For completeness, let's write out the entire derivation of CFG without skipping any steps. We want to get from classifier guidance to the classifier-free formula.

Start: Classifier guidance with scale w:

ũ(x|y) = utarget(x) + w · at ∇ log pt(y|x)   ... (1)

Apply Bayes' rule to replace the classifier gradient:

∇ log pt(y|x) = ∇ log pt(x|y) − ∇ log pt(x)   ... (2)

Substitute (2) into (1):

ũ(x|y) = utarget(x) + w · at[∇ log pt(x|y) − ∇ log pt(x)]   ... (3)

Expand utarget(x) = at∇ log pt(x) + btx:

ũ(x|y) = at∇ log pt(x) + btx + w·at∇ log pt(x|y) − w·at∇ log pt(x)   ... (4)

Group the ∇ log pt(x) terms:

ũ(x|y) = (1−w)at∇ log pt(x) + btx + w·at∇ log pt(x|y)   ... (5)

Recognize the vector fields:

utarget(x) = at∇ log pt(x) + btx
utarget(x|y) = at∇ log pt(x|y) + btx

Add and subtract wbtx in (5):

ũ(x|y) = (1−w)[at∇ log pt(x) + btx] + w[at∇ log pt(x|y) + btx]   ... (6)

Final result:

ũ(x|y) = (1−w) utarget(x) + w · utarget(x|y)   ... (7) ✓

Every step is elementary algebra. The key insight is step (2): using Bayes' rule to replace the classifier gradient with the difference of two score functions, both of which are already learned by the generative model.

Numerical Example of CFG

Let's compute a concrete CFG velocity. Suppose at some point x at time t:

u(x|y) = [3.0, −1.0]   (guided velocity: points toward the target)
u(x|∅) = [1.0, 0.5]   (unconditional velocity: points toward average)

At w=1 (vanilla guidance):

ũ = (1−1)[1.0, 0.5] + 1·[3.0, −1.0] = [3.0, −1.0]

At w=3 (amplified):

ũ = (1−3)[1.0, 0.5] + 3·[3.0, −1.0] = [−2.0, −1.0] + [9.0, −3.0] = [7.0, −4.0]

The amplified velocity [7.0, −4.0] is much more aggressive — it points strongly toward the target, overshooting the vanilla guidance direction. The magnitude has increased from 3.16 to 8.06, and the direction has shifted away from the unconditional "average" direction.

Alternative form: ũ = u(x|∅) + w·[u(x|y) − u(x|∅)] = [1.0, 0.5] + 3·[2.0, −1.5] = [1.0, 0.5] + [6.0, −4.5] = [7.0, −4.0] ✓

At w=0 (unconditional):

ũ = (1−0)[1.0, 0.5] + 0·[3.0, −1.0] = [1.0, 0.5]

The prompt is completely ignored. The velocity is purely unconditional.

At w=−1 (negative guidance / anti-prompt):

ũ = (1−(−1))[1.0, 0.5] + (−1)·[3.0, −1.0] = [2.0, 1.0] + [−3.0, 1.0] = [−1.0, 2.0]

The velocity now points away from the prompt target — useful for "negative prompting" (avoiding certain content).

These numerical examples make the CFG formula concrete. The key pattern: w scales the difference between guided and unconditional velocities. At w=1, you add the difference once (vanilla). At w=3, you add it three times (amplified). At w=−1, you subtract it (anti-guidance).

"The best guidance is one that amplifies the signal without destroying the distribution." — Ho & Salimans, 2022

What is the core advantage of classifier-free guidance over classifier guidance?