An unconditional model generates some image. Guidance makes it generate the image you asked for.
You type "a corgi wearing sunglasses on a beach" into an image generator. A few seconds later, a photo-realistic corgi appears, exactly matching your description. But in Chapter 3, we learned to train flow models that sample from pdata(x) — the unconditional distribution of all images. An unconditional model might give you a car, a cat, a sunset, anything. How do we make it listen to a prompt?
The answer is guidance: we modify the generative process so that instead of sampling from the full data distribution pdata(x), we sample from a conditional distribution pdata(x|y), where y is a text prompt, class label, or any conditioning variable. The key insight of this chapter is that there are two fundamentally different ways to achieve this — and the one that works best in practice (classifier-free guidance) is actually a heuristic, not a mathematically exact procedure.
Let's see the problem visually. In the simulation below, an unconditional model generates points from a mixture of two Gaussians (blue cluster and orange cluster). We want to generate points from only one of the clusters. That's guidance.
Click "Generate" to see unconditional sampling (both clusters). Then select a class to see guided sampling (one cluster only).
When you select a class, you see samples cluster in only one region. The guided model learned to associate each class label with a specific part of the distribution. This is exactly what happens in large-scale image generators — except the "class" is a rich text prompt and the distribution lives in a million-dimensional pixel space.
The simplest approach to guidance is almost embarrassingly straightforward: just give the prompt to the neural network as an extra input. This is called vanilla guidance.
Recall from Chapter 3 that an unconditional flow model has a neural network uθt(x) that takes a noisy point x and time t, and outputs a velocity vector. For vanilla guidance, we simply add the prompt y as a third input:
The network now takes three things: the noisy image x, the conditioning variable y (from some space Y), and the time t. It outputs a velocity vector uθt(x|y) ∈ Rd.
At inference time, sampling works exactly as before, but with y provided at every step:
The ODE is identical to the unconditional case from Chapter 2 — the only difference is that the neural network now also sees y at every evaluation. Think of y as a steering wheel: the vector field changes direction depending on what prompt you provide.
This shows the velocity field for a mixture of two Gaussians. Toggle the class to see how the vector field changes. With y="blue", vectors point toward the blue cluster. With y="orange", they point toward the orange cluster.
Notice how the arrows in the vector field change direction depending on which class you select. With no guidance, arrows point toward both clusters. With guidance, they converge toward just one — the network has learned to route different prompts to different parts of the data distribution.
Worked example: class-conditional generation. Suppose we have 10 classes (digits 0–9). The conditioning space is Y = {0, 1, ..., 9}. When we want to generate a "7", we set y=7 and pass it to the network at every ODE step. The network has learned that y=7 means "converge toward the region of image space where sevens live."
How does the network "see" y? For class labels, y is typically converted to a learned embedding vector yemb ∈ Rd via an embedding table (just like word embeddings in NLP). For text prompts, y is processed by a pretrained text encoder like CLIP. We'll cover this in detail in Chapter 6.
python # Vanilla guided inference (class-conditional) def guided_sample(model, class_label, n_steps=50): x = torch.randn(1, 3, 32, 32) # X_0 ~ N(0,I) dt = 1.0 / n_steps for i in range(n_steps): t = torch.tensor([i * dt]) # Network sees x, t, AND the class label y velocity = model(x, t, class_label) x = x + dt * velocity # Euler step return x # X_1 ~ p_data(x | y=class_label)
How do we train the guided network uθt(x|y)? The answer builds directly on the conditional flow matching (CFM) loss from Chapter 3. We simply add y to the picture.
Recall the unconditional CFM loss from Chapter 3:
For the guided version, we make two changes:
Change 1: Sample pairs (z, y) from the data distribution. Instead of just sampling images z, we now sample an image z together with its associated label/prompt y. In a PyTorch dataloader, each batch returns both the image tensor and the conditioning information.
Change 2: Feed y into the neural network. The network gets y as input and outputs a velocity conditioned on that prompt.
This gives us the guided conditional flow matching loss:
Let's write this out as a training algorithm:
python # Guided Conditional Flow Matching training loop for z, y in dataloader: # (z, y) pairs from dataset t = torch.rand(z.shape[0]) # random time per sample eps = torch.randn_like(z) # Gaussian noise # Construct noisy point on probability path x_t = alpha(t) * z + beta(t) * eps # Target velocity (straight line: alpha_dot * z + beta_dot * eps) u_target = alpha_dot(t) * z + beta_dot(t) * eps # Network prediction, conditioned on y u_pred = model(x_t, t, y) # MSE loss loss = ((u_pred - u_target) ** 2).mean() loss.backward() optimizer.step()
The only difference from unconditional training is on the line u_pred = model(x_t, t, y) — we pass y to the network. Everything else is identical.
Worked example: shapes of the tensors. Let's trace the exact data flow for training on 32×32 images with 10 classes:
| Variable | Shape | Source |
|---|---|---|
| z (clean image) | (B, 3, 32, 32) | Dataloader |
| y (class label) | (B,) integers in {0,...,9} | Dataloader |
| t (time) | (B,) floats in [0,1] | Random |
| ε (noise) | (B, 3, 32, 32) | N(0, I) |
| x (noisy image) | (B, 3, 32, 32) | αtz + βtε |
| utarget (target velocity) | (B, 3, 32, 32) | α̇tz + β̇tε |
| uθ(x|y) (prediction) | (B, 3, 32, 32) | Network forward pass |
Notice that the target velocity utarget is computed from z and ε only — it doesn't involve y at all. The label y enters only through the network's forward pass.
In theory, vanilla guidance should work perfectly: the network learns utargett(x|y) and generates samples from pdata(x|y). In practice, something goes wrong. Generated images often don't match the prompt well enough. A model asked to generate a "corgi" might produce something vaguely dog-like but not clearly a corgi. Why?
There are several reasons this happens:
1. Underfitting. The model might not have enough capacity or training time to faithfully learn the true conditional vector field. Neural networks are imperfect approximators — the learned uθt(x|y) is only an approximation of the true utargett(x|y).
2. Noisy data pairings. Real-world datasets like LAION (text-image pairs scraped from the web) have many mismatched pairs. An image of a cat might be labeled "my furry friend" — the conditioning signal is weak and noisy.
3. The diversity problem. Even with perfect training, the conditional distribution pdata(x|y = "dog") contains enormous variety: different breeds, poses, backgrounds, lighting, art styles. The model samples from this full conditional distribution, which often means the samples are "correct" but not strongly reflective of the prompt.
This is the motivation for classifier-free guidance (CFG): a controlled way to trade diversity for prompt adherence. Before we get there, we need to understand its predecessor, classifier guidance.
Quantifying the problem. Researchers measure two competing metrics:
• FID (Fréchet Inception Distance): measures how similar the generated distribution is to the real data distribution. Lower = better. Penalizes both poor quality and low diversity.
• CLIP score: measures how well each generated image matches its text prompt, using the CLIP embedding similarity. Higher = better prompt adherence.
With vanilla guidance (w=1), FID is often good (the distribution is correct) but CLIP scores are mediocre. The whole point of guidance scaling is to push CLIP scores higher, even if it costs some FID.
Each dot is a generated sample. The orange target region represents "ideal" samples for the prompt. Drag the guidance scale w to see how increasing it concentrates samples near the target but reduces diversity.
At w=1 (no guidance), samples spread across the full distribution. As w increases, they collapse toward the prompt target. But notice: at very high w, the samples lose variety — they all look nearly identical. This is the fundamental tradeoff.
The first approach to boosting prompt fidelity was classifier guidance. It starts from a simple idea: decompose the guided vector field into an unconditional part plus a classifier gradient, then scale up the classifier part.
Recall from Chapter 4 that for Gaussian probability paths, the vector field can be written in terms of the score function (Proposition 1):
Now comes the key move. We apply Bayes' rule to the score:
Let's derive this step by step. Starting from Bayes' theorem:
Take the log of both sides:
Take the gradient ∇ with respect to x. Since pt(y) doesn't depend on x, its gradient vanishes:
Substituting back into the vector field decomposition:
The guided vector field = the unconditional vector field + a classifier gradient. To amplify prompt fidelity, we scale up the classifier term with a guidance scale w > 1:
At w=1, this is the true guided vector field. At w>1, we over-emphasize the classifier, pushing generated samples to match the prompt more strongly — at the cost of moving away from the true data distribution.
To use this in practice, we need to train a classifier pt(y|x) on noisy data (since x = αtz + βtε is noisy at intermediate times). This classifier tells us: given this noisy image x at time t, what label y is most likely?
Worked example: what does scaled classifier guidance look like? Consider a Gaussian mixture with two classes (blue and orange). The unconditional vector field utarget(x) points toward the average of both clusters. The classifier gradient ∇ log p(y="blue"|x) points toward the blue cluster specifically. The guided field combines both:
At w=1, we get the true guided field (exactly right). At w=3, the classifier gradient is tripled — the field points much more aggressively toward blue, producing samples that are clearly blue but with less diversity. At w=10, nearly all samples collapse to the center of the blue cluster.
Classifier guidance requires a separate classifier. Classifier-free guidance (CFG) achieves the same amplification effect using only the generative model itself — no extra network needed. This is the method used by virtually every modern image and video generator.
The derivation is elegant. Start from the classifier guidance formula:
We want to eliminate the classifier gradient ∇ log pt(y|x). From Bayes' rule (Chapter 4), we know:
Substitute this into the classifier guidance formula:
Now, recall that utargett(x) = at∇ log pt(x) + btx (score-velocity relationship). Similarly, utargett(x|y) = at∇ log pt(x|y) + btx. Let's expand and simplify:
Group the terms with ∇ log pt(x) and ∇ log pt(x|y):
Recognize that at∇ log pt(x) + btx = utargett(x) and at∇ log pt(x|y) + btx = utargett(x|y). Add and subtract wbtx:
This is remarkable. We don't need a classifier at all! We just need two things: an unconditional model uθt(x|∅) and a guided model uθt(x|y). And as we'll see next, we can train both in a single network.
What does w=0 mean? At w=0: ũ(x|y) = u(x|∅). The model completely ignores the prompt and generates unconditionally. Between w=0 and w=1, we interpolate between unconditioned and conditioned generation. Beyond w=1, we extrapolate, amplifying the prompt signal beyond what the true conditional distribution dictates.
Remark: CFG for general probability paths. The construction ũ(x|y) = (1−w)utarget(x) + w·utarget(x|y) is valid for any probability path, not just Gaussian ones. Our derivation through Bayes' rule and classifier guidance was specific to Gaussian paths (to build intuition), but the final CFG formula applies universally. This is because the formula is equivalent to amplifying a hypothetical classifier regardless of the path choice.
The guided field points toward the blue cluster, the unconditional field toward both. Drag w to see how CFG interpolates (w=1) or extrapolates (w>1) between them.
At w=0, the field is fully unconditional (pointing toward both clusters). At w=1, it's the exact guided field (pointing toward blue). At w>1, it overshoots — the arrows point even more aggressively toward the target. This overshoot is what produces sharper, more prompt-adherent images, but it also shrinks the region of generated outputs.
The CFG formula requires both uθt(x|y) (guided) and uθt(x|∅) (unconditional). Training two separate networks would be wasteful. The trick is to train a single network that handles both cases.
The idea is beautifully simple: during training, with some probability η, we drop the label — we replace y with a special "null" token ∅. The network then learns to handle both y=∅ (unconditional) and y=real prompt (guided) seamlessly.
The CFG conditional flow matching objective is:
where the expectation □ includes: (z,y) ∼ pdata(z,y), t ∼ Unif[0,1], x ∼ pt(·|z), and replace y with ∅ with probability η.
Here's the full training algorithm:
python # CFG training with label dropping eta = 0.1 # probability of dropping the label for z, y in dataloader: t = torch.rand(z.shape[0]) eps = torch.randn_like(z) x_t = alpha(t) * z + beta(t) * eps u_target = alpha_dot(t) * z + beta_dot(t) * eps # KEY STEP: randomly drop labels mask = torch.rand(z.shape[0]) < eta y_dropped = y.clone() y_dropped[mask] = NULL_TOKEN # replace with empty label u_pred = model(x_t, t, y_dropped) loss = ((u_pred - u_target) ** 2).mean() loss.backward() optimizer.step()
And the inference procedure with CFG:
python # CFG inference (2 network evaluations per step) w = 4.0 # guidance scale x = torch.randn(batch, d) # X_0 ~ N(0, I) for t in timesteps: # Evaluate network twice u_uncond = model(x, t, NULL_TOKEN) # unconditional u_cond = model(x, t, y) # guided # CFG combination u_cfg = (1 - w) * u_uncond + w * u_cond # Euler step x = x + dt * u_cfg
Each training sample is a (z, y) pair. With probability η, the label y is replaced with ∅. Green = label kept, red = label dropped. Adjust η to see the effect.
A typical value is η = 0.1, meaning 10% of the time the label is dropped. This is enough for the network to learn a good unconditional mode while still spending most of its capacity on the conditional task.
Why does one network learn both modes? Think of it this way: when y is a real label, the network learns "given this class, where should the velocity point?" When y = ∅, the network learns "with no class information, where should the velocity point on average?" The null token ∅ is just another entry in the vocabulary — the embedding table has N+2 entries (N classes + the null token + optionally padding). The network doesn't know which entries are "real" classes and which is the null — it just learns the appropriate velocity for each conditioning value.
Implementation detail: what is the null token? Common choices:
• For class labels: a special integer ID (e.g., class N+1) with its own learned embedding.
• For text prompts: an empty string "" or a special [PAD] sequence that produces a fixed embedding from the frozen text encoder.
• For image conditioning: a tensor of zeros.
The guidance scale w is the most important hyperparameter at inference time. It controls the tradeoff between diversity and prompt fidelity. Let's build deep intuition for what it does.
w = 0: The CFG formula becomes ũ(x|y) = u(x|∅) — purely unconditional. The model ignores the prompt entirely.
w = 1: We get ũ(x|y) = u(x|y) — vanilla guidance. The model uses the prompt but doesn't amplify it.
w > 1: The prompt signal is amplified. The model generates samples that match the prompt more strongly, at the cost of reduced diversity and potential artifacts.
Very large w: The model "over-amplifies" the prompt. Images become saturated, artifact-ridden, and identical to each other. At extreme values, the ODE becomes unstable.
Watch how the generated sample distribution changes as w increases. Low w = diverse but sometimes off-prompt. High w = on-prompt but repetitive and potentially distorted.
In the simulation above, observe how increasing w causes the sample cloud to shrink and concentrate. The "sweet spot" is typically w ∈ [2, 7], depending on the application:
| Application | Typical w | Why |
|---|---|---|
| Stable Diffusion 3 | 2.0 – 5.0 | High-quality photorealistic images |
| DALL-E 3 | 3.0 – 7.5 | Creative illustration |
| Class-conditional (MNIST) | 1.5 – 4.0 | Simple, clear digit generation |
| Video generation | 4.0 – 8.0 | Temporal consistency matters more |
A mathematical perspective. Think of what CFG does to the effective distribution. If pdata(x|y) ∝ pdata(x) · p(y|x), then the CFG distribution roughly corresponds to:
Raising the likelihood p(y|x) to the power w sharpens the distribution around high-likelihood regions — a "temperature scaling" effect on the classifier part.
Worked example: CFG on MNIST. The book shows results for class-conditional MNIST digit generation at various guidance scales:
| w | FID (quality) | Visual effect |
|---|---|---|
| 1.0 | Moderate | Digits are recognizable but some are ambiguous or poorly formed |
| 2.0 | Best | Digits are sharp, clear, and well-formed; good variety of styles |
| 4.0 | Good | Digits are very uniform; less style variety but extremely clear |
| 8.0 | Degrading | Digits start to look "over-sharpened" with artifacts around edges |
The sweet spot (w ≈ 2–4 for MNIST) balances clarity with diversity. For more complex data like ImageNet or text-to-image models, higher w values (4–7) are typically needed because the conditioning signal is weaker relative to the data complexity.
Guidance for diffusion models (SDEs). Everything we derived for flow models extends directly to diffusion models. At inference, simply replace uθt(x|y) with ũθt(x|y) in the SDE simulation. The CFG formula is the same — we just apply it to the drift term of the SDE rather than the ODE velocity. The noise term σtdWt remains unchanged.
Rewriting the CFG formula in alternative forms. The formula ũ(x|y) = (1−w)u(x|∅) + w·u(x|y) can be rearranged into a more intuitive form:
Read it as: start with the unconditional velocity, then add w times the "guidance direction" (the difference between conditional and unconditional). This form makes it clear that CFG amplifies the difference between the guided and unconditional models. When w=1, the guidance direction is added once (vanilla guidance). When w=4, it's added four times.
Negative guidance (w < 0). What if we set w negative? The CFG formula becomes ũ(x|y) = (1−w)u(x|∅) + w·u(x|y), which pushes away from the prompt. This is the basis of negative prompting: replace the unconditional term with a negative-prompt term u(x|yneg) to steer away from undesired content (e.g. "blurry, low quality, text watermark").
Watch particles flow from noise to data with different guidance scales. Each particle starts at a random noise point and follows the CFG vector field. Higher w = particles converge more tightly to the target class.
Guidance is the bridge between mathematical generative modeling and practical AI systems. Without it, diffusion and flow models would generate beautiful but random images. With it, they follow your instructions.
Summary of the key ideas:
| Method | How it works | Pros | Cons |
|---|---|---|---|
| Vanilla guidance | Feed y into the network directly | Simple, exact sampling from p(x|y) | Often poor prompt fidelity in practice |
| Classifier guidance | Train a separate classifier, scale its gradient | Can amplify prompt signal | Needs extra classifier, hard for text prompts |
| Classifier-free guidance | Train one network with label dropping, combine at inference | No extra model, works with any conditioning type | 2× inference cost, is a heuristic |
What lies ahead. Chapter 6 will show how CFG is integrated into large-scale generators like Stable Diffusion 3 and Meta Movie Gen. The architecture details (DiT blocks, VAEs, text embeddings) are all designed to work with the CFG framework we developed here. Chapter 7 will extend these ideas to discrete data, where "guidance" takes a different form.
Open research directions:
• Guidance distillation: Training models that produce CFG-quality outputs in a single forward pass, halving inference cost.
• Dynamic guidance schedules: Varying w over time during inference (e.g. high w early for global structure, low w late for fine detail).
• Negative prompting: Using CFG with negative prompts — the unconditional term u(x|∅) is replaced with u(x|yneg) to steer away from undesired content.
To consolidate everything from this chapter, here is the complete pipeline for classifier-free guided generation in a single, annotated algorithm:
python # ====== TRAINING ====== # Data: dataset of (image, label/prompt) pairs # Model: neural network u_theta(x, t, y) -> velocity vector # Hyperparameters: eta (label drop rate), learning rate for epoch in range(num_epochs): for z, y in dataloader: # Step 1: Sample random time t = torch.rand(B) # Step 2: Sample noise and construct noisy x eps = torch.randn_like(z) x_t = t.view(-1,1,1,1) * z + (1-t).view(-1,1,1,1) * eps # Step 3: Compute target velocity (straight-line path) u_target = z - eps # alpha_dot * z + beta_dot * eps # Step 4: LABEL DROPPING (the CFG trick) drop_mask = torch.rand(B) < eta y[drop_mask] = NULL_TOKEN # Step 5: Forward pass and loss u_pred = model(x_t, t, y) loss = ((u_pred - u_target)**2).mean() loss.backward(); optimizer.step() # ====== INFERENCE ====== # Choose guidance scale w (e.g. 4.0) # Choose prompt y (e.g. "a corgi on a beach") x = torch.randn(B, C, H, W) # Pure noise for i in range(n_steps): t = torch.tensor([i / n_steps]) # TWO forward passes per step: u_uncond = model(x, t, NULL_TOKEN) u_cond = model(x, t, y) # CFG combination: u = (1 - w) * u_uncond + w * u_cond x = x + (1.0/n_steps) * u # x is now the generated image!
That's the entire algorithm. Everything in this chapter — Bayes' rule decomposition, classifier guidance, the CFG formula, label dropping — distills down to: (1) randomly drop labels during training, (2) evaluate the network twice at inference and mix the outputs.
For reference, here are the essential equations from this chapter in one place:
| Name | Equation | Where used |
|---|---|---|
| Guided CFM loss | E ||uθ(x|y) − utarget(x|z)||2 | Training (vanilla) |
| Bayes' rule on score | ∇log p(x|y) = ∇log p(x) + ∇log p(y|x) | Classifier guidance derivation |
| Classifier guidance | ũ(x|y) = u(x) + w·at∇log p(y|x) | Inference (requires classifier) |
| CFG formula | ũ(x|y) = (1−w)u(x|∅) + w·u(x|y) | Inference (no classifier needed) |
| CFG training loss | E ||uθ(x|y) − utarget(x|z)||2 with y→∅ w.p. η | Training (with label dropping) |
The logical flow: Vanilla guidance (Ch 1–2) → prompt problem motivates amplification (Ch 3) → Bayes' rule decomposes into unconditional + classifier (Ch 4) → scale up classifier = classifier guidance → eliminate classifier using Bayes' rule again = CFG (Ch 5) → label dropping trains one model for both modes (Ch 6) → w controls fidelity/diversity (Ch 7).
Key historical context. Classifier guidance was introduced by Dhariwal & Nichol (2021) in their paper "Diffusion Models Beat GANs on Image Synthesis." It required training a separate noisy classifier alongside the diffusion model. Classifier-free guidance was introduced by Ho & Salimans (2022), who recognized that the same effect could be achieved within a single model using label dropping. This simplification was a major breakthrough: it made guidance practical for arbitrary conditioning types (text, images, audio) without needing specialized classifiers.
CFG in modern systems. As of 2026, classifier-free guidance is used in virtually every production-grade generative model:
| System | Year | CFG w range | Drop rate η |
|---|---|---|---|
| DALL-E 2 | 2022 | 2.0 – 4.0 | 0.1 |
| Stable Diffusion 1.5 | 2022 | 7.0 – 12.0 | 0.1 |
| Imagen | 2022 | 4.0 – 10.0 | 0.1 |
| Stable Diffusion 3 | 2024 | 2.0 – 5.0 | 0.1 |
| FLUX | 2024 | 1.5 – 4.0 | 0.1 |
| Movie Gen Video | 2024 | 4.0 – 8.0 | 0.1 |
| VEO-3 | 2025 | 3.0 – 7.0 | 0.1 |
Notice that different systems use different w ranges. Earlier models (SD 1.5) needed very high w (7–12) because their conditioning was weaker. Later models with better text encoders and architectures (SD3, FLUX) can achieve strong prompt adherence with lower w values (2–5), producing more natural-looking outputs.
Dynamic guidance schedules. Instead of using a fixed w for all timesteps, some systems vary w over the ODE trajectory. A common approach is to use high w early (for global composition) and low w late (for fine detail):
python # Dynamic guidance schedule def guidance_schedule(t, w_max=7.0, w_min=1.5): # High guidance early (global structure) # Low guidance late (fine details) return w_max + (w_min - w_max) * t for i in range(n_steps): t = i / n_steps w = guidance_schedule(t) u_cfg = (1 - w) * u_uncond + w * u_cond x = x + dt * u_cfg
This often produces better results than a fixed w because early steps determine the broad composition (where objects are, overall color scheme) while late steps handle textures and fine patterns that benefit from more diversity.
Multi-prompt guidance. CFG can be extended to multiple prompts simultaneously. For example, guide toward "a corgi" AND "on a beach" AND "photorealistic" by combining multiple guidance terms:
Each prompt has its own guidance scale, allowing fine-grained control over how strongly each aspect of the prompt is enforced.
GLASS Flows (Holderrieth et al., 2025). A recent extension called GLASS (Generalized Learning of Aligned Score and Sampling) addresses the fact that CFG with w≠1 produces samples from a different distribution than pdata(x|y). GLASS introduces a principled way to sample from a target distribution that is close to the CFG distribution but has better theoretical properties, including improved diversity at high guidance scales.
For completeness, let's write out the entire derivation of CFG without skipping any steps. We want to get from classifier guidance to the classifier-free formula.
Start: Classifier guidance with scale w:
Apply Bayes' rule to replace the classifier gradient:
Substitute (2) into (1):
Expand utarget(x) = at∇ log pt(x) + btx:
Group the ∇ log pt(x) terms:
Recognize the vector fields:
Add and subtract wbtx in (5):
Final result:
Every step is elementary algebra. The key insight is step (2): using Bayes' rule to replace the classifier gradient with the difference of two score functions, both of which are already learned by the generative model.
Let's compute a concrete CFG velocity. Suppose at some point x at time t:
At w=1 (vanilla guidance):
At w=3 (amplified):
The amplified velocity [7.0, −4.0] is much more aggressive — it points strongly toward the target, overshooting the vanilla guidance direction. The magnitude has increased from 3.16 to 8.06, and the direction has shifted away from the unconditional "average" direction.
Alternative form: ũ = u(x|∅) + w·[u(x|y) − u(x|∅)] = [1.0, 0.5] + 3·[2.0, −1.5] = [1.0, 0.5] + [6.0, −4.5] = [7.0, −4.0] ✓
At w=0 (unconditional):
The prompt is completely ignored. The velocity is purely unconditional.
At w=−1 (negative guidance / anti-prompt):
The velocity now points away from the prompt target — useful for "negative prompting" (avoiding certain content).
These numerical examples make the CFG formula concrete. The key pattern: w scales the difference between guided and unconditional velocities. At w=1, you add the difference once (vanilla). At w=3, you add it three times (amplified). At w=−1, you subtract it (anti-guidance).
"The best guidance is one that amplifies the signal without destroying the distribution." — Ho & Salimans, 2022