Building language models with diffusion: from continuous vectors to discrete tokens via continuous-time Markov chains.
Everything we've learned about flow matching and diffusion operates in continuous Euclidean space Rd. We start with noise (a Gaussian), define a probability path, learn a vector field, and simulate an ODE to generate data. The math is beautiful and the results are stunning for images, videos, and audio.
But what about text?
Text is not a point in Rd. "The cat sat on the mat" is a sequence of discrete tokens. There's no meaningful notion of "halfway between cat and dog" in token space. You can't define a continuous ODE trajectory from noise to "The cat sat on the mat" because the space isn't continuous — it's discrete.
The key insight is that the principles of flow matching transfer perfectly to discrete spaces:
| Continuous (images) | Discrete (text) |
|---|---|
| State space Rd | State space S = Vd (sequences of tokens) |
| ODE/SDE dynamics | CTMC dynamics |
| Vector field ut(x) | Rate matrix Qt(y|x) |
| Gaussian noise → data | Uniform/masked noise → data |
| Regression loss (MSE) | Classification loss (cross-entropy) |
By the end of this chapter, you'll see that training a discrete diffusion model reduces to training a simple token classifier — predict which token should go in each position, given a partially corrupted sequence. This is remarkably similar to how BERT-style masked language models work, but with a rigorous generative framework underneath.
Let's formalize what "discrete data" means. We start with a vocabulary V = {v1, v2, ..., vV} of V possible tokens. For language, these could be words, subwords (like BPE tokens), or characters. For DNA, they'd be the four bases {A, C, G, T}.
A sequence of d tokens lives in the state space:
A single element x ∈ S looks like x = (x1, x2, ..., xd), where each xj ∈ V. For example, with vocabulary {"the", "cat", "sat", "on", "mat"} and d=5:
The size problem. The state space S = Vd is enormous. With a vocabulary of V=50,000 tokens and a sequence length of d=1,024:
This is astronomically larger than any continuous distribution we've dealt with. We can't enumerate states or store full probability vectors. Every algorithm must work with factorized representations that scale as d · V rather than Vd.
Now, let Xt be a stochastic process on S: a random trajectory X: [0,1] → S that maps continuous time to discrete states. Unlike ODEs (where Xt moves smoothly), Xt jumps between states. At most times it stays put; occasionally it jumps to a neighboring state.
Concrete example: language. Consider generating a 7-word sentence from a vocabulary of 50,000 tokens. The state space has |S| = 50,0007 ≈ 1033 possible sentences. The generative process starts from a random (or masked) sequence and progressively refines it, jumping one token at a time, until a coherent sentence emerges.
Contrast with autoregressive models. GPT-style models generate text left-to-right: they sample token 1, then token 2 conditioned on token 1, and so on. Discrete diffusion models generate all tokens simultaneously: at each step, any position can be updated. This parallelism is a key advantage — the model can revise earlier tokens based on later context, something autoregressive models cannot do.
In continuous flow matching, the dynamics are governed by a vector field ut(x) that tells us "which direction to move." In discrete space, there are no directions — only jumps. The analogue of a vector field is a rate matrix Qt(y|x) that tells us "how fast to jump from state x to state y."
Formally, a rate matrix is a function Q: S × S × [0,1] → R that satisfies two conditions:
Condition 1: Non-negative off-diagonal. For x ≠ y:
The rate of jumping from x to a different state y can only be non-negative. (Not jumping just means rate = 0.)
Condition 2: Rows sum to zero.
The diagonal entry (rate of "staying") equals the negative sum of all outgoing rates. This is a consistency condition: you either stay at x or leave to some y — there's no third option.
Worked example: two-state system. Consider S = {a, b} with a constant rate λ > 0. The rate matrix is:
This says: when at a, jump to b at rate λ. When at b, jump to a at rate λ. The transition probabilities over a time increment h are:
As h → ∞, the memory decays exponentially and the chain converges to uniform: P(a) = P(b) = 1/2. Higher λ means faster convergence.
Let's verify the transition probabilities satisfy the rate equation. Take the derivative at h=0:
And for staying at a:
The rate matrix conditions are satisfied. This gives us confidence that the rate matrix formalism correctly describes the dynamics.
A two-state CTMC with rate λ. The particle starts at state A and jumps between states. Adjust λ to see how it affects the jump frequency and equilibrium.
A Continuous-Time Markov Chain (CTMC) is a stochastic process Xt on a discrete state space S that has the Markov property: the future depends only on the present, not the past.
Think of a particle sitting on one of the states in S. At random times, it jumps to another state. The rate of jumping is governed by the rate matrix Qt. The fundamental relationship is:
This says: the instantaneous rate of change of the transition probability equals the rate matrix. It's the discrete analogue of "the velocity field determines the ODE trajectory."
The Kolmogorov Forward Equation. Just as the continuity equation governs how probability densities evolve under ODEs in continuous space, the Kolmogorov Forward Equation (KFE) governs how probability distributions evolve under CTMCs:
This says: the rate of change of the probability of being in state x equals the sum of "incoming flow" from all other states y (weighted by their probabilities and the rate of jumping from y to x). In matrix form:
This is a linear ODE on the vector space of probability distributions. Its uniqueness follows from standard ODE theory. The KFE is the key tool for proving that marginal rate matrices correctly generate the desired data distribution.
Compare with the continuous case:
| Concept | Continuous (ODEs) | Discrete (CTMCs) |
|---|---|---|
| Evolution equation | Continuity eq: ∂p/∂t + div(p u) = 0 | KFE: dp/dt = Q p |
| Nature | PDE on continuous density | Linear ODE on probability vector |
| Uniqueness | Picard-Lindelof theorem | Theorem 33 (rate matrix to unique CTMC) |
| Simulation | Euler: x = x + h u(x,t) | Euler: sample from delta + h Q |
How do we simulate a trajectory? For small step size h, we can approximate the transition probability:
In words: at each step, the particle either stays put (with probability 1 − h∑y≠xQ(y|x)) or jumps to some neighbor y (with probability h · Q(y|x)). We sample from this discrete distribution at each time step.
python # Euler simulation of a CTMC def simulate_ctmc(Q_fn, x0, n_steps=100): x = x0 h = 1.0 / n_steps for i in range(n_steps): t = i * h # Get rate matrix column for current state x rates = Q_fn(x, t) # shape: (|S|,) # Euler transition probabilities probs = h * rates # off-diagonal probs[x] = 1 - h * rates[rates > 0].sum() # diagonal # Sample next state x = np.random.choice(len(probs), p=probs) return x
A particle jumps between 5 states according to a rate matrix. Watch how it spends time in different states. The bar chart shows the empirical distribution (how often each state is visited).
A CTMC model parameterizes the rate matrix with a neural network: x → {Qθt(y|x)}y∈S. But there's a fatal problem: the output has size |S| = Vd. For V=50,000 and d=1,024, that's 50,0001,024 values. No computer can store this.
The solution is factorization: restrict the rate matrix so that the process can only change one token at a time.
This means if x = (x1, x2, ..., xd), the only reachable states from x are those that differ in exactly one token. We call these the neighbors N(x) of x.
With factorization, the network output has shape d × V instead of Vd:
Here, Qθt(vi, j | x) is the rate of replacing the j-th token of x with vocabulary item vi. This scales linearly in the sequence length, not exponentially.
The per-position rate matrix must satisfy the usual conditions:
A transformer fits perfectly: input a sequence of d tokens, output d × V logits. The rate conditions can be enforced with a simple softmax followed by negation for the diagonal.
python # Factorized CTMC model architecture class DiscreteDiffusionModel(nn.Module): def __init__(self, vocab_size, seq_len, d_model=768): super().__init__() self.embed = nn.Embedding(vocab_size, d_model) self.time_embed = FourierFeatures(d_model) self.transformer = TransformerEncoder(d_model, n_layers=12) self.head = nn.Linear(d_model, vocab_size) # output logits def forward(self, x, t): # x: (B, d) token indices # t: (B,) scalar times h = self.embed(x) # (B, d, d_model) h = h + self.time_embed(t) # add time conditioning h = self.transformer(h) # (B, d, d_model) logits = self.head(h) # (B, d, V) per-position logits return logits # To get rate matrix: softmax(logits) gives probabilities # Rate = kappa_dot/(1-kappa) * (softmax(logits) - one_hot(x))
The output shape is (B, d, V) — a probability distribution over V tokens for each of the d positions. This is exactly the shape of a standard language model output, making standard transformer architectures directly applicable.
Worked example: factorized rate matrix for d=2, V=3. Suppose our vocabulary is V={A, B, C} and our sequence length is d=2. The full state space has |S| = 32 = 9 states: AA, AB, AC, BA, BB, BC, CA, CB, CC. A full rate matrix would be 9×9 = 81 entries.
With factorization, only single-token changes are allowed. From state AB, the neighbors are: {AB, BB, CB, AA, AB, AC} (change position 1 to any value, or change position 2 to any value). That's at most d×V = 2×3 = 6 rates to store, not 9.
The diagonal entries (current tokens) are negative: Q(A,1|x) = −(Q(B,1|x) + Q(C,1|x)) and Q(B,2|x) = −(Q(A,2|x) + Q(C,2|x)).
With factorization, simulation becomes embarrassingly parallel: at each step, update each token independently:
python # Parallel per-token Euler step for j in range(d): # in parallel! rates_j = Q_theta(x, t)[j] # (V,) rates for position j # Euler transition: stay or jump probs_j = h * rates_j probs_j[x[j]] = 1 - h * rates_j[rates_j > 0].sum() x[j] = sample_categorical(probs_j)
Just like in continuous flow matching, we need a probability path that interpolates between noise and data. The key difference: instead of smoothly blending two distributions (like a Gaussian path does), a discrete probability path teleports probability mass — fading one distribution in and another out.
A discrete conditional probability path pt(x|z) satisfies:
Notice the similarity to continuous flow matching (Chapter 3): there we had p0(·|z) = N(0, I) (Gaussian noise) and p1(·|z) = δz (point mass at data). Here it's the same structure, just with discrete distributions.
Choices for the initial distribution pinit.
| Choice | pinit | Effect | Used by |
|---|---|---|---|
| Uniform | Unif(V) per position | Each masked token replaced by random vocabulary item | General DFM |
| Mask token | δ[MASK] per position | Each masked token replaced by special [MASK] | MDLM |
| Data distribution | Empirical marginal per position | Masked tokens look like "typical" tokens | Some variants |
The mask-token approach (MDLM) is the most popular because it gives the model a clear signal: "this position is unknown." With uniform noise, the model must distinguish between a token that is real data and one that is random noise — a harder discrimination task.
Worked example: sampling from the factorized mixture path. Data z = (T, h, e), scheduler κt = 0.4, noise = [MASK].
For each position j = 1, 2, 3:
• Sample mj ∼ Bernoulli(0.4): probability 0.4 of keeping the data token, 0.6 of masking.
• If mj = 1: xj = zj (keep data). If mj = 0: xj = [MASK].
Possible outcomes (each with its probability):
| m | x | P |
|---|---|---|
| (0,0,0) | [M] [M] [M] | 0.63 = 0.216 |
| (1,0,0) | T [M] [M] | 0.4 · 0.62 = 0.144 |
| (0,1,0) | [M] h [M] | 0.144 |
| (0,0,1) | [M] [M] e | 0.144 |
| (1,1,0) | T h [M] | 0.42 · 0.6 = 0.096 |
| (1,0,1) | T [M] e | 0.096 |
| (0,1,1) | [M] h e | 0.096 |
| (1,1,1) | T h e | 0.43 = 0.064 |
At κ = 0.4, the most likely outcome is fully masked (21.6%), and the data is fully revealed only 6.4% of the time. As κ increases toward 1, the probabilities shift toward more tokens being revealed.
Expected number of revealed tokens. At time t with scheduler κt, each position is independently revealed with probability κt. The expected number of revealed tokens is d·κt. For d=1024 and κ=0.4: expected 410 tokens revealed, 614 masked. At κ=0.9: expected 922 revealed, 102 masked. The model must predict these 102 remaining tokens from the 922 visible ones — a much easier task than predicting all 1024 from scratch.
This progressive difficulty is key to training: at high κ (easy), the model learns simple associations (local context). At low κ (hard), it learns global structure (long-range dependencies). The uniform time sampling t ∼ Unif[0,1] ensures the model is trained at all difficulty levels.
The loss at different noise levels. The DFM loss ∑j −log pθ(zj|x) behaves differently at different times:
| Time t | κt | Masked fraction | Loss character |
|---|---|---|---|
| t ≈ 0 | ~0 | ~100% masked | Very hard. Model must guess tokens from nearly zero context. Loss is high. |
| t ≈ 0.3 | ~0.3 | ~70% masked | Hard. Some context available. Model learns global patterns. |
| t ≈ 0.7 | ~0.7 | ~30% masked | Moderate. Most tokens visible. Model learns local corrections. |
| t ≈ 1 | ~1 | ~0% masked | Easy. Almost all tokens visible. Loss is low (just copy the input). |
The loss contribution is weighted uniformly across all t values. Some practitioners apply importance weighting (spending more training time at intermediate t values where the loss gradient is most informative) to improve sample quality.
Connection to denoising autoencoders. At each training step, the model receives a corrupted sequence x and must predict the clean sequence z. This is precisely the framework of a denoising autoencoder, but with a continuous corruption schedule and a rigorous generative model underneath. The continuum of corruption levels (parameterized by t) is what transforms a simple denoising task into a full generative model.
The neural network for discrete diffusion is typically a standard transformer encoder (not a decoder, since there's no causal masking). The key architectural decisions:
1. Bidirectional attention. Every token can attend to every other token, including tokens to its right. This is fundamentally different from GPT-style autoregressive models. Architecturally, it's identical to BERT.
2. Time conditioning. The scalar time t is embedded using Fourier features (identical to Chapter 6) and injected via AdaLN (adaptive layer normalization) at every transformer layer. This tells the model the current noise level.
3. Output head. A linear layer maps the transformer's hidden state at each position to logits over the vocabulary V. A softmax produces the predicted clean-token probabilities pθ1|t(v|x).
python # Discrete diffusion transformer (detailed) class DiffusionLM(nn.Module): def __init__(self, V, d, d_model, n_layers, n_heads): super().__init__() self.tok_emb = nn.Embedding(V + 1, d_model) # +1 for [MASK] self.pos_emb = nn.Embedding(d, d_model) self.time_mlp = nn.Sequential( FourierFeatures(d_model), nn.Linear(d_model, d_model), nn.SiLU(), nn.Linear(d_model, d_model) ) self.layers = nn.ModuleList([ TransformerBlock(d_model, n_heads, use_adaln=True) for _ in range(n_layers) ]) self.head = nn.Linear(d_model, V) # predict V tokens (not [MASK]) def forward(self, x, t): h = self.tok_emb(x) + self.pos_emb(torch.arange(x.size(1))) t_emb = self.time_mlp(t) for layer in self.layers: h = layer(h, t_emb) # AdaLN with time embedding logits = self.head(h) # (B, d, V) return logits
4. Scale. LLaDA 2.0 uses a 100B-parameter transformer with 80 layers, d_model=12288, and 96 attention heads. The architecture is essentially identical to a standard LLM (like LLaMA) but with bidirectional attention and AdaLN time conditioning.
5. Training data. Like autoregressive LLMs, discrete diffusion models are trained on web-scale text: Common Crawl, Wikipedia, books, code, and other text corpora. The data pipeline is identical — only the loss function changes from next-token prediction (cross-entropy with causal masking) to masked-token prediction (cross-entropy with random masking at variable rates).
Classifier-free guidance (Chapter 5) extends naturally to discrete models. The rate matrix combines conditional and unconditional predictions:
Training uses the same label-dropping trick: with probability η, replace the prompt y with a null token ∅. At inference, evaluate the model twice per step (once with y, once with ∅) and combine the predicted probabilities.
python # CFG for discrete diffusion def cfg_discrete_step(model, x, t, y, null_y, w): # Two forward passes logits_cond = model(x, t, y) # (B, d, V) logits_uncond = model(x, t, null_y) # (B, d, V) # Combine probabilities via CFG probs_cond = F.softmax(logits_cond, dim=-1) probs_uncond = F.softmax(logits_uncond, dim=-1) probs_cfg = (1 - w) * probs_uncond + w * probs_cond # Renormalize (CFG can push probs outside [0,1]) probs_cfg = probs_cfg.clamp(min=0) probs_cfg = probs_cfg / probs_cfg.sum(dim=-1, keepdim=True) return probs_cfg
Note the renormalization step: since CFG is a linear combination with w>1, the resulting "probabilities" can go negative or exceed 1. We clamp and renormalize to get a valid probability distribution. This is analogous to how CFG in continuous space can produce velocity vectors that are "too large" — a consequence of the heuristic nature of guidance.
Typical guidance scales for text generation. Discrete diffusion models typically use lower guidance scales than image models. For class-conditional text generation, w ∈ [1.5, 3.0] is typical. For text-conditional generation (e.g., prompt-guided story completion), w ∈ [2.0, 5.0] works well. Higher values tend to produce repetitive, less diverse text.
Discrete diffusion models are one of several approaches to generating discrete sequences. Here is how they compare:
| Model | Generation mechanism | Strengths | Weaknesses |
|---|---|---|---|
| Autoregressive (GPT) | Left-to-right, one token per step | Excellent quality, simple training | Sequential (slow), no revision, no infilling |
| Discrete diffusion (MDLM) | All positions updated in parallel via CTMC | Parallel generation, natural infilling, revision | Less mature, needs many sampling steps |
| Non-autoregressive (NAT) | All tokens generated in one pass | Very fast inference | Poor quality without iterative refinement |
| Energy-based models | MCMC sampling from learned energy function | Flexible, principled | Slow mixing, hard to train |
| VAE-based | Latent variable z → decode to sequence | Fast generation, smooth latent space | Posterior collapse, blurry outputs |
Discrete diffusion occupies a sweet spot: it's more flexible than autoregressive models (parallel generation, infilling) while being more principled and higher-quality than non-autoregressive models. As scaling and training recipes improve, discrete diffusion models are increasingly competitive with autoregressive LLMs on standard language modeling benchmarks.
The future convergence. There is growing evidence that the distinction between autoregressive and diffusion-based language models may eventually blur. Some recent work combines both approaches: use autoregressive generation for the initial draft, then refine with discrete diffusion. Others use diffusion-based models for planning and autoregressive models for execution. The principles learned in this chapter — CTMCs, rate matrices, and simulation-free training — provide the mathematical foundation for all of these approaches.
The most common choice is the factorized mixture path. It works independently per token position with a scheduler κt ∈ [0,1] (where κ0=0, κ1=1):
In words: for each position j independently, with probability κt the token equals the data zj, and with probability 1−κt it's drawn from the noise distribution. We can sample x ∼ pt(·|z) by flipping a coin per position:
The marginal probability path integrates over all data points z:
This satisfies p0 = pinit and p1 = pdata — exactly the interpolation from noise to data that we need.
A 12-token sequence being progressively unmasked. At κ=0, all tokens are noise (random). At κ=1, all tokens equal the data. Drag κ to see the interpolation.
Now we need a rate matrix whose CTMC "follows" the conditional probability path. This is the conditional rate matrix Qzt(y|x) — the discrete analogue of the conditional vector field utargett(x|z) from Chapter 3.
For the factorized mixture path, the conditional rate matrix has a beautifully simple form:
Let's unpack this case by case for a single position j:
| Condition | Rate | Interpretation |
|---|---|---|
| xj = zj (already correct) | 0 for all vi | Already at the data token — don't jump |
| xj ≠ zj, vi = zj (jump to correct) | κ̇t / (1 − κt) | Jump to the data token at rate κ̇/(1−κ) |
| xj ≠ zj, vi ≠ zj (jump to wrong) | 0 | Never jump to the wrong token |
Worked example: conditional rate matrix for d=1. Consider a single position with vocabulary V = {a, b, c} and data token z = b. Current state is x = a (incorrect). With κ̇t = 1 and κt = 0.5:
The conditional rate matrix for this single position:
| From ↓ / To → | a | b | c |
|---|---|---|---|
| a (current) | −2.0 | 2.0 | 0 |
| b (target) | 0 | 0 | 0 |
| c | 0 | 2.0 | −2.0 |
Row a: from state a, jump to b (the target) at rate 2.0, never jump to c. Row b: already at target, all rates zero. Row c: jump to b at rate 2.0, never jump to a. The conditional process moves only toward the data.
Why does the rate increase as κ → 1? At early times (κ small), 1−κ is close to 1, so the rate κ̇/(1−κ) is moderate. As t increases and κ approaches 1, the rate diverges — the process becomes increasingly urgent about reaching the target. This makes sense: near the end, any remaining incorrect tokens must be fixed quickly.
Just as in continuous flow matching (Theorem 12), the marginal rate matrix is a weighted average of conditional rate matrices:
where p1|t(z|x) = pt(x|z) pdata(z) / pt(x) is the posterior probability that the clean data is z, given the noisy observation x at time t.
For the factorized mixture path, the marginal rate matrix factorizes beautifully (Theorem 38):
The proof mirrors the continuous case (Theorem 12) but uses the KFE instead of the continuity equation. We need to show that the marginal rate matrix Qt(y|x) = ∑z Qzt(y|x) p1|t(z|x) satisfies the KFE for the marginal probability path pt(x) = ∑z pt(x|z) pdata(z).
Step 1: Differentiate the marginal path:
Step 2: Apply the conditional KFE (Qzt generates pt(·|z)):
Step 3: Multiply and divide by pt(y):
Step 4: Recognize the marginal rate matrix:
This is exactly the KFE for Qt and pt. By the sufficiency of the KFE (Proposition 2), the CTMC with rate matrix Qt generates the marginal probability path. Since p1 = pdata, the model generates from the data distribution at t=1. □
Let's work out the marginal rate matrix for the factorized mixture path in detail. Starting from Theorem 38:
Step 1: Start with the conditional rate matrix for position j (Example 37):
Step 2: Apply the marginalization trick Qt(vi,j|x) = ∑z Qzt(vi,j|x) p1|t(z|x):
Step 3: The δxj(vi) term doesn't depend on z, so ∑z δxj(vi) p1|t(z|x) = δxj(vi).
Step 4: For the first term, ∑z δzj(vi) p1|t(z|x) = p1|t(zj=vi|x) by marginalization over all positions except j.
Step 5: Combining: Qt(vi, j | x) = (κ̇t/(1−κt)) (p1|t(zj=vi|x) − δxj(vi)). □
The result is remarkable in its simplicity: the entire rate matrix reduces to the posterior token probabilities p1|t(zj=vi|x), which are exactly what a standard classifier network outputs (logits → softmax → probabilities).
Let's verify that the marginal rate matrix Qt(v, j|x) actually satisfies the rate matrix conditions:
Condition 1: Non-negative off-diagonal (v ≠ xj).
Since κ̇ ≥ 0, (1−κ) > 0 for t < 1, and probabilities are non-negative.
Condition 2: Diagonal equals negative sum of off-diagonal (v = xj).
Both conditions verified. The marginal rate matrix is a valid rate matrix for all t ∈ [0,1).
Numerical verification. Suppose at position j, xj = a, and the model predicts p1|t(zj=v|x) = {a: 0.1, b: 0.7, c: 0.2} with κ̇/(1−κ) = 3.0. Then:
Check: Q(a) + Q(b) + Q(c) = −2.7 + 2.1 + 0.6 = 0 ✓ (rows sum to zero).
The model predicts b is the most likely clean token (probability 0.7), so the rate of jumping to b (2.1) is highest. This is exactly what we want: the CTMC jumps preferentially toward the model's best guess of the correct token.
What happens in an Euler step? With step size h=0.01:
At each step, there's a 2.1% chance of jumping to the predicted clean token b, a 0.6% chance of jumping to c, and a 97.3% chance of staying at a. Over 100 steps (h=0.01 each), the cumulative probability of eventually jumping to b is approximately 1−0.973100 ≈ 93%. The process naturally converges to the most likely clean token.
We now have everything we need to train a discrete generative model. The rate matrix is determined by the posterior probabilities p1|t(zj = v | x). We parameterize these with a neural network and train with cross-entropy loss.
The Discrete Flow Matching (DFM) loss is:
This is exactly a cross-entropy classification loss. For each position j, predict which token zj was the original clean token, given the noisy sequence x at time t. Sum over all d positions.
The complete training algorithm:
python # Discrete Flow Matching Training for z in dataloader: # z: (B, d) token sequences t = torch.rand(B) # random time per sample kappa = scheduler(t) # kappa_t in [0,1] # Sample noisy x from factorized mixture path mask = torch.bernoulli(kappa.unsqueeze(1).expand(B, d)) # (B,d) noise = torch.randint(0, V, (B, d)) # random tokens x = mask * z + (1 - mask) * noise # keep or corrupt # Predict clean tokens from noisy sequence logits = model(x, t) # (B, d, V) logits per position # Cross-entropy loss per position loss = F.cross_entropy( logits.view(-1, V), z.view(-1), reduction='mean' ) loss.backward() optimizer.step()
Let's break down why this is so elegant. In continuous flow matching, the training target is the conditional vector field utargett(x|z) = α̇tz + β̇tε — a simple linear function that tells us "which direction to go." Here, the training target is the posterior probability p1|t(zj|x) — a simple classification target that tells us "which token should be here." Both are easy to compute from the training data, and both lead to simulation-free training.
Inference uses Algorithm 7: start from pure noise (or all [MASK] tokens), simulate the factorized CTMC for n steps using the Euler approximation, outputting the final sequence X1.
python # Discrete Flow Matching Inference x = torch.randint(0, V, (B, d)) # X_0 ~ p_init h = 1.0 / n_steps for i in range(n_steps): t = i * h logits = model(x, t) # (B, d, V) probs = F.softmax(logits, dim=-1) # p_theta(v|x,t) # Per-position Euler step kappa_dot = scheduler_deriv(t) rate_scale = kappa_dot / (1 - scheduler(t) + 1e-8) for j in range(d): # parallel in practice jump_probs = h * rate_scale * probs[:, j, :] jump_probs[:, x[:, j]] += 1 - jump_probs.sum(dim=-1, keepdim=True) x[:, j] = torch.multinomial(jump_probs, 1).squeeze()
A particularly elegant instantiation of discrete flow matching is the Masked Diffusion Language Model (MDLM). The idea: extend the vocabulary with a special [MASK] token, and use it as the initial noise state. The generation process starts from an all-masked sequence and progressively unmasks tokens.
Formally, we set:
The initial state is always the fully masked sequence [MASK][MASK]...[MASK]. The factorized mixture path then progressively reveals tokens: at time t, each position has probability κt of showing the true data token and probability 1−κt of being [MASK].
The generation process looks like this:
| Time | Sequence | Status |
|---|---|---|
| t = 0 | [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] | Fully masked |
| t = 0.25 | [MASK] [MASK] [MASK] on [MASK] [MASK] [MASK] | Some tokens revealed |
| t = 0.75 | [MASK] cat [MASK] on the mat [MASK] | Most tokens revealed |
| t = 1 | The cat sat on the mat . | Fully unmasked = generated text |
Watch a sequence progressively unmask from pure [MASK] to complete text. Press Play to animate. The progress bar shows time t from 0 to 1.
Why is this different from BERT? BERT masks a fixed fraction (~15%) of tokens and trains on predicting them. MDLM uses a continuous schedule: the masking fraction varies from 0% (κ=1, all data) to 100% (κ=0, all masked) during training. At each training step, a random time t is sampled, giving a random masking level κt. This continuous schedule is what makes MDLM a rigorous generative model — it learns the rate matrix at all noise levels, not just one.
The inference procedure for MDLM:
State-of-the-art. Current discrete diffusion models like LLaDA 2.0 (2025) use this recipe with transformer backbones trained on web-scale text data, achieving competitive performance with autoregressive language models. The advantages of the diffusion approach include parallel generation (all positions can be updated simultaneously) and natural handling of infilling tasks (fill in the middle of a sequence, not just left-to-right).
Advantages of discrete diffusion over autoregressive LLMs:
| Feature | Autoregressive (GPT) | Discrete Diffusion (MDLM) |
|---|---|---|
| Generation order | Left-to-right only | Any order (parallel) |
| Infilling | Requires special training | Natural (mask the middle) |
| Revision | Cannot change earlier tokens | Can revise any position at any step |
| Speed (sequential) | d steps (one per token) | n steps (tunable, often n ≪ d) |
| Training | Causal masking | Random masking at varying levels |
| Maturity | Highly optimized (2020–2026) | Rapidly improving (2024–2026) |
This chapter extended flow matching from continuous (images) to discrete (text) data. The principles transferred remarkably cleanly:
| Concept | Continuous (Rd) | Discrete (Vd) |
|---|---|---|
| Dynamics | ODE/SDE | CTMC |
| Generator | Vector field ut(x) | Rate matrix Qt(y|x) |
| Conditional path | Gaussian: pt(x|z) = N(αz, β2I) | Mixture: (1−κ)pinit + κδz |
| Marginalization trick | ut(x) = E[ut(x|z) | x] | Qt(y|x) = E[Qz(y|x) | x] |
| Training loss | MSE (regression) | Cross-entropy (classification) |
| Network architecture | DiT / U-Net | Transformer |
| Simulation | Euler ODE solver | Euler CTMC sampler |
This concludes the book. We have traveled from basic probability (Chapter 1) through flow matching (Chapter 3), score functions (Chapter 4), guidance (Chapter 5), large-scale architectures (Chapter 6), and now discrete diffusion (Chapter 7). The complete pipeline — from mathematical foundations to production systems generating images, videos, and text — rests on a single idea: define a path from noise to data, derive the conditional dynamics, and train a neural network to approximate the marginal dynamics.
Open research directions:
• Scaling discrete diffusion to match autoregressive LLMs on benchmarks (LLaDA 2.0 with 100B parameters is a promising start).
• Multimodal generation: joint text-and-image generation using mixed continuous-discrete state spaces — generate text and images in a single model.
• Structured discrete data: protein design (amino acid sequences), drug discovery (molecular graphs), code generation.
• Inference speed: reducing the number of sampling steps while maintaining quality. Autoregressive models need d steps for d tokens; can discrete diffusion do better?
• Consistency models: learning to denoise in a single step, distilling the multi-step CTMC sampling into one forward pass.
Looking back across all 7 chapters, the complete generative modeling pipeline rests on three pillars:
Every model we studied — flow matching, DDPM, score matching, Stable Diffusion 3, Movie Gen, MDLM — is an instantiation of this template. The differences are in the choice of probability path, the architecture, and the engineering details. The mathematical framework is universal.
Number of sampling steps. Autoregressive models need exactly d steps to generate a sequence of length d (one token per step). Discrete diffusion models can use any number of steps n. With n < d, multiple tokens can be updated per step, enabling faster generation. Empirically, n ≈ 100–1000 steps gives good results for text, while some methods achieve decent quality with as few as 10–50 steps using improved sampling strategies.
Guidance for text. CFG extends to discrete diffusion just as it extends to continuous models. We train with label dropping (replacing the prompt with ∅) and combine conditional and unconditional rate matrices at inference:
Temperature and top-p sampling. At each step, instead of sampling from the full Euler transition distribution, we can apply temperature scaling or nucleus (top-p) sampling to the token probabilities. This provides additional control over the diversity/quality tradeoff, complementing guidance.
Training efficiency. The cross-entropy loss for discrete flow matching is computed over all d positions simultaneously, making efficient use of GPU parallelism. Unlike autoregressive models (which need causal masking), discrete diffusion models process the full sequence with bidirectional attention — every position can attend to every other position.
Choice of scheduler κt. The scheduler controls the rate of unmasking. Common choices include:
| Scheduler | κt | Behavior |
|---|---|---|
| Linear | t | Uniform unmasking rate |
| Cosine | (1 − cos(πt/2)) | Slow start, fast middle, slow end |
| Square root | √t | Fast early unmasking, slow refinement |
| Sigmoid | sigmoid(a(t−0.5)) | S-curve, tunable sharpness |
The cosine schedule is most common, as it spends more time at intermediate noise levels where the model has the most to learn, rather than wasting time at extremely noisy or nearly clean states.
Applications beyond text. Discrete diffusion models have been applied to several domains beyond language:
• Protein design: amino acid sequences (V=20 amino acids, d=100–1000 residues). Models like EvoDiff generate novel protein sequences that fold into stable structures.
• DNA/RNA design: nucleotide sequences (V=4 bases, d=thousands). Discrete diffusion naturally handles the discrete nature of genetic sequences.
• Code generation: program tokens with bidirectional attention, enabling infilling and completion tasks that autoregressive models struggle with.
• Molecular design: discrete representations of molecules (SMILES strings or token-based molecular graphs).
The state-of-the-art discrete diffusion language model is LLaDA 2.0 (2025), scaling to 100 billion parameters. Here is how it compares to autoregressive models:
python # Autoregressive generation (GPT-style) for i in range(d): # d sequential steps logits = gpt(x[:i]) # causal attention, O(i^2) x[i] = sample(logits[-1]) # next-token prediction # Cannot go back and revise x[0..i-1]! # Discrete diffusion generation (MDLM-style) x = [MASK] * d # all masked for step in range(n): # n steps (n can be << d) logits = mdlm(x, t) # bidirectional attention, O(d^2) # Update ALL positions in parallel for j in range(d): # parallel! if should_update(j, t): x[j] = sample(logits[j]) # Can revise any position at any step!
The discrete diffusion approach trades sequential token generation for iterative refinement. Each step processes the full sequence with bidirectional attention, allowing the model to consider the full context (both left and right) when making predictions. This is particularly valuable for tasks like infilling (filling in the middle of a sequence), translation, and editing.
For quick reference, all essential formulas from this chapter:
| Concept | Formula |
|---|---|
| State space | S = Vd, |S| = Vd |
| Rate matrix conditions | Q(y|x) ≥ 0 for y≠x; Q(x|x) = −∑y≠xQ(y|x) |
| Rate ↔ transition | (d/dh)pt+h|t(y|x)|h=0 = Qt(y|x) |
| Euler simulation | p(y|x) ≈ δy=x + h·Q(y|x) |
| KFE | (d/dt)pt(x) = ∑yQt(x|y)pt(y) |
| Factorized mixture path | pt(x|z) = ∏j[(1−κ)pinit(xj) + κδzj(xj)] |
| Conditional rate matrix | Qz(v,j|x) = (κ̇/(1−κ))(δzj(v) − δxj(v)) |
| Marginal rate matrix | Q(v,j|x) = (κ̇/(1−κ))(p1|t(zj=v|x) − δxj(v)) |
| DFM loss | L = E[∑j −log pθ1|t(zj|x)] |
Let's verify that we can compute the DFM loss efficiently. For each training example:
1. Sample data z ∼ pdata: A sequence of d tokens. Cost: O(1) (read from dataset).
2. Sample time t ∼ Unif[0,1]: One random number. Cost: O(1).
3. Sample noisy x ∼ pt(·|z): For each position j, flip a Bernoulli(κt) coin: heads = keep zj, tails = sample from pinit. Cost: O(d).
4. Forward pass: Run the transformer on (x, t) to get logits ∈ Rd×V. Cost: O(d2 · dmodel + d · dmodel · V) — standard transformer cost.
5. Compute loss: Cross-entropy between logits and z at each position. Cost: O(d · V).
The total per-example cost is dominated by the transformer forward pass, which is the same as training any language model. No CTMC simulation is needed during training — only at inference. This "simulation-free training" property is inherited directly from flow matching.
Let's trace through one training step of discrete flow matching to make everything concrete.
Setup: Vocabulary V = {a, b, c, d, e, [MASK]}, sequence length d=4, scheduler κt = t (linear).
Step 1: Sample data z = (b, a, d, c) from the dataset.
Step 2: Sample time t = 0.6. So κ0.6 = 0.6.
Step 3: Sample noisy x from pt(·|z).
For each position j, flip a coin with P(heads) = 0.6:
• Position 1: heads → x1 = z1 = b (kept)
• Position 2: tails → x2 = [MASK] (masked)
• Position 3: heads → x3 = z3 = d (kept)
• Position 4: tails → x4 = [MASK] (masked)
Noisy sequence: x = (b, [MASK], d, [MASK])
Step 4: Forward pass. The transformer receives x = (b, [MASK], d, [MASK]) and t=0.6. It outputs 4 probability distributions, one per position:
| Pos | P(a) | P(b) | P(c) | P(d) | P(e) | Target zj |
|---|---|---|---|---|---|---|
| 1 | 0.05 | 0.80 | 0.05 | 0.05 | 0.05 | b |
| 2 | 0.60 | 0.10 | 0.10 | 0.10 | 0.10 | a |
| 3 | 0.05 | 0.05 | 0.05 | 0.80 | 0.05 | d |
| 4 | 0.10 | 0.10 | 0.50 | 0.10 | 0.20 | c |
Step 5: Compute loss. Cross-entropy at each position:
The loss is highest at position 4 (where the model is least confident about the correct token c). Gradient descent will push the model to increase its confidence at position 4. Over many training steps, the model learns to predict the clean token at each position from the noisy context.
That's it — one training step of discrete flow matching. The beauty is in its simplicity: corrupt a sequence, predict the originals, compute cross-entropy. No ODEs, no SDEs, no simulation. Just classification.
Now let's trace one inference step. We have a partially unmasked sequence x = (b, [MASK], d, [MASK]) at time t = 0.6 and want to advance to t + h = 0.62 (step size h = 0.02).
Step 1: Get the model's predicted probabilities (same as training table above).
Step 2: Compute the rate matrix entries for each position. With κ̇t = 1 (linear schedule, derivative = 1) and 1−κt = 0.4:
Step 3: Compute jump probabilities for position 2 (currently [MASK]):
So at this step, position 2 has a 95% chance of staying [MASK] and a 3% chance of jumping to the correct token "a". Over many steps, the cumulative probability of unmasking increases, and by t≈1, all positions will be unmasked.
Similarly, for position 4 (currently [MASK], model predicts P(c)=0.50):
Position 4 also stays masked with 95% probability. But when it does unmask, "c" (the correct token) is the most likely destination at 2.5%, followed by "e" at 1%. Over many steps, the tokens gradually unmask, with the model's highest-confidence predictions unmasking first.
For positions already at the correct data token (positions 1 and 3 in our example), the rate matrix produces:
If the model is confident that xj is already correct (p1|t(zj=xj|x) ≈ 1), then Q(xj, j|x) ≈ (κ̇/(1−κ))(1 − 1) = 0 and Q(v, j|x) ≈ (κ̇/(1−κ))(0 − 0) = 0 for v ≠ xj. All rates are near zero — the token stays put. This is exactly the behavior we want: correct tokens should not be disturbed.
Step 4: Sample from this categorical distribution for each position independently. Most positions stay unchanged; occasionally one unmasks. This is the parallel Euler approximation from Algorithm 7.
python # Complete Discrete Flow Matching inference def generate(model, seq_len, vocab_size, n_steps=256): # Initialize: all [MASK] tokens mask_id = vocab_size # [MASK] = last token in vocab x = torch.full((1, seq_len), mask_id, dtype=torch.long) h = 1.0 / n_steps for i in range(n_steps): t = torch.tensor([i * h]) kappa = scheduler(t) kappa_dot = scheduler_deriv(t) rate_scale = kappa_dot / (1 - kappa + 1e-8) # Forward pass: predict clean token probabilities logits = model(x, t) # (1, d, V+1) probs = F.softmax(logits[:, :, :vocab_size], dim=-1) # exclude [MASK] from targets # Per-position Euler step for j in range(seq_len): # parallel in practice if x[0, j] == mask_id: # only update masked positions jump_probs = h * rate_scale * probs[0, j] stay_prob = 1 - jump_probs.sum() all_probs = torch.cat([jump_probs, stay_prob.unsqueeze(0)]) sampled = torch.multinomial(all_probs, 1) if sampled < vocab_size: x[0, j] = sampled # unmask! return x # fully generated sequence
"Creating noise from data is easy; creating data from noise is generative modeling." — Holderrieth & Erives, 2026