Holderrieth & Erives, Chapter 7

Discrete Diffusion Models

Building language models with diffusion: from continuous vectors to discrete tokens via continuous-time Markov chains.

Prerequisites: Flow matching (Ch 3) + Basic probability. That's it.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Discrete?

Everything we've learned about flow matching and diffusion operates in continuous Euclidean space Rd. We start with noise (a Gaussian), define a probability path, learn a vector field, and simulate an ODE to generate data. The math is beautiful and the results are stunning for images, videos, and audio.

But what about text?

Text is not a point in Rd. "The cat sat on the mat" is a sequence of discrete tokens. There's no meaningful notion of "halfway between cat and dog" in token space. You can't define a continuous ODE trajectory from noise to "The cat sat on the mat" because the space isn't continuous — it's discrete.

The fundamental challenge. SDEs and ODEs don't exist on discrete state spaces. We can't use vector fields or score functions. We need a new mathematical framework that plays the same role as ODEs/SDEs but operates on discrete sequences. That framework is the continuous-time Markov chain (CTMC).

The key insight is that the principles of flow matching transfer perfectly to discrete spaces:

Continuous (images)Discrete (text)
State space RdState space S = Vd (sequences of tokens)
ODE/SDE dynamicsCTMC dynamics
Vector field ut(x)Rate matrix Qt(y|x)
Gaussian noise → dataUniform/masked noise → data
Regression loss (MSE)Classification loss (cross-entropy)

By the end of this chapter, you'll see that training a discrete diffusion model reduces to training a simple token classifier — predict which token should go in each position, given a partially corrupted sequence. This is remarkably similar to how BERT-style masked language models work, but with a rigorous generative framework underneath.

Why can't we directly apply continuous flow matching to text data?

Chapter 1: State Space

Let's formalize what "discrete data" means. We start with a vocabulary V = {v1, v2, ..., vV} of V possible tokens. For language, these could be words, subwords (like BPE tokens), or characters. For DNA, they'd be the four bases {A, C, G, T}.

A sequence of d tokens lives in the state space:

S = Vd,   where d is the sequence length, V is the vocabulary size

A single element x ∈ S looks like x = (x1, x2, ..., xd), where each xj ∈ V. For example, with vocabulary {"the", "cat", "sat", "on", "mat"} and d=5:

x = ("the", "cat", "sat", "on", "mat") ∈ V5

The size problem. The state space S = Vd is enormous. With a vocabulary of V=50,000 tokens and a sequence length of d=1,024:

|S| = Vd = 50,0001,024 ≈ 104,813

This is astronomically larger than any continuous distribution we've dealt with. We can't enumerate states or store full probability vectors. Every algorithm must work with factorized representations that scale as d · V rather than Vd.

Key point. A "generative model" for discrete sequences means learning a distribution pdata over S. Sampling from this distribution produces new sequences — new sentences, new DNA strands, new code snippets — that look like the training data.

Now, let Xt be a stochastic process on S: a random trajectory X: [0,1] → S that maps continuous time to discrete states. Unlike ODEs (where Xt moves smoothly), Xt jumps between states. At most times it stays put; occasionally it jumps to a neighboring state.

Concrete example: language. Consider generating a 7-word sentence from a vocabulary of 50,000 tokens. The state space has |S| = 50,0007 ≈ 1033 possible sentences. The generative process starts from a random (or masked) sequence and progressively refines it, jumping one token at a time, until a coherent sentence emerges.

Contrast with autoregressive models. GPT-style models generate text left-to-right: they sample token 1, then token 2 conditioned on token 1, and so on. Discrete diffusion models generate all tokens simultaneously: at each step, any position can be updated. This parallelism is a key advantage — the model can revise earlier tokens based on later context, something autoregressive models cannot do.

No diffusion in "discrete diffusion." Despite the name, there is no mathematical diffusion process in discrete state spaces. SDEs (stochastic differential equations) require continuous spaces to define derivatives and Brownian motion. The term "discrete diffusion" is a historical convention — what we actually have is a CTMC (continuous-time Markov chain) that mimics the role of diffusion in the discrete setting. Think of it as "discrete flow matching" (the more accurate term used by Gat et al., 2024).
For a vocabulary V of size 100 and sequence length d=10, how large is the state space S = Vd?

Chapter 2: Rate Matrices

In continuous flow matching, the dynamics are governed by a vector field ut(x) that tells us "which direction to move." In discrete space, there are no directions — only jumps. The analogue of a vector field is a rate matrix Qt(y|x) that tells us "how fast to jump from state x to state y."

Formally, a rate matrix is a function Q: S × S × [0,1] → R that satisfies two conditions:

Condition 1: Non-negative off-diagonal. For x ≠ y:

Qt(y|x) ≥ 0   whenever x ≠ y

The rate of jumping from x to a different state y can only be non-negative. (Not jumping just means rate = 0.)

Condition 2: Rows sum to zero.

Qt(x|x) = − ∑y≠x Qt(y|x)   for all x

The diagonal entry (rate of "staying") equals the negative sum of all outgoing rates. This is a consistency condition: you either stay at x or leave to some y — there's no third option.

Read the rate matrix as a transition intensity. If Qt(y|x) = 3, it means "at time t, the process jumps from x to y at a rate of 3 per unit time." A high rate means frequent jumps; rate 0 means this jump never happens. The diagonal Qt(x|x) is always ≤ 0 and tells you the total rate of leaving state x.

Worked example: two-state system. Consider S = {a, b} with a constant rate λ > 0. The rate matrix is:

Q =
  a: [−λ, λ]
  b: [λ, −λ]

This says: when at a, jump to b at rate λ. When at b, jump to a at rate λ. The transition probabilities over a time increment h are:

P(Xt+h = a | Xt = a) = (1 + e−2λh) / 2
P(Xt+h = b | Xt = a) = (1 − e−2λh) / 2

As h → ∞, the memory decays exponentially and the chain converges to uniform: P(a) = P(b) = 1/2. Higher λ means faster convergence.

Let's verify the transition probabilities satisfy the rate equation. Take the derivative at h=0:

(d/dh) P(Xt+h=b | Xt=a) |h=0 = (d/dh) [(1−e−2λh)/2] |h=0 = 2λe−2λ·0/2 = λ = Q(b|a) ✓

And for staying at a:

(d/dh) P(Xt+h=a | Xt=a) |h=0 = (d/dh) [(1+e−2λh)/2] |h=0 = −2λe0/2 = −λ = Q(a|a) ✓

The rate matrix conditions are satisfied. This gives us confidence that the rate matrix formalism correctly describes the dynamics.

Rate Matrix Visualizer

A two-state CTMC with rate λ. The particle starts at state A and jumps between states. Adjust λ to see how it affects the jump frequency and equilibrium.

Rate λ 3.0
In a rate matrix Q, if Qt(y|x) = 5 for x ≠ y, what does this mean?

Chapter 3: CTMCs

A Continuous-Time Markov Chain (CTMC) is a stochastic process Xt on a discrete state space S that has the Markov property: the future depends only on the present, not the past.

P(Xt+h | Xt, Xt1, ..., Xtk) = P(Xt+h | Xt)   for all 0 ≤ t1 < ... < tk < t

Think of a particle sitting on one of the states in S. At random times, it jumps to another state. The rate of jumping is governed by the rate matrix Qt. The fundamental relationship is:

(d/dh) pt+h|t(Xt+h = y | Xt = x) |h=0 = Qt(y|x)

This says: the instantaneous rate of change of the transition probability equals the rate matrix. It's the discrete analogue of "the velocity field determines the ODE trajectory."

Existence and uniqueness (Theorem 33). For any bounded, time-continuous rate matrix Qt, there exists a unique Markov chain whose transition probabilities satisfy the equation above. This is the discrete analogue of the Picard-Lindelöf theorem for ODEs. For machine learning, it means we can freely parameterize Qt with a neural network and know that a unique CTMC corresponds to it.

The Kolmogorov Forward Equation. Just as the continuity equation governs how probability densities evolve under ODEs in continuous space, the Kolmogorov Forward Equation (KFE) governs how probability distributions evolve under CTMCs:

(d/dt) pt(x) = ∑y∈S Qt(x|y) pt(y)

This says: the rate of change of the probability of being in state x equals the sum of "incoming flow" from all other states y (weighted by their probabilities and the rate of jumping from y to x). In matrix form:

(d/dt) pt = Qt pt

This is a linear ODE on the vector space of probability distributions. Its uniqueness follows from standard ODE theory. The KFE is the key tool for proving that marginal rate matrices correctly generate the desired data distribution.

Compare with the continuous case:

ConceptContinuous (ODEs)Discrete (CTMCs)
Evolution equationContinuity eq: ∂p/∂t + div(p u) = 0KFE: dp/dt = Q p
NaturePDE on continuous densityLinear ODE on probability vector
UniquenessPicard-Lindelof theoremTheorem 33 (rate matrix to unique CTMC)
SimulationEuler: x = x + h u(x,t)Euler: sample from delta + h Q

Simulating a CTMC (Euler Method)

How do we simulate a trajectory? For small step size h, we can approximate the transition probability:

pt+h|t(y|x) ≈ δy=x + h · Qt(y|x)

In words: at each step, the particle either stays put (with probability 1 − h∑y≠xQ(y|x)) or jumps to some neighbor y (with probability h · Q(y|x)). We sample from this discrete distribution at each time step.

python
# Euler simulation of a CTMC
def simulate_ctmc(Q_fn, x0, n_steps=100):
    x = x0
    h = 1.0 / n_steps
    for i in range(n_steps):
        t = i * h
        # Get rate matrix column for current state x
        rates = Q_fn(x, t)  # shape: (|S|,)
        # Euler transition probabilities
        probs = h * rates      # off-diagonal
        probs[x] = 1 - h * rates[rates > 0].sum()  # diagonal
        # Sample next state
        x = np.random.choice(len(probs), p=probs)
    return x
CTMC Particle Jumping

A particle jumps between 5 states according to a rate matrix. Watch how it spends time in different states. The bar chart shows the empirical distribution (how often each state is visited).

The Euler approximation for CTMC simulation uses p(y|x) ≈ δy=x + h·Q(y|x). What does δy=x represent?

Chapter 4: Factorized Models

A CTMC model parameterizes the rate matrix with a neural network: x → {Qθt(y|x)}y∈S. But there's a fatal problem: the output has size |S| = Vd. For V=50,000 and d=1,024, that's 50,0001,024 values. No computer can store this.

The solution is factorization: restrict the rate matrix so that the process can only change one token at a time.

Qθt(y|x) = 0   whenever y and x differ in more than one position

This means if x = (x1, x2, ..., xd), the only reachable states from x are those that differ in exactly one token. We call these the neighbors N(x) of x.

With factorization, the network output has shape d × V instead of Vd:

x → (Qθt(vi, j | x))i=1..V, j=1..d

Here, Qθt(vi, j | x) is the rate of replacing the j-th token of x with vocabulary item vi. This scales linearly in the sequence length, not exponentially.

d × V = manageable. For V=50,000 and d=1,024, the output is 50,000 × 1,024 ≈ 51 million values. Large, but tractable. Compare this to Vd = 50,0001,024 ≈ 104,813. Factorization turns an impossible problem into a practical one.

The per-position rate matrix must satisfy the usual conditions:

Qθt(v, j | x) ≥ 0   if v ≠ xj
Qθt(xj, j | x) = −∑v≠xj Qθt(v, j | x)

A transformer fits perfectly: input a sequence of d tokens, output d × V logits. The rate conditions can be enforced with a simple softmax followed by negation for the diagonal.

python
# Factorized CTMC model architecture
class DiscreteDiffusionModel(nn.Module):
    def __init__(self, vocab_size, seq_len, d_model=768):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.time_embed = FourierFeatures(d_model)
        self.transformer = TransformerEncoder(d_model, n_layers=12)
        self.head = nn.Linear(d_model, vocab_size)  # output logits

    def forward(self, x, t):
        # x: (B, d) token indices
        # t: (B,) scalar times
        h = self.embed(x)             # (B, d, d_model)
        h = h + self.time_embed(t)    # add time conditioning
        h = self.transformer(h)       # (B, d, d_model)
        logits = self.head(h)         # (B, d, V) per-position logits
        return logits
        # To get rate matrix: softmax(logits) gives probabilities
        # Rate = kappa_dot/(1-kappa) * (softmax(logits) - one_hot(x))

The output shape is (B, d, V) — a probability distribution over V tokens for each of the d positions. This is exactly the shape of a standard language model output, making standard transformer architectures directly applicable.

Worked example: factorized rate matrix for d=2, V=3. Suppose our vocabulary is V={A, B, C} and our sequence length is d=2. The full state space has |S| = 32 = 9 states: AA, AB, AC, BA, BB, BC, CA, CB, CC. A full rate matrix would be 9×9 = 81 entries.

With factorization, only single-token changes are allowed. From state AB, the neighbors are: {AB, BB, CB, AA, AB, AC} (change position 1 to any value, or change position 2 to any value). That's at most d×V = 2×3 = 6 rates to store, not 9.

x = (A, B) → network outputs a 2×3 rate matrix:
Position 1: [Q(A,1|x), Q(B,1|x), Q(C,1|x)] — rates for changing first token
Position 2: [Q(A,2|x), Q(B,2|x), Q(C,2|x)] — rates for changing second token

The diagonal entries (current tokens) are negative: Q(A,1|x) = −(Q(B,1|x) + Q(C,1|x)) and Q(B,2|x) = −(Q(A,2|x) + Q(C,2|x)).

Factorized Simulation

With factorization, simulation becomes embarrassingly parallel: at each step, update each token independently:

python
# Parallel per-token Euler step
for j in range(d):  # in parallel!
    rates_j = Q_theta(x, t)[j]  # (V,) rates for position j
    # Euler transition: stay or jump
    probs_j = h * rates_j
    probs_j[x[j]] = 1 - h * rates_j[rates_j > 0].sum()
    x[j] = sample_categorical(probs_j)
Why must the rate matrix be factorized for practical CTMC models?

Chapter 5: Probability Paths

Just like in continuous flow matching, we need a probability path that interpolates between noise and data. The key difference: instead of smoothly blending two distributions (like a Gaussian path does), a discrete probability path teleports probability mass — fading one distribution in and another out.

A discrete conditional probability path pt(x|z) satisfies:

p0(·|z) = pinit   (noise, independent of z)
p1(·|z) = δz   (all mass on the data point z)

Notice the similarity to continuous flow matching (Chapter 3): there we had p0(·|z) = N(0, I) (Gaussian noise) and p1(·|z) = δz (point mass at data). Here it's the same structure, just with discrete distributions.

Choices for the initial distribution pinit.

ChoicepinitEffectUsed by
UniformUnif(V) per positionEach masked token replaced by random vocabulary itemGeneral DFM
Mask tokenδ[MASK] per positionEach masked token replaced by special [MASK]MDLM
Data distributionEmpirical marginal per positionMasked tokens look like "typical" tokensSome variants

The mask-token approach (MDLM) is the most popular because it gives the model a clear signal: "this position is unknown." With uniform noise, the model must distinguish between a token that is real data and one that is random noise — a harder discrimination task.

Worked example: sampling from the factorized mixture path. Data z = (T, h, e), scheduler κt = 0.4, noise = [MASK].

For each position j = 1, 2, 3:

• Sample mj ∼ Bernoulli(0.4): probability 0.4 of keeping the data token, 0.6 of masking.

• If mj = 1: xj = zj (keep data). If mj = 0: xj = [MASK].

Possible outcomes (each with its probability):

mxP
(0,0,0)[M] [M] [M]0.63 = 0.216
(1,0,0)T [M] [M]0.4 · 0.62 = 0.144
(0,1,0)[M] h [M]0.144
(0,0,1)[M] [M] e0.144
(1,1,0)T h [M]0.42 · 0.6 = 0.096
(1,0,1)T [M] e0.096
(0,1,1)[M] h e0.096
(1,1,1)T h e0.43 = 0.064

At κ = 0.4, the most likely outcome is fully masked (21.6%), and the data is fully revealed only 6.4% of the time. As κ increases toward 1, the probabilities shift toward more tokens being revealed.

Expected number of revealed tokens. At time t with scheduler κt, each position is independently revealed with probability κt. The expected number of revealed tokens is d·κt. For d=1024 and κ=0.4: expected 410 tokens revealed, 614 masked. At κ=0.9: expected 922 revealed, 102 masked. The model must predict these 102 remaining tokens from the 922 visible ones — a much easier task than predicting all 1024 from scratch.

This progressive difficulty is key to training: at high κ (easy), the model learns simple associations (local context). At low κ (hard), it learns global structure (long-range dependencies). The uniform time sampling t ∼ Unif[0,1] ensures the model is trained at all difficulty levels.

The loss at different noise levels. The DFM loss ∑j −log pθ(zj|x) behaves differently at different times:

Time tκtMasked fractionLoss character
t ≈ 0~0~100% maskedVery hard. Model must guess tokens from nearly zero context. Loss is high.
t ≈ 0.3~0.3~70% maskedHard. Some context available. Model learns global patterns.
t ≈ 0.7~0.7~30% maskedModerate. Most tokens visible. Model learns local corrections.
t ≈ 1~1~0% maskedEasy. Almost all tokens visible. Loss is low (just copy the input).

The loss contribution is weighted uniformly across all t values. Some practitioners apply importance weighting (spending more training time at intermediate t values where the loss gradient is most informative) to improve sample quality.

Connection to denoising autoencoders. At each training step, the model receives a corrupted sequence x and must predict the clean sequence z. This is precisely the framework of a denoising autoencoder, but with a continuous corruption schedule and a rigorous generative model underneath. The continuum of corruption levels (parameterized by t) is what transforms a simple denoising task into a full generative model.

Architecture Choices for Discrete Diffusion

The neural network for discrete diffusion is typically a standard transformer encoder (not a decoder, since there's no causal masking). The key architectural decisions:

1. Bidirectional attention. Every token can attend to every other token, including tokens to its right. This is fundamentally different from GPT-style autoregressive models. Architecturally, it's identical to BERT.

2. Time conditioning. The scalar time t is embedded using Fourier features (identical to Chapter 6) and injected via AdaLN (adaptive layer normalization) at every transformer layer. This tells the model the current noise level.

3. Output head. A linear layer maps the transformer's hidden state at each position to logits over the vocabulary V. A softmax produces the predicted clean-token probabilities pθ1|t(v|x).

python
# Discrete diffusion transformer (detailed)
class DiffusionLM(nn.Module):
    def __init__(self, V, d, d_model, n_layers, n_heads):
        super().__init__()
        self.tok_emb = nn.Embedding(V + 1, d_model)  # +1 for [MASK]
        self.pos_emb = nn.Embedding(d, d_model)
        self.time_mlp = nn.Sequential(
            FourierFeatures(d_model),
            nn.Linear(d_model, d_model),
            nn.SiLU(),
            nn.Linear(d_model, d_model)
        )
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, n_heads, use_adaln=True)
            for _ in range(n_layers)
        ])
        self.head = nn.Linear(d_model, V)  # predict V tokens (not [MASK])

    def forward(self, x, t):
        h = self.tok_emb(x) + self.pos_emb(torch.arange(x.size(1)))
        t_emb = self.time_mlp(t)
        for layer in self.layers:
            h = layer(h, t_emb)  # AdaLN with time embedding
        logits = self.head(h)  # (B, d, V)
        return logits

4. Scale. LLaDA 2.0 uses a 100B-parameter transformer with 80 layers, d_model=12288, and 96 attention heads. The architecture is essentially identical to a standard LLM (like LLaMA) but with bidirectional attention and AdaLN time conditioning.

5. Training data. Like autoregressive LLMs, discrete diffusion models are trained on web-scale text: Common Crawl, Wikipedia, books, code, and other text corpora. The data pipeline is identical — only the loss function changes from next-token prediction (cross-entropy with causal masking) to masked-token prediction (cross-entropy with random masking at variable rates).

Guidance for Discrete Diffusion

Classifier-free guidance (Chapter 5) extends naturally to discrete models. The rate matrix combines conditional and unconditional predictions:

t(v, j | x, y) = (κ̇/(1−κ)) · ((1−w) p1|t(zj=v|x,∅) + w · p1|t(zj=v|x,y) − δxj(v))

Training uses the same label-dropping trick: with probability η, replace the prompt y with a null token ∅. At inference, evaluate the model twice per step (once with y, once with ∅) and combine the predicted probabilities.

python
# CFG for discrete diffusion
def cfg_discrete_step(model, x, t, y, null_y, w):
    # Two forward passes
    logits_cond = model(x, t, y)          # (B, d, V)
    logits_uncond = model(x, t, null_y)   # (B, d, V)

    # Combine probabilities via CFG
    probs_cond = F.softmax(logits_cond, dim=-1)
    probs_uncond = F.softmax(logits_uncond, dim=-1)
    probs_cfg = (1 - w) * probs_uncond + w * probs_cond

    # Renormalize (CFG can push probs outside [0,1])
    probs_cfg = probs_cfg.clamp(min=0)
    probs_cfg = probs_cfg / probs_cfg.sum(dim=-1, keepdim=True)

    return probs_cfg

Note the renormalization step: since CFG is a linear combination with w>1, the resulting "probabilities" can go negative or exceed 1. We clamp and renormalize to get a valid probability distribution. This is analogous to how CFG in continuous space can produce velocity vectors that are "too large" — a consequence of the heuristic nature of guidance.

Typical guidance scales for text generation. Discrete diffusion models typically use lower guidance scales than image models. For class-conditional text generation, w ∈ [1.5, 3.0] is typical. For text-conditional generation (e.g., prompt-guided story completion), w ∈ [2.0, 5.0] works well. Higher values tend to produce repetitive, less diverse text.

Relationship to Other Discrete Generative Models

Discrete diffusion models are one of several approaches to generating discrete sequences. Here is how they compare:

ModelGeneration mechanismStrengthsWeaknesses
Autoregressive (GPT)Left-to-right, one token per stepExcellent quality, simple trainingSequential (slow), no revision, no infilling
Discrete diffusion (MDLM)All positions updated in parallel via CTMCParallel generation, natural infilling, revisionLess mature, needs many sampling steps
Non-autoregressive (NAT)All tokens generated in one passVery fast inferencePoor quality without iterative refinement
Energy-based modelsMCMC sampling from learned energy functionFlexible, principledSlow mixing, hard to train
VAE-basedLatent variable z → decode to sequenceFast generation, smooth latent spacePosterior collapse, blurry outputs

Discrete diffusion occupies a sweet spot: it's more flexible than autoregressive models (parallel generation, infilling) while being more principled and higher-quality than non-autoregressive models. As scaling and training recipes improve, discrete diffusion models are increasingly competitive with autoregressive LLMs on standard language modeling benchmarks.

The future convergence. There is growing evidence that the distinction between autoregressive and diffusion-based language models may eventually blur. Some recent work combines both approaches: use autoregressive generation for the initial draft, then refine with discrete diffusion. Others use diffusion-based models for planning and autoregressive models for execution. The principles learned in this chapter — CTMCs, rate matrices, and simulation-free training — provide the mathematical foundation for all of these approaches.

The most common choice is the factorized mixture path. It works independently per token position with a scheduler κt ∈ [0,1] (where κ0=0, κ1=1):

pt(x|z) = ∏j=1d [(1 − κt) pinit(xj) + κt δzj(xj)]

In words: for each position j independently, with probability κt the token equals the data zj, and with probability 1−κt it's drawn from the noise distribution. We can sample x ∼ pt(·|z) by flipping a coin per position:

For each position j = 1, ..., d
Sample mask
mj ∼ Bernoulli(κt)
Sample noise
ξj ∼ pinit
Set token
xj = mj · zj + (1 − mj) · ξj
Compare with Gaussian paths. In continuous flow matching, the path pt(x|z) = N(αtz, βt2I) moves probability mass from noise toward data via a smooth interpolation. In the discrete case, there's no notion of "moving" in token space. Instead, the factorized mixture path independently masks/unmasks each token: at t=0, everything is noise; at t=1, everything is the data point z. The scheduler κt controls the rate of unmasking.

The marginal probability path integrates over all data points z:

pt(x) = ∑z∈S pt(x|z) pdata(z)

This satisfies p0 = pinit and p1 = pdata — exactly the interpolation from noise to data that we need.

Factorized Masking with κ Slider

A 12-token sequence being progressively unmasked. At κ=0, all tokens are noise (random). At κ=1, all tokens equal the data. Drag κ to see the interpolation.

κt 0.00
In the factorized mixture path, what does the scheduler κt control?

Chapter 6: Conditional Rate Matrix

Now we need a rate matrix whose CTMC "follows" the conditional probability path. This is the conditional rate matrix Qzt(y|x) — the discrete analogue of the conditional vector field utargett(x|z) from Chapter 3.

For the factorized mixture path, the conditional rate matrix has a beautifully simple form:

Qzt(vi, j | xj) = (κ̇t / (1 − κt)) · (δzj(vi) − δxj(vi))

Let's unpack this case by case for a single position j:

ConditionRateInterpretation
xj = zj (already correct)0 for all viAlready at the data token — don't jump
xj ≠ zj, vi = zj (jump to correct)κ̇t / (1 − κt)Jump to the data token at rate κ̇/(1−κ)
xj ≠ zj, vi ≠ zj (jump to wrong)0Never jump to the wrong token
The conditional rate matrix only allows jumps to zj. If a token is wrong, the only permitted transition is to the correct data token. No detours, no random exploration — the conditional process goes directly to the target. This is why the conditional rate matrix is so simple.

Worked example: conditional rate matrix for d=1. Consider a single position with vocabulary V = {a, b, c} and data token z = b. Current state is x = a (incorrect). With κ̇t = 1 and κt = 0.5:

Rate scale = κ̇/(1−κ) = 1/(1−0.5) = 2.0

The conditional rate matrix for this single position:

From ↓ / To →abc
a (current)−2.02.00
b (target)000
c02.0−2.0

Row a: from state a, jump to b (the target) at rate 2.0, never jump to c. Row b: already at target, all rates zero. Row c: jump to b at rate 2.0, never jump to a. The conditional process moves only toward the data.

Why does the rate increase as κ → 1? At early times (κ small), 1−κ is close to 1, so the rate κ̇/(1−κ) is moderate. As t increases and κ approaches 1, the rate diverges — the process becomes increasingly urgent about reaching the target. This makes sense: near the end, any remaining incorrect tokens must be fixed quickly.

The Discrete Marginalization Trick

Just as in continuous flow matching (Theorem 12), the marginal rate matrix is a weighted average of conditional rate matrices:

Qt(y|x) = ∑z∈S Qzt(y|x) · p1|t(z|x)

where p1|t(z|x) = pt(x|z) pdata(z) / pt(x) is the posterior probability that the clean data is z, given the noisy observation x at time t.

For the factorized mixture path, the marginal rate matrix factorizes beautifully (Theorem 38):

Qt(vi, j | x) = (κ̇t / (1 − κt)) · (p1|t(zj = vi | x) − δxj(vi))
This is just a classifier! The marginal rate matrix is entirely determined by p1|t(zj = vi | x): the posterior probability that position j holds token vi, given the noisy sequence x at time t. Training a CTMC model reduces to training a per-position token classifier.

Proof of the Discrete Marginalization Trick

The proof mirrors the continuous case (Theorem 12) but uses the KFE instead of the continuity equation. We need to show that the marginal rate matrix Qt(y|x) = ∑z Qzt(y|x) p1|t(z|x) satisfies the KFE for the marginal probability path pt(x) = ∑z pt(x|z) pdata(z).

Step 1: Differentiate the marginal path:

(d/dt) pt(x) = ∑z (d/dt) pt(x|z) · pdata(z)

Step 2: Apply the conditional KFE (Qzt generates pt(·|z)):

= ∑zy Qzt(x|y) pt(y|z) pdata(z)

Step 3: Multiply and divide by pt(y):

= ∑y pt(y) [∑z Qzt(x|y) · pt(y|z)pdata(z)/pt(y)]

Step 4: Recognize the marginal rate matrix:

= ∑y Qt(x|y) · pt(y)

This is exactly the KFE for Qt and pt. By the sufficiency of the KFE (Proposition 2), the CTMC with rate matrix Qt generates the marginal probability path. Since p1 = pdata, the model generates from the data distribution at t=1. □

The pattern is identical to continuous flow matching. Compare: (1) define conditional dynamics (conditional vector field / conditional rate matrix), (2) show that the marginal dynamics (posterior-weighted average) generates the marginal probability path, (3) train the neural network to approximate the marginal dynamics. The proofs use KFE in place of the continuity equation, and posterior-weighted sums in place of posterior-weighted integrals.

Deriving the Specific Form for the Factorized Mixture Path

Let's work out the marginal rate matrix for the factorized mixture path in detail. Starting from Theorem 38:

Qt(vi, j | x) = (κ̇t / (1 − κt)) · (p1|t(zj = vi | x) − δxj(vi))

Step 1: Start with the conditional rate matrix for position j (Example 37):

Qzt(vi, j | xj) = (κ̇t/(1−κt)) · (δzj(vi) − δxj(vi))

Step 2: Apply the marginalization trick Qt(vi,j|x) = ∑z Qzt(vi,j|x) p1|t(z|x):

Qt(vi, j | x) = (κ̇t/(1−κt)) ∑zzj(vi) − δxj(vi)) p1|t(z|x)

Step 3: The δxj(vi) term doesn't depend on z, so ∑z δxj(vi) p1|t(z|x) = δxj(vi).

Step 4: For the first term, ∑z δzj(vi) p1|t(z|x) = p1|t(zj=vi|x) by marginalization over all positions except j.

Step 5: Combining: Qt(vi, j | x) = (κ̇t/(1−κt)) (p1|t(zj=vi|x) − δxj(vi)). □

The result is remarkable in its simplicity: the entire rate matrix reduces to the posterior token probabilities p1|t(zj=vi|x), which are exactly what a standard classifier network outputs (logits → softmax → probabilities).

Verifying the Rate Matrix Conditions

Let's verify that the marginal rate matrix Qt(v, j|x) actually satisfies the rate matrix conditions:

Condition 1: Non-negative off-diagonal (v ≠ xj).

Qt(v, j|x) = (κ̇/(1−κ)) · (p1|t(zj=v|x) − 0) = (κ̇/(1−κ)) · p1|t(zj=v|x) ≥ 0 ✓

Since κ̇ ≥ 0, (1−κ) > 0 for t < 1, and probabilities are non-negative.

Condition 2: Diagonal equals negative sum of off-diagonal (v = xj).

Qt(xj, j|x) = (κ̇/(1−κ)) · (p1|t(zj=xj|x) − 1)
v≠xj Qt(v, j|x) = (κ̇/(1−κ)) ∑v≠xj p1|t(zj=v|x) = (κ̇/(1−κ)) (1 − p1|t(zj=xj|x))
Qt(xj, j|x) = −∑v≠xj Qt(v, j|x) ✓

Both conditions verified. The marginal rate matrix is a valid rate matrix for all t ∈ [0,1).

Numerical verification. Suppose at position j, xj = a, and the model predicts p1|t(zj=v|x) = {a: 0.1, b: 0.7, c: 0.2} with κ̇/(1−κ) = 3.0. Then:

Q(a, j|x) = 3.0 · (0.1 − 1) = 3.0 · (−0.9) = −2.7   (diagonal: rate of leaving a)
Q(b, j|x) = 3.0 · (0.7 − 0) = 2.1   (rate of jumping to b)
Q(c, j|x) = 3.0 · (0.2 − 0) = 0.6   (rate of jumping to c)

Check: Q(a) + Q(b) + Q(c) = −2.7 + 2.1 + 0.6 = 0 ✓ (rows sum to zero).

The model predicts b is the most likely clean token (probability 0.7), so the rate of jumping to b (2.1) is highest. This is exactly what we want: the CTMC jumps preferentially toward the model's best guess of the correct token.

What happens in an Euler step? With step size h=0.01:

P(stay at a) = 1 − h·(2.1+0.6) = 1 − 0.027 = 0.973
P(jump to b) = h·2.1 = 0.021
P(jump to c) = h·0.6 = 0.006

At each step, there's a 2.1% chance of jumping to the predicted clean token b, a 0.6% chance of jumping to c, and a 97.3% chance of staying at a. Over 100 steps (h=0.01 each), the cumulative probability of eventually jumping to b is approximately 1−0.973100 ≈ 93%. The process naturally converges to the most likely clean token.

Singularity at t=1. At t=1, κ1=1 so (1−κ) = 0 and the rate matrix has a division-by-zero singularity. In practice, we stop simulation slightly before t=1 (e.g., at t=0.999). By that point, essentially all tokens have been unmasked to their final values.
In the conditional rate matrix for the factorized mixture path, if a token xj already equals the data token zj, what happens?

Chapter 7: Discrete Flow Matching

We now have everything we need to train a discrete generative model. The rate matrix is determined by the posterior probabilities p1|t(zj = v | x). We parameterize these with a neural network and train with cross-entropy loss.

The Discrete Flow Matching (DFM) loss is:

LDFM(θ) = Ez∼pdata, t∼Unif[0,1], x∼pt(·|z) [ ∑j=1d −log pθ1|t(zj | x) ]

This is exactly a cross-entropy classification loss. For each position j, predict which token zj was the original clean token, given the noisy sequence x at time t. Sum over all d positions.

The remarkable reduction. Just as continuous flow matching reduces to regression (MSE loss), discrete flow matching reduces to classification (cross-entropy loss). The network is a standard sequence-to-sequence model (transformer) that takes a corrupted sequence x and outputs per-position logits over the vocabulary V. Training is just multi-position token classification.

The complete training algorithm:

python
# Discrete Flow Matching Training
for z in dataloader:                # z: (B, d) token sequences
    t = torch.rand(B)               # random time per sample
    kappa = scheduler(t)             # kappa_t in [0,1]

    # Sample noisy x from factorized mixture path
    mask = torch.bernoulli(kappa.unsqueeze(1).expand(B, d))  # (B,d)
    noise = torch.randint(0, V, (B, d))  # random tokens
    x = mask * z + (1 - mask) * noise     # keep or corrupt

    # Predict clean tokens from noisy sequence
    logits = model(x, t)             # (B, d, V) logits per position

    # Cross-entropy loss per position
    loss = F.cross_entropy(
        logits.view(-1, V), z.view(-1), reduction='mean'
    )
    loss.backward()
    optimizer.step()

Let's break down why this is so elegant. In continuous flow matching, the training target is the conditional vector field utargett(x|z) = α̇tz + β̇tε — a simple linear function that tells us "which direction to go." Here, the training target is the posterior probability p1|t(zj|x) — a simple classification target that tells us "which token should be here." Both are easy to compute from the training data, and both lead to simulation-free training.

Correspondence table.
Continuous: target = velocity utarget(x|z) ∈ Rd, loss = MSE regression
Discrete: target = token identity zj ∈ V, loss = cross-entropy classification
Both are simulation-free (no ODE/CTMC simulation needed during training), and both are conditioned on the noisy observation x at time t.

Inference uses Algorithm 7: start from pure noise (or all [MASK] tokens), simulate the factorized CTMC for n steps using the Euler approximation, outputting the final sequence X1.

python
# Discrete Flow Matching Inference
x = torch.randint(0, V, (B, d))  # X_0 ~ p_init
h = 1.0 / n_steps
for i in range(n_steps):
    t = i * h
    logits = model(x, t)         # (B, d, V)
    probs = F.softmax(logits, dim=-1)  # p_theta(v|x,t)
    # Per-position Euler step
    kappa_dot = scheduler_deriv(t)
    rate_scale = kappa_dot / (1 - scheduler(t) + 1e-8)
    for j in range(d):  # parallel in practice
        jump_probs = h * rate_scale * probs[:, j, :]
        jump_probs[:, x[:, j]] += 1 - jump_probs.sum(dim=-1, keepdim=True)
        x[:, j] = torch.multinomial(jump_probs, 1).squeeze()
The Discrete Flow Matching loss reduces to which standard machine learning task?

Chapter 8: MDLM

A particularly elegant instantiation of discrete flow matching is the Masked Diffusion Language Model (MDLM). The idea: extend the vocabulary with a special [MASK] token, and use it as the initial noise state. The generation process starts from an all-masked sequence and progressively unmasks tokens.

Formally, we set:

V' = {v1, ..., vV, [MASK]}
pinit = δ[MASK]d

The initial state is always the fully masked sequence [MASK][MASK]...[MASK]. The factorized mixture path then progressively reveals tokens: at time t, each position has probability κt of showing the true data token and probability 1−κt of being [MASK].

MDLM looks like BERT — but is generative. BERT masks ~15% of tokens and predicts them. MDLM uses a continuous masking schedule from 0% to 100%. At inference, it starts from 100% masked and progressively unmasks. The key difference: BERT is a discriminative model (predicts missing tokens from a mostly-complete sequence), while MDLM is a generative model with a rigorous probabilistic framework.

The generation process looks like this:

TimeSequenceStatus
t = 0[MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]Fully masked
t = 0.25[MASK] [MASK] [MASK] on [MASK] [MASK] [MASK]Some tokens revealed
t = 0.75[MASK] cat [MASK] on the mat [MASK]Most tokens revealed
t = 1The cat sat on the mat .Fully unmasked = generated text
MDLM Progressive Unmasking

Watch a sequence progressively unmask from pure [MASK] to complete text. Press Play to animate. The progress bar shows time t from 0 to 1.

Why is this different from BERT? BERT masks a fixed fraction (~15%) of tokens and trains on predicting them. MDLM uses a continuous schedule: the masking fraction varies from 0% (κ=1, all data) to 100% (κ=0, all masked) during training. At each training step, a random time t is sampled, giving a random masking level κt. This continuous schedule is what makes MDLM a rigorous generative model — it learns the rate matrix at all noise levels, not just one.

The inference procedure for MDLM:

1. Initialize
X0 = [MASK][MASK]...[MASK] (d tokens, all masked)
2. For each step
Compute logits = fθ(Xt, t) for all d positions
3. Sample updates
For each masked position j, sample a token from softmax(logitsj) with Euler probabilities
4. Unmask
Replace some [MASK] tokens with sampled tokens
↻ repeat n steps
5. Output
X1 = fully unmasked sequence

State-of-the-art. Current discrete diffusion models like LLaDA 2.0 (2025) use this recipe with transformer backbones trained on web-scale text data, achieving competitive performance with autoregressive language models. The advantages of the diffusion approach include parallel generation (all positions can be updated simultaneously) and natural handling of infilling tasks (fill in the middle of a sequence, not just left-to-right).

Advantages of discrete diffusion over autoregressive LLMs:

FeatureAutoregressive (GPT)Discrete Diffusion (MDLM)
Generation orderLeft-to-right onlyAny order (parallel)
InfillingRequires special trainingNatural (mask the middle)
RevisionCannot change earlier tokensCan revise any position at any step
Speed (sequential)d steps (one per token)n steps (tunable, often n ≪ d)
TrainingCausal maskingRandom masking at varying levels
MaturityHighly optimized (2020–2026)Rapidly improving (2024–2026)
How does MDLM differ from BERT's masked language modeling?

Chapter 9: Connections

This chapter extended flow matching from continuous (images) to discrete (text) data. The principles transferred remarkably cleanly:

ConceptContinuous (Rd)Discrete (Vd)
DynamicsODE/SDECTMC
GeneratorVector field ut(x)Rate matrix Qt(y|x)
Conditional pathGaussian: pt(x|z) = N(αz, β2I)Mixture: (1−κ)pinit + κδz
Marginalization trickut(x) = E[ut(x|z) | x]Qt(y|x) = E[Qz(y|x) | x]
Training lossMSE (regression)Cross-entropy (classification)
Network architectureDiT / U-NetTransformer
SimulationEuler ODE solverEuler CTMC sampler
Generator Matching. The fact that flow matching principles transfer so cleanly to discrete spaces is not a coincidence. The Generator Matching framework (Holderrieth et al., 2024) unifies continuous and discrete generative models under a single umbrella. A "generator" is the abstract concept that subsumes both vector fields and rate matrices. The framework extends to any state space: smooth manifolds, mixed continuous-discrete spaces, and beyond.

This concludes the book. We have traveled from basic probability (Chapter 1) through flow matching (Chapter 3), score functions (Chapter 4), guidance (Chapter 5), large-scale architectures (Chapter 6), and now discrete diffusion (Chapter 7). The complete pipeline — from mathematical foundations to production systems generating images, videos, and text — rests on a single idea: define a path from noise to data, derive the conditional dynamics, and train a neural network to approximate the marginal dynamics.

Generator Matching. The fact that flow matching principles transfer so cleanly to discrete spaces is not a coincidence. The Generator Matching framework (Holderrieth et al., 2024) unifies continuous and discrete generative models under a single umbrella. A "generator" is the abstract concept that subsumes both vector fields (for ODEs), drift+diffusion coefficients (for SDEs), and rate matrices (for CTMCs). The framework extends to any state space where Markov processes can be defined: smooth manifolds (for protein geometry), mixed continuous-discrete spaces (for joint text+image generation), and jump processes (for variable-length sequences).

Open research directions:

Scaling discrete diffusion to match autoregressive LLMs on benchmarks (LLaDA 2.0 with 100B parameters is a promising start).

Multimodal generation: joint text-and-image generation using mixed continuous-discrete state spaces — generate text and images in a single model.

Structured discrete data: protein design (amino acid sequences), drug discovery (molecular graphs), code generation.

Inference speed: reducing the number of sampling steps while maintaining quality. Autoregressive models need d steps for d tokens; can discrete diffusion do better?

Consistency models: learning to denoise in a single step, distilling the multi-step CTMC sampling into one forward pass.

What We Built in This Course

Looking back across all 7 chapters, the complete generative modeling pipeline rests on three pillars:

Pillar 1: Probability Paths
Define an interpolation from noise to data (Gaussian for images, factorized mixture for text)
Pillar 2: Conditional Dynamics
Derive the conditional vector field / rate matrix that follows the path (simple, known in closed form)
Pillar 3: Marginalization + Training
Show that the marginal dynamics (posterior-weighted average) generates pdata. Train a neural network to approximate it. Loss = regression (continuous) or classification (discrete).

Every model we studied — flow matching, DDPM, score matching, Stable Diffusion 3, Movie Gen, MDLM — is an instantiation of this template. The differences are in the choice of probability path, the architecture, and the engineering details. The mathematical framework is universal.

Practical Considerations for Discrete Diffusion

Number of sampling steps. Autoregressive models need exactly d steps to generate a sequence of length d (one token per step). Discrete diffusion models can use any number of steps n. With n < d, multiple tokens can be updated per step, enabling faster generation. Empirically, n ≈ 100–1000 steps gives good results for text, while some methods achieve decent quality with as few as 10–50 steps using improved sampling strategies.

Guidance for text. CFG extends to discrete diffusion just as it extends to continuous models. We train with label dropping (replacing the prompt with ∅) and combine conditional and unconditional rate matrices at inference:

t(v, j | x, y) = (1−w) Qθt(v, j | x, ∅) + w · Qθt(v, j | x, y)

Temperature and top-p sampling. At each step, instead of sampling from the full Euler transition distribution, we can apply temperature scaling or nucleus (top-p) sampling to the token probabilities. This provides additional control over the diversity/quality tradeoff, complementing guidance.

Training efficiency. The cross-entropy loss for discrete flow matching is computed over all d positions simultaneously, making efficient use of GPU parallelism. Unlike autoregressive models (which need causal masking), discrete diffusion models process the full sequence with bidirectional attention — every position can attend to every other position.

Choice of scheduler κt. The scheduler controls the rate of unmasking. Common choices include:

SchedulerκtBehavior
LineartUniform unmasking rate
Cosine(1 − cos(πt/2))Slow start, fast middle, slow end
Square root√tFast early unmasking, slow refinement
Sigmoidsigmoid(a(t−0.5))S-curve, tunable sharpness

The cosine schedule is most common, as it spends more time at intermediate noise levels where the model has the most to learn, rather than wasting time at extremely noisy or nearly clean states.

Applications beyond text. Discrete diffusion models have been applied to several domains beyond language:

Protein design: amino acid sequences (V=20 amino acids, d=100–1000 residues). Models like EvoDiff generate novel protein sequences that fold into stable structures.

DNA/RNA design: nucleotide sequences (V=4 bases, d=thousands). Discrete diffusion naturally handles the discrete nature of genetic sequences.

Code generation: program tokens with bidirectional attention, enabling infilling and completion tasks that autoregressive models struggle with.

Molecular design: discrete representations of molecules (SMILES strings or token-based molecular graphs).

Comparison: LLaDA 2.0 vs GPT

The state-of-the-art discrete diffusion language model is LLaDA 2.0 (2025), scaling to 100 billion parameters. Here is how it compares to autoregressive models:

python
# Autoregressive generation (GPT-style)
for i in range(d):  # d sequential steps
    logits = gpt(x[:i])          # causal attention, O(i^2)
    x[i] = sample(logits[-1])   # next-token prediction
    # Cannot go back and revise x[0..i-1]!

# Discrete diffusion generation (MDLM-style)
x = [MASK] * d  # all masked
for step in range(n):  # n steps (n can be << d)
    logits = mdlm(x, t)         # bidirectional attention, O(d^2)
    # Update ALL positions in parallel
    for j in range(d):  # parallel!
        if should_update(j, t):
            x[j] = sample(logits[j])
    # Can revise any position at any step!

The discrete diffusion approach trades sequential token generation for iterative refinement. Each step processes the full sequence with bidirectional attention, allowing the model to consider the full context (both left and right) when making predictions. This is particularly valuable for tasks like infilling (filling in the middle of a sequence), translation, and editing.

The key architectural difference. Autoregressive models use causal attention masks (each token can only see previous tokens). Discrete diffusion models use bidirectional attention (each token can see all other tokens). This means discrete diffusion models can naturally leverage context from both directions when predicting any position, just like BERT does for understanding tasks — but now for generation.

Summary of Key Equations

For quick reference, all essential formulas from this chapter:

ConceptFormula
State spaceS = Vd, |S| = Vd
Rate matrix conditionsQ(y|x) ≥ 0 for y≠x; Q(x|x) = −∑y≠xQ(y|x)
Rate ↔ transition(d/dh)pt+h|t(y|x)|h=0 = Qt(y|x)
Euler simulationp(y|x) ≈ δy=x + h·Q(y|x)
KFE(d/dt)pt(x) = ∑yQt(x|y)pt(y)
Factorized mixture pathpt(x|z) = ∏j[(1−κ)pinit(xj) + κδzj(xj)]
Conditional rate matrixQz(v,j|x) = (κ̇/(1−κ))(δzj(v) − δxj(v))
Marginal rate matrixQ(v,j|x) = (κ̇/(1−κ))(p1|t(zj=v|x) − δxj(v))
DFM lossL = E[∑j −log pθ1|t(zj|x)]

Verification: The DFM Loss is Tractable

Let's verify that we can compute the DFM loss efficiently. For each training example:

1. Sample data z ∼ pdata: A sequence of d tokens. Cost: O(1) (read from dataset).

2. Sample time t ∼ Unif[0,1]: One random number. Cost: O(1).

3. Sample noisy x ∼ pt(·|z): For each position j, flip a Bernoulli(κt) coin: heads = keep zj, tails = sample from pinit. Cost: O(d).

4. Forward pass: Run the transformer on (x, t) to get logits ∈ Rd×V. Cost: O(d2 · dmodel + d · dmodel · V) — standard transformer cost.

5. Compute loss: Cross-entropy between logits and z at each position. Cost: O(d · V).

The total per-example cost is dominated by the transformer forward pass, which is the same as training any language model. No CTMC simulation is needed during training — only at inference. This "simulation-free training" property is inherited directly from flow matching.

The deepest parallel between continuous and discrete. In continuous flow matching, we never simulate an ODE during training — we compute a loss from a single noisy point x and its target velocity. In discrete flow matching, we never simulate a CTMC during training — we compute a loss from a single noisy sequence x and its target token labels. Both achieve simulation-free training through the marginalization trick, which converts an intractable marginal objective into a tractable conditional one.

Worked Example: DFM Training on a Short Sequence

Let's trace through one training step of discrete flow matching to make everything concrete.

Setup: Vocabulary V = {a, b, c, d, e, [MASK]}, sequence length d=4, scheduler κt = t (linear).

Step 1: Sample data z = (b, a, d, c) from the dataset.

Step 2: Sample time t = 0.6. So κ0.6 = 0.6.

Step 3: Sample noisy x from pt(·|z).

For each position j, flip a coin with P(heads) = 0.6:

• Position 1: heads → x1 = z1 = b (kept)

• Position 2: tails → x2 = [MASK] (masked)

• Position 3: heads → x3 = z3 = d (kept)

• Position 4: tails → x4 = [MASK] (masked)

Noisy sequence: x = (b, [MASK], d, [MASK])

Step 4: Forward pass. The transformer receives x = (b, [MASK], d, [MASK]) and t=0.6. It outputs 4 probability distributions, one per position:

PosP(a)P(b)P(c)P(d)P(e)Target zj
10.050.800.050.050.05b
20.600.100.100.100.10a
30.050.050.050.800.05d
40.100.100.500.100.20c

Step 5: Compute loss. Cross-entropy at each position:

L = −log(0.80) − log(0.60) − log(0.80) − log(0.50) = 0.22 + 0.51 + 0.22 + 0.69 = 1.64

The loss is highest at position 4 (where the model is least confident about the correct token c). Gradient descent will push the model to increase its confidence at position 4. Over many training steps, the model learns to predict the clean token at each position from the noisy context.

That's it — one training step of discrete flow matching. The beauty is in its simplicity: corrupt a sequence, predict the originals, compute cross-entropy. No ODEs, no SDEs, no simulation. Just classification.

Worked Example: CTMC Inference Step

Now let's trace one inference step. We have a partially unmasked sequence x = (b, [MASK], d, [MASK]) at time t = 0.6 and want to advance to t + h = 0.62 (step size h = 0.02).

Step 1: Get the model's predicted probabilities (same as training table above).

Step 2: Compute the rate matrix entries for each position. With κ̇t = 1 (linear schedule, derivative = 1) and 1−κt = 0.4:

Rate scale = κ̇/(1−κ) = 1/0.4 = 2.5

Step 3: Compute jump probabilities for position 2 (currently [MASK]):

P(jump to a) = h · 2.5 · 0.60 = 0.02 · 1.50 = 0.030
P(jump to b) = h · 2.5 · 0.10 = 0.02 · 0.25 = 0.005
P(jump to c) = h · 2.5 · 0.10 = 0.02 · 0.25 = 0.005
P(jump to d) = h · 2.5 · 0.10 = 0.02 · 0.25 = 0.005
P(jump to e) = h · 2.5 · 0.10 = 0.02 · 0.25 = 0.005
P(stay [MASK]) = 1 − (0.030 + 4×0.005) = 1 − 0.050 = 0.950

So at this step, position 2 has a 95% chance of staying [MASK] and a 3% chance of jumping to the correct token "a". Over many steps, the cumulative probability of unmasking increases, and by t≈1, all positions will be unmasked.

Similarly, for position 4 (currently [MASK], model predicts P(c)=0.50):

P(jump to a) = 0.02 · 2.5 · 0.10 = 0.005
P(jump to b) = 0.02 · 2.5 · 0.10 = 0.005
P(jump to c) = 0.02 · 2.5 · 0.50 = 0.025
P(jump to d) = 0.02 · 2.5 · 0.10 = 0.005
P(jump to e) = 0.02 · 2.5 · 0.20 = 0.010
P(stay [MASK]) = 1 − 0.050 = 0.950

Position 4 also stays masked with 95% probability. But when it does unmask, "c" (the correct token) is the most likely destination at 2.5%, followed by "e" at 1%. Over many steps, the tokens gradually unmask, with the model's highest-confidence predictions unmasking first.

For positions already at the correct data token (positions 1 and 3 in our example), the rate matrix produces:

Q(v, j|x) = (κ̇/(1−κ))(p1|t(zj=v|x) − δxj(v))

If the model is confident that xj is already correct (p1|t(zj=xj|x) ≈ 1), then Q(xj, j|x) ≈ (κ̇/(1−κ))(1 − 1) = 0 and Q(v, j|x) ≈ (κ̇/(1−κ))(0 − 0) = 0 for v ≠ xj. All rates are near zero — the token stays put. This is exactly the behavior we want: correct tokens should not be disturbed.

Step 4: Sample from this categorical distribution for each position independently. Most positions stay unchanged; occasionally one unmasks. This is the parallel Euler approximation from Algorithm 7.

Why parallel sampling works. The factorized rate matrix assumes each position can be updated independently per step. In reality, positions are correlated (the correct token at position 2 depends on what's at position 3). The per-step independence is only an approximation valid for small h. The error from simultaneous multi-position updates is O(h2), which vanishes as the step size shrinks. In practice, 100–1000 steps gives accurate results.

Complete DFM Inference Algorithm

python
# Complete Discrete Flow Matching inference
def generate(model, seq_len, vocab_size, n_steps=256):
    # Initialize: all [MASK] tokens
    mask_id = vocab_size  # [MASK] = last token in vocab
    x = torch.full((1, seq_len), mask_id, dtype=torch.long)

    h = 1.0 / n_steps
    for i in range(n_steps):
        t = torch.tensor([i * h])
        kappa = scheduler(t)
        kappa_dot = scheduler_deriv(t)
        rate_scale = kappa_dot / (1 - kappa + 1e-8)

        # Forward pass: predict clean token probabilities
        logits = model(x, t)                  # (1, d, V+1)
        probs = F.softmax(logits[:, :, :vocab_size], dim=-1)  # exclude [MASK] from targets

        # Per-position Euler step
        for j in range(seq_len):  # parallel in practice
            if x[0, j] == mask_id:  # only update masked positions
                jump_probs = h * rate_scale * probs[0, j]
                stay_prob = 1 - jump_probs.sum()
                all_probs = torch.cat([jump_probs, stay_prob.unsqueeze(0)])
                sampled = torch.multinomial(all_probs, 1)
                if sampled < vocab_size:
                    x[0, j] = sampled  # unmask!

    return x  # fully generated sequence

"Creating noise from data is easy; creating data from noise is generative modeling." — Holderrieth & Erives, 2026

In discrete flow matching, training reduces to what fundamental task?