OneFlow (2025) — Veanors

Chapter 0: The Sequential Bottleneck

All previous multimodal generation models have a fundamental constraint: they generate text and images sequentially. In Chameleon, you generate text tokens first, then image tokens. In Transfusion, you generate text autoregressively, then denoise images. You can't start generating an image until the preceding text is complete.

This creates a bottleneck for interleaved generation — documents where text and images alternate naturally (like a recipe with step-by-step photos, or a scientific report with inline figures).

OneFlow's breakthrough: Generate text and images concurrently using flow matching. Instead of sequentially producing tokens, OneFlow defines a continuous flow from noise to the final mixed-modal output. All positions (text and image) evolve simultaneously through this flow. The result: faster generation and more coherent interleaved outputs because text and images co-evolve.

Think of the difference between building a house room by room (sequential) vs having all construction crews work simultaneously on different rooms (concurrent). The concurrent approach is faster and ensures the rooms fit together coherently.

Sequential vs Concurrent Generation

Compare sequential generation (top) where text comes first, then images, vs OneFlow's concurrent approach (bottom) where both evolve simultaneously.

What is the fundamental limitation of sequential multimodal generation that OneFlow addresses?

Text and images must be generated one after the other, creating a bottleneck for interleaved documents — OneFlow generates both concurrently using flow matching, enabling faster and more coherent mixed-modal output The model is too slow Images are always low quality

Chapter 1: Flow Matching Primer

Flow matching is a generative modeling framework that learns to transport samples from a simple distribution (noise) to the data distribution along straight-line paths. It's simpler and often more stable than diffusion.

The core idea

Given a noise sample x₀ ~ N(0,I) and a data sample x₁, define a straight-line path between them:

x_t = (1 − t) · x₀ + t · x₁, t ∈ [0, 1]

The model learns a velocity field v_θ(x_t, t) that predicts the direction from x₀ to x₁ at any point t:

v_θ(x_t, t) ≈ x₁ − x₀

The training loss is simply:

L = E_{t, x₀, x₁} ||v_θ(x_t, t) − (x₁ − x₀)||²

Why flow matching for multimodal? Unlike diffusion (which adds/removes Gaussian noise), flow matching directly transports from noise to data along straight paths. This means you can apply it to ANY data type — continuous images, discrete text (via continuous relaxations), or mixed. OneFlow exploits this generality to handle text and images with the same framework.

Flow Matching Visualization

Watch samples flow from noise (t=0) to data (t=1) along straight-line paths. The model learns the velocity field that drives this transport.

t 0.00

What does flow matching learn, and how does it differ from diffusion?

Flow matching learns a velocity field that transports noise to data along straight-line paths, while diffusion learns to reverse a noise-adding process — flow matching is simpler, more general (applies to any data type), and often more stable Flow matching uses more parameters Flow matching is identical to diffusion

Chapter 2: Edit Flows

OneFlow's core innovation is the edit flow — a flow that doesn't just go from noise to data, but from a corrupted version of the output to the correct output. This enables concurrent editing of all modalities simultaneously.

Standard flow vs edit flow

Aspect	Standard Flow	Edit Flow
Source	Pure noise	Corrupted version of target
Target	Clean data	Clean data
Corruption	Gaussian noise only	Mask tokens + add noise to images
For text	N/A (text is discrete)	Mask random tokens → predict unmasked
For images	Noise → denoise	Noisy patches → clean patches

The key insight: by defining corruption differently per modality (masking for text, noise for images), OneFlow can generate both modalities with the same flow matching objective. Text tokens are "demasked" while image patches are "denoised" — concurrently, in the same forward pass.

Edit flow: x_t = (1 − t) · corrupt(x₁) + t · x₁

Why "edit" flow? Instead of creating something from noise, the model edits a corrupted version into the correct version. At t=0, you have a heavily corrupted document (many masked tokens, very noisy images). At t=1, you have the clean output. The model learns to progressively fix the corruption — unmasking text and denoising images simultaneously.

Edit Flow: Concurrent Text + Image Refinement

Drag the slider to see how text (masked tokens revealed) and images (noise removed) are refined concurrently through the edit flow.

t 0.00

How does OneFlow handle the fact that text is discrete and images are continuous?

By defining different corruption types per modality within the same edit flow — text is corrupted by masking tokens, images by adding noise — so the model "demasks" text and "denoises" images concurrently with the same velocity field objective By converting text to continuous representations By using separate models for each modality

Chapter 3: Architecture

OneFlow uses a transformer backbone similar to Transfusion but with modifications for flow matching and concurrent generation.

Key architectural choices

Component	Design	Why
Text embedding	Standard token embedding + continuous relaxation	Enables gradient flow through discrete text
Image embedding	VAE latent patches + linear projection	Continuous representation for flow matching
Attention	Full bidirectional (not causal)	All positions generated concurrently
Time conditioning	Sinusoidal time embedding added to all positions	Model knows where in the flow it is
Output	Velocity prediction for all positions	Single unified objective

Bidirectional attention is critical. Unlike Transfusion (which uses causal masking for text), OneFlow uses full bidirectional attention for ALL positions. This is possible because all tokens are generated concurrently — there's no "left-to-right" ordering during generation. Every position can attend to every other position, enabling maximum information flow.

python
# OneFlow forward pass
def forward(self, x_t, t, modality_mask):
    # x_t: [B, L, D] — corrupted mixed sequence at time t
    # t: [B] — flow time (0=corrupted, 1=clean)

    # Add time conditioning
    time_emb = self.time_embed(t)              # [B, D]
    x_t = x_t + time_emb.unsqueeze(1)         # broadcast to all positions

    # Bidirectional transformer (no causal mask!)
    for layer in self.layers:
        x_t = layer(x_t, mask=None)           # full attention

    # Predict velocity for all positions
    velocity = self.velocity_head(x_t)         # [B, L, D]
    return velocity  # used to move x_t toward x_1

OneFlow Architecture

OneFlow's transformer processes all modalities with full bidirectional attention and predicts velocity for concurrent generation.

Why does OneFlow use bidirectional attention instead of causal masking?

Because all positions are generated concurrently (not left-to-right), so there's no temporal ordering to enforce — every position benefits from attending to every other position for maximum information flow during the edit flow Because bidirectional attention is faster Because causal masking doesn't work with images

Chapter 4: Concurrent Generation

At inference time, OneFlow generates the entire mixed-modal output through an ODE (ordinary differential equation) solver. Starting from a fully corrupted sequence, it iteratively refines all positions:

Initialize (t=0)

All text positions: [MASK]. All image positions: random noise. The entire output is corrupted.

↓ ODE step (t → t + dt)

Predict Velocity

Forward pass: predict velocity v(x_t, t) for all positions simultaneously. Text tokens become less masked, images become less noisy.

↓ update x_t

Refine

x_{t+dt} = x_t + dt × v. Move all positions toward their clean versions.

↻ repeat for N steps until t=1

Output (t=1)

Clean mixed-modal document: coherent text + sharp images, generated concurrently.

Speed advantage: Because all positions evolve concurrently, the number of forward passes scales with the number of ODE steps (typically 20-50), NOT with the sequence length. Autoregressive models need L forward passes for L tokens. OneFlow generates an entire 2000-token interleaved document in ~30 forward passes.

Concurrent Generation Simulator

Watch text tokens get unmasked and image patches get denoised simultaneously through the ODE solver. All positions evolve together.

Why is OneFlow's concurrent generation faster than autoregressive approaches?

Because the number of forward passes scales with ODE steps (20-50), not sequence length — generating a 2000-token interleaved document takes ~30 passes instead of 2000, since all positions are refined simultaneously Because the model is smaller Because it uses GPU parallelism

Chapter 5: Training

OneFlow trains on interleaved text+image documents with the flow matching objective applied to both modalities simultaneously.

Training procedure

python
# OneFlow training step
def training_step(model, clean_seq):
    # clean_seq: [B, L, D] — mixed text + image tokens

    # Sample random time t
    t = torch.rand(B)  # [B], uniform in [0, 1]

    # Create corrupted version
    noise = torch.randn_like(clean_seq)           # for images
    mask_noise = random_mask_tokens(clean_seq)     # for text
    x_0 = apply_corruption(clean_seq, noise, mask_noise, modality)

    # Interpolate: x_t = (1-t)*x_0 + t*x_1
    x_t = (1 - t) * x_0 + t * clean_seq

    # Predict velocity
    v_pred = model(x_t, t)

    # Target velocity: x_1 - x_0
    v_target = clean_seq - x_0

    # Loss: MSE between predicted and target velocity
    loss = F.mse_loss(v_pred, v_target)
    return loss

Unified loss for both modalities. Unlike Transfusion (which uses separate losses for text and images), OneFlow uses a single velocity prediction loss for everything. The flow matching framework naturally handles both discrete (masked text) and continuous (noisy image) corruption within the same objective.

Training Loss Visualization

See how the unified flow matching loss captures both text demasking and image denoising in a single objective.

How does OneFlow's training objective differ from Transfusion's?

OneFlow uses a single velocity prediction loss for both text and images (unified flow matching), while Transfusion uses separate cross-entropy for text and MSE for images — OneFlow's unified objective naturally handles both modalities within the same framework OneFlow uses a larger batch size There is no difference

Chapter 6: Results & Showcase

OneFlow achieves competitive results on both text and image generation benchmarks while offering a fundamentally different generation paradigm: concurrent rather than sequential.

Model	Generation Mode	Image FID ↓	Interleaved Quality	Forward Passes
Chameleon	Sequential AR	~24	Good	~2000+
Transfusion	AR text + diffusion img	~6.8	Good	~500+
OneFlow	Concurrent flow	~7.5	Best	~30-50

OneFlow's advantages: While image FID is slightly behind Transfusion, OneFlow excels at interleaved generation quality (text-image coherence in documents with multiple images) and requires 10-60x fewer forward passes. For applications like generating illustrated articles or visual tutorials, OneFlow's concurrent approach produces more coherent output because text and images co-evolve.

OneFlow vs Baselines

Compare generation speed and quality across different approaches.

Where does OneFlow shine compared to Transfusion?

In interleaved document generation (text-image coherence) and generation speed (30-50 vs 500+ forward passes) — because concurrent generation allows text and images to co-evolve, producing more coherent mixed-modal outputs In pure image generation quality In text-only tasks

Chapter 7: Connections

OneFlow represents the latest evolution in multimodal generation: from separate models, to unified sequential models, to unified concurrent models.

Generation Paradigm	Models	Speed	Interleaved Quality
Sequential AR	Chameleon, DALL-E	Slow (L passes)	Moderate
AR + Diffusion	Transfusion, LMFusion	Medium	Good
Concurrent Flow	OneFlow	Fast (30-50 passes)	Best

Lesson 1: Concurrency beats sequentiality for interleaved content. When text and images must be coherent within a document, generating them simultaneously produces better results than generating them one at a time.

Lesson 2: Flow matching unifies modalities. The edit flow framework naturally handles both discrete text (via masking) and continuous images (via noise) with a single objective. This is more elegant than Transfusion's dual-objective approach.

Lesson 3: Speed and quality can be complementary. OneFlow is both faster AND better at interleaved generation. Concurrent generation isn't just an efficiency trick — it enables richer cross-modal interaction during the generation process itself.

Generation Paradigm Evolution

From sequential to concurrent: the evolution of multimodal generation.

Era Concurrent Flow

What is OneFlow's most important conceptual contribution?

Demonstrating that concurrent mixed-modal generation via edit flows produces more coherent interleaved outputs at higher speed than sequential approaches — shifting the paradigm from "generate one modality, then the other" to "all modalities evolve together" A new transformer architecture Better image quality than DALL-E 3

OneFlow: Concurrent Mixed-Modal Generation