2025

OneFlow: Concurrent Mixed-Modal Generation

Concurrent Mixed-Modal and Interleaved Generation with Edit Flows — generate text and images simultaneously using flow matching, not sequentially.

Prerequisites: Flow matching + Transfusion concepts + Diffusion basics. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Sequential Bottleneck

All previous multimodal generation models have a fundamental constraint: they generate text and images sequentially. In Chameleon, you generate text tokens first, then image tokens. In Transfusion, you generate text autoregressively, then denoise images. You can't start generating an image until the preceding text is complete.

This creates a bottleneck for interleaved generation — documents where text and images alternate naturally (like a recipe with step-by-step photos, or a scientific report with inline figures).

OneFlow's breakthrough: Generate text and images concurrently using flow matching. Instead of sequentially producing tokens, OneFlow defines a continuous flow from noise to the final mixed-modal output. All positions (text and image) evolve simultaneously through this flow. The result: faster generation and more coherent interleaved outputs because text and images co-evolve.

Think of the difference between building a house room by room (sequential) vs having all construction crews work simultaneously on different rooms (concurrent). The concurrent approach is faster and ensures the rooms fit together coherently.

Sequential vs Concurrent Generation

Compare sequential generation (top) where text comes first, then images, vs OneFlow's concurrent approach (bottom) where both evolve simultaneously.

What is the fundamental limitation of sequential multimodal generation that OneFlow addresses?

Chapter 1: Flow Matching Primer

Flow matching is a generative modeling framework that learns to transport samples from a simple distribution (noise) to the data distribution along straight-line paths. It's simpler and often more stable than diffusion.

The core idea

Given a noise sample x0 ~ N(0,I) and a data sample x1, define a straight-line path between them:

xt = (1 − t) · x0 + t · x1,    t ∈ [0, 1]

The model learns a velocity field vθ(xt, t) that predicts the direction from x0 to x1 at any point t:

vθ(xt, t) ≈ x1 − x0

The training loss is simply:

L = Et, x0, x1 ||vθ(xt, t) − (x1 − x0)||2
Why flow matching for multimodal? Unlike diffusion (which adds/removes Gaussian noise), flow matching directly transports from noise to data along straight paths. This means you can apply it to ANY data type — continuous images, discrete text (via continuous relaxations), or mixed. OneFlow exploits this generality to handle text and images with the same framework.
Flow Matching Visualization

Watch samples flow from noise (t=0) to data (t=1) along straight-line paths. The model learns the velocity field that drives this transport.

t 0.00
What does flow matching learn, and how does it differ from diffusion?

Chapter 2: Edit Flows

OneFlow's core innovation is the edit flow — a flow that doesn't just go from noise to data, but from a corrupted version of the output to the correct output. This enables concurrent editing of all modalities simultaneously.

Standard flow vs edit flow

AspectStandard FlowEdit Flow
SourcePure noiseCorrupted version of target
TargetClean dataClean data
CorruptionGaussian noise onlyMask tokens + add noise to images
For textN/A (text is discrete)Mask random tokens → predict unmasked
For imagesNoise → denoiseNoisy patches → clean patches

The key insight: by defining corruption differently per modality (masking for text, noise for images), OneFlow can generate both modalities with the same flow matching objective. Text tokens are "demasked" while image patches are "denoised" — concurrently, in the same forward pass.

Edit flow: xt = (1 − t) · corrupt(x1) + t · x1
Why "edit" flow? Instead of creating something from noise, the model edits a corrupted version into the correct version. At t=0, you have a heavily corrupted document (many masked tokens, very noisy images). At t=1, you have the clean output. The model learns to progressively fix the corruption — unmasking text and denoising images simultaneously.
Edit Flow: Concurrent Text + Image Refinement

Drag the slider to see how text (masked tokens revealed) and images (noise removed) are refined concurrently through the edit flow.

t 0.00
How does OneFlow handle the fact that text is discrete and images are continuous?

Chapter 3: Architecture

OneFlow uses a transformer backbone similar to Transfusion but with modifications for flow matching and concurrent generation.

Key architectural choices

ComponentDesignWhy
Text embeddingStandard token embedding + continuous relaxationEnables gradient flow through discrete text
Image embeddingVAE latent patches + linear projectionContinuous representation for flow matching
AttentionFull bidirectional (not causal)All positions generated concurrently
Time conditioningSinusoidal time embedding added to all positionsModel knows where in the flow it is
OutputVelocity prediction for all positionsSingle unified objective
Bidirectional attention is critical. Unlike Transfusion (which uses causal masking for text), OneFlow uses full bidirectional attention for ALL positions. This is possible because all tokens are generated concurrently — there's no "left-to-right" ordering during generation. Every position can attend to every other position, enabling maximum information flow.
python
# OneFlow forward pass
def forward(self, x_t, t, modality_mask):
    # x_t: [B, L, D] — corrupted mixed sequence at time t
    # t: [B] — flow time (0=corrupted, 1=clean)

    # Add time conditioning
    time_emb = self.time_embed(t)              # [B, D]
    x_t = x_t + time_emb.unsqueeze(1)         # broadcast to all positions

    # Bidirectional transformer (no causal mask!)
    for layer in self.layers:
        x_t = layer(x_t, mask=None)           # full attention

    # Predict velocity for all positions
    velocity = self.velocity_head(x_t)         # [B, L, D]
    return velocity  # used to move x_t toward x_1
OneFlow Architecture

OneFlow's transformer processes all modalities with full bidirectional attention and predicts velocity for concurrent generation.

Why does OneFlow use bidirectional attention instead of causal masking?

Chapter 4: Concurrent Generation

At inference time, OneFlow generates the entire mixed-modal output through an ODE (ordinary differential equation) solver. Starting from a fully corrupted sequence, it iteratively refines all positions:

Initialize (t=0)
All text positions: [MASK]. All image positions: random noise. The entire output is corrupted.
↓ ODE step (t → t + dt)
Predict Velocity
Forward pass: predict velocity v(x_t, t) for all positions simultaneously. Text tokens become less masked, images become less noisy.
↓ update x_t
Refine
x_{t+dt} = x_t + dt × v. Move all positions toward their clean versions.
↻ repeat for N steps until t=1
Output (t=1)
Clean mixed-modal document: coherent text + sharp images, generated concurrently.
Speed advantage: Because all positions evolve concurrently, the number of forward passes scales with the number of ODE steps (typically 20-50), NOT with the sequence length. Autoregressive models need L forward passes for L tokens. OneFlow generates an entire 2000-token interleaved document in ~30 forward passes.
Concurrent Generation Simulator

Watch text tokens get unmasked and image patches get denoised simultaneously through the ODE solver. All positions evolve together.

Why is OneFlow's concurrent generation faster than autoregressive approaches?

Chapter 5: Training

OneFlow trains on interleaved text+image documents with the flow matching objective applied to both modalities simultaneously.

Training procedure

python
# OneFlow training step
def training_step(model, clean_seq):
    # clean_seq: [B, L, D] — mixed text + image tokens

    # Sample random time t
    t = torch.rand(B)  # [B], uniform in [0, 1]

    # Create corrupted version
    noise = torch.randn_like(clean_seq)           # for images
    mask_noise = random_mask_tokens(clean_seq)     # for text
    x_0 = apply_corruption(clean_seq, noise, mask_noise, modality)

    # Interpolate: x_t = (1-t)*x_0 + t*x_1
    x_t = (1 - t) * x_0 + t * clean_seq

    # Predict velocity
    v_pred = model(x_t, t)

    # Target velocity: x_1 - x_0
    v_target = clean_seq - x_0

    # Loss: MSE between predicted and target velocity
    loss = F.mse_loss(v_pred, v_target)
    return loss
Unified loss for both modalities. Unlike Transfusion (which uses separate losses for text and images), OneFlow uses a single velocity prediction loss for everything. The flow matching framework naturally handles both discrete (masked text) and continuous (noisy image) corruption within the same objective.
Training Loss Visualization

See how the unified flow matching loss captures both text demasking and image denoising in a single objective.

How does OneFlow's training objective differ from Transfusion's?

Chapter 6: Results & Showcase

OneFlow achieves competitive results on both text and image generation benchmarks while offering a fundamentally different generation paradigm: concurrent rather than sequential.

ModelGeneration ModeImage FID ↓Interleaved QualityForward Passes
ChameleonSequential AR~24Good~2000+
TransfusionAR text + diffusion img~6.8Good~500+
OneFlowConcurrent flow~7.5Best~30-50
OneFlow's advantages: While image FID is slightly behind Transfusion, OneFlow excels at interleaved generation quality (text-image coherence in documents with multiple images) and requires 10-60x fewer forward passes. For applications like generating illustrated articles or visual tutorials, OneFlow's concurrent approach produces more coherent output because text and images co-evolve.
OneFlow vs Baselines

Compare generation speed and quality across different approaches.

Where does OneFlow shine compared to Transfusion?

Chapter 7: Connections

OneFlow represents the latest evolution in multimodal generation: from separate models, to unified sequential models, to unified concurrent models.

Generation ParadigmModelsSpeedInterleaved Quality
Sequential ARChameleon, DALL-ESlow (L passes)Moderate
AR + DiffusionTransfusion, LMFusionMediumGood
Concurrent FlowOneFlowFast (30-50 passes)Best
Lesson 1: Concurrency beats sequentiality for interleaved content. When text and images must be coherent within a document, generating them simultaneously produces better results than generating them one at a time.
Lesson 2: Flow matching unifies modalities. The edit flow framework naturally handles both discrete text (via masking) and continuous images (via noise) with a single objective. This is more elegant than Transfusion's dual-objective approach.
Lesson 3: Speed and quality can be complementary. OneFlow is both faster AND better at interleaved generation. Concurrent generation isn't just an efficiency trick — it enables richer cross-modal interaction during the generation process itself.
Generation Paradigm Evolution

From sequential to concurrent: the evolution of multimodal generation.

Era Concurrent Flow
What is OneFlow's most important conceptual contribution?