Concurrent Mixed-Modal and Interleaved Generation with Edit Flows — generate text and images simultaneously using flow matching, not sequentially.
All previous multimodal generation models have a fundamental constraint: they generate text and images sequentially. In Chameleon, you generate text tokens first, then image tokens. In Transfusion, you generate text autoregressively, then denoise images. You can't start generating an image until the preceding text is complete.
This creates a bottleneck for interleaved generation — documents where text and images alternate naturally (like a recipe with step-by-step photos, or a scientific report with inline figures).
Think of the difference between building a house room by room (sequential) vs having all construction crews work simultaneously on different rooms (concurrent). The concurrent approach is faster and ensures the rooms fit together coherently.
Compare sequential generation (top) where text comes first, then images, vs OneFlow's concurrent approach (bottom) where both evolve simultaneously.
Flow matching is a generative modeling framework that learns to transport samples from a simple distribution (noise) to the data distribution along straight-line paths. It's simpler and often more stable than diffusion.
Given a noise sample x0 ~ N(0,I) and a data sample x1, define a straight-line path between them:
The model learns a velocity field vθ(xt, t) that predicts the direction from x0 to x1 at any point t:
The training loss is simply:
Watch samples flow from noise (t=0) to data (t=1) along straight-line paths. The model learns the velocity field that drives this transport.
OneFlow's core innovation is the edit flow — a flow that doesn't just go from noise to data, but from a corrupted version of the output to the correct output. This enables concurrent editing of all modalities simultaneously.
| Aspect | Standard Flow | Edit Flow |
|---|---|---|
| Source | Pure noise | Corrupted version of target |
| Target | Clean data | Clean data |
| Corruption | Gaussian noise only | Mask tokens + add noise to images |
| For text | N/A (text is discrete) | Mask random tokens → predict unmasked |
| For images | Noise → denoise | Noisy patches → clean patches |
The key insight: by defining corruption differently per modality (masking for text, noise for images), OneFlow can generate both modalities with the same flow matching objective. Text tokens are "demasked" while image patches are "denoised" — concurrently, in the same forward pass.
Drag the slider to see how text (masked tokens revealed) and images (noise removed) are refined concurrently through the edit flow.
OneFlow uses a transformer backbone similar to Transfusion but with modifications for flow matching and concurrent generation.
| Component | Design | Why |
|---|---|---|
| Text embedding | Standard token embedding + continuous relaxation | Enables gradient flow through discrete text |
| Image embedding | VAE latent patches + linear projection | Continuous representation for flow matching |
| Attention | Full bidirectional (not causal) | All positions generated concurrently |
| Time conditioning | Sinusoidal time embedding added to all positions | Model knows where in the flow it is |
| Output | Velocity prediction for all positions | Single unified objective |
python # OneFlow forward pass def forward(self, x_t, t, modality_mask): # x_t: [B, L, D] — corrupted mixed sequence at time t # t: [B] — flow time (0=corrupted, 1=clean) # Add time conditioning time_emb = self.time_embed(t) # [B, D] x_t = x_t + time_emb.unsqueeze(1) # broadcast to all positions # Bidirectional transformer (no causal mask!) for layer in self.layers: x_t = layer(x_t, mask=None) # full attention # Predict velocity for all positions velocity = self.velocity_head(x_t) # [B, L, D] return velocity # used to move x_t toward x_1
OneFlow's transformer processes all modalities with full bidirectional attention and predicts velocity for concurrent generation.
At inference time, OneFlow generates the entire mixed-modal output through an ODE (ordinary differential equation) solver. Starting from a fully corrupted sequence, it iteratively refines all positions:
Watch text tokens get unmasked and image patches get denoised simultaneously through the ODE solver. All positions evolve together.
OneFlow trains on interleaved text+image documents with the flow matching objective applied to both modalities simultaneously.
python # OneFlow training step def training_step(model, clean_seq): # clean_seq: [B, L, D] — mixed text + image tokens # Sample random time t t = torch.rand(B) # [B], uniform in [0, 1] # Create corrupted version noise = torch.randn_like(clean_seq) # for images mask_noise = random_mask_tokens(clean_seq) # for text x_0 = apply_corruption(clean_seq, noise, mask_noise, modality) # Interpolate: x_t = (1-t)*x_0 + t*x_1 x_t = (1 - t) * x_0 + t * clean_seq # Predict velocity v_pred = model(x_t, t) # Target velocity: x_1 - x_0 v_target = clean_seq - x_0 # Loss: MSE between predicted and target velocity loss = F.mse_loss(v_pred, v_target) return loss
See how the unified flow matching loss captures both text demasking and image denoising in a single objective.
OneFlow achieves competitive results on both text and image generation benchmarks while offering a fundamentally different generation paradigm: concurrent rather than sequential.
| Model | Generation Mode | Image FID ↓ | Interleaved Quality | Forward Passes |
|---|---|---|---|---|
| Chameleon | Sequential AR | ~24 | Good | ~2000+ |
| Transfusion | AR text + diffusion img | ~6.8 | Good | ~500+ |
| OneFlow | Concurrent flow | ~7.5 | Best | ~30-50 |
Compare generation speed and quality across different approaches.
OneFlow represents the latest evolution in multimodal generation: from separate models, to unified sequential models, to unified concurrent models.
| Generation Paradigm | Models | Speed | Interleaved Quality |
|---|---|---|---|
| Sequential AR | Chameleon, DALL-E | Slow (L passes) | Moderate |
| AR + Diffusion | Transfusion, LMFusion | Medium | Good |
| Concurrent Flow | OneFlow | Fast (30-50 passes) | Best |
From sequential to concurrent: the evolution of multimodal generation.