Liang et al. — 2024

Mixture-of-Transformers

A Sparse and Scalable Architecture for Multi-Modal Foundation Models — share attention across modalities but give each modality its own feed-forward experts.

Prerequisites: Transformers + Mixture-of-Experts + Chameleon/Transfusion (recommended). That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Density Problem

You've just trained Chameleon or Transfusion — a single transformer that handles both text and images. It works. But there's a nagging inefficiency: every token, regardless of modality, activates every parameter in the network. When the model processes a text token, all the parameters that learned about image patches are activated uselessly. When it processes an image patch, all the text-specific knowledge is wasted computation.

This is the density problem. In a dense transformer, the full model is active for every input token. For a 7B parameter model, that's 7 billion multiplications per token, regardless of whether that token is text or image. Most of those parameters are in the feed-forward network (FFN), which typically accounts for ~67% of model parameters.

Component% of Params (typical)Shared by Modalities?
Embeddings~5%Yes (in Chameleon) / No (in Transfusion)
Attention (QKV + output)~28%Yes (always shared)
FFN layers~67%Yes (dense) / No (MoT)

The FFN layers are where most modality-specific processing happens. Text FFNs learn syntactic patterns and word associations. Image FFNs learn spatial features and color relationships. Forcing them to share the same parameters means each modality gets a compromised representation.

MoT's key idea: Keep the attention layers shared (they benefit from cross-modal interaction) but give each modality its own dedicated FFN layers. Text tokens go through text-specific feed-forward experts. Image patches go through image-specific feed-forward experts. The attention mechanism remains the meeting point where modalities exchange information. This is "sparse" because only a subset of parameters is active for each token.

Think of it as a bilingual office. Everyone meets in the same conference room (attention) to discuss projects together. But each language group has its own workspace (FFN) where they do their focused work in their native language. The meetings enable collaboration; the separate workspaces enable specialization.

Dense vs MoT Parameter Usage

Toggle between Dense (all parameters active for every token) and MoT (only modality-specific experts active). Watch how the active parameter count changes.

What is the core inefficiency in dense multimodal transformers that MoT addresses?

Chapter 1: Dense vs Sparse

Before diving into MoT specifically, let's understand the broader idea of sparse computation in transformers and why it matters for multimodal models.

Dense models: every token uses everything

In a dense transformer (GPT, LLaMA, Chameleon), every token passes through every layer, every attention head, and every FFN neuron. If the model has N parameters, the FLOPs per token scale linearly with N. Doubling model size doubles compute cost per token.

Mixture-of-Experts (MoE): routing to subsets

Standard MoE models (like Mixtral, Switch Transformer) have multiple FFN "experts" per layer and a router that selects which experts to use for each token. Each token activates only k out of E experts (typically k=2, E=8). This means the model can have many more total parameters without increasing per-token compute.

FLOPsdense = 2 · N · L     FLOPsMoE = 2 · (Nattn + k/E · NFFN) · L

MoT: modality IS the routing signal

Here's MoT's simplification: instead of learning a router network that decides which expert to use (as in standard MoE), the modality of the token determines the expert. Text tokens always go to the text FFN. Image tokens always go to the image FFN. No learned routing, no load balancing loss, no routing collapse — the routing is deterministic and free.

PropertyDenseStandard MoEMoT
RoutingNone (all params)Learned routerDeterministic (by modality)
# Experts1 (shared)E (usually 8)M (one per modality)
Routing overheadNoneRouter network + aux lossZero
Active params/token100%~25-30%~55% (shared attn + one FFN)
SpecializationNoneEmergentExplicit (by modality)
python
# Standard MoE: learned routing, complex
class MoELayer(nn.Module):
    def forward(self, x):
        router_logits = self.router(x)           # [B, L, num_experts]
        weights, indices = router_logits.topk(2)  # pick top-2 experts
        # Complex routing, load balancing loss, etc.
        ...

# MoT: deterministic routing by modality, simple
class MoTLayer(nn.Module):
    def forward(self, x, modality_mask):
        # Shared attention for ALL tokens
        attn_out = self.attention(x)

        # Separate FFN per modality — no router needed!
        text_out = self.text_ffn(attn_out[modality_mask == 0])
        image_out = self.image_ffn(attn_out[modality_mask == 1])

        # Merge back
        out = torch.empty_like(attn_out)
        out[modality_mask == 0] = text_out
        out[modality_mask == 1] = image_out
        return out
Why deterministic routing works: In standard MoE, the model must learn which expert handles which input — and this learning process is fragile (routing collapse, load imbalance). In MoT, the routing is known a priori because modality is an inherent property of each token. This makes training more stable and eliminates the auxiliary load-balancing losses that plague standard MoE.
Routing Strategies Compared

See how different routing strategies assign tokens to experts. Dense uses all; MoE uses a learned router; MoT routes deterministically by modality.

Strategy MoT
How does MoT's routing differ from standard Mixture-of-Experts?

Chapter 2: MoT Architecture

Now let's look at MoT's complete architecture. The design principle is: share what benefits from sharing (attention) and separate what benefits from separation (FFN, embeddings, output heads).

Per-layer structure

Each transformer layer in MoT has three zones:

Modality-Specific Input Norms
Separate RMSNorm for text and image tokens. Each modality enters with its own normalization statistics.
Shared Self-Attention
All tokens (text + image) attend to all other tokens through the same Q, K, V projections. Cross-modal interaction happens here.
Modality-Specific FFN
Text tokens → Text FFN (separate weights). Image tokens → Image FFN (separate weights). No cross-contamination.

What exactly is shared vs separate?

ComponentShared?Rationale
Token embeddingSeparate per modalityText: lookup table. Images: linear projection. Different input types.
Positional encodingShared (RoPE)Position in sequence is modality-agnostic.
Q, K, V projectionsSharedShared attention enables cross-modal reasoning.
Attention output projSharedPart of the attention mechanism.
FFN up + down projSeparate per modalityFFN learns modality-specific features.
Layer normsSeparate per modalityDifferent activation statistics per modality.
Output headSeparate per modalityText: softmax. Images: noise/continuous pred.
python
# MoT Transformer Block
class MoTBlock(nn.Module):
    def __init__(self, dim, n_heads):
        # Shared attention
        self.attn = MultiHeadAttention(dim, n_heads)

        # Modality-specific norms
        self.norm_text = RMSNorm(dim)
        self.norm_image = RMSNorm(dim)
        self.post_norm_text = RMSNorm(dim)
        self.post_norm_image = RMSNorm(dim)

        # Modality-specific FFNs (the key innovation)
        self.ffn_text = SwiGLU_FFN(dim, dim * 4)
        self.ffn_image = SwiGLU_FFN(dim, dim * 4)

    def forward(self, x, mask, modality):
        # Step 1: modality-specific pre-norm
        normed = torch.where(modality.unsqueeze(-1) == 0,
                              self.norm_text(x), self.norm_image(x))

        # Step 2: shared attention (ALL tokens interact)
        attn_out = self.attn(normed, mask=mask) + x  # residual

        # Step 3: modality-specific post-norm + FFN
        text_mask = (modality == 0)
        img_mask = (modality == 1)

        out = torch.empty_like(attn_out)
        if text_mask.any():
            t = self.post_norm_text(attn_out[text_mask])
            out[text_mask] = self.ffn_text(t) + attn_out[text_mask]
        if img_mask.any():
            m = self.post_norm_image(attn_out[img_mask])
            out[img_mask] = self.ffn_image(m) + attn_out[img_mask]

        return out
Parameter budget insight: In a standard 7B dense transformer, the FFN accounts for ~4.7B parameters. MoT with 2 modalities has 2 FFNs = ~9.4B FFN parameters, but only ~4.7B are active for any given token. So MoT has more total parameters but the same per-token compute as a dense model. You get specialization for free.
MoT Layer Architecture

Click "Forward Pass" to watch text and image tokens flow through a MoT layer. Note how they share attention but diverge at the FFN.

In MoT, why is the attention mechanism shared across modalities while the FFN is separate?

Chapter 3: Modality Experts

The feed-forward experts in MoT are not just separate copies of the same FFN. They develop genuinely different internal representations. Let's look at what each expert learns and why separation helps.

Text expert specialization

The text FFN processes tokens through a nonlinear transformation that learns linguistic features:

FFNtext(x) = W2 · SwiGLU(W1 · x)    where   SwiGLU(x) = (x · Wgate) ⊙ σ(x · Wgate)

The text expert's weights learn patterns like: syntactic agreement, semantic relationships between words, factual associations, logical reasoning patterns. These are fundamentally different from what an image expert needs.

Image expert specialization

The image FFN processes patches through the same architectural structure but learns completely different features: spatial relationships between patches, texture patterns, color distributions, object boundaries. These features are meaningless for text tokens.

Why not share?

The paper provides an intuitive ablation. When the FFN is shared (dense model), it must use the same neurons for both "is this token part of a verb phrase?" (text) and "does this patch contain an edge?" (image). These are unrelated computations that interfere with each other. Separation eliminates this interference.

python
# What the text expert learns (conceptually)
# Neuron activations for text tokens
text_expert_neuron_42:  high for tokens following "the" (expects noun)
text_expert_neuron_137: high for closing parentheses (tracking structure)
text_expert_neuron_891: high for tokens in quoted strings

# What the image expert learns (conceptually)
# Neuron activations for image patches
img_expert_neuron_42:  high for patches with horizontal edges
img_expert_neuron_137: high for patches in upper-left quadrant
img_expert_neuron_891: high for patches with warm colors (skin tones)
The capacity argument: A shared FFN of size D × 4D has 8D2 parameters learning both text and image features. Two separate FFNs have 16D2 total but each modality gets the full 8D2 dedicated capacity. Per-token compute is the same (8D2) but each modality gets an uncompromised representation.

Extending to more modalities

MoT naturally extends to 3+ modalities. For a model handling text, images, and audio: three FFN experts, one shared attention. The per-token compute stays constant (one FFN + shared attention) regardless of the number of modalities. Only total parameter count grows.

# ModalitiesTotal FFN ParamsActive FFN Params/TokenOverhead vs Dense
1 (dense)8D28D2Baseline
2 (text+image)16D28D22× params, 1× compute
3 (text+image+audio)24D28D23× params, 1× compute
Expert Specialization Viewer

See what different expert neurons respond to. Text expert neurons (teal) fire for linguistic patterns. Image expert neurons (orange) fire for visual patterns. In a dense model, these would interfere.

Why does MoT achieve better results than a dense model with the same per-token compute?

Chapter 4: Shared Attention

MoT's decision to share attention across modalities is not arbitrary — it's grounded in what attention actually does and why cross-modal interaction matters.

What attention computes

Self-attention computes a weighted combination of all positions' values, where the weights are determined by query-key similarity. In a multimodal sequence, this means:

InteractionWhat It CapturesExample
Text→TextStandard linguistic attention"it" attends to "the cat" (coreference)
Image→ImageSpatial relationships between patchesPatch of sky attends to adjacent sky patches
Text→ImageGrounding text in visual content"red" attends to patches with red objects
Image→TextVisual content informed by textual contextPatch processing changes based on preceding caption

The cross-modal interactions (Text→Image and Image→Text) are why attention must be shared. If attention were separate per modality, the model couldn't learn to ground language in vision or condition visual processing on text.

MoT's ablation results

The paper ablates different levels of sharing. The results are clear:

ConfigurationShared ComponentsText PPLImage FID
Fully denseEverythingBaselineBaseline
Separate FFN only (MoT)Attention-3.7%-7.2%
Separate attention + FFNNothing-2.1%-9.1%
Separate FFN + embeddings + normsAttention only-4.2%-10.4%

Separating the FFN helps significantly. Separating the attention hurts text quality but helps images slightly. The sweet spot is MoT: shared attention, separate everything else.

The paper's key finding: Shared attention acts as a "communication channel" between modalities. Without it, each modality processes in isolation and can't leverage the other's context. With shared attention, text tokens influence how image patches are processed (and vice versa), enabling richer multimodal reasoning without sacrificing modality-specific quality in the FFN.
Cross-Modal Attention Flow

Watch how attention connects text and image tokens. Each line shows an attention weight — thicker lines mean stronger attention. Notice how text tokens ground themselves in relevant image patches.

Why does MoT keep the attention mechanism shared across modalities?

Chapter 5: Training & Efficiency

MoT's efficiency advantage becomes dramatic at scale. The paper shows that MoT can match a dense model's performance using significantly fewer FLOPs, or achieve better performance at the same compute budget.

FLOP efficiency

The key metric is performance per FLOP. Since MoT activates fewer parameters per token than a dense model with the same total parameter count, it processes more tokens per second. At equal training compute budgets:

MoT achieves same quality as dense Chameleon-like model using 37.3% fewer FLOPs for text and 75% fewer FLOPs for image generation

These are massive savings. A 7B MoT model performs like a ~10B dense model on images and a ~9B dense model on text, while using only 7B-equivalent compute per token.

Memory and implementation

MoT is straightforward to implement with standard deep learning frameworks. The modality-specific routing doesn't require any special CUDA kernels (unlike some MoE implementations). The key implementation trick is gathering/scattering tokens by modality at each layer:

python
# Efficient MoT implementation with gather/scatter
def mot_layer(x, modality_mask, attn, text_ffn, img_ffn, norms):
    # x: [B, L, D], modality_mask: [B, L] (0=text, 1=image)

    # 1. Modality-specific pre-norm (vectorized, no loop)
    text_idx = (modality_mask == 0).nonzero()
    img_idx = (modality_mask == 1).nonzero()
    x_normed = x.clone()
    x_normed[text_idx[:,0], text_idx[:,1]] = norms['text_pre'](x[text_idx[:,0], text_idx[:,1]])
    x_normed[img_idx[:,0], img_idx[:,1]] = norms['img_pre'](x[img_idx[:,0], img_idx[:,1]])

    # 2. Shared attention (all tokens together)
    h = attn(x_normed) + x

    # 3. Modality-specific FFN (gather, compute, scatter)
    out = h.clone()
    out[text_idx[:,0], text_idx[:,1]] += text_ffn(norms['text_post'](h[text_idx[:,0], text_idx[:,1]]))
    out[img_idx[:,0], img_idx[:,1]] += img_ffn(norms['img_post'](h[img_idx[:,0], img_idx[:,1]]))

    return out
Compute efficiency insight: MoT's FLOP savings come from a simple arithmetic fact. In a dense model, 67% of FLOPs are in the FFN. MoT uses the same-sized FFN but only activates ONE modality's FFN per token. For a 50/50 text-image batch, only 50% of FFN FLOPs are "useful" (the rest would process the other modality). MoT eliminates the wasted 50% by routing each token to its specific expert.
FLOP Efficiency Comparison

Compare the training efficiency of Dense vs MoT models. Drag the slider to set the compute budget and see the resulting quality for each approach.

Compute 5x
How does MoT achieve 37-75% FLOP savings over dense models?

Chapter 6: Results & Showcase

MoT is evaluated at multiple scales and consistently outperforms dense baselines at the same compute budget. The improvements are especially large for image generation.

Performance at 7B scale

MetricDense Transfusion 7BMoT 7BImprovement
Text Perplexity ↓8.07.6-5.0%
Image FID ↓6.85.1-25.0%
GenEval Score ↑0.630.71+12.7%
FLOPs per tokenBaseline-37%37% savings
The headline result: MoT matches the quality of a dense model that's ~1.5x larger. Put differently, MoT-7B performs like Dense-10B on images and Dense-9B on text, while training at 7B compute cost. This is a free lunch: better quality at lower cost, with minimal architectural complexity.

Scaling behavior

MoT's advantage grows with scale. At 0.76B, the gap is ~15% on FID. At 7B, it's ~25%. The paper projects that at 34B+, the gap would be even larger, because the FFN-to-attention ratio increases with model size (FFN grows as 4D while attention grows as D).

Beyond two modalities

The paper also experiments with speech as a third modality. Adding a speech expert to MoT requires only ~33% more total parameters (one more FFN) but maintains the same per-token compute. The three-modality MoT outperforms a dense model on all three tasks simultaneously.

MoT Performance Dashboard

Compare Dense vs MoT across multiple metrics at different model scales. Drag the size slider and toggle between metrics.

Scale 7B
What is MoT's most practically significant result?

Chapter 7: Connections

MoT represents the natural evolution of multimodal architectures: from separate models, to unified dense models, to unified sparse models. Each step improves the quality-efficiency tradeoff.

ModelArchitectureSharing StrategyKey Innovation
LLaVALate fusionMinimal sharing (adapter only)Simple but effective VLM
ChameleonDense early fusionEverything sharedAll modalities as tokens
TransfusionDense, dual objectiveEverything sharedRight objective per modality
MoTSparse early fusionAttn shared, FFN separateRight compute per modality
Lesson 1: Sparsity through structure. MoT shows that the best form of sparsity for multimodal models is structural (per-modality experts) rather than learned (routers). When the routing signal is known a priori, don't learn it.
Lesson 2: Share the bottleneck, separate the capacity. Attention (the bottleneck for long-range dependencies) benefits from sharing. FFN (the capacity for feature learning) benefits from separation. This principle likely generalizes beyond multimodal models.
Lesson 3: The modality tax is avoidable. Chameleon and Transfusion showed a "modality tax" — performance drop on text when images are added. MoT's separate FFNs largely eliminate this tax because text processing doesn't compete with image processing for FFN capacity.
Architecture Evolution

Trace the evolution from separate models to MoT's sparse unified architecture.

Era MoT
What principle does MoT establish for multimodal architecture design?