Perceiver & Perceiver IO — One Architecture for Everything

Chapter 0: Two Walls

The transformer is a beautiful, general idea: let every element attend to every other element. But two walls stop it from being the universal perception machine it looks like it should be. The Perceiver was built to knock both down.

Wall 1 — the quadratic explosion

Self-attention compares every element with every other element. For n elements that’s n² comparisons. For a sentence of 500 tokens, fine. But a single 224×224 image is 50,176 pixels — and 50,176 squared is about 2.5 billion pairwise scores, per layer, per head. Raw audio is hundreds of thousands of samples. Video is millions. You simply cannot run plain self-attention directly on the raw signal. Transformers only work on images at all because we cheat first — chopping the image into a few hundred patches.

Wall 2 — the modality silos

And that cheat is the second wall. Every modality gets a hand-built front-end: convolutions or patch-embeddings for images, spectrograms or strided convs for audio, tokenizers for text, special encoders for point clouds. Each assumes a particular structure of its input. There is no single network you can point at anything. We’ve built a zoo of bespoke architectures, one per data type.

The trap: “Attention is general, so transformers are general.” In practice the attention is general but the cost and the front-end are not — n² forces you to pre-digest every modality into a small token set in a modality-specific way. The Perceiver’s claim: feed the raw, enormous input directly to one architecture that scales linearly and assumes nothing about its structure.

The quadratic wall

Slide the input size. A plain transformer’s self-attention cost (orange) grows with the square of the input; the Perceiver’s cross-attention (teal) grows only linearly. Watch the gap become absurd at image scale.

input elements 4096

The trick that knocks down both walls at once is almost embarrassingly simple: don’t let the huge input attend to itself. Instead, give the network a small, fixed set of latent vectors, and let those read from the input. The next chapter introduces this bottleneck; the rest of the lesson builds the whole Perceiver and Perceiver IO around it.

Why can’t a plain transformer run self-attention directly on the raw pixels of a 224×224 image?

Images have too many color channels for attention Self-attention costs n²; with ~50,000 pixels that’s billions of scores per layer — intractable Attention only works on 1-D sequences, never on 2-D data

Chapter 1: The Latent Bottleneck

Here is the central object of the whole architecture: a small, fixed array of latent vectors. Think of it as the model’s workspace — a mental scratchpad. Where the input might have 50,000 elements, the latent array has just a few hundred, say 512, each a vector of some width D. And critically, the number of latents is chosen by you, completely decoupled from the input size.

The latents are learned — they start as trainable parameters (like a fixed set of question-vectors the network always brings to any input) and are refined by attention. They are the same regardless of whether the input is an image, a sound, or a point cloud. The input pours into this small workspace; all the heavy thinking happens inside the workspace; and the answer is read back out of it.

The mental model: imagine 50,000 people shouting facts (the input) and a committee of 512 note-takers (the latents). Instead of every shouter talking to every other shouter (50,000² conversations — chaos), the 512 note-takers each listen to the crowd, summarize, then deliberate among themselves (512² — manageable), and finally report out. The committee size is fixed no matter how big the crowd gets.

This single design decision — a fixed-size latent workspace that’s smaller than the input — is what makes everything else possible. It is an information bottleneck (the latents must compress the whole input into 512 vectors) and a computational bottleneck (all the expensive self-attention happens on 512, not 50,000). The art is in how information gets into the latents and out of them — and that is just attention, which we build next.

Input array vs. latent workspace

The big column (orange) is the raw input — thousands of elements. The small column (teal) is the fixed latent workspace. Slide the input size: the input grows, the workspace stays exactly the same. That fixed size is the whole point.

input elements 400

latents (your choice) 16

What is the defining property of the Perceiver’s latent array?

Its size grows with the input so it can hold every element It is a small, fixed-size, learned workspace whose size is chosen independently of the input It is a copy of the input passed through a convolution

Chapter 2: Cross-Attention — reading the input

How does the input get into the latents? With cross-attention — the same attention you know, but asymmetric. In ordinary self-attention, the queries, keys, and values all come from the same set. In cross-attention, they come from different sets: the queries come from the latents (few), and the keys and values come from the input (many).

So each of the 512 latent vectors asks a question (its query) and gathers a weighted summary from all 50,000 input elements (their keys and values). The attention matrix has shape latents × inputs — 512 by 50,000 — not inputs by inputs. The cost is the number of latents times the number of inputs: linear in the input size, because the latent count is fixed.

cost of cross-attention ≈ N_latents × N_inputs (linear in input) vs self-attention ≈ N_inputs² (quadratic)

Worked example by hand

Take 2 latents reading from 3 input elements, in a tiny 1-D world. Latent 1’s query is q = [1, 0]. The three inputs have keys k = [1,0], [0,1], [1,1] and values v = 10, 20, 30. Compute latent 1’s output (skip the √d scaling for clarity).

input	key	score = q·k	softmax	value
1	[1,0]	1	0.506	10
2	[0,1]	0	0.186	20
3	[1,1]	1	0.308	30

Softmax of [1, 0, 1] = [0.506, 0.186, 0.308] (the two 1’s tie, the 0 gets less). Output = 0.506·10 + 0.186·20 + 0.308·30 = 5.06 + 3.72 + 9.24 = 18.0. Latent 1 has distilled all three inputs into a single number, weighted by how well each input’s key matched its query. Repeat for latent 2 with its own query — and the entire 3-element input has been compressed into 2 latent values. Scale that to 512 latents reading 50,000 inputs and you have the Perceiver’s encoder.

Cross-attention: latents read the input

Left column = latents (queries), right column = input elements (keys/values). Click a latent to see how strongly it attends to each input (line thickness = attention weight). The attention matrix is small-by-large, never large-by-large.

In the Perceiver’s cross-attention, where do the queries vs. keys/values come from, and why does that make it linear?

All from the input — it’s just self-attention with fewer heads Queries from the (few) latents, keys/values from the (many) inputs — so the attention matrix is latents×inputs, linear in input size Queries from the inputs, keys from the latents — making it quadratic in the latents

Chapter 3: Thinking in the latent space

Once the input is distilled into 512 latents, the real computation begins — and here we go back to ordinary self-attention, but applied only to the latents. Stack a deep transformer (many self-attention + MLP layers) that operates purely on the 512-vector workspace. The latents attend to each other, refine each other, build up an understanding.

And now the magic of the bottleneck pays off. This deep stack costs latents squared per layer — 512² ≈ 260,000 — completely independent of the input size. Whether the input had 4,000 or 4,000,000 elements, the thinking costs exactly the same. You have decoupled network depth from input size. Want a 48-layer transformer? Its cost depends only on your chosen latent count, never on the data.

The decoupling, made concrete: a normal transformer that wants to be both deep and handle a large input pays depth × input² — ruinous. The Perceiver pays input×latents (once, to read in) plus depth×latents² (to think). The expensive depth multiplies only the small latent count. This is why Perceivers can be deep and swallow raw signals.

Putting the cost together

For an image (50,176 inputs), 512 latents, 8 processing layers, the rough operation counts are:

stage	cost formula	order of magnitude
plain transformer (1 layer)	inputs²	≈ 2,500,000,000
Perceiver cross-attn (read)	latents × inputs	≈ 26,000,000
Perceiver self-attn (8 layers)	8 × latents²	≈ 2,000,000

Even with eight layers of thinking, the Perceiver does roughly 100× less work than a single layer of the naive transformer — and stays flat as the image grows.

Depth is free of input size

Total cost as you add processing depth, for two input sizes. The two Perceiver curves (teal/blue) rise gently with depth and barely separate — input size hardly matters. The transformer (orange, single layer) towers over both before you even add depth.

processing depth (layers) 8

The Perceiver’s deep self-attention stack costs the same whether the input has 4,000 or 4,000,000 elements. Why?

It downsamples the input to a fixed resolution first with pooling It only processes every 1000th element The self-attention runs on the fixed-size latent array, so its cost depends only on the latent count, not the input

Chapter 4: Position & why it eats any modality

There’s a subtlety hiding in cross-attention. Attention is a set operation — it has no built-in sense of order. To the cross-attention, the pixels of an image are just an unordered bag; shuffle them and the output is identical. But position matters enormously: a pixel in the top-left is different from one in the bottom-right. So we must tell the model where each input element lives.

Fourier positional features

The Perceiver does this by concatenating positional features onto each input element before it’s read. It uses Fourier features — a bank of sines and cosines at many frequencies, evaluated at each element’s coordinates. High frequencies distinguish nearby positions; low frequencies encode coarse location. Each pixel becomes “my color values, plus a code for exactly where I am.”

And here is the beautiful consequence: this is the only thing that changes between modalities. For audio, you attach 1-D position (time). For images, 2-D position (row, column). For video, 3-D (row, column, frame). For a point cloud, 3-D spatial coordinates. The network is identical — only the positional features you tack on differ. No convolutions, no patches, no tokenizers. That is what “modality-agnostic” really means: the architecture makes no structural assumptions, and you inject structure purely through position features.

Concept → realization: a convolution bakes in the 2-D grid assumption (it slides a kernel over neighbors). The Perceiver refuses to bake in anything — it treats input as a flat set and learns spatial relationships from the Fourier position codes. Trade-off: it gives up the convolution’s built-in locality prior, so it needs more data to learn what a conv assumes for free. Generality has a price.

Fourier positional features

Each curve is one sine/cosine frequency band. Drag the position: the dots show that element’s feature vector — a unique fingerprint of sines/cosines. Many frequencies give a multi-scale code that pins down location exactly. Toggle 1-D (audio) vs 2-D (image).

position 0.30

What is the ONLY thing that changes when you point a Perceiver at audio vs. images vs. point clouds?

The number of self-attention layers The convolutional front-end is swapped for each modality The positional (Fourier) features attached to each input element — the network itself is identical

Chapter 5: Iterative Attention & weight sharing

One pass of cross-attention gives the latents a first glance at the input. But a single glance may miss things — like skimming a page once. The Perceiver can re-attend to the input multiple times, interleaving: cross-attend (look at the input), then several self-attention layers (think), then cross-attend again (look again, now with a better idea of what to look for), and so on. This is the “iterative” in iterative attention.

Because each look uses the same kind of operation, the Perceiver often shares the weights across these repeated cross-attend/think blocks — the same parameters applied over and over. This makes the architecture behave like a recurrent network unrolled in depth: a fixed set of weights, applied iteratively, refining the latent state each round.

Why weight sharing is a big deal: it decouples parameter count from depth, just as the bottleneck decoupled compute from input size. You can run 8 refinement rounds with the parameters of 1. Fewer parameters means less overfitting and a smaller model — at the cost of doing the same computation repeatedly. It’s the same trick that makes RNNs parameter-efficient, applied to attention.

There is a tension to manage: the first cross-attention is usually given its own weights (the initial read is special), while later repeats share. Too many shared repeats and the latents stop improving (diminishing returns, like re-reading a page you’ve memorized); too few and they never fully absorb the input. The number of iterations is a knob you tune.

Iterative refinement, weights shared

The pipeline unrolled: each round is a cross-attend (read input) + self-attend (think). Toggle weight sharing — shared rounds reuse one block of parameters (same color), like an unrolled RNN. Drag the iteration count.

iterations 3

Sharing weights across the Perceiver’s repeated cross-attend/process blocks primarily buys you:

faster attention math per layer parameter efficiency — many refinement rounds with the parameters of one, like an unrolled RNN the ability to skip positional encodings

Chapter 6: Any Output — Perceiver IO

The original Perceiver had one weakness: it could only produce simple outputs. To classify, it just averaged the latents into one vector and ran a classifier. Fine for “is this a cat?”, useless for “label every pixel” or “generate this sentence.” Perceiver IO (2021) fixes the output side with the exact same idea it used for the input side: cross-attention.

The output query array

To produce outputs, you supply a query array — one query vector per output element you want. Then you do one more cross-attention, but flipped: now the output queries are the queries, and the latents are the keys and values. Each output query reads from the latent workspace and produces one output element.

This is wonderfully flexible. Want a per-pixel output (optical flow, segmentation)? Make one query per pixel, each encoding that pixel’s position. Want a sentence? One query per output token position. Want a single class? One query. Want multiple tasks at once? Concatenate their query sets. The number and meaning of outputs is set entirely by the query array — the network doesn’t change. And the cost is outputs × latents — linear in the number of outputs.

ENCODE

input [huge] → cross-attn → latents [512] (linear in input)

↓

PROCESS

latents → deep self-attn → latents [512] (independent of input)

↓

DECODE

output queries [any #] → cross-attn(latents) → outputs (linear in outputs)

The symmetry: cross-attention reads a huge input into a fixed workspace; the same cross-attention reads a fixed workspace out into an arbitrary output. Input size and output size are both decoupled from the expensive middle. One architecture: any input → fixed latent thinking → any output.

Decoding with output queries

The latent workspace (center) is fixed. Slide the number of output queries: each query (right) cross-attends to the latents to produce one output element. Few queries → a class label; many → per-pixel maps or a sentence. The same latents serve any output shape.

output queries 8

How does Perceiver IO produce arbitrary, structured outputs (like a value per pixel)?

It upsamples the averaged latent vector with transposed convolutions It runs the processor once per output element An output query array cross-attends to the latents — one query per desired output element, so any shape is possible

Chapter 7: The Full Perceiver IO Pipeline (showcase)

Assemble the whole thing: encode (cross-attention reads the huge input into the latents), process (deep self-attention thinks in the latents), decode (output queries cross-attend to read the answer out). The simulator runs the full pipeline and tallies the cost against a plain transformer as you scale every dimension.

Perceiver IO end-to-end — and its cost vs. a transformer

Set the input size, latent count, processing depth, and number of outputs. The diagram shows the three stages; the readout compares total operations against a plain transformer doing the same job. Push the input size up and watch the transformer’s cost explode while the Perceiver’s barely moves. Then shrink the latents too far and watch the bottleneck choke.

input elements 12544

latents 512

process depth 8

outputs 1

Notice what the readout teaches. The transformer’s cost is dominated by input-squared — one term, and it’s catastrophic. The Perceiver splits its cost into three modest, separable terms: read (input×latents), think (depth×latents²), write (outputs×latents). Each is linear in the thing you scale. That separability is the architecture’s superpower — you can grow the input, the depth, or the output independently without any of them multiplying together.

What Perceiver IO achieved (2021): a single architecture competitive across wildly different tasks — ImageNet classification (from raw pixels, no conv stem), optical flow (state-of-the-art, per-pixel output), audio-visual classification, StarCraft II unit sets, and even language (matching a BERT-style model on GLUE without tokenization tricks). One design, many modalities, any output shape.

Chapter 8: Scaling, Trade-offs & Where It Lives

The Perceiver’s gift is linear scaling in input and output, with all the expensive depth confined to the latent workspace. But generality and bottlenecks have costs too — understanding them tells you when to reach for it.

The bottleneck cuts both ways

Squeezing 50,000 inputs into 512 latents is a compression. If the task needs fine, distributed detail from everywhere in the input simultaneously, 512 latents may not have the capacity to hold it — information is lost at the bottleneck. Too few latents and the model underperforms; too many and you creep back toward the quadratic cost you were avoiding. The latent count is the central dial trading capacity against cost.

The price of no priors

Because it bakes in no convolutional locality, the Perceiver must learn spatial structure from position features — so on pure vision tasks with limited data, a ConvNet or ViT with built-in locality can be more sample-efficient. The Perceiver shines when you have lots of data, multiple modalities, very large inputs, or unusual output shapes — situations where bespoke priors don’t exist or don’t fit.

Where the idea lives now

The latent-bottleneck-via-cross-attention idea spread far beyond the original papers. The most famous descendant is the Perceiver Resampler in DeepMind’s Flamingo: it cross-attends a variable, large number of vision features down to a fixed small set of tokens that a language model can consume — exactly the encode step, used as a bridge between a vision encoder and an LLM. Many modern multimodal models use this “resampler” pattern to glue modalities together.

Cost vs. input size: the asymptotics

Total compute as the input grows. Transformer (orange): quadratic, unbounded. Perceiver (teal): linear — a gentle slope set by the latent count. The crossover is early; past it, the gap is everything.

latents 512

When is a ConvNet/ViT likely to beat a Perceiver?

On very large multimodal inputs with little structure On vision tasks with limited data, where built-in convolutional locality is more sample-efficient than learning structure from position features When the output must be an arbitrary per-pixel map

Chapter 9: Cheat Sheet & Connections

The whole architecture, and how it sits in the family.

problem

self-attention is n² (can’t eat raw signals) + every modality needs a bespoke front-end

↓ fixed latent workspace

latents

small, learned, fixed-size array — size chosen independent of input

↓ ENCODE (cross-attn: Q=latents, KV=input)

read

linear cost (latents×inputs); + Fourier position features = modality-agnostic

↓ PROCESS (self-attn on latents)

think

deep transformer, cost latents² — independent of input; optional iteration + weight sharing

↓ DECODE (cross-attn: Q=output queries, KV=latents)

write (Perceiver IO)

any output shape; cost outputs×latents

The family

Model	Handles large input?	Modality-agnostic?	Arbitrary output?	Key trick
Transformer	no (n²)	no	seq only	full self-attention
ViT	via patches	no (images)	class / patch	patch embed
Perceiver	yes (linear)	yes	simple (avg)	latent cross-attn
Perceiver IO	yes	yes	any (queries)	+ output query decode
Set Transformer	yes	sets	pooled	inducing points
DETR	via CNN	detection	object queries	learned queries

The unifying thread: a small set of learned query vectors that cross-attend to a large input is a recurring superpower — Perceiver latents, DETR object queries, Set Transformer inducing points, Flamingo’s resampler are all the same idea. Cross-attention into a bottleneck is one of the most reusable patterns in modern architecture design.

When to reach for it

Huge or multimodal inputs (raw audio/video, point clouds, fused sensors) where n² is hopeless and no clean tokenizer exists.
Unusual output shapes (dense per-element predictions, multi-task heads) — Perceiver IO’s query decoder.
Bridging modalities — the resampler pattern, compressing many features into a few tokens for an LLM.
Reach for a ConvNet/ViT instead on data-limited vision, where built-in locality wins.

Keep exploring

→ Attention Variants — cross-attention, MQA, FlashAttention in depth
→ Vision Transformer — the patch-based approach Perceiver sidesteps
→ Linear Attention & RWKV — another route around the n² wall
→ Vision-Language Models — where the resampler bridges vision and text
→ JEPA — another fixed-bottleneck idea, for self-supervised prediction

“What I cannot create, I do not understand.” You just rebuilt Perceiver IO from one move: a fixed latent workspace that cross-attends to read any input in, thinks in private at constant cost, and cross-attends to write any output out. One architecture, any modality, any shape — the n² wall and the modality silos, both gone.